Modelling company events

I'm working on a project to open up information about companies. Think of it as a cross between Crunchbase and OpenCorporates.

I am now looking at how best to store the data I've collected. A graph database is the obvious choice, and, given my interest in open data, I'm really keen to find a way to organise the data in a way that will be easy for others to use and understand. I've started by looking at RDF (really loved playing with neosemantics).

Broadly speaking I've got the following things to model:

  • Organizations
  • People
  • Roles
  • Locations
  • TargetEntities
  • Activities (e.g. an appointment of a person to a role, the acquisition of a target entity by an organization, the opening of an office in a new location). Each activity has a date/date range (e.g. happened on or happened between X and Y) and some information about it.

The sorts of things I'd love to get some guidance from the community on are:

  • What standard ontologies should I be leveraging for this sort of data?
  • Should activities be modelled as edges or as their own nodes?
  • How best to handle dates so that I can do queries like "show me the lifecycle events of company X over the past 3 years"?
  • Is RDF the best choice for making something super easy for other people to link up with their data, or is there a better way these days?

Thanks in advance.

There aren't really standard ontologies for data modeling. It all depends on what kinds of questions you want to ask of the data (what are your necessary queries), and then modeling the data in a way that answers those questions with maximum performance.

That sounds more complicated than it is, but for instance, you model CompanyA and Activity as generic nodes and create unique relationships for each activity, you might end up with thousands+ relationships between two nodes. Traversing all those each time you run a query isn't as efficient as creating more unique nodes/relationship types, so that queries are more specific and filter out unwanted results earlier. I would initially gravitate towards creating each activity as a separate node, and then adjust as needed. Dates could be separated out into nodes (one for each date OR one for each year + one for each month + one for each day), though I might start with the date as the relationship type and see what kinds of queries you run to see if that needs optimized.

As far as data format, I'd naturally lean towards converting to a JSON or CSV because it's more widely usable.....however, I'm not an expert in RDF, so I would be less comfortable approaching data in that format. :slight_smile:

Cheers,
Jennifer

Thanks so much for taking the time to respond @jennifer_reif

What I'm taking away from this is that it's better to have more content in the nodes and keep the relationships as simple as possible. Is that right?

That would be my natural tendency, yes. You can have some properties on relationships and/or nodes where it makes sense, but Neo4j is optimized for hopping across nodes and relationships....so paths are better. :slight_smile: