What is the better data model-- creating more nodes, or utilizing more properties?

Hello, looking to build out a data model in Neo4j and looking for what would be the better data model. See the attached image--

In Option 1, Person is connected to two EmailAddress nodes (as they have two emails), and three Address nodes, each with distinct relationships based on whether its a billing, mailing, or living address.

In Option 2, all of the data is instead properties on the Person node. The emails they have are stored as a list.

Multiple Person(s) may share data (for example, if two people have Boston as their living city. In Option 1, two people would have LIVES_AT connections to the Boston node. In Option 2, each Person would just have the property living_city set to "Boston").

We are interested in running some of the graph algorithms on our data and I understand that Option 1 would be more suited for a use case like that. But would it make more sense to use properties with Option 2 to not clutter our model with too many nodes?

Thank you!

You described what I was going to tell you. Graph databases are for analyzing relationships between entities. It is especially powerful with a network of connected data. Analyzing such relationships with a relational database is impractical.

You should use nodes for information that can be used to relate multiple entities. Properties are good for metadata.

Sometimes I feel people want to use a graph database because is new and in, but can do just fine with a relational database if you have parent and child tables.

2 Likes

I like Option 1 as it presents a good pictorial description of the issues. It's like seeing is believing!

Technically you could implement both. If you are looking for people living in a specific location, then having the location as a node would process faster (more efficient) as each node wouldn't have to be queried for the property. I would not include the specific address in the location node as that is specific to the person, not the location. It could also be used as a property in the LIVES_AT relationship.

Nodes are the smarter choice. Don't worry about cluttering the db with too many nodes; Neo4j is designed to efficiently handle a very large number of them.

If your model is purely a list of Person nodes, like an address book, then it doesn't make much difference. There would also be no reason to prefer a graph db over a relational one, either.

However, if your model includes interactions between people and/or organisations, and email is involved, there's additional information available here:

  • The fact that the communication was by email, not by post.
  • Which address was involved.

The same thing goes for the physical addresses, and this is where graph databases start to shine.

When the model is small and simple, this might look like hair-splitting. However, as you accumulate a large number of entities, and a large number of interactions between them, there's an increasing amount of information in these distinctions. One of the terms for this kind of thing is "traffic analysis" - discovering relationships between people/organisations, and the nature of those relationships, just by looking at who communicated with whom, by what address, and when, without ever needing to know the content of the communications.

You can always start with one of these approaches and switch to the other later, but this additional information is only available if you use nodes to represent addresses. If you start with nodes and later discover that you really don't need that extra information, you can collapse the address nodes into attributes in the Person nodes, and throw it away. However, if you start with attributes, you're throwing that information away at the start, and you can only add it in later by going back to the original source data.