Heavy relationships vs Node hops. Which is better?

Assuming two products P1,P2 are related. Our end response should be to give back all the details regarding how these products are related on a specific country.

Lets say P2 is related to P1 only on certain countries and via a specific platform.

Countries: C1,C2,C3
Platforms: iOS, Android, Web

The current RDBMS modeling which translated as is to graph will look like the below structure.

graph

The detail node has around 10 properties and each Country node contains the some more properties and what platform its sold at.

Match (P1:Product)-[:saleRelation]->(D)-[:relatedTo]->(P2:Product),
(D)-[:soldAt]->(C:Country)
Where P1.Name="iPhone" and C.Name="C1"
return P1,D,C,P2

In the above query, no. of node hops, filtering and dbhits is more.

If the same is modeled differently as more of straight relationship between products with heavy relationships(many properties) as shown below, the node hops, filtering and dbhits are very less comparatively.

graph_newModeling

Match (P1:Product)-[relation:soldAtC1]->(P2:Product)
Where P1.Name="iPhone"
Return P1,relation,P2

Surprisingly while testing both approaches the first approach was providing better query response time which wasnt my expectation as there are more db hits.

The question more importantly is
while modeling should relationships be light or heavy? Or should we introduce nodes just to hold data but connect them via more relationships?

Should more hops and more nodes be preferred instead of multiple relationships between same nodes?

Is dbhits directly proportional to runtime of a query?

There isn't a good general answer to your question but I'll give you a couple of principles that I use. Ultimately the right model is one that balances a couple of different needs:

  • The model should be simple and close to how you think about your domain, for example when you're whiteboarding the problem. Models that are far away from how you think about your domain will be hard to query.
  • The model should be designed with some of your queries in mind. Models exist to help you answer questions about your data. It's exceedingly hard to come up with a model in isolation, you need to know what kinds of questions you want to ask. If you don't start with some idea of the questions you'll ask, it's unlikely you'll end up with a good/convenient model for asking the questions you later come up with.

So trying to give you some more specific answers - a thing for you to consider is that neo4j doesn't let you index relationship properties at present. So if you put in piles of relationships and lots of properties on those relationships, and then you later want to issue queries that do a lot of selection based on relationship property values, you're not going to be positioning yourself well. Nodes can have indexed properties, and often a better way is to have a set of nodes, cut that down drastically with filtering/indexing, and then traverse a tiny subset of your graph, rather than traversing a lot of edges and then filtering them.

Another thing to keep in mind is that an extra node in between shouldn't be much of a performance hit, because neo4j fundamentally makes traversals cheap (that's a good thing).

dbhits is in some ways proportional to the runtime of the query, yes. The more data your database has to consider to answer your query, the more work it has to do, and the slower it will be. This is why we use techniques like filtering & indexes -- to explicitly cut down on how much data needs to be considered to answer the query. In the absolute worst case, dbhits basically equals how much data you have in the database, and you're doing a full database scan for a query, which is how you get maximally bad performance.

You can read more about what dbhits are here:

Some other modeling things you might consider. In your "RDBMS model" you break out all of the individual 10 details into their own nodes. This is fine -- keep in mind with graphs though it's usually a good practice to break categorical variables into their own nodes. Say you have a field "color" which can be red/blue/green, that's a categorical variable. Turn that into its own node and link it, (don't make it a node property) so that you can navigate to other things that share that color.

Sometimes you want grouping instance nodes too. In your graph you say an iPhone has "details". Well there's an abstract iPhone but there's also say a product instance (space grey 16gb iPhone 7, or whatever). So keep in mind you can do (:ProductClass)<-[:instance_of]-(:ProductInstance)-[:has_detail]->(:Color) or whatever. This allows the ProductClass to have certain details, while allowing the product instance to differ on those details, or add extras.

7 Likes

@david_allen gave a great response, this is just my two cents on when I start to model a new graph.

I follow the rule of thumb that nouns are nodes and verbs are relationships and I actually say aloud or write it down in a sentence what my domain model is. Personally I tend to lean heavy on creating nodes and I try to limit myself on the properties. This is usually because a lot of my queries I'm looking for shared commonalities between entities which are easier to find when those commonalities are nodes and I'm looking for relationships between nodes.

When I model I also look for "lazy speech" in my model. Popular example is the email model. You could model (user)-[:emails]->(user) but email is not a real verb, it's lazy speech. You really send someone an email. An email is a noun. So the model would be (user)-[:sends]->(message)-[to]->(user). So I look for those lazy speech patterns that could turn to be a pitfall.

At #graphconnect there was an excellent presentation of how the query engines works that has helped better think about my model to be tuned for the queries I will be executing.

3 Likes

One overlooked trick, is Nodes can have multiple labels.

So, if it's supercritical to have fast access to Color, for instance, you can do:

CREATE (phone:SmartPhone:IPhone:Red)

Which creates a node with the Labels, SmartPhone, IPhone, and Red

So if you want to match for a red smartphone:

MATCH(n:SmartPhone:Red)

Because Neo4J is very efficient with Nodes of the same Label (it's kept like a set), then this is a quick union of two sets (or more).

If color is an attribute, then you have to scan the attributes in an index. If it's on a hop, you have to do a hop to get to a Node and then examine the node.

The multiple indexes can be very useful, but you probably should consider how you plan on filtering on them. With categories and types that is very obvious. When a label is used in place of a property value, though, it can be troublesome depending on how you want to use it.

Labels (and relationship types) can't be parameterized, so using labels for colors probably wouldn't be a good idea, if the color to filter on is dynamic, such as from user input. Colors work a bit more naturally as properties, or even as :Color nodes with their own property that you create relationships to. With each color as its own label, you can't easily ask "what color is this smart phone?" in Cypher, since the concept of "multiple items which are all types of colors" is missing when each individual color is a separate label. There's no grouping. You would have to know all the color labels ahead of time and write a bit more complex Cypher to find the intersection of the node's labels against the known color labels. It just isn't as easy as simply returning a property value, or expanding to a :Color node and returning it's property.

Likewise for "what kind of smartphone is it", you would have to know all the phone types that exist ahead of time so you could do intersection against the node's labels, since you have no way of knowing which labels on the node correspond to some kind of grouping (read being a color, iPhone being a kind of smartphone).

You can always combine the two approaches, however, keeping the labels for either manual querying or for cases where the labels you want are hardcoded and not dynamic, and keeping properties (or connected nodes like :Color) for queries that work better with that approach.

2 Likes

I also found this useful guideline to Node Label Design

1 Like