Heavy relationships vs Node hops. Which is better?

performance

(Naren V) #1

Assuming two products P1,P2 are related. Our end response should be to give back all the details regarding how these products are related on a specific country.

Lets say P2 is related to P1 only on certain countries and via a specific platform.

Countries: C1,C2,C3
Platforms: iOS, Android, Web

The current RDBMS modeling which translated as is to graph will look like the below structure.

graph

The detail node has around 10 properties and each Country node contains the some more properties and what platform its sold at.

Match (P1:Product)-[:saleRelation]->(D)-[:relatedTo]->(P2:Product),
(D)-[:soldAt]->(C:Country)
Where P1.Name="iPhone" and C.Name="C1"
return P1,D,C,P2

In the above query, no. of node hops, filtering and dbhits is more.

If the same is modeled differently as more of straight relationship between products with heavy relationships(many properties) as shown below, the node hops, filtering and dbhits are very less comparatively.

graph_newModeling

Match (P1:Product)-[relation:soldAtC1]->(P2:Product)
Where P1.Name="iPhone"
Return P1,relation,P2

Surprisingly while testing both approaches the first approach was providing better query response time which wasnt my expectation as there are more db hits.

The question more importantly is
while modeling should relationships be light or heavy? Or should we introduce nodes just to hold data but connect them via more relationships?

Should more hops and more nodes be preferred instead of multiple relationships between same nodes?

Is dbhits directly proportional to runtime of a query?


(M. David Allen) #2

There isn't a good general answer to your question but I'll give you a couple of principles that I use. Ultimately the right model is one that balances a couple of different needs:

  • The model should be simple and close to how you think about your domain, for example when you're whiteboarding the problem. Models that are far away from how you think about your domain will be hard to query.
  • The model should be designed with some of your queries in mind. Models exist to help you answer questions about your data. It's exceedingly hard to come up with a model in isolation, you need to know what kinds of questions you want to ask. If you don't start with some idea of the questions you'll ask, it's unlikely you'll end up with a good/convenient model for asking the questions you later come up with.

So trying to give you some more specific answers - a thing for you to consider is that neo4j doesn't let you index relationship properties at present. So if you put in piles of relationships and lots of properties on those relationships, and then you later want to issue queries that do a lot of selection based on relationship property values, you're not going to be positioning yourself well. Nodes can have indexed properties, and often a better way is to have a set of nodes, cut that down drastically with filtering/indexing, and then traverse a tiny subset of your graph, rather than traversing a lot of edges and then filtering them.

Another thing to keep in mind is that an extra node in between shouldn't be much of a performance hit, because neo4j fundamentally makes traversals cheap (that's a good thing).

dbhits is in some ways proportional to the runtime of the query, yes. The more data your database has to consider to answer your query, the more work it has to do, and the slower it will be. This is why we use techniques like filtering & indexes -- to explicitly cut down on how much data needs to be considered to answer the query. In the absolute worst case, dbhits basically equals how much data you have in the database, and you're doing a full database scan for a query, which is how you get maximally bad performance.

You can read more about what dbhits are here:

https://neo4j.com/docs/cypher-manual/current/execution-plans/#execution-plans-dbhits

Some other modeling things you might consider. In your "RDBMS model" you break out all of the individual 10 details into their own nodes. This is fine -- keep in mind with graphs though it's usually a good practice to break categorical variables into their own nodes. Say you have a field "color" which can be red/blue/green, that's a categorical variable. Turn that into its own node and link it, (don't make it a node property) so that you can navigate to other things that share that color.

Sometimes you want grouping instance nodes too. In your graph you say an iPhone has "details". Well there's an abstract iPhone but there's also say a product instance (space grey 16gb iPhone 7, or whatever). So keep in mind you can do (:ProductClass)<-[:instance_of]-(:ProductInstance)-[:has_detail]->(:Color) or whatever. This allows the ProductClass to have certain details, while allowing the product instance to differ on those details, or add extras.


(Michael Black) #3

@david.allen gave a great response, this is just my two cents on when I start to model a new graph.

I follow the rule of thumb that nouns are nodes and verbs are relationships and I actually say aloud or write it down in a sentence what my domain model is. Personally I tend to lean heavy on creating nodes and I try to limit myself on the properties. This is usually because a lot of my queries I'm looking for shared commonalities between entities which are easier to find when those commonalities are nodes and I'm looking for relationships between nodes.

When I model I also look for "lazy speech" in my model. Popular example is the email model. You could model (user)-[:emails]->(user) but email is not a real verb, it's lazy speech. You really send someone an email. An email is a noun. So the model would be (user)-[:sends]->(message)-[to]->(user). So I look for those lazy speech patterns that could turn to be a pitfall.

At #graphconnect there was an excellent presentation of how the query engines works that has helped better think about my model to be tuned for the queries I will be executing.