Graph Data Modeling Question

neo4j specifications:

dbms.memory.heap.initial_size=24G
dbms.memory.heap.max_size=24G
dbms.memory.pagecache.size=8G

neo4j version: Community 4.2.0
desktop version: 1.3.11

Hello,

How can I improve on my data model to reduce the number of relationships?

Currently, my graph database contains a total of 300,000 nodes connected by about 550,000,000 relationships. After debating the graph data model with my colleagues for several weeks and performing numerous refactorings, I can't figure out a way to diminish the number of relationships between a subset of nodes on the part of my knowledge graph that is illustrated below.

I'm hoping that by reducing the number of relationships I can speed up the performance of my cypher queries which are beginning to take more than hour to finish as they become more complex.

Here's a proxy example of my current use case:

In this example, here are the counts for each of the nodes and relationships:

Nodes
Person: 100,000 nodes with 5 properties
A_Law: 5,000 nodes with 4 properties
B_Law: 3,000 nodes with 4 properties
C_Law: 2,000 nodes with 4 properties
Business: 7,000 nodes with 3 properties

Relationships
A_Access: 500,000,000 relationships (every person node is connected to every single unique A_Law node. Therefore, we have 100,000 times 5,000 is 500,000,000) with 0 properties
B_Access: 2,000 relationships with 0 properties
C_Access: 3,000 relationships with 0 properties
A_Rel: 5,000 relationships with 2 properties
B_Rel: 3,000 relationships with 2 properties
C_Re;: 2,000 relationships with 2 properties

As you can see above, the problem here is with the number of the A_Access relationship that has exploded as a result of the product between all of the person nodes to all of the A_Law nodes. Our business domain requires that each of the unique person nodes must have one relationship to each of the unique A_Law nodes. This results in 500,000,000 relationships and this is causing our cypher queries to take more than an hour to finish.

Another approach that I'll implement here is to add more indices to the properties of the nodes or relationships, but I would like to be careful when adding various indices. Here's a warning I read about adding too many indices: https://neo4j.com/blog/dark-side-neo4j-worst-practices/

I've tried profiling and explaining our queries and I see that there are billions of rows being accessed.

I also read and watched the posts and videos listed below multiple times, but I still don't know how else to improve on the data model:

I also read the entire O'Reilly book on Graph Databases, but I still don't know how to improve my data model.

Thanks for any ideas or suggestions.

-Tony

Hi , though i may yet get complete understanding on this domain/context but few questions here pls

  1. What is the query you are trying to get from here - Any Data Model - Best way to design based on the Buesiness Question you are trying to do

  2. From your Business need, I understand you need to have unique relations for each node. This leads to me to think -unique ID for each person right? so better to have this as property in my the Node itself - instead of relations which actually reduces all these millions of paths

Based on my experience , i wud suggest,

  1. Validate when we can turn property to Node or Node to Property- Based on whats your Business question

  2. When we have dense node- connecting millions of relations...best approach wud be having relations by Month or category . (Fan_out technique)
    Examples like
    Orders -[:ORDER]-> Product to Orders-[: ORDER_JAN|ORDER_FEB...]->(:Product)

Hope it may get you some thoughts to redefine the model .

With Smiles,
Senthil Chidambaram

1 Like

Hi Senthil,

Thank you for your advice. You're right in that we have a densely connected node on A_Law and Person, and quite frankly, I don't know what to do that about that. These articles I found seem promising in helping with these super node issues. I'll give these a shot:

Still, we must have that A_Access connection there with all 500,000,000 relationships as this is giving my team important insights as to how the Person nodes can access the A_Law node. Otherwise, this Neo4j effort is going to fail because we can't have queries taking hours to finish.

Here are my answers:

For example, let's say we want to count the number of A_Law's being used by a person in the sales department. Then that query will be:

MATCH(p:Person {dept: 'sales'})-[:A_Access]->(a:A_Law)
RETURN COUNT(DISTINCT a.title)

This query alone takes about 20 minutes to finish. Additionally, this query forms part of a bigger query that we use to answer more complex business questions and this is causing the bigger queries to take more than an hour to finish.

Yes, there are unique ID's for each person. We loaded this graph using the neo4j-admin import tool which forced us to use unique id's on the nodes.

Agreed. We discussed this internally at length to all agree on the Node entities and properties based on the business questions we wanted to answer. We iterated through this as well to arrive at a working data model. Querying all of the other relationships outside of the A_Access relationship actually perform really well, within seconds actually.

This is a good point. I guess I could shuffle some of the properties from A_Rel into A_Access to further segregate the relationships that we need, but I still think we're going to end up with 500,000,000. It would also be useful if I could place an index on these relationship properties for faster look up but I don't think that's supported yet as shown in the docs or this thread:

Or do I need more memory on my Neo4j instance? I thought having 24G in heap would be more than plenty for a graph database this size, no?

Thanks

Hi @TonyOD27 ,
Really i like your kind way to explain each and every point. Let me go through in detail and update but got the core

1.Query Execution time is your concern and wants to ensure number of relationship count to super node may be the issue.

But 3 quick pointers to share
[1]. Aggregated operations always have some performance issue (Graph vs RDBMS ). So when you try COUNT(DISTINCT(node)) obvious it will load all ur nodes into memory and need more computing since you are using 'DISTINCT' .

[2]. since you mentioned on heap-size- 24GB. We faced similar issue (a blind mistake) since we ignored the memory config while setting up initial instance (AWScloud) so thought to confirm is that same on your side.

Please re-check whats your server RAM config. Hope it shud be greater that 24GB.
There is a recommendation and formulae
Memory = Heap Size + Page Cache + 2-3 GB for OS

We made HeapSize and page Cache but sum of these 2 config value was equal to RAM of Server itself ; so OS have no space to allocate for its I/O and process. so even simple MATCH query took too much time .Then we realized logic and importance of checking Heap Size + Page Cache+ OS memory (min 2-3 GB should be left for OS)

Example like - Instance got 24 GB RAM, recommended configuration is like
Initial Heap_Size 8 GB and made it Heap_Max to 12 GB = 50% of server Memory
Page Cache : 8 GB
So left 4 GB for OS

I was trying to get the actual Memory estimation technique link but missed to book mark.
But you pls check this below though it might be just 1% cause for the issue.

just one more point to share my learning since it wud help you other scenario ..
[3]. Neo4J stored data with its relations as we knew . I remember some example- Library- Books sorted and placed in Racks and each rack labeled like 'Graph Books, DB Books, OS, Server, Programming like .. so when the user wants to take 'Neo4J' book, he/she may not need to look from RACK 1 to RACK 10 but directly go to 'Graph Books rack 6' and take out books related to Graph right....this example helped me to have multiple RELATIONSHIP Name - and having multiple RELATIONS actually helps to fetch the data so quick instead of having just ONE Relations....this is the principle behind FAN_out..

Example: ( :Customer )-[:ORDER] -> (:Product) for last 3 years wud be many but
same, i had it like 2 ways based on my Business need

1 .(:Customer : GuestUser) -[:ORDER]-> (Product)
here instead of searching all customer base, i have a label to get a subset of customer so performance/query hit will be balanced
2. ( : Customer: MonthlyCustomer)-[: ORDER_JAN_1WEEK]->(:Product)

basically subset of customers based on some property /category aligned and helped Business to get the context

So this is what i tried and it worked well for Customer360 view and i got to learn- it is all OK to have multiple Relationships between 2 Nodes when we have Millions of connection of course each Business scenario is different and what we want to check /query is the key differentiator here.

Instead of 'Dept' as property in 'Person' node, is that OK to have 'DEPARTMENT' as new Node (Sales, HR, FINANCE,OPERATIONS) - basically group by Segments may work for you i think if its aligned to your Business scope.

Hope it helps!

With Smiles,
Senthil Chidambaram

1 Like

I am a rank beginner - but was wondering how long the query takes without the aggregate command? Also is there some way to get a table output - and pull the distinct values from there and use that in a subsequent query?

Hi,
in my experience, sometimes you can go faster by

  1. adding more nodes and relationships
  2. precomputing

Let's take Person and A_Law. You say you have 500 000 000 possible combinations.
Idea 1)
For each Person.department, create a single node, it means

MATCH(p:PersonByDept{dept: 'sales'})

matches one node.
For each A_Law.title, create a single node. It means

COUNT (a:A_LawByName{name: 'aname'})

matches one node.
Use these nodes to query what you need. So :

MATCH(p:PersonByDept{dept: 'sales'})-[:something]->(a)
RETURN COUNT(a)

Idea 2:
Precompute what you will query. Attach a node counter to whatever entity you need and keep a count of what you want to count in this node counter.
For instance create a node (x:Department) and attach a node (x:Department)<-[]-(s:Stats) and keep a property s.ALawTitlesCount = 6

Does it work ?

Decreasing the relationships by increasing the number of properties might make performance worse. It depends on your queries and the nature of your data.

For example, suppose a Person has a property "JobTitle", and there are different Law types, and Business types. Now if a lot of your queries are:

MATCH(p:Person {JobTitle:"Paralegal"})-[:B_access]->(l:B_Law {Name:"PatentLaw"})-[:A_Rel]->(b:Business {Type:"HighTech"})

The more properties you have to access during the query, the slower it will get.

I'm making a conjecture as to the type of query you're trying to make... but it would be better to have the types represented as different Node Labels:

MATCH(p:Paralegal) -[:REL]->(l:PatentLaw)->(b:BusinessHighTech)

This is because the number of Paralegal nodes is far smaller than the number of Person nodes, so Cypher has to scan a lot fewer nodes when scanning for a Paralegal than Person.

Similarly, there are a lot fewer PatentLaw nodes than all Law nodes, and a lot fewer BusinessHighTech nodes than all Business nodes.

I believe MATCHing based a Property is slower. This is because all Nodes of a certain Label are already collected in a Set (behind the scenes in Neo4J), so all the work of winnowing down which Nodes to look at is already done!

If you try to MATCH Nodes by Property, and if the properties are indexed then, Neo4J has to hit a B-Tree to collect all the nodes into a temporary set. If the Properties are not indexed then Cypher has to do a linear search (expensive!) through all the Nodes of the more general Label type (e.g. Person).

One advantage of Neo4J is a node can have multiple Labels (e.g. Types), so that a Business could have two labels: HighTech label and a Retailer label (e.g. Amazon), or Partner and Lawyer and Litigator, so you don't have to worry about painting yourself into a corner when making the Labels. That is in a regular relationship DB, the categories could be too fine grained resulting in a many-to-many relationship, which is a PITA.

Similarly, if you can split up the *_Law nodes into a finer set of types, you'd also be better off.

Since access must be done in memory, repeated queries are faster once the data (Set of Nodes of a Label type) is loaded in memory from a previous query, subsequent queries will be faster.

I recommend taking a subset of your data and experiment with the performance on the subset so that you don't have to wait an hour to see if any tweaks you do makes an improvement.

If you gave us some sample queries you are trying to make, it would help us understand the nature of the problem better.

I hope that helps.

2 Likes

This is a great reply with lots of information to think about. It would be interesting to mock this data and try what was suggested above. Tangentially, I wonder if other products like DGraph and TigerGraph might be faster, and whether the language that Neo4j is written in (Java, Haskell) might be slower than DGraph (Go) and TigerGraph (C++).

It’s certainly worth trying to see which system performs better.

BUT remember, the data is probably stored on disk, which typically means the bottleneck is going to be I/O not CPU (where C++ would have an advantage.). In addition, with HotSpot, Java might not perform significantly worse than C++. It maybe that more memory will have a greater impact on performance than anything else, as more memory means more data can be kept in memory (cached) than having to be read off of disk.

So, I don’t think you can a priori claim one system will be better based solely on the language of implementation. There are too many factors that will effect performance, some of which won’t be readily visible to you.

Neo4J had been around for a long time, so it might be better (and stabler) because of it being a more mature product.

1 Like

DGraph is claiming all kinds of performance capabilities here: Neo4j vs Dgraph - The numbers speak for themselves - Dgraph Blog but the implementation does look far more basic than Neo4j.

1 Like

Note that the Blog is making a comparison with version 3 of Neo4J. It’s now on 4.2.1.

You’ll need to benchmark it on your data. Benchmarks created by a company to show off their product may or may not be valid when applying to your situation.

And then there is this URL. I’m not sure how accurate it is but it’s another data point:

I have been looking at Azure Cosmos/Gremlin but I’m not happy with it, as it’s been a bit painful to get up and running for me.

I’m not well versed in the Graph DB competition, so I’m hesitant to say for sure which is the best Graph DB out there.

It’s just from my long work experience, I’ve learned to take benchmarks with a grain of salt.

1 Like

From the standpoint of tools - based on a cursory review - Neo4j looks the best so far.

I've been exploring TigerGraph.

It's been a bit rough... All the intuition that I've developed with Cyper and SQL has led me astray a number of times with TG.

They claim better performance (which I believe is true), but TG definitely has a steeper learning curve than with Neo4J, which I got up and running pretty easily.

I think the Cypher language is more intuitive too and the documentation is better. That is, I pretty much figured out most of what I needed with Cypher on my own (and I strongly believe that I could train semi-technical people to do their own Cypher queries), but I feel like a total noob with TG.