Hello everyone! I am very new to Neo4j and Cypher. I have been evaluating Neo4j for the past several weeks as a replacement for our existing DB to be able to run more efficient queries for our use case. So we have come up with the best DB schema that fits our needs very well and the data fetching queries work very well. The problem I am facing is while inserting data into Neo4j.
MERGE queries are super slow and take a lot of CPU horsepower, and keeps getting slower with increasing Nodes. With only a couple Million Nodes, the MERGE queries take several minutes.
I have used MERGE everywhere because we do not desire same / similar Nodes for every request, as we have similar data coming in from multiple sources, and might get the same data multiple times.
I am using Python and the Neo4j Python library to interact. I have a lot of JSON data coming in via APIs, which is then filtered and normalized in Python before inserting into Neo4j.
Details of what I am doing :
- The main (Base) Node will be connected to all the data points directly, and the data points are of different variety, so they have to be different kinds of Nodes.
- There are a few Nodes (BaseConnectedNodes) that have more than a few properties and are only 1 per Base Node.
- Then there are other Nodes (ConnectedNodes) that are generally multiple and have very few properties. So these Nodes are passed as lists and using FOREACH or UNWIND then created or matched using MERGE, then linked to the BaseNode.
MERGE (b: BaseNode {k_id: $b.k_id, value: $b.value})
ON CREATE
SET b = $b
ON MATCH
SET b += $b
MERGE (b1: BaseConnectedNodeType1 {k_id: $b.k_id, value: $b1.value})
ON CREATE
SET b1 = $b1
ON MATCH
SET b1.sources = CASE WHEN NOT 'source' IN b1.sources THEN b1.sources + 'source' END
MERGE (b)-[:baseConnectionType1]->(b1)
MERGE (b2: BaseConnectedNodeType2 {k_id: $b.k_id, value: $b2.value})
ON CREATE
SET b2 = $b2
ON MATCH
SET b2.value1 = $b2.value1,
b2.value2 = $b2.value2,
b2.value3 = $b2.value3,
b2.sources = CASE WHEN NOT 'source' IN b2.sources THEN b2.sources + 'source' END
MERGE (b)-[:baseConnectionType2]->(b2)
MERGE (b3: BaseConnectedNodeType3 {k_id: $b.k_id, value: $b3.value})
ON CREATE
SET b3 = $b3
ON MATCH
SET b3.value1 = $b3.value1,
b3.value2 = $b3.value2,
b3.sources = CASE WHEN NOT 'source' IN b3.sources THEN b3.sources + 'source' END
MERGE (b)-[:baseConnectionType3]->(b3)
FOREACH (x in $x_list |
MERGE (x1: ConnectedNodeType1 {
k_id: $b.k_id,
value: x
})
ON CREATE
SET x1.sources = ['source']
ON MATCH
SET x1.sources = CASE WHEN NOT 'source' IN x1.sources THEN x1.sources + 'source' END
MERGE (p)-[:connectionType1]->(x1))
FOREACH (y in $y_list |
MERGE (y1: ConnectedNodeType2 {
k_id: $b.k_id,
value: y
})
ON CREATE
SET y1.sources = ['source']
ON MATCH
SET y1.sources = CASE WHEN NOT 'source' IN y1.sources THEN y1.sources + 'source' END
MERGE (p)-[:connectionType2]->(y1))
FOREACH (z in $z_list |
MERGE (z1: ConnectedNodeType3 {
k_id: $b.k_id,
value: z
})
ON CREATE
SET z1.sources = ['source']
ON MATCH
SET z1.sources = CASE WHEN NOT 'source' IN z1.sources THEN z1.sources + 'source' END
MERGE (p)-[:connectionType3]->(z1))
Here is a sample query, we have several such queries.
I did post the same on Neo4j Discord, and someone suggested that I eliminate Eager Operations in my Query. So I broke the query down to multiple smaller queries making sure None of them had any Eager operations. But they are still very slow, actually slower than the original query.