MERGE Performance Extremely Slow in 4.x Due to Unexpectedly Large Node Scanning

jwillis · October 27, 2022, 5:44pm

I am finally upgrading from 3.5.14 to 4.4.11, and I'm running into some pretty significant performance issues in 4.4 when running the same queries with the same data as I have in 3.5. The issue seems to be with different query plans for MERGE statements, wherein 4.4 devises a query plan that begins with an extremely large AllNodesScan that I cannot debug for the life of me.

I have created around 28K (t:Trans) nodes and now need to create (p:Person) nodes from a subset of properties in those original Trans nodes. I have created indexes for prop1 and prop2, but I've also tried running with indexes on every property in Trans nodes and had the same issue, so I don't think a lack of indexing is to blame.

Here is the query I'm running:

MATCH(t:Trans)
WHERE t.prop2 IS NOT NULL
WITH t
MERGE (p:Person {cluster_id: t.prop2})
ON CREATE SET
    p.prop_a = t.prop3,
    p.prop_b = t.prop4,
    p.prop_c = t.prop5,
    p.prop_d = t.prop6,
    p.prop_e = t.prop7,
    p.prop_f = t.prop8,
    p.prop_g = t.prop9,
    p.prop_h = t.prop10,
    p.prop_i = t.prop11,
    p.prop_j = t.prop12,
    p.prop_k = t.prop13,
    p.prop_l = t.prop14,
    p.prop_m = t.prop15;

After running PROFILE on the above query in both versions, I can see that the execution plan in Neo4j 3.5 results in 1,007,473,423 total db hits, while the execution plan in Neo4j 4.4 results in 3,021,108,640 total db hits. This is with only 28,966 nodes in the graph — a much smaller subset of the actual data I need to work with, which is around 100K nodes and is large enough that this query is unusable. I'm not sure how such large db hits are happening with such a small data set — especially with the node scans for [t, p] which which happen before any p nodes even exist and there are only 28K t nodes.

Here is the execution plan in 3.5:

Here is the execution plan in 4.4:

Why are these query plans so different, and why is 4.4 choosing a query plan that performs several orders of magnitude slower than the one in 3.5? (For what it's worth, I've also tried this in 4.1 and have the same performance issue).

Cobra · October 28, 2022, 7:49am

Hello @jwillis

Do you have a unique constraint on Person nodes for the cluster_id property?

Moreover, you should use apoc.periodic.iterate() procedure to create these nodes.

Regards,
Cobra

jwillis · November 18, 2022, 9:07pm

Hi @Cobra, thanks so much for your reply. My apologies on the delay in responding. In terms of the performance issue that brought me here, the unique constraint (as opposed to the index I was using) seems to have solved the issue. I'm looking into apoc.periodic.iterate() as well, but I think we'll end up implementing that in another sprint since the constraint solves the most pressing problem.

Just for my own understanding of cypher query performance, do you know why the query plan without the unique constraint involves an AllNodesScan of billions of rows at the outset? There's got to be some kind of cardinality issue, but I don't know where it is or why it's happening since it happens on the first step when p doesn't yet exist.

Cobra · November 20, 2022, 11:10am

Hello, without a unique constraint, the database has to check if the node exists in the database before creating/updating it. With a unique constraint, the database doesn't have to do this check. In the end, without a unique constraint, the more nodes there are, the longer it will take because the label scan to be done will be bigger and bigger.

Topic		Replies	Views
Very slow merges neo4j Neo4j Graph Platform performance , cypher	12	531	April 4, 2022
Performance issue after upgrading neo4j v4 to v5 Aura & Cloud apoc , cypher	14	100	July 7, 2025
Improving the performance of a cypher query Neo4j Graph Platform	15	725	October 26, 2020
Merging on relationships slows down as super-nodes have greater degrees Cypher relationship , merge	3	289	May 9, 2022
Query Optimization for Parallel Connection in Neo4j Community Version Procedures & APOC apoc , cypher	3	404	May 3, 2021

July Summer Fun!

MERGE Performance Extremely Slow in 4.x Due to Unexpectedly Large Node Scanning

Related topics