cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! Site migration is underway. Phase 2: migrate recent content

MERGE Performance Extremely Slow in 4.x Due to Unexpectedly Large Node Scanning

jwillis
Node

I am finally upgrading from 3.5.14 to 4.4.11, and I'm running into some pretty significant performance issues in 4.4 when running the same queries with the same data as I have in 3.5. The issue seems to be with different query plans for MERGE statements, wherein 4.4 devises a query plan that begins with an extremely large AllNodesScan that I cannot debug for the life of me.

I have created around 28K (t:Trans) nodes and now need to create (p:Person) nodes from a subset of properties in those original Trans nodes. I have created indexes for prop1 and prop2, but I've also tried running with indexes on every property in Trans nodes and had the same issue, so I don't think a lack of indexing is to blame.

Here is the query I'm running:

MATCH(t:Trans)
WHERE t.prop2 IS NOT NULL
WITH t
MERGE (p:Person {cluster_id: t.prop2})
ON CREATE SET
    p.prop_a = t.prop3,
    p.prop_b = t.prop4,
    p.prop_c = t.prop5,
    p.prop_d = t.prop6,
    p.prop_e = t.prop7,
    p.prop_f = t.prop8,
    p.prop_g = t.prop9,
    p.prop_h = t.prop10,
    p.prop_i = t.prop11,
    p.prop_j = t.prop12,
    p.prop_k = t.prop13,
    p.prop_l = t.prop14,
    p.prop_m = t.prop15;

After running PROFILE on the above query in both versions, I can see that the execution plan in Neo4j 3.5 results in 1,007,473,423 total db hits, while the execution plan in Neo4j 4.4 results in 3,021,108,640 total db hits. This is with only 28,966 nodes in the graph — a much smaller subset of the actual data I need to work with, which is around 100K nodes and is large enough that this query is unusable. I'm not sure how such large db hits are happening with such a small data set — especially with the node scans for [t, p] which which happen before any p nodes even exist and there are only 28K t nodes.

Here is the execution plan in 3.5:

Screen Shot 2022-10-27 at 1.13.15 PM.png

 

Here is the execution plan in 4.4:

Screen Shot 2022-10-27 at 1.09.41 PM.png

 

Why are these query plans so different, and why is 4.4 choosing a query plan that performs several orders of magnitude slower than the one in 3.5? (For what it's worth, I've also tried this in 4.1 and have the same performance issue).

1 ACCEPTED SOLUTION

Cobra
Ninja
Ninja

Hello @jwillis 🙂

Do you have a unique constraint on Person nodes for the cluster_id property?

Moreover, you should use apoc.periodic.iterate() procedure to create these nodes.

Regards,
Cobra

View solution in original post

3 REPLIES 3

Cobra
Ninja
Ninja

Hello @jwillis 🙂

Do you have a unique constraint on Person nodes for the cluster_id property?

Moreover, you should use apoc.periodic.iterate() procedure to create these nodes.

Regards,
Cobra

jwillis
Node

Hi @Cobra, thanks so much for your reply. My apologies on the delay in responding. In terms of the performance issue that brought me here, the unique constraint (as opposed to the index I was using) seems to have solved the issue. I'm looking into apoc.periodic.iterate() as well, but I think we'll end up implementing that in another sprint since the constraint solves the most pressing problem.

Just for my own understanding of cypher query performance, do you know why the query plan without the unique constraint involves an AllNodesScan of billions of rows at the outset? There's got to be some kind of cardinality issue, but I don't know where it is or why it's happening since it happens on the first step when p doesn't yet exist.

Hello, without a unique constraint, the database has to check if the node exists in the database before creating/updating it. With a unique constraint, the database doesn't have to do this check. In the end, without a unique constraint, the more nodes there are, the longer it will take because the label scan to be done will be bigger and bigger.