Speeding up apoc.refactor.mergeNodes query

space_sam · April 27, 2023, 6:32pm

Hello,

I have the following cypher statement that I will be running hundreds of thousands of times:

UNWIND $list_of_lists_ids_batch as row
MATCH (c:Item)
WHERE id(c) IN row
WITH row, c ORDER BY size([(c)<-[:ITEM_IS_IN]-() | c]) DESC
WITH collect(DISTINCT c) as nodes, row
CALL apoc.refactor.mergeNodes(nodes, {properties:\'discard\'})
YIELD node
RETURN 'success'

The list_of_lists_ids_batch normally has 200-400 lists of node ids inside of it. I preprocess this list in Python so that every nested list is unique and there are no duplicated node ids inside of the list (i.e. no deadlock scenarios). Here is a small sample of a list:
[[388519361, 386406966], [318427627, 388519327], [263243312, 386153013, 384477766], [388516057, 388516046, 388500214]]

If I run this without the apoc.refactor.mergeNodes, it runs in about 0.2s with 5,664 db hits. When I add in the apoc.refactor.mergeNodes, it takes about 4 seconds.
I would think this should run fairly quickly (<1 second) so I'm at a bit of a loss of why it is taking so long each time. I have all the needed indexes and there's nothing obvious in the profile that I could optimize.

The only thing I can think of is that I see about 40-80 page faults in the log every time this query is run, and I can see pageFaults increasing over time in our monitoring page. We have 4TB of Ram on our machine, and we currently have 31G allocated to Heap and the rest allocated to pageCache size (current setting is 4053900m which I believe is larger than our current memory). We are running the enterprise version of neo4j.

Misc Items:
Python driver version: 5.7.0
Python version: 3.10
Neo4j Version: 5.4.0
Apoc Version: 5.4.0

Any help or advice is greatly appreciated.

space_sam · April 28, 2023, 1:26am

I've done a lot more digging into this today, and I'm fairly convinced that the apoc.refactor.mergeNodes call is what's taking the most time. I also modified the last little bit to return COUNT(*) instead of returning success. I profiled my query inside of python and have the following query execution plan:

Result available after 0
Result consumed after: 3803
Planner COST

Runtime PIPELINED

Runtime version 5.4

Batch size 128

Total database accesses: 968 + ?, total allocated memory: 107816
I feel pretty stuck at this point since there's nothing I can really do about a slow apoc procedure...

Topic		Replies	Views
Improve performance of apoc.refactor.mergeNodes Conferences, Meetups, & Events migrated	6	157	December 21, 2022
Performance of apoc.refactor.mergeNodes on Windows vs Linux Procedures & APOC	4	1219	October 18, 2018
Performance Issues Merging Nodes Cypher apoc , performance , cypher	3	348	March 13, 2022
Multiple Node Merge Speed Question Procedures & APOC apoc , performance , cypher	10	527	March 29, 2021
Merging nodes on multiple fields is very slow Cypher apoc , performance	8	375	March 9, 2023

Get Certified in June!

Speeding up apoc.refactor.mergeNodes query

Related topics