Hello,
I have the following cypher statement that I will be running hundreds of thousands of times:
UNWIND $list_of_lists_ids_batch as row
MATCH (c:Item)
WHERE id(c) IN row
WITH row, c ORDER BY size([(c)<-[:ITEM_IS_IN]-() | c]) DESC
WITH collect(DISTINCT c) as nodes, row
CALL apoc.refactor.mergeNodes(nodes, {properties:\'discard\'})
YIELD node
RETURN 'success'
The list_of_lists_ids_batch normally has 200-400 lists of node ids inside of it. I preprocess this list in Python so that every nested list is unique and there are no duplicated node ids inside of the list (i.e. no deadlock scenarios). Here is a small sample of a list:
[[388519361, 386406966], [318427627, 388519327], [263243312, 386153013, 384477766], [388516057, 388516046, 388500214]]
If I run this without the apoc.refactor.mergeNodes, it runs in about 0.2s with 5,664 db hits. When I add in the apoc.refactor.mergeNodes, it takes about 4 seconds.
I would think this should run fairly quickly (<1 second) so I'm at a bit of a loss of why it is taking so long each time. I have all the needed indexes and there's nothing obvious in the profile that I could optimize.
The only thing I can think of is that I see about 40-80 page faults in the log every time this query is run, and I can see pageFaults increasing over time in our monitoring page. We have 4TB of Ram on our machine, and we currently have 31G allocated to Heap and the rest allocated to pageCache size (current setting is 4053900m which I believe is larger than our current memory). We are running the enterprise version of neo4j.
Misc Items:
Python driver version: 5.7.0
Python version: 3.10
Neo4j Version: 5.4.0
Apoc Version: 5.4.0
Any help or advice is greatly appreciated.