The following is the cypher that I used to merge the duplicate node
###################
MATCH (n:User)
WITH n.user AS repeatuser, collect(n) AS nodes
WHERE size(nodes) > 1
CALL apoc.refactor.mergeNodes(nodes)
YIELD node
RETURN node
######################
Question : How can I run the above query faster ? I tried the following
CALL apoc.periodic.iterate(
'MATCH (n:Process) RETURN n.pid AS repeatpid',
'MATCH(n:Process {pid:repeatpid}) WITH repeatpid, collect(n) AS nodes WHERE size(nodes)>1 CALL apoc.refactor.mergeNodes(nodes) yield node return node', {batchSize:10000, parallel:true}) yield total
Although it work but it still take lots of time (with total 180,000 nodes with RAM has 500GB, 44 core CPU 88 threads)
Do you need to return the whole node or anything for that matter? If not, try removing the return statement. If it complains you can’t end with a call without returning anything, return a constant or a limited number of node properties.
You could wrap the apoc procedure in a ‘call subquery in transaction’ clause, importing ‘nodes’ using ‘with’. This would batch the updates.
In your implementation using ‘apoc.periodic.iterate’, you are matching twice to get the same nodes. I would suggest the first query create the collections and return them. The second query calls the apoc method for each collection of nodes created in the first query. This would be similar to using ‘call subquery’.
You should not need the call subquery. I suggested using ‘call subquery with transactions’ as an alternative to apoc.periodic.iterate.
I assume the nodes you are merging have relationships, which will be merged too. As such, you may get record locking contention. Try not running it parallel. Also, try increasing the batch sized. You could try 10,000. Decrease if you experience memory issues.
'MATCH(n:Process {pid:repeatpid}) WITH repeatpid, collect(n) AS nodes WHERE size(nodes)>1 CALL apoc.refactor.mergeNodes(nodes) yield node RETURN 5', {batchSize:10000, parallel:false}) yield total
It spent 9mins and almost 50mins for (A) and (B), respectively. Is there any possible to add Batchsize for method (A) for anything that can improve the performance again? Since in fact in my database there would be about 1 billion node which need to be merged...
How do you plan on running this?
‘Call {} in transactions’ only works with implied transactions. This requires prepending ‘:auto’ when executing in the browser.