Improve performance of apoc.refactor.mergeNodes

The following is the cypher that I used to merge the duplicate node

###################

MATCH (n:User)

WITH n.user AS repeatuser, collect(n) AS nodes

WHERE size(nodes) > 1

CALL apoc.refactor.mergeNodes(nodes)

YIELD node

RETURN node

######################

Question : How can I run the above query faster ? I tried the following

CALL apoc.periodic.iterate(

'MATCH (n:Process) RETURN n.pid AS repeatpid',

'MATCH(n:Process {pid:repeatpid}) WITH repeatpid, collect(n) AS nodes WHERE size(nodes)>1 CALL apoc.refactor.mergeNodes(nodes) yield node return node', {batchSize:10000, parallel:true}) yield total

Although it work but it still take lots of time (with total 180,000 nodes with RAM has 500GB, 44 core CPU 88 threads)

Thanks.

Do you need to return the whole node or anything for that matter? If not, try removing the return statement. If it complains you can’t end with a call without returning anything, return a constant or a limited number of node properties.

You could wrap the apoc procedure in a ‘call subquery in transaction’ clause, importing ‘nodes’ using ‘with’. This would batch the updates.

In your implementation using ‘apoc.periodic.iterate’, you are matching twice to get the same nodes. I would suggest the first query create the collections and return them. The second query calls the apoc method for each collection of nodes created in the first query. This would be similar to using ‘call subquery’.

Thanks, the "call subquery" and "remove return" work. But for the last (your suggestion), I tried the following

CALL{

CALL apoc.periodic.iterate(

'MATCH (n:Process) WITH n.pid AS repeatpid, collect(n) AS nodes WHERE size(nodes)>1 RETURN nodes ',

' CALL apoc.refactor.mergeNodes(nodes) yield node return 5', {batchSize:10, parallel:true}) yield total

}

It just keep running and do not reach the end. Could you give me some hint that where I wrong? Thanks.

You should not need the call subquery. I suggested using ‘call subquery with transactions’ as an alternative to apoc.periodic.iterate.

I assume the nodes you are merging have relationships, which will be merged too. As such, you may get record locking contention. Try not running it parallel. Also, try increasing the batch sized. You could try 10,000. Decrease if you experience memory issues.

Excuse me, now I have 483,000 node (all named as "process") with 2,950,000 relation (all named as "fork"), I tried the following

(A) : Call subquery with transaction

CALL{

MATCH (n:Process)

WITH n.pid AS repeatpid, collect(n) AS nodes

WHERE size(nodes) > 1

CALL{

WITH nodes

CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})

YIELD node

RETURN 5

}

}

(B) : apoc.periodic (no parallel)

CALL apoc.periodic.iterate(

'MATCH (n:Process) RETURN n.pid AS repeatpid',

'MATCH(n:Process {pid:repeatpid}) WITH repeatpid, collect(n) AS nodes WHERE size(nodes)>1 CALL apoc.refactor.mergeNodes(nodes) yield node RETURN 5', {batchSize:10000, parallel:false}) yield total

It spent 9mins and almost 50mins for (A) and (B), respectively. Is there any possible to add Batchsize for method (A) for anything that can improve the performance again? Since in fact in my database there would be about 1 billion node which need to be merged...

Thanks lot.

How do you plan on running this?
‘Call {} in transactions’ only works with implied transactions. This requires prepending ‘:auto’ when executing in the browser.

:auto

MATCH (n:Process)

WITH n.pid AS repeatpid, collect(n) AS nodes

WHERE size(nodes) > 1

CALL{

WITH nodes

CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})

YIELD node

RETURN 5

} in transactions of 10000 rows

Can you remove the ‘return’ or both the ‘yield and ‘return’, or does it complain neither is allowed?

https://neo4j.com/docs/cypher-manual/current/clauses/call-subquery/#_batching

how do you have so many duplicates?

I tried what you showed, i.e.

:auto MATCH (n:Process)

WITH n.pid AS repeatpid, collect(n) AS nodes

WHERE size(nodes) > 1

CALL{

WITH nodes

CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})

YIELD node

} in transactions of 10000 rows

To my surprised that it take about 30mins which is slower than without in transactions of 10000 rows (9mins), why Batch not faster...?

By the way, may I asked that what's difference or what kind of timing that I should chose apoc.periodic.iterate or call subquery with transactions ?

Moreover, that's because I'm checking the security log data which contains lots of duplicate information (like user, internet name, computer...)

Appreciate about it.