Improve performance of apoc.refactor.mergeNodes

Peter_Lian · December 20, 2022, 2:54am

The following is the cypher that I used to merge the duplicate node

###################

MATCH (n:User)

WITH n.user AS repeatuser, collect(n) AS nodes

WHERE size(nodes) > 1

CALL apoc.refactor.mergeNodes(nodes)

YIELD node

RETURN node

######################

Question : How can I run the above query faster ? I tried the following

CALL apoc.periodic.iterate(

'MATCH (n:Process) RETURN n.pid AS repeatpid',

'MATCH(n:Process {pid:repeatpid}) WITH repeatpid, collect(n) AS nodes WHERE size(nodes)>1 CALL apoc.refactor.mergeNodes(nodes) yield node return node', {batchSize:10000, parallel:true}) yield total

Although it work but it still take lots of time (with total 180,000 nodes with RAM has 500GB, 44 core CPU 88 threads)

Thanks.

glilienfield · December 20, 2022, 4:20pm

Do you need to return the whole node or anything for that matter? If not, try removing the return statement. If it complains you can’t end with a call without returning anything, return a constant or a limited number of node properties.

You could wrap the apoc procedure in a ‘call subquery in transaction’ clause, importing ‘nodes’ using ‘with’. This would batch the updates.

In your implementation using ‘apoc.periodic.iterate’, you are matching twice to get the same nodes. I would suggest the first query create the collections and return them. The second query calls the apoc method for each collection of nodes created in the first query. This would be similar to using ‘call subquery’.

Peter_Lian · December 21, 2022, 3:24am

Thanks, the "call subquery" and "remove return" work. But for the last (your suggestion), I tried the following

CALL{

CALL apoc.periodic.iterate(

'MATCH (n:Process) WITH n.pid AS repeatpid, collect(n) AS nodes WHERE size(nodes)>1 RETURN nodes ',

' CALL apoc.refactor.mergeNodes(nodes) yield node return 5', {batchSize:10, parallel:true}) yield total

}

It just keep running and do not reach the end. Could you give me some hint that where I wrong? Thanks.

glilienfield · December 21, 2022, 3:51am

You should not need the call subquery. I suggested using ‘call subquery with transactions’ as an alternative to apoc.periodic.iterate.

I assume the nodes you are merging have relationships, which will be merged too. As such, you may get record locking contention. Try not running it parallel. Also, try increasing the batch sized. You could try 10,000. Decrease if you experience memory issues.

Peter_Lian · December 21, 2022, 6:45am

Excuse me, now I have 483,000 node (all named as "process") with 2,950,000 relation (all named as "fork"), I tried the following

(A) : Call subquery with transaction

CALL{

MATCH (n:Process)

WITH n.pid AS repeatpid, collect(n) AS nodes

WHERE size(nodes) > 1

CALL{

WITH nodes

CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})

YIELD node

RETURN 5

}

(B) : apoc.periodic (no parallel)

CALL apoc.periodic.iterate(

'MATCH (n:Process) RETURN n.pid AS repeatpid',

'MATCH(n:Process {pid:repeatpid}) WITH repeatpid, collect(n) AS nodes WHERE size(nodes)>1 CALL apoc.refactor.mergeNodes(nodes) yield node RETURN 5', {batchSize:10000, parallel:false}) yield total

It spent 9mins and almost 50mins for (A) and (B), respectively. Is there any possible to add Batchsize for method (A) for anything that can improve the performance again? Since in fact in my database there would be about 1 billion node which need to be merged...

Thanks lot.

glilienfield · December 21, 2022, 7:07am

How do you plan on running this?
‘Call {} in transactions’ only works with implied transactions. This requires prepending ‘:auto’ when executing in the browser.

:auto

MATCH (n:Process)

WITH n.pid AS repeatpid, collect(n) AS nodes

WHERE size(nodes) > 1

CALL{

WITH nodes

CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})

YIELD node

RETURN 5

} in transactions of 10000 rows

Can you remove the ‘return’ or both the ‘yield and ‘return’, or does it complain neither is allowed?

https://neo4j.com/docs/cypher-manual/current/clauses/call-subquery/#_batching

how do you have so many duplicates?

Peter_Lian · December 21, 2022, 8:44am

I tried what you showed, i.e.

:auto MATCH (n:Process)

WITH n.pid AS repeatpid, collect(n) AS nodes

WHERE size(nodes) > 1

CALL{

WITH nodes

CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})

YIELD node

} in transactions of 10000 rows

To my surprised that it take about 30mins which is slower than without in transactions of 10000 rows (9mins), why Batch not faster...?

By the way, may I asked that what's difference or what kind of timing that I should chose apoc.periodic.iterate or call subquery with transactions ?

Moreover, that's because I'm checking the security log data which contains lots of duplicate information (like user, internet name, computer...)

Appreciate about it.

Topic		Replies	Views
Apoc iterated MergeNodes will not merge all matching nodes Cypher apoc	6	1275	September 4, 2019
apoc.refactor.mergeNodes Performance Neo4j Graph Platform apoc , performance , migrated , mergenode	6	343	November 28, 2022
Speeding up apoc.refactor.mergeNodes query Cypher apoc , performance , cypher , relationship	1	224	April 28, 2023
Nested APOC statements using apoc.refactor.mergeNodes() Procedures & APOC	0	356	July 10, 2020
Statement using Apoc Periodic Iterate gets stuck, but works without the iterate Cypher apoc , cypher	3	192	March 10, 2023

July Summer Fun!

Improve performance of apoc.refactor.mergeNodes

Related topics