Merging nodes on multiple fields is very slow

I have a database with 8 million Nodes. Max and Initial Heap 5G. PageCache 7G. I am trying to merge duplicate nodes based on some fields
CALL apoc.periodic.iterate(
"match (a1:Person) with a1.name as name, a1.registryNumber as registryNumber, a1.registryKadaa as kadaa,a1.registryMouhafaza as moh,a1.registryTown as town, collect(a1) as nodes where size(nodes)>1 return nodes",
"CALL apoc.refactor.mergeNodes(nodes,{ properties:'combine', mergeRels:true}) YIELD node set node.id=randomUUID()",
{batchSize:5000});

name:String
registryNumber:Int
RegistryKadaa:String
RegistryMouhafaza:String
RegistryTown:String

I did a query to get the count of the duplicate codes and its 731804 .
I takes several hours and it keeps running, never completes.
Explain plan

@mbourjeily

doesnt this query return every :Person node and doesnt each record report a size(nodes)=1. the collect(a1) is effectively aggregating on a1 and each a1 is unqiue.
Are we sure this is the correct query?

1 Like

Well, your query is retrieving all 8 million nodes and then grouping them into lists of nodes with the same value for five properties. This may require a lot of memory, depending how the aggregation is performed by neo4j.

Anyway, you may want to manually batch the work instead. You can figure out the range of the registeryNumber and break the ranging into smaller equal size buckets, so you can execute the query for each individual bucket of registryIdentifer values. It will also help for this case to have an index on the property so the match is fast. You just have to add a range condition to your match to limit the results to the specific batch.

2 Likes

Hi. Yes I have tested it with some custom filtering, it returns only the Persons that have these same values

Thank you. Is there a way to batch it automatically. I am trying to get a distinct combination of the unique fields (each combination does not seem to have more than 10 000 record) ,as a list then unwind the list and for each row, do a match and find duplicates, but it's still not completing

match(p1:Person) with p1.registryMouhafaza as moh ,p1.registryKadaa as kad,p1.registryTown as town, count(*) as cnt
with collect({moh:moh,kad:kad,town:town,cnt:cnt}) as list with list unwind(list) as row match (a1:Person) where a1.registryMouhafaza=row.moh and a1.registryKadaa=row.kad and a1.registryTown=row.town
with a1.name as name, a1.registryNumber as registryNumber, a1.registryKadaa as kadaa,a1.registryMouhafaza as moh,a1.registryTown as town,
collect(a1) as nodes where size(nodes)>1 return count(nodes)

I am suggesting you do something like this, where you manually execute the query through the full range of registryNumbers. It will greatly help if you had an index on registryNumber as well.

WITH 0 as minRegNum, 100000 as maxRegNum
CALL apoc.periodic.iterate(
"match (a1:Person)
where minRegNum <= a1.registryNumber <= maxRegNum
with a1.name as name, a1.registryNumber as registryNumber, a1.registryKadaa as kadaa, a1.registryMouhafaza as moh, a1.registryTown as town, collect(a1) as nodes 
where size(nodes)>1 
return nodes",
"CALL apoc.refactor.mergeNodes(nodes,{ properties:'combine', mergeRels:true}) YIELD node set node.id=randomUUID()",
{batchSize:10000});

Thanks again for your reply. My problem is that our range of registryNumber is very low, between 0 and 150 with most Person concentrated between 0 and 50, so we would still end up with large number for each registryNumber. That's why i was trying the approach of the combination, which have only several thousands for each combination.

Yeah, that will not help. Can you partition on any of the other fields you are grouping on?

Hi Gary,
Yes on the registryKadaa, registryTown, RegistryMouhafaza, but i dont have explicit values, they need to be fetched using match. That's what i was trying to do in my previous comment, getting a list of unique combinations and unwinding and working each batch for a unique combination