I have a database with 8 million Nodes. Max and Initial Heap 5G. PageCache 7G. I am trying to merge duplicate nodes based on some fields
CALL apoc.periodic.iterate(
"match (a1:Person) with a1.name as name, a1.registryNumber as registryNumber, a1.registryKadaa as kadaa,a1.registryMouhafaza as moh,a1.registryTown as town, collect(a1) as nodes where size(nodes)>1 return nodes",
"CALL apoc.refactor.mergeNodes(nodes,{ properties:'combine', mergeRels:true}) YIELD node set node.id=randomUUID()",
{batchSize:5000});
doesnt this query return every :Person node and doesnt each record report a size(nodes)=1. the collect(a1) is effectively aggregating on a1 and each a1 is unqiue.
Are we sure this is the correct query?
Well, your query is retrieving all 8 million nodes and then grouping them into lists of nodes with the same value for five properties. This may require a lot of memory, depending how the aggregation is performed by neo4j.
Anyway, you may want to manually batch the work instead. You can figure out the range of the registeryNumber and break the ranging into smaller equal size buckets, so you can execute the query for each individual bucket of registryIdentifer values. It will also help for this case to have an index on the property so the match is fast. You just have to add a range condition to your match to limit the results to the specific batch.
Thank you. Is there a way to batch it automatically. I am trying to get a distinct combination of the unique fields (each combination does not seem to have more than 10 000 record) ,as a list then unwind the list and for each row, do a match and find duplicates, but it's still not completing
match(p1:Person) with p1.registryMouhafaza as moh ,p1.registryKadaa as kad,p1.registryTown as town, count(*) as cnt
with collect({moh:moh,kad:kad,town:town,cnt:cnt}) as list with list unwind(list) as row match (a1:Person) where a1.registryMouhafaza=row.moh and a1.registryKadaa=row.kad and a1.registryTown=row.town
with a1.name as name, a1.registryNumber as registryNumber, a1.registryKadaa as kadaa,a1.registryMouhafaza as moh,a1.registryTown as town,
collect(a1) as nodes where size(nodes)>1 return count(nodes)
I am suggesting you do something like this, where you manually execute the query through the full range of registryNumbers. It will greatly help if you had an index on registryNumber as well.
WITH 0 as minRegNum, 100000 as maxRegNum
CALL apoc.periodic.iterate(
"match (a1:Person)
where minRegNum <= a1.registryNumber <= maxRegNum
with a1.name as name, a1.registryNumber as registryNumber, a1.registryKadaa as kadaa, a1.registryMouhafaza as moh, a1.registryTown as town, collect(a1) as nodes
where size(nodes)>1
return nodes",
"CALL apoc.refactor.mergeNodes(nodes,{ properties:'combine', mergeRels:true}) YIELD node set node.id=randomUUID()",
{batchSize:10000});
Thanks again for your reply. My problem is that our range of registryNumber is very low, between 0 and 150 with most Person concentrated between 0 and 50, so we would still end up with large number for each registryNumber. That's why i was trying the approach of the combination, which have only several thousands for each combination.
Hi Gary,
Yes on the registryKadaa, registryTown, RegistryMouhafaza, but i dont have explicit values, they need to be fetched using match. That's what i was trying to do in my previous comment, getting a list of unique combinations and unwinding and working each batch for a unique combination