I am trying to merge duplicate nodes using a combination of apoc.periodic.iterate and apoc.refactor.mergeNodes but get a strange result. The code runs and seems to do the job with the clusters touched upon, but when finished there are still some clusters of duplicates left. When I run the code again some more duplicates are merged correctly but it does not go over the whole database.
The principle is as follows:
A central parent node (p) can have several child nodes (c) that sometimes are duplicates and should be merged. As merging criteria I am using a combination of
link to the same parent node (p) with a unique number (p.number)
identical property values on the linked child nodes (c.idstring)
"WITH p
MATCH (p:Parent)-(r:HAS_LINK)-(c:Child)
WITH c.idstring AS idstring, p.number AS number,
COLLECT p AS nodes
CALL apoc.refactor.mergeNodes (nodes, {properties: 'discard', mergeRels: true})
YIELD node
RETURN node",
Your statement cannot work, please try each query with explain
e.g. collect(p) as nodes or the relationship-syntax -[r:HAS_CHILD]->
I wouldn't do that in parallel because they can step on each other.
Also you must make sure that the batch size doesn't split across parents that share a child.
probably better to do the match in the driving query and pass the parent collection to the executing query
something like:
CALL apoc.periodic.iterate(
"MATCH (p:Parent)-[:HAS_LINK]->(c:Child)
WITH c.idstring AS idstring, p.number AS number, collect(p) AS nodes
RETURN nodes",
"CALL apoc.refactor.mergeNodes (nodes, {properties: 'discard', mergeRels: true})
YIELD node RETURN count(*)",
{batchsize: 1000, parallel: true})
;
The code I have been using works fine and does the job, but not all parent-child clusters are processed.
They won't step on each other since no child has more than one parent. Since the batch focusses on the parents only, I thought that all clusters in the database would be handled, 1000 clusters in every iteration. Or will the batch size include both parents and children, thus leaving some duplicate children unmerged..?
I just thought because you aggregate both on parent and child information, if there is p.number shared between parents then you'd get the effect I mentioned.
If your tree is clearly separated and no repeating parent.number then your query should have isolated baches of parents, which then are aggregated in your query and merged.
Perhaps you can try to reproduce with one of the missing merged nodes in a non-periodic-iterate example?