Apoc iterated MergeNodes will not merge all matching nodes

jhellquist · August 29, 2019, 1:16pm

I am trying to merge duplicate nodes using a combination of apoc.periodic.iterate and apoc.refactor.mergeNodes but get a strange result. The code runs and seems to do the job with the clusters touched upon, but when finished there are still some clusters of duplicates left. When I run the code again some more duplicates are merged correctly but it does not go over the whole database.

The principle is as follows:
A central parent node (p) can have several child nodes (c) that sometimes are duplicates and should be merged. As merging criteria I am using a combination of

link to the same parent node (p) with a unique number (p.number)
identical property values on the linked child nodes (c.idstring)

Neo4j 3.5.4 Enterprise edition
Apoc 3.5.0.3
Code:
CALL apoc.periodic.iterate( "Match (p:Parent) RETURN p",

"WITH p
MATCH (p:Parent)-(r:HAS_LINK)-(c:Child)
WITH c.idstring AS idstring, p.number AS number,
COLLECT p AS nodes
CALL apoc.refactor.mergeNodes (nodes, {properties: 'discard', mergeRels: true})
YIELD node
RETURN node",

{batchsize: 1000, parallel: true})
;

michael.hunger · August 30, 2019, 10:07am

Your statement cannot work, please try each query with explain

e.g. collect(p) as nodes or the relationship-syntax -[r:HAS_CHILD]->

I wouldn't do that in parallel because they can step on each other.
Also you must make sure that the batch size doesn't split across parents that share a child.
probably better to do the match in the driving query and pass the parent collection to the executing query

something like:

CALL apoc.periodic.iterate(
"MATCH (p:Parent)-[:HAS_LINK]->(c:Child)
WITH c.idstring AS idstring, p.number AS number, collect(p) AS nodes
RETURN nodes",

"CALL apoc.refactor.mergeNodes (nodes, {properties: 'discard', mergeRels: true})
YIELD node RETURN count(*)",

{batchsize: 1000, parallel: true})
;

jhellquist · August 30, 2019, 11:10am

The code I have been using works fine and does the job, but not all parent-child clusters are processed.

They won't step on each other since no child has more than one parent. Since the batch focusses on the parents only, I thought that all clusters in the database would be handled, 1000 clusters in every iteration. Or will the batch size include both parents and children, thus leaving some duplicate children unmerged..?

michael.hunger · August 31, 2019, 12:37am

I just thought because you aggregate both on parent and child information, if there is p.number shared between parents then you'd get the effect I mentioned.

jhellquist · August 31, 2019, 9:15am

Ok, thanks. Any ideas about the batch content; will it only include parents or a mix of parents and linked child nodes?

michael.hunger · August 31, 2019, 10:25am

If your tree is clearly separated and no repeating parent.number then your query should have isolated baches of parents, which then are aggregated in your query and merged.

Perhaps you can try to reproduce with one of the missing merged nodes in a non-periodic-iterate example?

jhellquist · September 4, 2019, 7:25am

I have repeated the mergenodes code without iteration on some remaining clusters, and then they are merged as they should... This is a mystery to me.

Topic		Replies	Views
Statement using Apoc Periodic Iterate gets stuck, but works without the iterate Cypher apoc , cypher	3	192	March 10, 2023
Merging nodes on multiple fields is very slow Cypher apoc , performance	8	375	March 9, 2023
Performance Issues Merging Nodes Cypher apoc , performance , cypher	3	348	March 13, 2022
Nested APOC statements using apoc.refactor.mergeNodes() Procedures & APOC	0	356	July 10, 2020
Improve performance of apoc.refactor.mergeNodes Conferences, Meetups, & Events migrated	6	157	December 21, 2022

Get Certified in June!

Apoc iterated MergeNodes will not merge all matching nodes

Related topics