Merge all nodes with the same property name

Hi, I've a problem that I do not know how to code in cypher. I have duplicate nodes with the same property name, (n.name) and they have their own relationships. I wanted to match these nodes, merges the properties and relationships of the 2nd through last nodes onto the first node, and deletes the 2nd through last nodes. I have the code below and it works.

MATCH (n:name)
WHERE n.name = "john"
WITH COLLECT(n) AS ns
CALL apoc.refactor.mergeNodes(ns) YIELD node
RETURN node;

However, I do not want to input the name manually. Is it possible to do a for loop all the nodes that have the same property name and do the above code? like john, jack, jane ...

to find nodes with the same property value

MATCH (n1:name),(n2:name)
WHERE n1.name = n2.name and id(n1) < id(n2)
WITH [n1,n2] as ns
CALL apoc.refactor.mergeNodes(ns) YIELD node
RETURN node

id(n1) < id(n2)
don't quite get this and got syntax error
does the id refer to the internal id that neo4j created? i do not have id as property in the node

Yes, this is internal id of Neo4j. I fixed the syntax error

works! thanks alot!!!!!

Hello, I have actually been looking for a solution like that as well. For me it however leads to a Cartesian product? Is there a way to avoid that?

Yes, it will look similar to the query in the first post, except we'll collect with respect to name (for each row/name, we'll get the collection of nodes with that name) and filter to only rows where there are multiple nodes for that name:

MATCH (n:name)
WITH n.name as name, COLLECT(n) AS ns
WHERE size(ns) > 1
CALL apoc.refactor.mergeNodes(ns) YIELD node
RETURN node;
3 Likes

Brilliant, thanks a lot!!!

Hi, I've an error if the property of my nodes are uniquely indexed. Are there any work around with it?

New data does not satisfy CONSTRAINT ON ( label[0]:label[0] ) ASSERT label[0].property[0] IS UNIQUE.

Nvm solved it using this

CALL apoc.refactor.mergeNodes(nodes,{properties:"combine", mergeRels:true}) yield node

Hi andrew, is it possible to create case insensitive merge query like this one?

You mean case insensitive on the name property (or the common property used to identify duplicates), or case insensitive on the properties being merged?

If you want case insensitive on the name property, then it should be enough to use toLower() or toUpper() on that property at the time of the collection:

MATCH (n:name)
WITH toLower(n.name) as name, COLLECT(n) AS ns
WHERE size(ns) > 1
CALL apoc.refactor.mergeNodes(ns) YIELD node
RETURN node;
1 Like

Worked like a charm, thanks.

Hi, I've got a very similar question, but I am unsure on how to solve it.
I have a pretty big database (>3 000 000 nodes) and I'm trying to merge nodes but those who have multiple similar properties only.

First option works ok, but it's doing a cartesian product, and I fear of running out of memory, or that it will take ages to complete.

I want to use second option, but I don't quite understand how "WITH ... as ... COLLECT" works.
Is this query correct ?

MATCH (n:Word) WITH toLower(n.spelling) as spelling AND toLower(n.pos) as pos AND toLower(n.language) as spelling, COLLECT(n) AS ns
WHERE size(ns) > 1
CALL apoc.refactor.mergeNodes(ns) YIELD node
RETURN node;

I hope this is an appropriate place to ask my question. Thank you for ur help
Please excuse my fragile english

This is my solution that avoids the Cartesian product.

The trick is to use CALL to force a first MATCH to completion without involving a second MATCH in a Cartesian Product. This generates the duplicate nodes and a list of the property (names) values of those nodes that have been duplicated. The list of properties that are duplicate makes searching for the duplicates much faster.

The 3rd MATCH (outside of the CALL) is highly filtered so it's much faster. It gets the two duplicate nodes (that are different because they have different internal id's).

CALL {MATCH (n:Label)
WITH COLLECT(n.name) AS names  // I have to admit I don't know why this works.
WHERE size(names) > 1 
WITH collect(names[0]) as bnames // Makes a list of names that were duplicated but without duplicates
MATCH (b:Label) WHERE b.name in bnames
return b, bnames}
WITH b, bnames // b is duplicate nodes, bnames is list of names that were dupped

// b2 is the node that is a duplicate of b (by name)
MATCH (b2:Label) // get a second node
WHERE b2.name in bnames AND b2.name = b.name AND id(b) > id(b2) // fast match a duplicate
CALL apoc.refactor.mergeNodes([b, b2],
     {properties:"combine", mergeRels:true}) // merging details
YIELD node
RETURN node

I'm at the intermediate level, so this query could be improved.... For instance, I suspect that this test isn't needed:

b2.name in bnames

Anyway, this query ran faster than the other versions above.

I also Cloned my DB before running the query just in case something went awry...

I hope this helps.

1 Like