Understanding how MERGE works

andrew_bowman · August 23, 2018, 2:43am

What is MERGE, and how does it work?

The MERGE clause ensures that a pattern exists in the graph.
Either the pattern already exists, or it needs to be created.

In this way, it's helpful to think of MERGE as attempting a MATCH on the pattern, and if no match is found, a CREATE of the pattern.

When the specified pattern is not present and needs to be created, any variables previously bound to existing graph elements will be reused in the pattern. All other elements of the pattern will be created.

It's important to know which pattern elements will use existing graph elements and which will be created instead.

For the following examples, we'll use a very simplistic graph of :Student, :Class, :ReportCard, and :Term nodes as follows.

(:Student)-[:ENROLLED_IN]->(:Class)
(:Student)-[:EARNED]->(:ReportCard)
(:Class)-[:FOR_TERM]->(:Term)

A MERGE without bound variables can create duplicate elements

The most common MERGE mistake is attempting to MERGE a pattern with no bound variables when you want to use existing graph elements.

For example, attempting to enroll an existing student in an existing class.

MERGE (student:Student{id:123})-[:ENROLLED_IN]->(class:Class{name:'Cypher101'})

In the above query, student and class haven't been previously bound to any node, this is their first use in the query.

If the given student is already enrolled in the given class, the variables will be bound to the existing nodes in the graph as expected.

However, if the pattern doesn't already exist, all unbound elements of the pattern will be created. In this case, all of them;
a new :Student node will be created with the given id, and a new :Class node will be created with the given name, and a new :ENROLLED_IN relationship will be created between these brand new nodes.

If there is a unique constraint for :Student(id), then an error will be thrown. Otherwise, duplicate nodes will be created, which might escape notice, especially for novice users.

A MERGE with bound variables reuses existing graph elements

To use the existing nodes and relationships in the graph, MATCH or MERGE on the nodes or relationships first, and then MERGE in the pattern using the bound variables.

A correct version of the enrollment query from above will MATCH on the student and class first, and then MERGE the relationship.

MATCH (student:Student{id:123})
MATCH (class:Class{name:'Cypher101'})
MERGE (student)-[:ENROLLED_IN]->(class)

MERGE using combinations of bound and unbound variables for different use cases

While the above approach is correct for that particular use case, it's not the right approach for all use cases.
We may need to use combinations of bound and unbound variables in the MERGE for correct behavior.

Consider a query creating report cards for students.

If we reused the above approach, the query may look like this.

MATCH (student:Student{id:123})
MERGE (reportCard:ReportCard{term:'Spring2017'})
MERGE (student)-[:EARNED]->(reportCard)

The problem in this query is that the same :ReportCard node is being reused for all students.
If the query also needed to add grades to the :ReportCard, each subsequent entry would overwrite what was added before.
If not caught, this approach would end up with all students having the exact same grades, the grades entered by the last student processed.

What we really need is a separate :ReportCard per student. We can achieve this by binding the :Student node, but not the :ReportCard node.

MATCH (student:Student{id:123})
MERGE (student)-[:EARNED]->(reportCard:ReportCard{term:'Spring2017'})

Since the student variable is bound to a node, that node will be used if the pattern needs to be created, and the :ReportCard will be created just for this student, not shared among all students.

Remember unbound relationships will be created too

The above examples should be easy to understand for nodes, but remember they apply to relationships too. Unbound relationships in a pattern will be created if the entire pattern doesn't exist.

Consider if we needed to enroll a :Student in a :Class for a specific :Term.

MATCH (student:Student{id:123})
MATCH (spring:Term{name:'Spring2017'})
MATCH (class:Class{name:'Cypher101'})
MERGE (student)-[:ENROLLED_IN]->(class)-[:FOR_TERM]->(spring)

This may look correct, and will behave correctly for the very first run, or for repeated runs for the same student, term and class, but not in the case where the query is run for different students but the same class and term.

This is because even if the the :FOR_TERM relationship exists between the class and the term, if the entire pattern doesn't match, such as when enrolling students in the class, the entire pattern will be created. If the query was run 30 times to enroll 30 students for the same class and term, there would be 30 :FOR_TERM relationships between the class and the term.

To fix this, if it's known that the :FOR_TERM relationship already exists between each :Class and its :Term, then move that part of the pattern into the MATCH.

MATCH (student:Student{id:123})
MATCH (class:Class{name:'Cypher101'})-[:FOR_TERM]->(spring:Term{name:'Spring2017'})
MERGE (student)-[:ENROLLED_IN]->(class)

Use ON MERGE and ON CREATE after MERGE to SET properties according to MERGE behavior

Often after a MERGE we need to SET properties on elements of the pattern, but we may want to conditionally set these properties depending on if MERGE matched to an existing pattern, or had to create it.
For example, if we have default property values we want to set on creation.

The ON MERGE and ON CREATE clauses give us the control we need. This also makes it possible to re-run queries and ensure we aren't overriding existing properties with defaults.

Here's an example when setting student report cards and grades, assuming {grades} is a map parameter of grades we want to set when creating a new :ReportCard node.

MATCH (student:Student{id:123})
MERGE (student)-[:EARNED]->(reportCard:ReportCard{term:'Spring2017'})
ON CREATE SET reportCard += {grades}

MERGE acquires locks on nodes and relationships in the pattern when resulting in pattern creation

When MERGE fails to find an existing pattern, it will acquire locks on all bound nodes and relationships in the pattern before creating the missing elements of the pattern.

This is to ensure a MERGE or CREATE can't concurrently create the pattern (or alter the properties of an existing pattern to make it identical to the desired pattern) while the MERGE is executing, which would cause pattern duplication.

MERGE uses double-checked locking to avoid race conditions where the pattern might get created in the time gap between when MERGE determines the pattern doesn't exist, and when locks are acquired.

Note for Neo4j < 3.0.9 for 3.0.x versions, and < 3.1.2 for 3.1.x versions

A bug in the cost planner for affected versions prevented double-checked locking on MERGE. This allowed race conditions which could result in duplicate patterns being created by concurrent MERGE operations, or other write operation which caused the previously non-existent pattern to exist.

We recommend upgrading to a version which includes this bug fix.

1708krishna · December 11, 2018, 9:34am

Amazing and very helpful post. It really helped me to understand Neo4j better. Thanks

jeffreylm · March 13, 2019, 6:02pm

The post is worth goal. I have read the online manual, have been watching hours of tutorial, and nothing covered this essential way of using merge and/or the effect of combining it.

Now I only have to find more information on what bounded and unbounded variables are.

andrew_bowman · March 13, 2019, 7:44pm

Hi Jeffrey,

This just refers to variables which were used previously in the query (and are thus bound to graph elements from previous operations), or variables that are being used for the first time in the MERGE operation, and are not yet bound to anything in the graph.

This is relevant because if you're using variables that are already bound to graph elements, then in the MERGE those elements will be used.

jeffreylm · March 13, 2019, 8:15pm

Hello Andrew,

As simple as that explanation is. It would help a great many newcomers to see that wording in the help file with the create, match and Merge entries.

Once again, thank you.

andrew_bowman · March 13, 2019, 8:37pm

That's fair, I'll update accordingly.

hung.le112143 · October 10, 2019, 10:26am

Hi,

Thanks for the post.

I'm having troubles with MERGE lately. According to what I'd read, race conditions are solved in Neo4j 3.2.x versions. However, I'm fighting it in Neo4j 3.2.6.

I'm having multiple sessions running concurrently, each of which has a transaction with MERGE. This is the MERGE clause:
MERGE (n:NODE {name:$sname})
MERGE (m:NODE {name:$oname})
MERGE (n)-[r:RELATION {relation:$relation}]->(m)

Is it the expecting result for this clause to create multiple nodes with the same name when running concurrently? Because that's what I'm experiencing.

Thanks for any help!

andrew_bowman · October 12, 2019, 3:59pm

Maybe :NODE(name) needs to have a unique constraint? Give that a try.

Some additional info, without a unique constraint, there are no locks taken with respect to some label lock when adding or removing nodes of a label (otherwise we could have lock contention simply by adding or removing nodes of a label in all cases, that would have a big impact to performance).

Because we don't have locks at this level, it is possible for race conditions when doing MERGEs for single nodes, because a MERGE is like first doing a MATCH, and if that MATCH fails then a CREATE. Two concurrently-executing queries (with the same name for the node) could both process the MATCH portion, see that the node with that name doesn't exist, and then they both create a new node with that name.

If there is a unique constraint present, then this scenario isn't possible, as node insert for a labeled node with a unique constraint does need to take locks to ensure uniqueness.

So you can add a unique constraint if :NODE(name) is supposed to be unique. If you absolutely need serialized isolation here, and if :NODE(name) isn't supposed to be unique, then you may need to either not do these operations in parallel, or you may need to go with a pessimistic locking strategy, locking upon some specific node before you do any creation of nodes of that label (and adhering to this in all such operations). This serializes your insert queries, so only one such query will be able to execute at a time.

As for the MERGE of relationships as in your last line, the above concerns do NOT apply, since the nodes for that relationship merge already exist and will be locked upon in case the relationship needs to be created (and we execute the MATCH a second time after taking locks, following the double-checked locking pattern, to ensure we avoid a race condition here).

andrew_bowman · November 12, 2019, 1:00am

A small typo in the last paragraph above, this should be:

As for the MERGE of relationships as in your last line, the above concerns do not apply...

Topic		Replies	Views
Merge: Bounding Relationships Cypher	1	651	August 19, 2019
How do I create a cypher query for the following use case? Cypher	2	344	May 11, 2020
Cypher UPSERT Cypher merge	2	4529	September 20, 2018
Question about creating an intermediate node Cypher	2	30	September 5, 2024
Graph Academy: MERGE Newbie Questions	1	379	July 5, 2021

Submit Your Talk by June 15