Methods to batch insert / save multiple nodes + links?

DC1 · May 6, 2024, 1:33am

Using the JS driver

I'm writing some nodes and relations to a neo graph, and the round trip seems about 1s for a single insert. Is there a way to batch insert nodes in one query, eg by passing an array of values?

my basic cypher is like this.
I can't insert these in parallel as there are duplicated nodes, or the t2/end of one insert is the t1/start node of the next. For some reason even when using transactions I seem to get duplicated nodes.

      MERGE(t1:TOPIC {name: $entity1})
      MERGE(t2:TOPIC {name: $entity2})
      MERGE(t1)-[r:RELATION {name: $relation}]->(t2)

I could maybe detect the unique nodes myself and then create a monster cypher query like this, just building a string in code.

      MERGE(t1:TOPIC {name: $entity1})
...
      MERGE(t100:TOPIC {name: $entity100})

and passing a massive hash to send the data too.

But massive string building seems very clunky to me...
It seems like there must be a better method to go about this?

I also considered writing a CSV file and then loading that but that seems a roundabout trip.

also i don't want to use special APOC stuff in case i move to another cypher client.
related 1
rel2

maybe I can do something with CALL subquery but not sure how to pass an array of data in to that via the JS driver.

glilienfield · May 6, 2024, 12:14pm

You can possibly get duplicates when merging if you simultaneously merge the same node the first time and don’t have a uniqueness constraint.

You can pass in an array of data via your driver and use it in the query. I suggest an array of maps, so you can access the properties of the map with dot notation.

Assuming you pass the data as “rows”, which is an array of maps with properties entity1, entity2, and relation, you can try the following:

UNWIND $rows as row 
MERGE(t1:TOPIC {name: row.entity1})
MERGE(t2:TOPIC {name: row.entity2})
MERGE(t1)-[r:RELATION {name: row.relation}]->(t2)

You will pass the rows array as a parameter when executing your query in the driver.

DC1 · May 7, 2024, 12:53pm

You can possibly get duplicates when merging if you simultaneously merge the same node the first time

can you clarify this a bit please?

and don’t have a uniqueness constraint.

I'm worried a constraint here would blow up the entire query, esp if its a transaction.
so neither option is great.

UNWIND $rows as row

this is awesome, i didn't think about unwind for import also. works well, tx!

glilienfield · May 7, 2024, 2:03pm

Sure, if you have multiple threads merging on the same label and property value very close together, you can experience a race condition where neither thread matches, but each creates a node thinking one does not already exists. Once the node(s) are created and setup in the db, the merge will match and will not create a new node.

I have simulated this behavior by submitting the same task (which merged the same node) without latency to a thread pool. I observed duplicate nodes created from the first few threads. I was able to eliminate this behavior by adding a uniqueness constrain to the label/property.

I am not sure why you are "worried" about adding a uniqueness constraint. It will also add a primary key index, which will greatly improve your match/merges on the index's label/property.

DC1 · May 7, 2024, 2:27pm

interesting. but even with threads / or separate sessions / multiple async callbacks -
if these are transactions I would think the existence check is part of the same TX so there shouldn't be a duplicate created?

I'm running these with JS, and a bunch of async TXs with a Promise.all() to wait for them to complete. Is the transaction just safe within it's own process? That can't be

re constraint : worry is if i have one long sequence of inserts with the unwind and one node seems to be duplicated, then the whole sequence (transaction) would fail.
It would be better to figure out where the dupe is coming from...

glilienfield · May 7, 2024, 6:55pm

I am referring to multiple threads, which should always use their own session, as sessions are not thread safe. As such, each thread would have their own transaction. Transactions are not aware other transaction’s write until they are committed. This can cause a race condition with a merge, when multiple transactions are merging the same node and it doesn’t already exist. The race condition is eliminated with a uniqueness constraint. I am sure there is some locking in this case to make sure the merges are sequential.

DC1 · May 8, 2024, 1:03pm

Thanks for the clarification.

But if the constraint throws an error, would that mean the whole transaction would fail?
so if i had a big UNWIND in there all nodes would fail if one were to be a duplicate...

then maybe I need a way within cypher to create a set of individual transactions?
But I think the TX handling was at the driver level, not within cypher itself.
If I make lots of individual calls, it seems the latency or whatever the issue is makes each one take around a second even though my dataset is still tiny.

glilienfield · May 8, 2024, 2:23pm

An exception will be thrown. If you are using auto-commit transactions the whole thing will be rolled back. You could use a transaction function so you can handle the exception.

You should try not to insert duplicate data if you know you have a uniqueness constraint, so it should not happen with frequency.

The point of the uniqueness constraint is to avoid duplicates. It’s necessary if this is important to you. I would rather have an exception then create duplicates when uniqueness is a requirement or expected.

You should have a uniqueness constraint on any property you are using to merge nodes on. You should consider these properties as primary keys.

Keep in mind the situation we are discussing is a race condition where you merge the same node concurrently for the first time and very close in time. Once the node is setup in the db thus is not an issue. There is locking you may content with if you try to update the same node concurrently, but thuu it a standard stuff to deal with it in a multiuser application.

Topic		Replies	Views
Neo4j driver driving me crazy - Strange behaviour Javascript	0	1269	February 2, 2019
Merge Nodes using APOC is slow Procedures & APOC apoc , performance , cypher	4	1304	August 27, 2020
Connecting one node to multiple Cypher	7	541	July 30, 2021
Using Unwind and Dumping Data in neo4j - Query Optimization Cypher apoc , performance , cypher	0	521	July 9, 2020
Merge creating duplicate nodes [Neo4j 4.2.2] Cypher	2	422	February 9, 2021

Demystifying Neo4j UX Research

Methods to batch insert / save multiple nodes + links?

Related topics