I'm writing some nodes and relations to a neo graph, and the round trip seems about 1s for a single insert. Is there a way to batch insert nodes in one query, eg by passing an array of values?
my basic cypher is like this.
I can't insert these in parallel as there are duplicated nodes, or the t2/end of one insert is the t1/start node of the next. For some reason even when using transactions I seem to get duplicated nodes.
You can possibly get duplicates when merging if you simultaneously merge the same node the first time and don’t have a uniqueness constraint.
You can pass in an array of data via your driver and use it in the query. I suggest an array of maps, so you can access the properties of the map with dot notation.
Assuming you pass the data as “rows”, which is an array of maps with properties entity1, entity2, and relation, you can try the following:
Sure, if you have multiple threads merging on the same label and property value very close together, you can experience a race condition where neither thread matches, but each creates a node thinking one does not already exists. Once the node(s) are created and setup in the db, the merge will match and will not create a new node.
I have simulated this behavior by submitting the same task (which merged the same node) without latency to a thread pool. I observed duplicate nodes created from the first few threads. I was able to eliminate this behavior by adding a uniqueness constrain to the label/property.
I am not sure why you are "worried" about adding a uniqueness constraint. It will also add a primary key index, which will greatly improve your match/merges on the index's label/property.
interesting. but even with threads / or separate sessions / multiple async callbacks -
if these are transactions I would think the existence check is part of the same TX so there shouldn't be a duplicate created?
I'm running these with JS, and a bunch of async TXs with a Promise.all() to wait for them to complete. Is the transaction just safe within it's own process? That can't be
re constraint : worry is if i have one long sequence of inserts with the unwind and one node seems to be duplicated, then the whole sequence (transaction) would fail.
It would be better to figure out where the dupe is coming from...
I am referring to multiple threads, which should always use their own session, as sessions are not thread safe. As such, each thread would have their own transaction. Transactions are not aware other transaction’s write until they are committed. This can cause a race condition with a merge, when multiple transactions are merging the same node and it doesn’t already exist. The race condition is eliminated with a uniqueness constraint. I am sure there is some locking in this case to make sure the merges are sequential.
But if the constraint throws an error, would that mean the whole transaction would fail?
so if i had a big UNWIND in there all nodes would fail if one were to be a duplicate...
then maybe I need a way within cypher to create a set of individual transactions?
But I think the TX handling was at the driver level, not within cypher itself.
If I make lots of individual calls, it seems the latency or whatever the issue is makes each one take around a second even though my dataset is still tiny.
An exception will be thrown. If you are using auto-commit transactions the whole thing will be rolled back. You could use a transaction function so you can handle the exception.
You should try not to insert duplicate data if you know you have a uniqueness constraint, so it should not happen with frequency.
The point of the uniqueness constraint is to avoid duplicates. It’s necessary if this is important to you. I would rather have an exception then create duplicates when uniqueness is a requirement or expected.
You should have a uniqueness constraint on any property you are using to merge nodes on. You should consider these properties as primary keys.
Keep in mind the situation we are discussing is a race condition where you merge the same node concurrently for the first time and very close in time. Once the node is setup in the db thus is not an issue. There is locking you may content with if you try to update the same node concurrently, but thuu it a standard stuff to deal with it in a multiuser application.