Pass ConstraintError and keep creating nodes

Hi all,

I am trying to increase my import speed and at the same time avoid duplicate nodes. Since MERGE is slower than CREATE, I thought it might be useful to simply CREATE nodes while having a UNIQUE constraint in place. This obviously comes with a lot of ConstraintErrors, one for every time I am trying to create a duplicate node. Is there any way to catch these errors and simply keep going in the python interface? A simple try/except doesnt seem to work.

Especially since i am processing in batches using UNWIND, is it possible to catch these errors for every single node individually and not lose the whole batch it occured in?

Thanks a lot for your help!

Cheers,

Florin

What's the entire error message the the duplicate node is throwing? You should be able to except the error or possibly import errors from the Neo4j python driver. If you're the fly by the seat of your pants type and you're looping through objects to create nodes you can always to just leave a blank except and simply continue.

Whatever you're importing, it is likely faster to preprocess the dataset and remove duplicates prior to import :slight_smile:

Thanks for answering. The complete error traceback is shown below:

Traceback (most recent call last):00.
  File "graphBuilder.py", line 242, in <module>
    CorrelationBuilder3(args.r2).build()
  File "graphBuilder.py", line 109, in build
self.feeder()
  File "graphBuilder.py", line 86, in feeder
results = session.write_transaction(commit_batch, batch)
  File "/home/user/anaconda3/envs/builder/lib/python3.6/site-packages/neo4j/__init__.py", line 714, in write_transaction
    return self._run_transaction(WRITE_ACCESS, unit_of_work, *args, **kwargs)
  File "/home/user/anaconda3/envs/builder/lib/python3.6/site-packages/neo4j/__init__.py", line 686, in _run_transaction
    tx.close()
  File "/home/user/anaconda3/envs/builder/lib/python3.6/site-packages/neo4j/__init__.py", line 828, in close
    self.sync()
  File "/home/user/anaconda3/envs/builder/lib/python3.6/site-packages/neo4j/__init__.py", line 793, in sync
self.session.sync()
  File "/home/user/anaconda3/envs/builder/lib/python3.6/site-packages/neo4j/__init__.py", line 538, in sync
detail_count, _ = self._connection.sync()
  File "/home/user/.local/lib/python3.6/site-packages/neobolt/direct.py", line 531, in sync
detail_delta, summary_delta = self.fetch()
 File "/home/user/.local/lib/python3.6/site-packages/neobolt/direct.py", line 422, in fetch
return self._fetch()
  File "/home/user/.local/lib/python3.6/site-packages/neobolt/direct.py", line 464, in _fetch
response.on_failure(summary_metadata or {})
File "/home/user/.local/lib/python3.6/site-packages/neobolt/direct.py", line 759, in on_failure
raise CypherError.hydrate(**metadata)
neobolt.exceptions.ConstraintError: Node(41663539) already exists with label `X` and property `Y` = 'ABC'
Failed to write data to connection Address(host='localhost', port=7687) (Address(host='127.0.0.1', port=7687)); ("0; 'Underlying socket connection gone (_ssl.c:2084)'")
 Failed to write data to connection Address(host='localhost', port=7687) (Address(host='127.0.0.1', port=7687)); ("0; 'Underlying socket connection gone (_ssl.c:2084)'")

(i altered label and property names)

I tried to import that error but i dont know how exactly.

I also tried the suggestion with a simple except: pass since i am looping anyway. It doesnt break completely but the import simply crashes in every single batch without importing anything while spamming

FIXME: should always disconnect before connect

on stdout.

@Thomas_Silkjaer Even though this might be true, i think it is very very cumbersome to manually manage that no single node one is trying to import is already present in the graph. Especially when we have hundreds of millions of nodes and are still adding more from different sources, this becomes unfeasible. Of course it is possible to manage all the IDs, properties, adjacencies and whatsoever in huge csv files and then just use neo4j admin import, but i dont see the value of a graph database if you have to hold all its contents in massive static tabular formats anyway.

Please correct me if I'm wrong, i am always happy to hear if some things are actually easier than i thought :slight_smile:

This should be the error that you're trying to except. You should be able to import:

from neo4j import exceptions

Give that a shot.

Out of curiosity, how much does using MERGE affect your process vs. CREATE? Because if it's not incredibly different it's probably just safer. Or as @Thomas_Silkjaer suggested, you can use Python to per-process the duplicates out with something like a list comprehension? I know you've said you're adverse to it but it would certainly make your uploads to Neo4j faster.

I’d call it an initial execution. Source and organize the data you need, organize in CSVs without duplicates and import to a fresh database with admin import. Depending on your dataset you could save anything from days to weeks to months.

Of course it doesn’t work if the data you ingest post init is growing faster than you can “stream” it :)

Thanks guys, your answers have been extremely helpful. I tested the exception that @MuddyBootsCode suggested and it works, but the database still complains and loses the whole batch just the same. I think I will resort to neo4j-admin import as @Thomas_Silkjaer suggested. This way i will have more issues to manage myself but at least I know that I am using the fastest option.

I found some interesting information here. Might be worth your time Efficient Neo4j Data Import Using Cypher-Scripts | by Andrea Santurbano | Neo4j Developer Blog | Medium If you haven't already seen it.