Looking for some advice on speeding up database creation. It’s being created from thousands of little JSON statements. Right now I have each node and arc being created as a separate transaction, and unsurprisingly, it’s about as fast as pouring molasses. I’m using C#/MONO and the Bolt driver.
Any suggestions on how to speed up this process? I did check out PERIODIC COMMIT, but this seems to be limited to the IMPORT statement.
Can you provide an example snippet of your code? As far as I know there isn't an IMPORT keyword in Cypher, though there is LOAD CSV. If you can, take your import query and execute an EXPLAIN of it and add the expanded query plan here.
Haven't done any constraints because the input data is high quality. All integrity is enforced on the source side, so it seemed to me that constraints would be redundant.
Oops, yes, I did mean LOAD CSV.
It isn't the query, it's that there are thousands of transactions. I'm looking for ways to process those more efficiently.
Unique constraints or indexes are almost mandatory if you are using merge during your data importation process. It's not about the quality of your data but about speed.
Specifically this is related to MATCH and MERGE operations in the query. Without indexes or unique constraints, it's possible that each of these (per operation and per row, not just once!) will perform an entire label scan to find the node. So if you had a 1000 line file consisting of two MERGEs, then that's 2000 label scans happening for the query, which will be horrible performance. Adding indexes changes those to index lookups instead, which is where you'll speed things up significantly.
There's also a question of Eager operators in the plan, as that can cause heap issues if you're not careful. So getting details about the query, and supporting your import with indexes, is essential for speed.
The database gets created with thousands of create node statements, followed by thousands of arc creations: MATCH (n1…) MATCH (n2…) MERGE (n1)-[:whatever]-(n2)
So if I understand correctly then with constraints those MATCH statements are going to execute much more quickly.
I hadn’t thought about doing INDEX anywhere because I thought that the scale of the database was small enough that it wouldn’t make a difference, but from the numbers you’re stating it sounds like I’m already way past that point.
You can even try to create unique constraint instead of simple index, it's slightly better.
Careful, even with unique constraints, double MATCH clauses are often deadly.
We could help you better if you can share a part of your importation query.
There's no import query, the database is created by a C# application with huge lists of data. First there's a loop CREATE (:Thought) { (some attributes including string ID) }
for a few thousand elements, followed by a loop of a few thousand MATCH (thought1 by ID) MATCH (thought2 by ID) MERGE (thought1)-[:whatever]-(thought2)
where whatever comes from the source data. The source application is trusted to not duplicate an ID or a link, and it's pretty mature software, so it's well trusted.
If by "loop" you mean a separate query per loop, and one update per query, then that's not optimized. You should try to batch many creations (10k-100k, depending on your heap) at a time per each query.
Yes, that's why I was asking for advice, and this page you referenced has a ton of good stuff. Thanks! I'll be going through it when I'm back at work next week.