Speed up database creation?

jeg · May 11, 2021, 10:48pm

Looking for some advice on speeding up database creation. It’s being created from thousands of little JSON statements. Right now I have each node and arc being created as a separate transaction, and unsurprisingly, it’s about as fast as pouring molasses. I’m using C#/MONO and the Bolt driver.

Any suggestions on how to speed up this process? I did check out PERIODIC COMMIT, but this seems to be limited to the IMPORT statement.

tard_gabriel · May 12, 2021, 5:43am

I don't know the language you are using, it might effect a lot the performance depending on how you are using the driver.

But whatever the language you use, you should always create constraints in your graph before importing your data. Do you have any of theses?

andrew_bowman · May 13, 2021, 12:43am

Can you provide an example snippet of your code? As far as I know there isn't an IMPORT keyword in Cypher, though there is LOAD CSV. If you can, take your import query and execute an EXPLAIN of it and add the expanded query plan here.

jeg · May 13, 2021, 3:06am

Haven't done any constraints because the input data is high quality. All integrity is enforced on the source side, so it seemed to me that constraints would be redundant.

jeg · May 13, 2021, 3:07am

Oops, yes, I did mean LOAD CSV.
It isn't the query, it's that there are thousands of transactions. I'm looking for ways to process those more efficiently.

tard_gabriel · May 13, 2021, 7:05am

Unique constraints or indexes are almost mandatory if you are using merge during your data importation process. It's not about the quality of your data but about speed.

[Importing CSV Data into Neo4j | Free Neo4j Courses from GraphAcademy](https://Importing data)

Do you have an example of your importation query?

andrew_bowman · May 13, 2021, 6:32pm

As Gabriel mentioned, you will need these.

Specifically this is related to MATCH and MERGE operations in the query. Without indexes or unique constraints, it's possible that each of these (per operation and per row, not just once!) will perform an entire label scan to find the node. So if you had a 1000 line file consisting of two MERGEs, then that's 2000 label scans happening for the query, which will be horrible performance. Adding indexes changes those to index lookups instead, which is where you'll speed things up significantly.

There's also a question of Eager operators in the plan, as that can cause heap issues if you're not careful. So getting details about the query, and supporting your import with indexes, is essential for speed.

jeg · May 17, 2021, 6:39am

Sorry for delay, been sick.

The database gets created with thousands of create node statements, followed by thousands of arc creations:
MATCH (n1…) MATCH (n2…) MERGE (n1)-[:whatever]-(n2)

So if I understand correctly then with constraints those MATCH statements are going to execute much more quickly.

I hadn’t thought about doing INDEX anywhere because I thought that the scale of the database was small enough that it wouldn’t make a difference, but from the numbers you’re stating it sounds like I’m already way past that point.

I’ll try this as soon as I can get back to work.

tard_gabriel · May 17, 2021, 7:07pm

You can even try to create unique constraint instead of simple index, it's slightly better.

Careful, even with unique constraints, double MATCH clauses are often deadly.
We could help you better if you can share a part of your importation query.

jeg · May 18, 2021, 8:00pm

There's no import query, the database is created by a C# application with huge lists of data. First there's a loop
CREATE (:Thought) { (some attributes including string ID) }
for a few thousand elements, followed by a loop of a few thousand
MATCH (thought1 by ID) MATCH (thought2 by ID) MERGE (thought1)-[:whatever]-(thought2)
where whatever comes from the source data. The source application is trusted to not duplicate an ID or a link, and it's pretty mature software, so it's well trusted.

andrew_bowman · May 18, 2021, 10:51pm

If by "loop" you mean a separate query per loop, and one update per query, then that's not optimized. You should try to batch many creations (10k-100k, depending on your heap) at a time per each query.

https://dzone.com/articles/tips-for-fast-batch-updates-of-graph-structures-wi

jeg · May 21, 2021, 5:09am

Yes, that's why I was asking for advice, and this page you referenced has a ton of good stuff. Thanks! I'll be going through it when I'm back at work next week.

Topic		Replies	Views
LOAD CSV taking time Import / Export cypher , import	6	698	September 25, 2021
How to speed up uploading data from csv in graph db Neo4j Graph Platform apoc , bolt , import , migrated , cypher-tagged	1	330	November 16, 2022
Create cypher query very slow Import / Export cypher , import , neo4j-desktop	4	573	June 6, 2023
CSV import issue Import / Export	26	635	June 21, 2023
How Can speed up Neo4j Importing CSV FILE General migrated	4	262	October 28, 2022

Speed up database creation?

Related topics