Improve admin-import performance

hamed.metalgear · April 11, 2025, 2:26am

Running on the latest community version of neo4j, I'm trying to bulk-import data into an empty database using the following command.

.\bin\neo4j-admin.ps1 database import full --overwrite-destination neo4j `
    --nodes="type_1.tsv.gz" `
    --nodes="type_2_header.tsv.gz,\.*type_2.tsv.gz" `
    --delimiter "\t" --array-delimiter ";" --verbose `

This query works as expected, though my data is over 2 TB, and this process running on NVMe takes over 3 weeks. Is there anything I can do to speed it up?

Given the extensive runtime, I cannot experiment with options, and I'm only interested in solutions that are proven to speed up performance.

peterpirogtf · April 11, 2025, 3:55am

Probably it works very slow because nodes and relations aren't sorted. You can sort nodes in files by id in python. To do that don't use pandas because it's too slow. I suggest use modin https://pypi.org/project/modin/ or polars https://docs.pola.rs/ to make some preprocessing.

hamed.metalgear · April 11, 2025, 6:50pm

Are you suggesting to sort 2TB of data?

peterpirogtf · April 11, 2025, 8:05pm

You can try to sort data inside separate files, of course 2 TB is too big to place in to memory. Polars has special mode for this purpose https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.sort.html

I use this library to work with huge files I hope it will help You too

hamed.metalgear · April 11, 2025, 10:40pm

Thanks! While sorting might improve, I won't be able to invest in sorting as it will take both additional runtime and dev time. For instance, it won't be worth it if it takes 1 week to sort and then 2 weeks to import, which, using the bash sorting tool with on-disk cache, it already takes a long time to sort, so this tool wont be significantly faster.

Other than sorting, what else I can do relying on neo4j?

joshcornejo · April 12, 2025, 10:40am

You are not describing your hardware, but i can only assume that you are running this community version on a laptop.

So it is very likely that any of your queries will also take forever to execute, since it is likely you won't have enough RAM to accommodate the datasets/temporary parts of execution/etc.

For example if you have 32 GB of RAM available for Neo4J, in an ideal scenario (everything is luckily organised as per the query) you'll swap a minimum of 60 times from memory to disk ... but in a practical scenario you are looking at hundreds of swaps, maybe thousands.

RDBMS or Graph is probably the same limitation. This has nothing to do with the database - more with the physical limitations of hardware versus dataset sizes.

You should size your data accordingly ...

hamed.metalgear · April 13, 2025, 6:42pm

I'm running on a desktop computer with 64GB RAM and NVMe.

If I'm reading you correctly, you're assuming the database operations will work efficiently if the system has enough memory to accommodate all the data. I doubt that because if you have enough memory to accommodate all the data, why would you need a database solution in the first place?

This is not about running queries, though; it is about importing data into the database.

joshcornejo · April 14, 2025, 6:30am

What I mean is that performance is going to be slower, depending upon the available resources ... "twice as much memory" is at a theoretical best " half the time".

Memory is just a bucket, a database is an engine to make sense of the contents of the memory.

You are building relationships as you are importing the data, if it was just reading files, you would be constrained by the disks' speeds / PCI bandwidth / memory ...

Your files are also compressed meaning that they are first decompressed - then interpreted.

If you are using the same disk for everything, you are reading (2TB?) -> decompressing -> writing (+2TB?) -> analysing -> importing (+2TB?), your NVMe writes are about 1/2 the speed of the reads. If you are using multiple disks over a network, now you are constrained by that speed.

hamed.metalgear · April 24, 2025, 12:05pm

I have edges of different types, each type in a different group of batched files. For instance, 0_actor_knows_actor, 1_actor_knowns_actor, 0_actor_played-in_movie. So, what do you mean by sorting these, sort each file separately? what is sort key?

After sorting, is there anyway to tell neo4j admin tool to not re-sort?

Also, is improving performance by sorting data your guestimate/theory or you have the prior experience?

Topic		Replies	Views
Neo4j admin is stuck when bulk importing Import / Export performance	0	10	April 24, 2025
Is the admin bulk import faster for zipped csv? Import / Export import , neo4j-admin	5	348	April 5, 2023
I have some questions about importing data Import / Export	4	1063	January 3, 2019
Neo4j-admin import very slow withou --multiline-fields Import / Export	4	966	December 10, 2019
Importing Relationships / Nodes very slow Import / Export performance , cypher , import	3	1070	March 5, 2020

Improve admin-import performance

Related topics