Improve admin-import performance

Running on the latest community version of neo4j, I'm trying to bulk-import data into an empty database using the following command.

.\bin\neo4j-admin.ps1 database import full --overwrite-destination neo4j `
    --nodes="type_1.tsv.gz" `
    --nodes="type_2_header.tsv.gz,\.*type_2.tsv.gz" `
    --delimiter "\t" --array-delimiter ";" --verbose `

This query works as expected, though my data is over 2 TB, and this process running on NVMe takes over 3 weeks. Is there anything I can do to speed it up?

Given the extensive runtime, I cannot experiment with options, and I'm only interested in solutions that are proven to speed up performance.

Probably it works very slow because nodes and relations aren't sorted. You can sort nodes in files by id in python. To do that don't use pandas because it's too slow. I suggest use modin https://pypi.org/project/modin/ or polars https://docs.pola.rs/ to make some preprocessing.

Are you suggesting to sort 2TB of data?

You can try to sort data inside separate files, of course 2 TB is too big to place in to memory. Polars has special mode for this purpose https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.sort.html

I use this library to work with huge files I hope it will help You too

Thanks! While sorting might improve, I won't be able to invest in sorting as it will take both additional runtime and dev time. For instance, it won't be worth it if it takes 1 week to sort and then 2 weeks to import, which, using the bash sorting tool with on-disk cache, it already takes a long time to sort, so this tool wont be significantly faster.

Other than sorting, what else I can do relying on neo4j?

You are not describing your hardware, but i can only assume that you are running this community version on a laptop.

So it is very likely that any of your queries will also take forever to execute, since it is likely you won't have enough RAM to accommodate the datasets/temporary parts of execution/etc.

For example if you have 32 GB of RAM available for Neo4J, in an ideal scenario (everything is luckily organised as per the query) you'll swap a minimum of 60 times from memory to disk ... but in a practical scenario you are looking at hundreds of swaps, maybe thousands.

RDBMS or Graph is probably the same limitation. This has nothing to do with the database - more with the physical limitations of hardware versus dataset sizes.

You should size your data accordingly ...

I'm running on a desktop computer with 64GB RAM and NVMe.

If I'm reading you correctly, you're assuming the database operations will work efficiently if the system has enough memory to accommodate all the data. I doubt that because if you have enough memory to accommodate all the data, why would you need a database solution in the first place?

This is not about running queries, though; it is about importing data into the database.

What I mean is that performance is going to be slower, depending upon the available resources ... "twice as much memory" is at a theoretical best " half the time".

Memory is just a bucket, a database is an engine to make sense of the contents of the memory.

You are building relationships as you are importing the data, if it was just reading files, you would be constrained by the disks' speeds / PCI bandwidth / memory ...

Your files are also compressed meaning that they are first decompressed - then interpreted.

If you are using the same disk for everything, you are reading (2TB?) -> decompressing -> writing (+2TB?) -> analysing -> importing (+2TB?), your NVMe writes are about 1/2 the speed of the reads. If you are using multiple disks over a network, now you are constrained by that speed.