The import for multiple csv is very slow . Will compiling all the files into one big file and zipping will make the process fast?
Probably getting a faster SSD/NVMe would help more? Did you look at disk / CPU utilization? How many CPUs do you have?
You can provide your indivdiual CSV files as gz files.
How many files do you have?
If you have duplicate nodes in there, it will be slow fixing that would help.
Available resources:
Total machine memory: 31.10GiB
Free machine memory: 5.687GiB
Max heap memory : 6.914GiB
Max worker threads: 8
Configured max memory: 2.694GiB
High parallel IO: true
Nodes, started 2023-04-04 12:27:43.147+0000
[*Nodes:0B/s 1.421GiB-------------------------------------------------------------------------]56.5M ∆ 590K
Done in 1m 739ms
Prepare node index, started 2023-04-04 12:28:43.895+0000
[*:2.358GiB-----------------------------------------------------------------------------------] 340M ∆ 950K
Done in 1h 30m 48s 78ms
DEDUP, started 2023-04-04 13:59:31.985+0000
[*DEDUP---------------------------------------------------------------------------------------] 0 ∆ 0
Done in 1m 42s 697ms
Relationships, started 2023-04-04 14:01:14.775+0000
[*Relationships:0B/s 2.358GiB-----------------------------------------------------------------]56.8M ∆ 120K
Done in 2h 53m 2s 779ms
Node Degrees, started 2023-04-04 16:54:18.130+0000
[>(2)========================================|*CALCULATE:1.522GiB(2)==========================]55.5M ∆17.7M
Done in 2s 648ms
Relationship --> Relationship 1-11/11, started 2023-04-04 16:54:20.930+0000
[*>-----------------------------------|LINK(3)================|v:338.6MiB/s-------------------]55.5M ∆15.8M
Done in 5s 52ms
RelationshipGroup 1-11/11, started 2023-04-04 16:54:25.984+0000
[>---------|*v:??-----------------------------------------------------------------------------]14.2M ∆14.2M
Done in 331ms
Node --> Relationship, started 2023-04-04 16:54:26.325+0000
[>:431.1MiB/s---|>(2)========================|LINK---|*v:301.5MiB/s---------------------------]19.9M ∆19.9M
Done in 1s 778ms
Relationship <-- Relationship 1-11/11, started 2023-04-04 16:54:28.120+0000
[*>--------------------------------|LINK(3)========================|v:338.6MiB/s--------------]55.5M ∆21.1M
Done in 5s 15ms
Count groups, started 2023-04-04 16:54:36.009+0000
[>|*>------------------------------------------------------------------------------|COUNT:1.10]2.06M ∆2.06M
Done in 131ms
Gather, started 2023-04-04 16:54:36.552+0000
[>-----------------|*CACHE:1.575GiB-----------------------------------------------------------]2.06M ∆2.06M
Done in 333ms
Write, started 2023-04-04 16:54:36.887+0000
[*>:??------------------------------------------------------------------------------------||v:]2.03M ∆2.03M
Done in 251ms
Node --> Group, started 2023-04-04 16:54:37.146+0000
[*>---------------------------------------|FIRST----------------------|v:??-------------------] 389K ∆ 389K
Done in 203ms
Node counts and label index build, started 2023-04-04 16:54:37.917+0000
[*>(4)======================================|LABEL INDEX---------------------------|COUNT:1.41]56.5M ∆43.7M
Done in 2s 470ms
Relationship counts and relationship type index build, started 2023-04-04 16:54:40.446+0000
[>(2)=======================|*RELATIONSHIP TYPE INDEX------------------|COUNT(2)==============]55.4M ∆ 2.1M
Done in 3s 123ms
IMPORT DONE in 4h 27m 48s 188ms.
Imported:
20149254 nodes
55476552 relationships
181771503 properties
Peak memory usage: 2.358GiB
This the verbose output of the import for a portion of data (21 Gb) that I am dealing with.
Now this data is spread in multiple csv files(100-150 of them).
Now as you can observe the report indicates that the process took most of the time in
- Prepare node index
- Relationships
And I also observed that in the 'preapare node index' process the process gets too slow for a significant amount of time and before this drop in speed the process is utilising all 8 threads but during that drop only 1 or 2 threads are getting used.
I feel that there is some kind of bottleneck present in the process, but I am not sure what it is.
Right now there are duplicates in the CSVs , but are we sure that this drop in speed is happening because of the duplicates ?
Also I had a doubt that does the import tool opens the csv by concatenating them or does it read them file by file just like how we specify in the command, because if it is concatenating at the backend then ram might become a bottleneck.
I am using neo4j 5.6.
@michael.hunger Thanks for suggesting the removal of duplicates import of 5 hours reduced to 9 minutes.
But I wanted to know how does this import tool works internally why removal of duplicates becomes this big hurdle while importing.
I asked the team, yes sorry for the experience :(
Yes, please share the details.
Thanks in advance.