Problem is very similar to what is described here:
I have done lots of research, and everyone is mentioning deduplication, but I am pretty sure that ids are unique as they have been generated from dataset with unique ID values per type, and each :ID in header marked with the node type it belongs to:
Dataset has around 900M nodes, and 5B relationships
Note that ids are strings.
We can generate globally unique numeric ids for nodes, and re-generate nodes and relationships to use them if that would help.
Can someone please clearly specify what is meant by deduplication (id/namespace combos are already unique), and how exactly do node ids need to be unique,
How to make neo4j-admin import perform reasonably well?
Also, I have noticed a huge slowdown from 3.3.5 to later versions, so I cannot even try to import on the neo4j 4+. I am sure I am doing something wrong again, but cannot find out what as commands have the exact same options on the same machine with same memory/disk.
I am currently using neo4j-3.3.5 community for import, but would like to check 4.1.3 if it is possible to make import speed comparable to 3.3.5
ec2 instance with 256GB RAM and EBS storage volume is used for testing
as you've obscured the details I assume this isn't a dataset you could share with me, if you could I'd take a crack at loading it just for fun. Sounds like it is 10x the nodes, and 100x the rels I'm working with currently.
I use headers like you do (but using tab char instead of pipe), a few thoughts and notes
the identifiers need to be unique only within the (NodeType) scope you've specified
In my opinion data cleaning is always required: every large dataset (where humans were involved in creating it) will in my experience have every conceivable data issue plus a few that you couldn't have thought up if you tried, including but not limited to, unprintable characters, the delimiter and line feed characters you are using will be found inside data fields, alpha characters will be in numeric fields, and unique identifiers are not unique.
moving through 3.x to 3.5 was seamless for me, moving to 4.0 required more than a few changes to my scripts, some of the configuration parameters have been renamed, and some of the config input values changed (e.g. true/false changed to on/off but just for some parameters)
I use options that should enable a load to complete, but log issues. Then I examine the importreport.txt
of course if there is a bug we're trying to avoid these options may not help (e.g. duplicates issue?)
Thank you for confirmation that removing duplicates will work.
I went back and re-checked all of the data files, and there were indeed just a few duplicates in ID space. I've made necessary changes, and re-tested import with neo4j 3.3.5, and import worked fine - complete import of fairly large dataset:
Available resources:
Total machine memory: 249.70 GB
Free machine memory: 249.18 GB
Max heap memory : 21.33 GB
Processors: 32
Configured max memory: 205.06 GB
Nodes, started 2020-11-04 18:34:57.645+0000
[*>:39.92 MB/s------------------------------------------------------------|N|PROPER||v:101.84 ] 788M ∆92.8K
Done in 15m 5s 591ms
Prepare node index, started 2020-11-04 18:50:03.367+0000
[*DETECT:8.82 GB------------------------------------------------------------------------------] 788M ∆15.9M310000
Done in 2m 38s 934ms
Relationships, started 2020-11-04 18:52:42.313+0000
[*>:29.69 MB/s--------------------------|TYP|PREPARE(2)=============|REC|PROPERTIES--|v:52.11 ]4.22B ∆1.57M
Done in 1h 10m 19s 842ms
Node Degrees, started 2020-11-04 20:03:06.447+0000
[*>(7)================================================|CALCULATE(30)==========================]4.22B ∆11.7M
Done in 7m 35s 358ms
Relationship --> Relationship 1-77/77, started 2020-11-04 20:10:43.567+0000
[*>----------------------------------------|LINK(30)============================|v:183.63 MB/s]4.22B ∆6.27M
Done in 12m 26s 4ms
RelationshipGroup 1-77/77, started 2020-11-04 20:23:09.593+0000
[*>:103.31 MB/s---------------------------------------------|v:103.31 MB/s--------------------] 134M ∆5.53M
Done in 31s 168ms
Node --> Relationship, started 2020-11-04 20:23:40.773+0000
[>:51.41 M|*>-----------------------------------------------------------|L|v:95.30 MB/s-------] 779M ∆ 251K
Done in 1m 57s 113ms
Relationship --> Relationship 1-77/77, started 2020-11-04 20:25:39.041+0000
[*>-------------------------------------|LINK(7)========================|v:169.73 MB/s(2)=====]4.22B ∆8.76M
Done in 13m 26s 166ms
Count groups, started 2020-11-04 20:39:06.364+0000
[>|*>(11)===========================================|COUNT------------------------------------] 134M ∆63.1M
Done in 5s 682ms
Gather, started 2020-11-04 20:39:23.941+0000
[>----------------------|*CACHE---------------------------------------------------------------] 134M ∆1.79M
Done in 1m 4s 543ms
Write, started 2020-11-04 20:40:28.506+0000
[*>:-95703614.00 B/s-----------------------------|ENCODE----|v:227.49 MB/s(7)=================] 132M ∆11.1M
Done in 14s 853ms
Node --> Group, started 2020-11-04 20:40:43.709+0000
[*>---------------------------------------------|FIRST---|v:7.10 MB/s(2)======================]8.93M ∆ 289K
Done in 18s 898ms
Node counts, started 2020-11-04 20:41:02.980+0000
[>(2)=========================================|*COUNT:5.89 GB---------------------------------] 788M ∆28.3M
Done in 36s 241ms
Relationship counts, started 2020-11-04 20:41:39.242+0000
[*>(13)=========================================|COUNT(24)====================================]4.22B ∆2.05M
Done in 2m 26s 831ms
IMPORT DONE in 2h 9m 8s 957ms.
Imported:
788406015 nodes
4224169706 relationships
7282106075 properties
Peak memory usage: 12.59 GB
neo4j 4.1.3 is taking much longer, and I am trying to find out what the issue is.
Just to update - managed to find the issue with import on neo4j 4.1.3 - turns out that "--multiline-fields=true" parameter was causing a problem - number of nodes and relationships that import was showing was way overinflated by the factor of 100 with this parameter. It was there as a carryover from 3.3.5 import that needed it due to some errors before, and I thought that since I don't have any multiline fields it wouldn't cause trouble.
So to recap - this command did not work on 4.1.3 - import would have taken days, and would most likely have corrupt data: