Extremely slow import for large graph database using neo4j-admin import

Problem is very similar to what is described here:

I have done lots of research, and everyone is mentioning deduplication, but I am pretty sure that ids are unique as they have been generated from dataset with unique ID values per type, and each :ID in header marked with the node type it belongs to:

Sample Node Header:
idProp:ID(NodeType)|valueVector:double

Sample Relationship Header:
:START_ID(NodeType)|:END_ID(NodeType1)|relationshipScore:double

Dataset has around 900M nodes, and 5B relationships

Note that ids are strings.

We can generate globally unique numeric ids for nodes, and re-generate nodes and relationships to use them if that would help.

Can someone please clearly specify what is meant by deduplication (id/namespace combos are already unique), and how exactly do node ids need to be unique,
How to make neo4j-admin import perform reasonably well?

Also, I have noticed a huge slowdown from 3.3.5 to later versions, so I cannot even try to import on the neo4j 4+. I am sure I am doing something wrong again, but cannot find out what as commands have the exact same options on the same machine with same memory/disk.

I am currently using neo4j-3.3.5 community for import, but would like to check 4.1.3 if it is possible to make import speed comparable to 3.3.5

ec2 instance with 256GB RAM and EBS storage volume is used for testing

as you've obscured the details I assume this isn't a dataset you could share with me, if you could I'd take a crack at loading it just for fun. Sounds like it is 10x the nodes, and 100x the rels I'm working with currently.

I use headers like you do (but using tab char instead of pipe), a few thoughts and notes

  • the identifiers need to be unique only within the (NodeType) scope you've specified
  • In my opinion data cleaning is always required: every large dataset (where humans were involved in creating it) will in my experience have every conceivable data issue plus a few that you couldn't have thought up if you tried, including but not limited to, unprintable characters, the delimiter and line feed characters you are using will be found inside data fields, alpha characters will be in numeric fields, and unique identifiers are not unique.
  • moving through 3.x to 3.5 was seamless for me, moving to 4.0 required more than a few changes to my scripts, some of the configuration parameters have been renamed, and some of the config input values changed (e.g. true/false changed to on/off but just for some parameters)
  • I use options that should enable a load to complete, but log issues. Then I examine the importreport.txt
  • of course if there is a bug we're trying to avoid these options may not help (e.g. duplicates issue?)

neo4j 3.5.20

  --mode=csv \
  --ignore-missing-nodes \
  --delimiter "\t" \
  --report-file=importreport.txt \

neo4j 4.x

  --verbose \
  --skip-bad-relationships=true \
  --skip-duplicate-nodes=true \
  --ignore-empty-strings=true \
  --normalize-types=true \
  --trim-strings=false \
  --delimiter "\t" \
  --high-io=true \
  --report-file=importreport.txt \

Thank you for confirmation that removing duplicates will work.
I went back and re-checked all of the data files, and there were indeed just a few duplicates in ID space. I've made necessary changes, and re-tested import with neo4j 3.3.5, and import worked fine - complete import of fairly large dataset:

Available resources:
  Total machine memory: 249.70 GB
  Free machine memory: 249.18 GB
  Max heap memory : 21.33 GB
  Processors: 32
  Configured max memory: 205.06 GB

Nodes, started 2020-11-04 18:34:57.645+0000
[*>:39.92 MB/s------------------------------------------------------------|N|PROPER||v:101.84 ] 788M ∆92.8K
Done in 15m 5s 591ms
Prepare node index, started 2020-11-04 18:50:03.367+0000
[*DETECT:8.82 GB------------------------------------------------------------------------------] 788M ∆15.9M310000
Done in 2m 38s 934ms
Relationships, started 2020-11-04 18:52:42.313+0000
[*>:29.69 MB/s--------------------------|TYP|PREPARE(2)=============|REC|PROPERTIES--|v:52.11 ]4.22B ∆1.57M
Done in 1h 10m 19s 842ms
Node Degrees, started 2020-11-04 20:03:06.447+0000
[*>(7)================================================|CALCULATE(30)==========================]4.22B ∆11.7M
Done in 7m 35s 358ms
Relationship --> Relationship  1-77/77, started 2020-11-04 20:10:43.567+0000
[*>----------------------------------------|LINK(30)============================|v:183.63 MB/s]4.22B ∆6.27M
Done in 12m 26s 4ms
RelationshipGroup 1-77/77, started 2020-11-04 20:23:09.593+0000
[*>:103.31 MB/s---------------------------------------------|v:103.31 MB/s--------------------] 134M ∆5.53M
Done in 31s 168ms
Node --> Relationship, started 2020-11-04 20:23:40.773+0000
[>:51.41 M|*>-----------------------------------------------------------|L|v:95.30 MB/s-------] 779M ∆ 251K
Done in 1m 57s 113ms
Relationship --> Relationship 1-77/77, started 2020-11-04 20:25:39.041+0000
[*>-------------------------------------|LINK(7)========================|v:169.73 MB/s(2)=====]4.22B ∆8.76M
Done in 13m 26s 166ms
Count groups, started 2020-11-04 20:39:06.364+0000
[>|*>(11)===========================================|COUNT------------------------------------] 134M ∆63.1M
Done in 5s 682ms
Gather, started 2020-11-04 20:39:23.941+0000
[>----------------------|*CACHE---------------------------------------------------------------] 134M ∆1.79M
Done in 1m 4s 543ms
Write, started 2020-11-04 20:40:28.506+0000
[*>:-95703614.00 B/s-----------------------------|ENCODE----|v:227.49 MB/s(7)=================] 132M ∆11.1M
Done in 14s 853ms
Node --> Group, started 2020-11-04 20:40:43.709+0000
[*>---------------------------------------------|FIRST---|v:7.10 MB/s(2)======================]8.93M ∆ 289K
Done in 18s 898ms
Node counts, started 2020-11-04 20:41:02.980+0000
[>(2)=========================================|*COUNT:5.89 GB---------------------------------] 788M ∆28.3M
Done in 36s 241ms
Relationship counts, started 2020-11-04 20:41:39.242+0000
[*>(13)=========================================|COUNT(24)====================================]4.22B ∆2.05M
Done in 2m 26s 831ms

IMPORT DONE in 2h 9m 8s 957ms. 
Imported:
  788406015 nodes
  4224169706 relationships
  7282106075 properties
Peak memory usage: 12.59 GB

neo4j 4.1.3 is taking much longer, and I am trying to find out what the issue is.

Just to update - managed to find the issue with import on neo4j 4.1.3 - turns out that "--multiline-fields=true" parameter was causing a problem - number of nodes and relationships that import was showing was way overinflated by the factor of 100 with this parameter. It was there as a carryover from 3.3.5 import that needed it due to some errors before, and I thought that since I don't have any multiline fields it wouldn't cause trouble.

So to recap - this command did not work on 4.1.3 - import would have taken days, and would most likely have corrupt data:

neo4j-admin import --database=graph.db --id-type=STRING --delimiter='|' --skip-duplicate-nodes=true --skip-bad-relationships=true --ignore-extra-columns=true --multiline-fields=true

This command worked fine (multiline-fields param removed)

neo4j-admin import --database=graph.db --id-type=STRING --delimiter='|' --skip-duplicate-nodes=true --skip-bad-relationships=true --ignore-extra-columns=true

Import on 4.1.3 finished even faster:

Import starting 2020-11-04 23:53:31.526+0000
  Estimated number of nodes: 793.46 M
  Estimated number of node properties: 3.13 G
  Estimated number of relationships: 4.91 G
  Estimated number of relationship properties: 5.50 G
  Estimated disk space usage: 350.8GiB
  Estimated required memory usage: 10.75GiB

(1/4) Node import 2020-11-04 23:53:31.570+0000
  Estimated number of nodes: 793.46 M
  Estimated disk space usage: 89.87GiB
  Estimated required memory usage: 10.75GiB
.......... .......... .......... .......... ..........   5% ∆53s 118ms
.......... .......... .......... .......... ..........  10% ∆53s 860ms
.......... .......... .......... .......... ..........  15% ∆2m 5s 905ms
.......... .......... .......... .......... ..........  20% ∆43s 248ms
.......... .......... .......... .......... ..........  25% ∆42s 644ms
.......... .......... .......... .......... ..........  30% ∆55s 471ms
.......... .......... .......... -......... ..........  35% ∆14s 14ms
.......... .......... .......... .......... ..........  40% ∆0ms
.......... .......... .......... .......... ..........  45% ∆0ms
.......... .......... .......... .......... ..........  50% ∆28s 622ms
.......... .......... .......... .......... ..........  55% ∆30s 35ms
.......... .......... .......... .......... ..........  60% ∆31s 235ms
.......... .......... .......... .......... ..........  65% ∆30s 428ms
.......... .......... .......... .......... ..........  70% ∆28s 27ms
.......... .......... .......... .......... ..........  75% ∆15s 17ms
.......... .......... .......... .......... ..........  80% ∆4s 5ms
.......... .......... .......... .......... ..........  85% ∆4s 7ms
.......... .......... .......... .......... ..........  90% ∆6s 205ms
.......... .......... .......... .......... ..........  95% ∆6s 2ms
.......... .......... .......... .......... .......... 100% ∆5s 205ms

(2/4) Relationship import 2020-11-05 00:04:12.242+0000
  Estimated number of relationships: 4.91 G
  Estimated disk space usage: 260.9GiB
  Estimated required memory usage: 9.812GiB
.......... .......... .......... .......... ..........   5% ∆3m 57s 787ms
.......... .......... .......... .......... ..........  10% ∆3m 53s 418ms
.......... .......... .......... .......... ..........  15% ∆2m 57s 248ms
.......... .......... .......... .......... ..........  20% ∆1m 53s 961ms
.......... .......... .......... .......... ..........  25% ∆2m 10s 861ms
.......... .......... .......... .......... ..........  30% ∆2m 674ms
.......... .......... .......... .......... ..........  35% ∆2m 13s 340ms
.......... .......... .......... .......... ..........  40% ∆2m 21s 563ms
.......... .......... .......... .......... ..........  45% ∆2m 26s 723ms
.......... .......... .......... .......... ..........  50% ∆2m 31s 42ms
.......... .......... .......... .......... ..........  55% ∆2m 10s 966ms
.......... .......... .......... .......... ..........  60% ∆1m 53s 912ms
.......... .......... .......... .......... ..........  65% ∆2m 3s 149ms
.......... .......... .......... .......... ..........  70% ∆2m 9s 937ms
.......... .......... .......... .......... ..........  75% ∆2m 28s 932ms
.......... .......... .......... .......... ..........  80% ∆2m 4s 118ms
.......... .......... .......... .......... ..........  85% ∆2m 9s 383ms
.......... .......... .......... .......... ..........  90% ∆27s 731ms
.......... .......... .......... .......... ..........  95% ∆0ms
.......... .......... .......... .......... .......... 100% ∆0ms

(3/4) Relationship linking 2020-11-05 00:46:06.988+0000
  Estimated required memory usage: 9.073GiB
.......... .......... .......... .......... ..........   5% ∆3m 14s 487ms
.......... .......... .......... .......... ..........  10% ∆3m 19s 480ms
.......... .......... .......... .......... ..........  15% ∆3m 19s 311ms
.......... .......... .......... .......... .........-  20% ∆401ms
.......... .......... .......... .......... ..........  25% ∆2m 14s 464ms
.......... .......... .......... .......... ..........  30% ∆2m 19s 85ms
.......... .......... .......... .......... ..........  35% ∆2m 16s 479ms
.......... .......... .......... .......... ..........  40% ∆2m 16s 476ms
.......... .......... .......... .......... ..........  45% ∆2m 17s 509ms
.......... .......... .......... .......... ..........  50% ∆2m 31s 485ms
.......... .......... .......... .......... ..........  55% ∆2m 23s 664ms
.......... .......... .......... .......... .........-  60% ∆802ms
.......... .......... .......... .......... ..........  65% ∆2m 51s 934ms
.......... .......... .......... .......... ..........  70% ∆2m 30s 482ms
.......... .......... .......... .......... ..........  75% ∆2m 35s 881ms
.......... .......... .......... .......... ..........  80% ∆2m 19s 678ms
.......... .......... .......... .......... ..........  85% ∆2m 10s 263ms
.......... .......... .......... .......... ..........  90% ∆2m 32s 929ms
.......... .......... .......... .......... ..........  95% ∆2m 31s 482ms
.......... .......... .......... .......... .......... 100% ∆2m 37s 842ms

(4/4) Post processing 2020-11-05 01:43:08.060+0000
  Estimated required memory usage: 1020MiB
.......... .......... ....-..... .......... ........-.   5% ∆24s 18ms
-......... .......... .......... .......... ..........  10% ∆19s 9ms
.......... .......... .......... .......... ..........  15% ∆21s 11ms
.......... .......... .......... .......... .....-....  20% ∆3s 242ms
.......... .......... .......... .......... ..........  25% ∆14s 411ms
.......... .......... .......... .......... ..........  30% ∆15s 21ms
.......... .......... .......... .......... ..........  35% ∆10s 817ms
.......... .......... .......... .......... ..........  40% ∆12s 15ms
.......... .......... .......... .......... ..........  45% ∆10s 226ms
.......... .......... .......... .......... ..........  50% ∆11s 424ms
.......... .......... .......... .......... ..........  55% ∆10s 828ms
.......... .......... .......... .......... ..........  60% ∆12s 11ms
.......... .......... .......... .......... ..........  65% ∆13s 20ms
.......... .......... .......... .......... ..........  70% ∆10s 811ms
.......... .......... .......... .......... ..........  75% ∆11s 8ms
.......... .......... .......... .......... ..........  80% ∆11s 23ms
.......... .......... .......... .......... ..........  85% ∆11s 805ms
.......... .......... .......... .......... ..........  90% ∆10s 809ms
.......... .......... .......... .......... ..........  95% ∆10s 807ms
.......... .......... .......... .......... .......... 100% ∆4s 977ms


IMPORT DONE in 1h 56m 6s 71ms. 
Imported:
  788406015 nodes
  4221522494 relationships
  7282779583 properties
Peak memory usage: 13.12GiB