Extremely slow import for large graph database using neo4j-admin import

Problem is very similar to what is described here:

I have done lots of research, and everyone is mentioning deduplication, but I am pretty sure that ids are unique as they have been generated from dataset with unique ID values per type, and each :ID in header marked with the node type it belongs to:

Sample Node Header:

Sample Relationship Header:

Dataset has around 900M nodes, and 5B relationships

Note that ids are strings.

We can generate globally unique numeric ids for nodes, and re-generate nodes and relationships to use them if that would help.

Can someone please clearly specify what is meant by deduplication (id/namespace combos are already unique), and how exactly do node ids need to be unique,
How to make neo4j-admin import perform reasonably well?

Also, I have noticed a huge slowdown from 3.3.5 to later versions, so I cannot even try to import on the neo4j 4+. I am sure I am doing something wrong again, but cannot find out what as commands have the exact same options on the same machine with same memory/disk.

I am currently using neo4j-3.3.5 community for import, but would like to check 4.1.3 if it is possible to make import speed comparable to 3.3.5

ec2 instance with 256GB RAM and EBS storage volume is used for testing

as you've obscured the details I assume this isn't a dataset you could share with me, if you could I'd take a crack at loading it just for fun. Sounds like it is 10x the nodes, and 100x the rels I'm working with currently.

I use headers like you do (but using tab char instead of pipe), a few thoughts and notes

  • the identifiers need to be unique only within the (NodeType) scope you've specified
  • In my opinion data cleaning is always required: every large dataset (where humans were involved in creating it) will in my experience have every conceivable data issue plus a few that you couldn't have thought up if you tried, including but not limited to, unprintable characters, the delimiter and line feed characters you are using will be found inside data fields, alpha characters will be in numeric fields, and unique identifiers are not unique.
  • moving through 3.x to 3.5 was seamless for me, moving to 4.0 required more than a few changes to my scripts, some of the configuration parameters have been renamed, and some of the config input values changed (e.g. true/false changed to on/off but just for some parameters)
  • I use options that should enable a load to complete, but log issues. Then I examine the importreport.txt
  • of course if there is a bug we're trying to avoid these options may not help (e.g. duplicates issue?)

neo4j 3.5.20

  --mode=csv \
  --ignore-missing-nodes \
  --delimiter "\t" \
  --report-file=importreport.txt \

neo4j 4.x

  --verbose \
  --skip-bad-relationships=true \
  --skip-duplicate-nodes=true \
  --ignore-empty-strings=true \
  --normalize-types=true \
  --trim-strings=false \
  --delimiter "\t" \
  --high-io=true \
  --report-file=importreport.txt \