When importing data using
neo4j-import, make sure to review the required CSV file structure and considerations before moving on.
ACTUAL vs. String (default) or Integer:
Each node in the CSV must have an
:ID, which can be in the format
By default it is
String, and one can specify
Integer explicitly in the header.
However, the reality is that performance and memory-wise, both are about equal, and using the default is fine.
Using Integer or String is a bit more memory intensive that using
ACTUAL, so one may be tempted to go for that option.
However, this is actually not usually the best option, unless you have a specific use case for it.
What is ACTUAL, and why is it special?
ACTUAL refers to the actual node ID, which in Neo4j, means the actual location of that record on disk.
When using ACTUAL with neo4j-import, all
:IDs must be ordered, and they must be ordered across ALL CSV files being imported during the load.
This is generally difficult to achieve, particularly in an existing data set, and one that is quite large and complex.
Assuming you have ordered all of the nodes across all CSV files to be imported, one must also consider whether there are gaps in those IDs.
Any gaps will yield places on disk where we will not use that area for storage, and will potentially reduce the amount of storage space you have available for the graph data store.
Lastly, avoid using large ids with
ACTUAL, as this will greatly increase the size of your store files, which should be avoided for obvious reasons.
Take home message:
If you use low, consecutive, ordered
ACTUAL should work fine for you, but it does require knowledge of the internal storage architecture and can be challenging to keep all of the nodes in order across CSV files.
neo4j-import is intended to populate a new, empty database.
It cannot be used to import into an existing database.