Using the ACTUAL data type with neo4j-import

When importing data using neo4j-import, make sure to review the required CSV file structure and considerations before moving on.

http://neo4j.com/docs/stable/import-tool.html[]

ACTUAL vs. String (default) or Integer:

Each node in the CSV must have an :ID, which can be in the format Integer, String, or ACTUAL.
By default it is String, and one can specify Integer explicitly in the header.
However, the reality is that performance and memory-wise, both are about equal, and using the default is fine.

Using Integer or String is a bit more memory intensive that using ACTUAL, so one may be tempted to go for that option.
However, this is actually not usually the best option, unless you have a specific use case for it.

What is ACTUAL, and why is it special?

ACTUAL refers to the actual node ID, which in Neo4j, means the actual location of that record on disk.
When using ACTUAL with neo4j-import, all :IDs must be ordered, and they must be ordered across ALL CSV files being imported during the load.
This is generally difficult to achieve, particularly in an existing data set, and one that is quite large and complex.

Assuming you have ordered all of the nodes across all CSV files to be imported, one must also consider whether there are gaps in those IDs.
Any gaps will yield places on disk where we will not use that area for storage, and will potentially reduce the amount of storage space you have available for the graph data store.

Lastly, avoid using large ids with ACTUAL, as this will greatly increase the size of your store files, which should be avoided for obvious reasons.

Take home message:

If you use low, consecutive, ordered :IDs, ACTUAL should work fine for you, but it does require knowledge of the internal storage architecture and can be challenging to keep all of the nodes in order across CSV files.

[Note]
neo4j-import is intended to populate a new, empty database.
It cannot be used to import into an existing database.

@david_gordon1 I have a graph database that contains 1.7 billion nodes. I'd like to create an index on those nodes, for one particular property, but am having little to no success doing so. I think that using "ACTUAL" IDs could help, assuming that I'm setting the "ACTUAL" Neo4j ID (returned by ID()).

In your post here, you mention not using "large" IDs. What do you mean by "large"? 2**32 = Large?