Using the ACTUAL data type with neo4j-import

david_gordon1 · August 23, 2018, 2:44am

When importing data using neo4j-import, make sure to review the required CSV file structure and considerations before moving on.

http://neo4j.com/docs/stable/import-tool.html[]

ACTUAL vs. String (default) or Integer:

Each node in the CSV must have an :ID, which can be in the format Integer, String, or ACTUAL.
By default it is String, and one can specify Integer explicitly in the header.
However, the reality is that performance and memory-wise, both are about equal, and using the default is fine.

Using Integer or String is a bit more memory intensive that using ACTUAL, so one may be tempted to go for that option.
However, this is actually not usually the best option, unless you have a specific use case for it.

What is ACTUAL, and why is it special?

ACTUAL refers to the actual node ID, which in Neo4j, means the actual location of that record on disk.
When using ACTUAL with neo4j-import, all :IDs must be ordered, and they must be ordered across ALL CSV files being imported during the load.
This is generally difficult to achieve, particularly in an existing data set, and one that is quite large and complex.

Assuming you have ordered all of the nodes across all CSV files to be imported, one must also consider whether there are gaps in those IDs.
Any gaps will yield places on disk where we will not use that area for storage, and will potentially reduce the amount of storage space you have available for the graph data store.

Lastly, avoid using large ids with ACTUAL, as this will greatly increase the size of your store files, which should be avoided for obvious reasons.

Take home message:

If you use low, consecutive, ordered :IDs, ACTUAL should work fine for you, but it does require knowledge of the internal storage architecture and can be challenging to keep all of the nodes in order across CSV files.

[Note]
neo4j-import is intended to populate a new, empty database.
It cannot be used to import into an existing database.

jherna6 · March 4, 2020, 5:01pm

@david_gordon1 I have a graph database that contains 1.7 billion nodes. I'd like to create an index on those nodes, for one particular property, but am having little to no success doing so. I think that using "ACTUAL" IDs could help, assuming that I'm setting the "ACTUAL" Neo4j ID (returned by ID()).

In your post here, you mention not using "large" IDs. What do you mean by "large"? 2**32 = Large?

Topic		Replies	Views
Extremely slow import for large graph database using neo4j-admin import Import / Export	3	2460	November 5, 2020
Create node with specific internal id using LOAD CSV Import / Export	5	2332	March 23, 2020
Data type on import Import / Export	1	909	September 15, 2018
My long importing query never ends Cypher	26	1195	April 12, 2020
Aura db import does not create nodes Import / Export cypher	5	441	May 18, 2023

Take the Course Then Join The Aura Agent Hackathon

Using the ACTUAL data type with neo4j-import

ACTUAL vs. String (default) or Integer:

What is ACTUAL, and why is it special?

Related topics

Take the Course Then Join
The Aura Agent Hackathon