APOC Function apoc.load.csv() Slowing Down After Importing ~20 Million Nodes

I am using apoc.load.csv() to load in a batch of CSVs as nodes into Neo4j. I have ~263.5 million nodes to load, and I'm loading them 100K at a time using this CALL:

CALL apoc.load.csv('vs.txt',
 {nullValues: ['']}) 
YIELD map AS row
WITH row
CREATE (p: Entity) SET p = row

wrapped in a Python loop via the neo4j driver, to generate a new batch of 100K on each iteration.

The first ~20 million loaded quickly, ~1.5 seconds to load each 100K nodes. But after ~20 million nodes were loaded, the load times jumped way up, at the extremes taking 500 to 700 seconds to load 100K nodes.

Just to double check, I reran the import starting on the same file that gave the first slow-loads, just to see if it was something in the data, but everything loads quickly from the start, so I don't think it was something about the earlier versus later batches of CSVs.

Is there something in Neo4j that would cause apoc.load.csv() to slow down appreciably as more nodes are added?

Other Details:

I have a constraint on the nodes to ensure that their identifier is unique:

CREATE CONSTRAINT IF NOT EXISTS ON (n: Entity) ASSERT n.bvd_id IS UNIQUE;

I have the:

dbms.memory.heap.initial_size=512m
dbms.memory.heap.max_size=1G

lines in the config file commented out, so Java dynamically calculates the resource size, and this amounts to ~ 5 GB of RAM when the java process runs.

If you create everything in one transaction, then Neo4j needs roughly 2-3G heap for 1M updates.
As the intermediate state of a tx is held in memory. That's why transaction batching helps.

You can just use apoc.periodic.iterate for that. The first statement provides the stream of data to operate on, the 2nd does the actual updates.

CALL apoc.periodic.iterate("
CALL apoc.load.csv('vs.txt', {nullValues: ['']}) YIELD map AS row RETURN row
","
CREATE (p: Entity) SET p = row
", {batchSize:50000, parallel:true})

Thanks for your help, Michael!

I'm trying the apocalypse.periodic.iterate() approach now, and it's running slightly faster. But it's still slowing down the more nodes I add.

I tried running an experiment over the weekend where I didn't impose this constraint:

CREATE CONSTRAINT IF NOT EXISTS ON (n: Entity) ASSERT n.bvd_id IS UNIQUE;

and everything loaded in quickly. But then when I tried to add the relationships based on bvd_id, the load statement crapped out.

Is the constraint somehow slowing things down as more and more nodes are added? And if so, is there a better way to handle the constraint?

can you share the statement you use for creating the relationships?

Sure thing!

CALL apoc.load.csv('es.txt',
{mapping:{
   direct: {name: 'direct_ownership_proportion'},
   total: {name: 'total_ownership_proportion'}
  }, nullValues: ['']})
YIELD map AS row
WITH row
MATCH (s:Entity {bvd_id: row.shareholder_bvd_id})
MATCH (d:Entity {bvd_id: row.subsidiary_bvd_id})
CREATE (s)-[rel:OWNS]->(d)
SET rel = row;

Can you also address my question about why the CONSTRAINT would cause the node loading to slow over time? After loading ~50 files (all of approximately the same size), the load time has jumped up significantly:

https://github.com/ddarmon/master/blob/master/images/time_for_file.png

again it's down to batching, so you can use the same apoc.periodic.iterate for your relationships.

call apoc.periodic.iterate("
CALL apoc.load.csv('es.txt',
{mapping:{
   direct: {name: 'direct_ownership_proportion'},
   total: {name: 'total_ownership_proportion'}
  }, nullValues: ['']})
YIELD map AS row
RETURN row
","
MATCH (s:Entity {bvd_id: row.shareholder_bvd_id})
MATCH (d:Entity {bvd_id: row.subsidiary_bvd_id})
CREATE (s)-[rel:OWNS]->(d)
SET rel = row;
",{batchSize:50000})

something you probably want to do is not set the id's on the relationship but only the properties you really want to put onto it.

Why do you think that the constraint slows the node-loading?
Could also be an I/O issue, not sure what your page-cache size is and the database size in comparison.

Because without the constraint, the nodes loaded in about twice as fast (600 ms per query, versus 1.2 s per addition of 10,000 nodes), and didn't get bogged down like they are now (with the huge 2+ hour load times for files that usually load in < 2 minutes). i.e. the load times look like:

https://github.com/ddarmon/master/blob/master/images/time_for_file--fast.png

e.g. small and steady throughout.

The only difference between these two types of load times was the presence or absence of

CREATE CONSTRAINT IF NOT EXISTS ON (n: Entity) ASSERT n.bvd_id IS UNIQUE;

immediately after initializing the database.

Is it inadvisable to add a constraint on the nodes at the end, after you have already loaded them into the database? I tried to do that this morning, but gave up after 30 minutes of Neo4j trying to process the query.

It seems really like a memory or i/o issue that you're running into.

Loading 263M nodes and probably many more rels definitely needs more memory than 5G for page-cache and heap, not sure how much memory your machine has but I think what you're seeing is swapping data in and out from disk, and writing the tx log.

Gotcha. On the machine I'm prototyping this on, I have 16 GB of RAM. The java process spawned by Neo4j maxes out around 10 GB of RAM. I thought this might be a memory issue, but thought it was weird that the memory issue only arises in the presence of the constraint.

FWIW, the network is fairly sparse, and has only 194 M relationships.

Does a constraint on a node force Neo4j to check that uniqueness of a new node each time the CREATE clause is called? If so, I could see why it would slow down over the course of the load, since it would have to look through more and more nodes to check for uniqueness, having a harder time fitting those into RAM, and then having to swap to disk using virtual memory.

Would using neo4j-admin make this any faster? Or is it a matter of running this on a machine with more RAM?

Thanks, by the way, for your help with this! I really appreciate it.

2 Likes

Yes the constraints checks each time for uniqueness.

If you have 16G I recommend to do the following:

Use 10G for heap and 5G for page-cache.
Import the nodes first and then create the constraints.
Them import the relationships.

Yes neo4j-admin import would be significantly faster at lower resources as it doesn't use the transactional APIs but creates the store directly from the input CSVs. You'd still want to create the constraint afterwards to be safe.

Sorry I originally only thought you were importing only 20M nodes didn't read it carefully enough.

Great! I'll give both of those a shot and see if they speed things up.

No worries. I really appreciate your help with this!