Load-CSV very slow with millions of nodes

Dear Community,

We are creating a knowledge graph using data processed through Spark. Spark output includes a number of CSV files that were formatted to load in neo4j for the various nodes and relationships required. We have a large number of files for both nodes and relationships separately of varying sizes. For example one of the 'Author' node type the data is split in 110 files of approx 180Mb each.

We have manged to load all the nodes on neo4 which turned out to be 570 million nodes. At the moment we are trying to build the relationships between the nodes using simple queries such as the one below.

USING PERIODIC COMMIT 5000
LOAD CSV FROM {csvfile} AS line
MATCH (a:Author { authorid: line[0] })
MATCH (b:Author { authorid: line[1] })
CREATE (a)-[r:co_author { collaborations: line[2] }]->(b);

The relationship files are also large; for example for the above relationship we have 200 files of around 70MB each (3.4 million relationships per file). We are encountering an issue with loading these relationships in neo4j where one file requires several hours to load. We have an index on the node ids such as authorids for the above example. Below is a sample of the data we are trying to load:

1246478160,2110912591,2
1246488368,2298908021,3
124649346,2077283241,5
1246495591,2156716603,8

We are using Neo4j community edition version 3.5.5 on an Ubuntu machine using a containerized environment. The machine is run on Azure and has 8vcpus and 64RAM. Also we have configured a pagecache=30G, and heap_max=30G.

When loading the CSV file the CPU of the container remains very low ( <4%) and it does not run out of memory. It seems strange that the CPU remains so low as if something else is preventing the loading to run faster. Do you have any insight on this behaviour? and do you have any suggestions on how we an can improving the load time of these relationships?

Thank you for you help!

Regards,

Joseph

Two things:

1.) Using Neo4j admin import seen here https://neo4j.com/docs/operations-manual/current/tools/import/
We load 6 Billions of nodes and 10 Billion rels in 2-3 hours. You want to break the files down into distinct nodes and distinct pairs of nodes. if your numerical id's are unique you can maybe use the flag "id-type =integer" which will give you a boost on memory used.

2.)If you have to use somekind of load csv, Load in Parallel. Consider apoc.periodic.iterate, seen here https://neo4j-contrib.github.io/neo4j-apoc-procedures/#commit-batching .Relationships cannot load in parallel unless you are sure that the way the file is built that no two cpus will try and grab the same nodes, so it looks like you may be stuck with slow loading, if you can be smart about sorting the data such that for each batch, you never have a race condition to the same nodes I think it could be parallelized, i.e. if you have a batch size of 1000 and you have 8 cpu's then the first 1-1000 can have the same nodes, but the next 2000-6000 must have different nodes, It might be tough but could give lift if the sort was figured out. Also I bet neo4j would appreciate it xD! I would suggest maybe increasing the the commit size to 100k, 5k is small, you have 30 GB RAM to use when you load relationships, you might be able to get away with higher, like 300k per batch.

Hi Benjamin,

Thank you for your quick reply and suggestions.

  1. Can you explain what setup you had to achieve such fast loading of 16B items in 3 hrs? Also does the tool required enterprise edition? The files are currently formatted to create separate nodes, and seperate relationships between two nodes. Yes the ids are unique so we can try that.

Did you load the data from local disk or using url like LOAD CSV. At the moment our data is in blob storage so ideally we get data directly from blob storage. Do you have an sample query/command used to load through neo4j-admin?

How did you find running queries with billions of nodes? One aspect we are concerned is the fact that it will be difficult to run some more complex queries.

  1. we can try to increase batch size if (1) fails. It might be more difficult to sort the data. We can investigate that a bit further.

Thanks,

Joseph

1.) Neo4j-admin import is super quick, you feed it a bunch of csv's and it does the rest. The call is not overly complicated:
bin/neo4j-admin import --nodes "/import/uids-header.csv,/import/uid_no.*" --relationships:CO_AUTHOR "/import/rels-header.csv,/import/rels.*" --ignore-missing-nodes --delimiter="~" --high-io=true --id-type=integer --additional-config=conf/neo4j.conf
Choose what ever delimiter works for you, ignore-missing-nodes affects when the rels load, if it can't find the matching nodes which load first then it doesn't make that rel. high-io =true is for NVMe and fast SSD devices. Note - It is neo4j syntax for all capital letters for rels, i.e. - CO_AUTHOR
I normally load from disk, we considered using makefifo seen here: https://linux.die.net/man/3/mkfifo, the syntax would have been: makefifo /tmp/fake.csv
aws s3 cp s3://bucket/file.csv.gz - | gunzip -cd > /tmp/fake.csv &
neo4j-admin import --nodes /tmp/fake.csv but we decided to stick to just loading from disk.

Running queries on billions of nodes doesn't work without HUGE amounts of ram. Our current machine is a 488 GB ram NVMe storage beast. In order to work with that kind of data size it is usually best to really think through what you are doing with a particular dataset, I often find I have to relabel data as ":Of_Interest" or something like that to flag it for further analysis. It is best to think of ways to interact with parts of graph and then piece it together if possible. Apoc.periodic.Iterate is a life saver. Learn APOC or figure out how to work with unmanaged java extensions on the server it self interacting with the data layer itself.

Thank you for the suggestions, much appreciated. We will try the neo4j-admin to see how it goes. Ok that is a huge machine, it is far from what we have :frowning: Unfortunately we do not have such resources available.

Any insight why with LOAD CSV, the CPU remains very low, almost idle?

Joseph

No problem. Copy 5,000 rows into a csv and look how big it is, there is almost no memory to consume that. Secondly, if you are on Community Edition they have a strict limit to 1 CPU (apart from apoc.periodic) so you can't get any lift as compared to Enterprise version.

You may want to run an EXPLAIN of your query, just to double-check that it's actually using the index lookups on :Author(authorid) that you expect. If it's instead using NodeByLabelScan then something is wrong with your indexes, and you should double-check the case and that the label and properties are correct.

1 Like

I suspect I've got this problem. Thanks for providing the EXPLAIN tip
https://neo4j.com/docs/cypher-manual/current/query-tuning/how-do-i-profile-a-query/

I've created a constraint on Cid and Org name, by using the toString() am I slowing things down? (i.e. removing the value of indexing in the first place)

CREATE CONSTRAINT ON (o:Organisation) ASSERT o.orgID IS UNIQUE;
CREATE CONSTRAINT ON (o:Organisation) ASSERT o.name IS UNIQUE;
CREATE CONSTRAINT ON (c:Cid) ASSERT c.contentID IS UNIQUE;

// Create HAS_ORGANISATIONS relationship
// from Content Store API
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///cid_has_organisations_org.csv" AS csvLine
FIELDTERMINATOR '\t'
MATCH (cid:Cid { name: toString(csvLine.content_id)}), (org:Organisation { name: tostring(csvLine.organisations)}) CREATE (cid)-[:HAS_ORGANISATIONS]->(org)
;

While the toString() shouldn't be a problem here, it's also not necessary. When doing a LOAD CSV all fields are addressed as strings by default, so you can remove the toString() usage.

The constraints you showed show an index on :Cid(contentID), not :CID(name), again make sure you EXPLAIN the plan and ensure that these are doing expected index lookups.

Good spot - that checks out, as PROFILE reveals NodeByLabelScan on Cid