I'm using the latest Neo4j Community docker container to load a graph from a directory that contains 1330 ndjson files. The graph has some ~29 million nodes and ~1 billion edges. So I'm trying to use apoc.peridic.iterate to speed up graph creation. For creating the nodes, I'm using a apoc.periodic.iterate query for loading the nodes in parallel which is pretty fast. After the node-creation query finishes, it is followed by a separate apoc.periodic.iterate query to create the edges. I tried to create the edges in parallel but it seems that the files have dependency between them so the query always failed (i.e. the query ends, but it does not create all the expected edges), so I disabled the parallel execution and was trying to play with the batchSize to see if I can speed up the process. However, there is something I don't understand and I wanted to ask for your help.
When I set the batchSize to 100 or above (the number of files is 1330), the operation ends up not creating all the required number of edges. Doesn't set parallel to false avoids potential deadlocks or possible conflicts? It only works when I set the batchSize to very low numbers, e.g., 1 (serial execution) or 2 and like, but it still takes a lot of time. Does anyone have any idea whats wrong with the query or my expectations of it?
Here is the query for loading the edges:
CALL apoc.periodic.iterate("
CALL apoc.load.directory() YIELD value as files
WITH files AS file
RETURN file
",
"
CALL apoc.load.json(file) YIELD value as fjson
UNWIND fjson as rows
MATCH (c1:Concept {ocid:rows.ocid})
UNWIND rows.ancestors as ancestor_ocid
MATCH (c2:Concept {ocid:ancestor_ocid})
MERGE (c1)-[:DESCENDENT_OF]->(c2)
",
{batchSize:4, parallel:false}
Do you have an index defined on Concept(ocid)? That will speed up the match.
You are batching files. Are these huge files so that batch sizes over 2 results in a large number of rows processed in one batch. You could convert to batching rows by moving the first two rows of the second query to the first. This should help with better control on the number of rows in a batch, as it is specified versus when you batch files with various number of rows each. Worth a try to see if it helps.
The documentation states that apoc.load.json returns a map of the file info, so there should be no need to unwind.
CALL apoc.periodic.iterate("
CALL apoc.load.directory() YIELD value as file
CALL apoc.load.json(file) YIELD value as row
RETURN row
",
"
MATCH (c1:Concept {ocid:row.ocid})
UNWIND row.ancestors as ancestor_ocid
MATCH (c2:Concept {ocid:ancestor_ocid})
MERGE (c1)-[:DESCENDENT_OF]->(c2)
",
{batchSize:10000, parallel:false}
Thanks @glilienfield for the ideas -- Yeah there is an index on Concept (more concisely a uniqueness constraint).
The files contain 100k rows each, each row creates an edge between a node and their ancestors (~5 nodes) so yeah I think there is lots of processing here. I'll try out working on the rows instead of whole files. -- The batchSize in your query is set to 10000, were you just giving an example or you choose this value intentionally? I saw this value normally set in different neo4j online resources, can you please tell me on which basis should the value of batchSize decided?
I've noticed one thing, I would like to know do you have any opinions on it: while the query is running, there is a large number of neo4j processes running in "S"/idle state most of the time, any ideas whats happening here? is increasing the transaction_memory or disable transaction logging would have positive impact?
Michael Hunger has stated that is a good number to use in general. Of course, it also depends on other circumstances, but you can start with it. He has a blog where he has some older articles he posted.