I'm new to Cypher, and I'm using Neo4j Browser to run and test out apoc.periodic.iterate queries on the latest version of Neo4j community docker container. The machine that I'm using has 48 cores and 200GB of RAM. I've some questions regarding its performance, and I want to ask for your help and opinions .
The use case is as follows: I've a directory with a lot of JSON files, and I wrote a parallel query using apoc.periodic.iterate to count the number of rows/lines in each of these files in parallel -The number of files is limited to 48 for simplicity-. Here is the query:
CALL apoc.load.directory() YIELD value AS files
WITH files AS files
LIMIT 48
UNWIND files as file
CALL apoc.periodic.iterate(
"
CALL apoc.load.json($file) YIELD value as row
RETURN row
",
"
UNWIND row as r
RETURN count(r)
",
{batchSize:1, parallel:true, params: {file: file}}
)
YIELD timeTaken, batches, total
return timeTaken, batches, total
The timetaken for each batch is close to 0 for almost all the 48 files -1 second for 2 files-, however, the final result is returned after about ~35 seconds. Does anyone know why it takes ~35 seconds to return the results? I don't think it should take this long, or it is not really running in parallel?
Another mysterious thing for me is that the batches column shows the same value as the total column, these values are actually the number of rows in each file. Shouldn't the value of batches be 1? since there is only 48 files and the batchSize is 1?
The count of batches is the same as the total rows, because the first statement of apoc.periodic.iterate builds the set for which the iteration happens, the second statement executes for each element of this set. So the first statement reads all the 48 files, and builds the set with the rows of these files. In order to get the results that you want you should use the call to apoc.load.directory...etc. as the first statement in apoc.periodic.iterate!
After I modified the query to perform in parallel - I can't get the count results that are returned by the second statement, is this how it is supposed to work?
I've been reading about how to load data and create a graph using apoc.periodic.iterate, (so given all the files, create relationships between nodes in parallel) , however some online resources advice against it to avoid creating deadlocks (or other performance issue) -- is this really the case? how this would result in a deadlock given that the database supports ACID transactions?
I'm not sure I understand correctly your first point, could you share the code you have executed and the problem you're facing?
As for the second one, yes, Neo4j is ACID compliant, which means that no succesfully executed transaction will break any of the ACID characteristics. So, if a transaction is unsuccessful it may be because of a deadlock, which occurs when one transaction seeks to modify a node that is being locked by another transaction. So, in order to avoid this situation you can execute without parallel configuration with true value, or asure that no element per batch and concurrency config will be modified more than once, eg, setting a property for nodes uniquely.
I hope this helps! Please let me know if I haven't explained myself.