Understanding `apoc.periodic.iterate` parallel performance

Hatem · October 19, 2023, 4:25pm

Hello,

I'm new to Cypher, and I'm using Neo4j Browser to run and test out apoc.periodic.iterate queries on the latest version of Neo4j community docker container. The machine that I'm using has 48 cores and 200GB of RAM. I've some questions regarding its performance, and I want to ask for your help and opinions .

The use case is as follows: I've a directory with a lot of JSON files, and I wrote a parallel query using apoc.periodic.iterate to count the number of rows/lines in each of these files in parallel -The number of files is limited to 48 for simplicity-. Here is the query:

CALL apoc.load.directory() YIELD value AS files
WITH files AS files
LIMIT 48
UNWIND files as file
CALL apoc.periodic.iterate(
    "
        CALL apoc.load.json($file) YIELD value as row
        RETURN row
    ",
    "
        UNWIND row as r
        RETURN count(r)
    ",
    {batchSize:1, parallel:true, params: {file: file}}
)
YIELD timeTaken, batches, total
return timeTaken, batches, total

Here is a snippet of the output:

The timetaken for each batch is close to 0 for almost all the 48 files -1 second for 2 files-, however, the final result is returned after about ~35 seconds. Does anyone know why it takes ~35 seconds to return the results? I don't think it should take this long, or it is not really running in parallel?

Another mysterious thing for me is that the batches column shows the same value as the total column, these values are actually the number of rows in each file. Shouldn't the value of batches be 1? since there is only 48 files and the batchSize is 1?

Thanks!

luiseduardo · October 19, 2023, 9:06pm

Hi @Hatem !

The count of batches is the same as the total rows, because the first statement of apoc.periodic.iterate builds the set for which the iteration happens, the second statement executes for each element of this set. So the first statement reads all the 48 files, and builds the set with the rows of these files. In order to get the results that you want you should use the call to apoc.load.directory...etc. as the first statement in apoc.periodic.iterate!

Hope this helps!

Hatem · October 20, 2023, 10:56am

Hi @luiseduardo -- Thanks a lot for the insight!

Hatem · October 20, 2023, 3:30pm

Hi Again :))

Can you please help me with the following doubts?

After I modified the query to perform in parallel - I can't get the count results that are returned by the second statement, is this how it is supposed to work?
I've been reading about how to load data and create a graph using apoc.periodic.iterate, (so given all the files, create relationships between nodes in parallel) , however some online resources advice against it to avoid creating deadlocks (or other performance issue) -- is this really the case? how this would result in a deadlock given that the database supports ACID transactions?

Thanks!

luiseduardo · November 1, 2023, 4:19am

Hi @Hatem !

I'm not sure I understand correctly your first point, could you share the code you have executed and the problem you're facing?

As for the second one, yes, Neo4j is ACID compliant, which means that no succesfully executed transaction will break any of the ACID characteristics. So, if a transaction is unsuccessful it may be because of a deadlock, which occurs when one transaction seeks to modify a node that is being locked by another transaction. So, in order to avoid this situation you can execute without parallel configuration with true value, or asure that no element per batch and concurrency config will be modified more than once, eg, setting a property for nodes uniquely.

I hope this helps! Please let me know if I haven't explained myself.

Topic		Replies	Views
Creating Large Number of Edges with `apoc.periodic.iterate` Cypher apoc , performance , cypher	3	415	October 24, 2023
Parallel Cypher & Apoc Cypher apoc , cypher	8	3932	June 19, 2019
How can I improve the performance of this query? Newbie Questions	5	1386	April 4, 2019
Apoc.periodic.iterate only writing one batch with parallel Procedures & APOC	4	750	July 29, 2020
Apoc.periodic.iterate() parallelization not working with Python driver Drivers & Stacks apoc , performance , browser , neo4j-python-driver	0	42	November 4, 2024

Understanding `apoc.periodic.iterate` parallel performance

Related topics