Creating Large Number of Edges with `apoc.periodic.iterate`

Hatem · October 24, 2023, 4:03pm

I'm using the latest Neo4j Community docker container to load a graph from a directory that contains 1330 ndjson files. The graph has some ~29 million nodes and ~1 billion edges. So I'm trying to use apoc.peridic.iterate to speed up graph creation. For creating the nodes, I'm using a apoc.periodic.iterate query for loading the nodes in parallel which is pretty fast. After the node-creation query finishes, it is followed by a separate apoc.periodic.iterate query to create the edges. I tried to create the edges in parallel but it seems that the files have dependency between them so the query always failed (i.e. the query ends, but it does not create all the expected edges), so I disabled the parallel execution and was trying to play with the batchSize to see if I can speed up the process. However, there is something I don't understand and I wanted to ask for your help.

When I set the batchSize to 100 or above (the number of files is 1330), the operation ends up not creating all the required number of edges. Doesn't set parallel to false avoids potential deadlocks or possible conflicts? It only works when I set the batchSize to very low numbers, e.g., 1 (serial execution) or 2 and like, but it still takes a lot of time. Does anyone have any idea whats wrong with the query or my expectations of it?

Here is the query for loading the edges:

 CALL apoc.periodic.iterate("
            CALL apoc.load.directory() YIELD value as files
             WITH files AS file
             RETURN file
          ", 
          "
           CALL apoc.load.json(file) YIELD value as fjson
           UNWIND fjson as rows
           MATCH (c1:Concept {ocid:rows.ocid})
           UNWIND rows.ancestors as ancestor_ocid
           MATCH (c2:Concept {ocid:ancestor_ocid})
           MERGE (c1)-[:DESCENDENT_OF]->(c2)
       ",
       {batchSize:4, parallel:false}

Thanks!

glilienfield · October 24, 2023, 5:28pm

Do you have an index defined on Concept(ocid)? That will speed up the match.

You are batching files. Are these huge files so that batch sizes over 2 results in a large number of rows processed in one batch. You could convert to batching rows by moving the first two rows of the second query to the first. This should help with better control on the number of rows in a batch, as it is specified versus when you batch files with various number of rows each. Worth a try to see if it helps.

The documentation states that apoc.load.json returns a map of the file info, so there should be no need to unwind.

CALL apoc.periodic.iterate("
          CALL apoc.load.directory() YIELD value as file
          CALL apoc.load.json(file) YIELD value as row
          RETURN row
          ", 
          "
           MATCH (c1:Concept {ocid:row.ocid})
           UNWIND row.ancestors as ancestor_ocid
           MATCH (c2:Concept {ocid:ancestor_ocid})
           MERGE (c1)-[:DESCENDENT_OF]->(c2)
       ",
       {batchSize:10000, parallel:false}

Hatem · October 24, 2023, 6:19pm

Thanks @glilienfield for the ideas -- Yeah there is an index on Concept (more concisely a uniqueness constraint).

The files contain 100k rows each, each row creates an edge between a node and their ancestors (~5 nodes) so yeah I think there is lots of processing here. I'll try out working on the rows instead of whole files. -- The batchSize in your query is set to 10000, were you just giving an example or you choose this value intentionally? I saw this value normally set in different neo4j online resources, can you please tell me on which basis should the value of batchSize decided?

I've noticed one thing, I would like to know do you have any opinions on it: while the query is running, there is a large number of neo4j processes running in "S"/idle state most of the time, any ideas whats happening here? is increasing the transaction_memory or disable transaction logging would have positive impact?

Thanks again!

glilienfield · October 24, 2023, 7:07pm

Michael Hunger has stated that is a good number to use in general. Of course, it also depends on other circumstances, but you can start with it. He has a blog where he has some older articles he posted.

http://www.jexp.de/blog/

Sorry, I don't have any insight in to idle processes. You could try increasing heap memory if you have memory to spare.

Topic		Replies	Views
Understanding `apoc.periodic.iterate` parallel performance Cypher apoc , performance , browser , cypher	4	629	November 1, 2023
Creating weighted edges with apoc.periodic.iterate Procedures & APOC apoc , neo4j-desktop	2	135	May 15, 2024
Apoc.periodic.iterate only writing one batch with parallel Procedures & APOC	4	746	July 29, 2020
Create edge using apoc.periodic.iterate suffer from Cartesian product Neo4j Graph Platform migrated	8	155	January 6, 2023
Apoc.periodic.iterate for CREATE relation can not work on large data (500 million) Neo4j Graph Platform migrated	1	171	November 20, 2022

Creating Large Number of Edges with `apoc.periodic.iterate`

Related topics