Optimizing the writing of large amounts of data in neo4j with apoc Parquet, periodic iterate

tlaucournet · November 24, 2023, 10:03am

Hi,
I need to import hundreds of millions of nodes and relationships into my database while taking care of duplicate.
I created indexes for most of my nodes.
I'm using Using neo4j 5.13 and apoc

This is what look like my cypher query

            CALL apoc.periodic.iterate('
            CALL apoc.load.parquet("file:///my_file.parquet") YIELD value RETURN value
        ','
            MERGE (n1:Label1 {property: value.property1})
            ON CREATE SET n.property2 = value.property2

            FOREACH (_ IN CASE WHEN value.info1 IS NOT NULL AND value.info1 <> \\'\\' THEN [1] ELSE [] END |
                MERGE (n2:Label2 {info: value.info1})
                MERGE (n)-[:HAS_INFO]->(n2)
            )
            
            FOREACH (_ IN CASE WHEN value.info2 IS NOT NULL AND value.info2 <> \\'\\' THEN [1] ELSE [] END |
                MERGE (n3:Label3 {info: value.info2})
                MERGE (n)-[:HAS_INFO]->(n3)
            )

            WITH value.unknown_values AS uvalues, n
            UNWIND uvalues AS uvalue
            MERGE (n4:Label4{info:uvalue})
            MERGE (n)-[:HAS_UNKNOWN_INFO]->(n4)

            ',{batchSize: 10000, parallel: true}
            )

I'd like to know if my way of doing things seems to you to be optimized or not.

Thanks

glilienfield · November 24, 2023, 10:47am

It does look typical. Two comments though. I don’t see where the variable ‘n’ that is referenced throughout the code is defined. The first match binds its result to ‘n1’. Should ‘n1’ be ‘n’ instead?

Second, sometimes running the batches in parallel becomes problematic when the code creates relationships. This is because there could be blocking due to the end nodes in a relationship needing to be locked in order to create the relationship. This would occur if the data is creating multiple relationships to the same node. Just a heads up.

tlaucournet · November 24, 2023, 11:03am

Thanks for your answer.
Indeed I made a mistake, 'n' should be 'n1'
Regarding lock between batches i have some issue, I am using the 'retries' parameter of apoc.periodic.iterate that i forgot to add in the example.

Topic		Replies	Views
Optimization of Cypher query to create nodes Cypher apoc , performance , cypher , operations	1	232	September 28, 2021
How to support generating a massive list of potentially unique merge queries Cypher operations	2	207	April 5, 2023
Creating relationship over several millions of nodes Cypher apoc , performance , cypher , relationship	23	2863	September 24, 2020
Performance Issues Merging Nodes Cypher apoc , performance , cypher	3	348	March 13, 2022
Apoc.periodic.iterate only writing one batch with parallel Procedures & APOC	4	757	July 29, 2020

Get Certified in June!

Optimizing the writing of large amounts of data in neo4j with apoc Parquet, periodic iterate

Related topics