Creating relationship between millions of nodes and runnning out of heap memory

Hi neo4j Community

I'm trying to create relationships between two groups of nodes. The first group is Listing and it has 10 million nodes, each of which is supposed to be connected to a unique Picture node. There are 50 million Pictures nodes (5 pictures for each listing).
First I loaded listings csv and created the 10 million Listing nodes. Then I wrapped my next query around in 'apoc.periodic.iterate.' As it loads csv to create the picture nodes, it finds the listing node that it should be connected to and creates that relationship. The heap memory runs out after 30k relationships are created with a batch size of 10k.
Any help would be much appreciated. I'm super new to neo4j and would love to learn anything I can!

My query to load listings and create Listing nodes

CALL apoc.periodic.iterate("
CALL apoc.load.csv('file:///listings.csv',{
  mapping:{
    id: {type:'int'},
    beds: {type:'int'},
    price: {type: 'int'},
    score: {type: 'float'},
    reviews: {type: 'int'}
  }
}) YIELD map as row return row
","
CREATE (l:Listing) SET l = row
", {batchSize:10000, iterateList:true, parallel:true});

My query to load pictures csv, create Picture nodes, and create relationships between Picture nodes and Listing nodes:

CALL apoc.periodic.iterate("
  CALL apoc.load.csv('file:///pictures.csv',{
    mapping:{
      id: {type:'int'},
      listing: {type:'int'}
    }
  }) YIELD map as row RETURN row
"," 
  CREATE (p:Picture) SET p = row
  WITH p
  MATCH (l:Listing)
  WHERE p.listing = l.id
  CREATE (p)-[:PICTURE_OF]->(l)
", {batchSize:10000, parallel:true, iterateList:true});

Hi , welcome to community :slight_smile:. You can do a couple of things first

  1. You can try to increase the heap size In conf file
  2. Setting parallel parameter false

And create index on Listing (id) before running the second query for better performance.

Thank you! I increased the heap size to 8G and pagecashe to 8G (my laptop's RAM is 16G).

dbms.memory.heap.initial_size=8G
dbms.memory.heap.max_size=8G
dbms.memory.pagecache.size=8G

I created the constraint on listing's id

CREATE CONSTRAINT ON (listing:Listing) ASSERT listing.id IS UNIQUE

And changed parrallel to false

CALL apoc.periodic.iterate("
  CALL apoc.load.csv('file:///pictures.csv',{
    mapping:{
      id: {type:'int'},
      listing: {type:'int'}
    }
  }) YIELD map as row RETURN row
"," 
  CREATE (p:Picture) SET p = row
  WITH p
  MATCH (l:Listing)
  WHERE p.listing = l.id
  CREATE (p)-[:PICTURE_OF]->(l)
", {batchSize:10000, parallel:false, iterateList:true});

It created 760k relationships and Picture nodes this time but still ran out of heap memory. I'm confused why it would run out of memory since the "apoc.periodic.iterate" should be executed to each specific batch?

Which version of Neo4j and which version of APOC?

Depending on the versions used, you could try prefixing your outer query with CYPHER runtime=INTERPRETED or CYPHER runtime=SLOTTED.

You can also try decreasing your batchSize, maybe try 5000 or even 1000.

Thank you for the suggestions! I'm using Neo4j desktop 1.2.4 running 4.0.0. The APOC version is 4.0.0.3. I prefixed my outer query with CYPHER runtime=INTERPRETED and it capped out at 625k. I also tried CYPHER runtime=SLOTTED and it capped out at 710k.

Can you clarify what you mean by "capped out at"? Did you run out of heap at this point?

If so, please raise an issue on the APOC github issues page, might be something new caused by the 4.0 change.

Yes I ran out of heap at this point (sorry for using an inaccurate word). I will raise an issue on the APOC github page. Thank you again for your help!

INDEX and CONSTRAINT are not same . Constraint ensures that property is unique . Index improves the performance of match or merge queries .

Thank you for pointing it out! I created the constraint because I read this in the documentation (3.5 Defining a schema)

Adding the unique constraint will implicitly add an index on that property. If the constraint is dropped, but the index is still needed, the index will have to be created explicitly.

I wanted to make sure each of my Listing node has a unique id and is indexed to improve performance. Would you say it's better than I just create the index but not adding the unique constraint?
Thank you again for the help!

that's great . Some how i missed this part while going through the documentation . Thanks for sharing.
No , its better to have constraint when you want to make you are creating only one node per id.