Creating relationship between millions of nodes and runnning out of heap memory

jdeng · February 19, 2020, 6:44pm

Hi neo4j Community

I'm trying to create relationships between two groups of nodes. The first group is Listing and it has 10 million nodes, each of which is supposed to be connected to a unique Picture node. There are 50 million Pictures nodes (5 pictures for each listing).
First I loaded listings csv and created the 10 million Listing nodes. Then I wrapped my next query around in 'apoc.periodic.iterate.' As it loads csv to create the picture nodes, it finds the listing node that it should be connected to and creates that relationship. The heap memory runs out after 30k relationships are created with a batch size of 10k.
Any help would be much appreciated. I'm super new to neo4j and would love to learn anything I can!

My query to load listings and create Listing nodes

CALL apoc.periodic.iterate("
CALL apoc.load.csv('file:///listings.csv',{
  mapping:{
    id: {type:'int'},
    beds: {type:'int'},
    price: {type: 'int'},
    score: {type: 'float'},
    reviews: {type: 'int'}
  }
}) YIELD map as row return row
","
CREATE (l:Listing) SET l = row
", {batchSize:10000, iterateList:true, parallel:true});

My query to load pictures csv, create Picture nodes, and create relationships between Picture nodes and Listing nodes:

CALL apoc.periodic.iterate("
  CALL apoc.load.csv('file:///pictures.csv',{
    mapping:{
      id: {type:'int'},
      listing: {type:'int'}
    }
  }) YIELD map as row RETURN row
"," 
  CREATE (p:Picture) SET p = row
  WITH p
  MATCH (l:Listing)
  WHERE p.listing = l.id
  CREATE (p)-[:PICTURE_OF]->(l)
", {batchSize:10000, parallel:true, iterateList:true});

ganesanmithun323 · February 19, 2020, 6:54pm

Hi , welcome to community . You can do a couple of things first

You can try to increase the heap size In conf file
Setting parallel parameter false

And create index on Listing (id) before running the second query for better performance.

jdeng · February 19, 2020, 8:04pm

Thank you! I increased the heap size to 8G and pagecashe to 8G (my laptop's RAM is 16G).

dbms.memory.heap.initial_size=8G
dbms.memory.heap.max_size=8G
dbms.memory.pagecache.size=8G

I created the constraint on listing's id

CREATE CONSTRAINT ON (listing:Listing) ASSERT listing.id IS UNIQUE

And changed parrallel to false

CALL apoc.periodic.iterate("
  CALL apoc.load.csv('file:///pictures.csv',{
    mapping:{
      id: {type:'int'},
      listing: {type:'int'}
    }
  }) YIELD map as row RETURN row
"," 
  CREATE (p:Picture) SET p = row
  WITH p
  MATCH (l:Listing)
  WHERE p.listing = l.id
  CREATE (p)-[:PICTURE_OF]->(l)
", {batchSize:10000, parallel:false, iterateList:true});

It created 760k relationships and Picture nodes this time but still ran out of heap memory. I'm confused why it would run out of memory since the "apoc.periodic.iterate" should be executed to each specific batch?

andrew_bowman · February 19, 2020, 8:20pm

Which version of Neo4j and which version of APOC?

Depending on the versions used, you could try prefixing your outer query with CYPHER runtime=INTERPRETED or CYPHER runtime=SLOTTED.

You can also try decreasing your batchSize, maybe try 5000 or even 1000.

jdeng · February 19, 2020, 10:04pm

Thank you for the suggestions! I'm using Neo4j desktop 1.2.4 running 4.0.0. The APOC version is 4.0.0.3. I prefixed my outer query with CYPHER runtime=INTERPRETED and it capped out at 625k. I also tried CYPHER runtime=SLOTTED and it capped out at 710k.

andrew_bowman · February 19, 2020, 10:53pm

Can you clarify what you mean by "capped out at"? Did you run out of heap at this point?

If so, please raise an issue on the APOC github issues page, might be something new caused by the 4.0 change.

jdeng · February 19, 2020, 11:05pm

Yes I ran out of heap at this point (sorry for using an inaccurate word). I will raise an issue on the APOC github page. Thank you again for your help!

ganesanmithun323 · February 20, 2020, 5:28am

INDEX and CONSTRAINT are not same . Constraint ensures that property is unique . Index improves the performance of match or merge queries .

jdeng · February 20, 2020, 6:01am

Thank you for pointing it out! I created the constraint because I read this in the documentation (3.5 Defining a schema)

Adding the unique constraint will implicitly add an index on that property. If the constraint is dropped, but the index is still needed, the index will have to be created explicitly.

I wanted to make sure each of my Listing node has a unique id and is indexed to improve performance. Would you say it's better than I just create the index but not adding the unique constraint?
Thank you again for the help!

ganesanmithun323 · February 20, 2020, 6:07am

that's great . Some how i missed this part while going through the documentation . Thanks for sharing.
No , its better to have constraint when you want to make you are creating only one node per id.

Topic		Replies	Views
Running out of heap memory size Cypher	8	1300	February 20, 2020
Create large amount of relationships [not enough memory] Desktop cypher , relationship	6	466	August 10, 2020
Creating relationship over several millions of nodes Cypher apoc , performance , cypher , relationship	23	2864	September 24, 2020
Java Heap space issue while creating relationship between nodes Neo4j Graph Platform performance	2	376	November 16, 2021
Reliably create relationships on 12million+ nodes Cypher	6	828	August 7, 2020

Get Certified in June!

Creating relationship between millions of nodes and runnning out of heap memory

Related topics