Updating many nodes in large graph consumes all memory and crashes

m_hess · October 19, 2020, 2:21pm

Hi,

after researching and a lot of trial and error, I was not able to figure out, what I'm doing wrong. So I ended up here writing my first post.

Here's what I'm trying to achieve. I have a graph with ~200M nodes and ~260M relationships. I want to introduce an inferred property like this:

CALL apoc.periodic.iterate(
  'MATCH (n)-[:HAS_LOCATION]->(t) WHERE n.coordinates IS NULL RETURN n,t',
  "SET n.coordinates=t.coordinates",
  {batchSize:10000, parallel:true})

This query crashes neo4j 4.1.1 community edition (with apoc-4.1.0.0-all.jar) every time after a few minutes. In the debug logs I can see entries like

2020-10-18 21:35:40.856+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=123, gcTime=174, gcCount=1}

at the beginning, ramping up to

2020-10-18 21:42:40.558+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=15462, gcTime=15561, gcCount=4}

right before the crash.

The machine has 64GB of RAM. We configured it like this:

dbms.memory.heap.initial_size=24100m
dbms.memory.heap.max_size=24100m
dbms.memory.pagecache.size=30100m
dbms.memory.transaction.global_max_size=10000m
dbms.memory.transaction.max_size=5000m

I tried to lower the batch size (to 500), it did not help at all. Is there any chance, that this is data-related? Any idea, what is actually consuming so much memory? Will a node with two outgoing 'HAS_LOCATION' relationships cause issues?

Also, I was trying to count the nodes, for which to apply the SET operation:

MATCH (n)-[:HAS_LOCATION]->() WHERE n.coordinates IS NULL RETURN count(*)

This also crashes neo4j. I don't expect it to return fast, as it's scanning all the nodes, but I don't see why it should consume so much memory.
When I rephrase this query as

MATCH (n) WHERE (n)-[:HAS_LOCATION]->() AND n.coordinates IS NULL RETURN count(n)

it completed once and returned 3724 (after 15 minutes). Another invocation of it crashed, too.

Any hints would be appreciated.
-- matt

terryfranklin82 · October 20, 2020, 6:58am

I haven't used a graph with that many nodes & relationships before, but as a starting point have you tried including some labels in your match statement, so that the query doesn't check all 200 million nodes?

nghia71 · October 20, 2020, 12:31pm

Hi,

I think you have a correct approach of using APOC periodic iterate.
1/ Perhaps you need to specify what kind of nodes you want. Otherwise you would get too many nodes.
2/ There would be many MyNodeN linked to MyNodeT, they cannot be updated simultaneously (parallel) at the same time.

How about try this one first:

CALL apoc.periodic.iterate(
'MATCH (n:MyNNode)-[:HAS_LOCATION]->(t:MyTNode) WHERE NOT EXISTS(n.coordinates) RETURN n',
"SET n.coordinates=t.coordinates",
{batchSize:100, parallel:false})

m_hess · October 21, 2020, 1:36pm

I managed to update all nodes by using a batchSize of 10 and by using labels at both ends of the path. I had to pass all possible combinations of labels manually, but it worked.
In the end, I did not try out these things systematically. I did not retry all the cypher statements multiple times in order to verify whether the behavior is consistent. So, it's hard to tell what actually leads to these issues.

Thanks for your input!

nghia71 · October 29, 2020, 2:42pm

Hi @m_hess,

Michael Hunger wrote a wonderful article. I think it can help you 5 Tips & Tricks for Fast Batched Updates of Graph Structures with Neo4j and Cypher | by Michael Hunger | Neo4j Developer Blog | Medium.

From my own perspective, the problem occurs when you have lots of nodes and relationships to be created/updated simultaneously. The best approach, for me, is to break the graph, that needs to be persisted, into connected components and then use apoc.periodic.iterate to run parallel update of those disjoint but connected components. Without conflict of shared nodes or relationships the update/create operation should work.

The question is, how to break the graph into connected components? If you have it in Neoj4 already, then some algorithms of GDS (Graph algorithms - Neo4j Graph Data Science) can help. If you have only raw data, I suggest NetworkX (https://networkx.org) that can help to identify these components.

Hope that help.

Nghia Doan

Topic		Replies	Views
Experiencing GC pause and high CPU Cypher	2	488	May 11, 2022
Memory usage when deleting large amount of relationships Procedures & APOC performance , memory , delete , apocperiodiciterate	0	218	January 17, 2024
Preparing OSM data for routing GeoSpatial	9	2702	April 10, 2022
Apoc.periodic.iterate fail in large data (640 million nodes) Neo4j Graph Platform migrated	0	157	January 17, 2023
Apoc.periodic.iterate for CREATE relation can not work on large data (500 million) Neo4j Graph Platform migrated	1	172	November 20, 2022

Updating many nodes in large graph consumes all memory and crashes

Related topics