Process 155 million nodes without running out of memory

This should have been a simple query, but I cannot do it without running out of Java heap memory. I have a database of 155 million nodes of type WaterNode. Nodes have an integer property 'osm_id' and a unique integer property 'node_id'. I would like to process all nodes and create a relationship NEXT_TO between each node with the same osm_id and adjacent node_id. I tried this and similar queries:

CALL {
MATCH (n1:WaterNode), (n2:WaterNode)
WHERE n1 <> n2
AND n1.osm_id = n2.osm_id
AND n1.node_id - n2.node_id = 1
AND NOT EXISTS((n1)-[:NEXT_TO]-(n2))
WITH n1, n2
MERGE (n1)-[:NEXT_TO]->(n2)
} IN TRANSACTIONS OF 1000 ROWS;

I am using Neo4j Community edition v5.13.0 and loaded apoc code and extended v5.13.0. I am running the queries in cypher-shell.

I loaded the nodes from a CSV file where nodes were ordered by node_id like this:
LOAD CSV WITH HEADERS FROM 'file:///path/europe.csv' AS row
CALL {
WITH row
CREATE (n:WaterNode {
node_id: toInteger(row.node_id),
osm_id: toInteger(row.osm_id),
name: row.name,
latitude: toFloat(row.latitude),
longitude: toFloat(row.longitude)
})
} IN TRANSACTIONS OF 100000 ROWS;

The 155 million nodes were created in 30 minutes. I can add the relationship creation to the CSV load step if that helps. For the NEXT_TO query I tried using apoc.periodic.iterate but got the same memory overrun. Can you please advise how to create a query to process all nodes in such a large database? Thank you.

Your query as written with the “match a, b” structure is going produce up to 1-to-2 times 155 million rows of data to process, assuming each not has zero or one 'adjacent' nodes. This must be putting a burden on your memory.

Try refactoring as follows:

MATCH (n1:WaterNode)
CALL {
WITH n1
MATCH (n2:WaterNode)
WHERE n1 <> n2
AND n1.osm_id = n2.osm_id
AND n1.node_id - n2.node_id = 1
AND NOT EXISTS((n1)-[:NEXT_TO]-(n2))
MERGE (n1)-[:NEXT_TO]->(n2)
} IN TRANSACTIONS OF 1000 ROWS

If this still doesn't work, I will try to refactor the query into batches.

This may batch the plates into smaller batches:

:auto 
MATCH (n:WaterNode)
with max(n.node_id) as maxId, 100000 as groupSize
with toInteger(maxId / groupSize) + 1 as numOfBatches, groupSize
unwind range(0, numOfBatches-1) as batchIndex
CALL {
    WITH batchIndex, groupSize
    unwind range(batchIndex * groupSize, (batchIndex + 1) * groupSize - 1) as id
    MATCH (n1:WaterNode{node_id: id})
    MATCH (n2:WaterNode)
    WHERE n1.osm_id = n2.osm_id
    AND n1.node_id - n2.node_id = 1
    AND NOT EXISTS((n1)-[:NEXT_TO]-(n2))
    MERGE (n1)-[:NEXT_TO]->(n2)
} IN TRANSACTIONS of 10000 rows

Thank you @glilienfield. The first approach worked and it has completed the query in 17 hours. This is expected, I guess, for a complexity of O(n2).

Just for the record, I will try a different approach to load the CSV file two times, once for creating the nodes and then once more to create the relationships between row[i] and row[i+1] by matching for node_id. I hope to process the database faster.

1 Like