Showing results for 
Search instead for 
Did you mean: 

Is it possible to use apoc.cloneSubgraphFromPaths (or similar) to clone a large graph in batches to prevent OOM?



I am attempting to clone a pretty large subgraph (needs to support cloning millions of nodes).
Currently I have been utilizing apoc.cloneSubgraphFromPaths in the following manner:

MATCH path=(doc:Document)-[*]->(:Node)
WHERE id(doc) = 23398
WITH doc, collect(path) as paths
CALL apoc.refactor.cloneSubgraphFromPaths(paths) YIELD input, output, error
SET output:Temp
WITH output 
WHERE output:Document
RETURN id(output)

Which works great, except as the subgraph size increases the memory usage of the db gets pretty big to the point of hitting an OOM error and requiring the db to be restarted.
Currently I'm running the database within docker and giving it just 2GB of RAM with the plan to try stuff out on smaller data sets (few hundred thousand) and try and get it successfully cloning without exceeding that 2GB. The reasoning is I'd like to be able to clone a subgraph without worrying about it potentially running out of memory.

I have tried copying a method I found in this presentation (slide 45) to do the cloning in batches (the presentation uses the method to delete nodes):

WITH range(2, 54) AS highr
UNWIND highr AS i1
CALL apoc.periodic.commit(
    'MATCH (doc:Document) WHERE ID(doc) = 541714
     WITH range(0, 9999) AS lowr, doc
     UNWIND lowr AS i2
     WITH '+i1+' * 10000 + i2 AS id WHERE id < 542330
     MATCH path=(doc)-[*]->(n:Node)
     WITH doc, COLLECT(path) AS paths LIMIT 10000
     CALL apoc.refactor.cloneSubgraphFromPaths(paths) YIELD output, error
     SET output:Temp',
) YIELD updates

Firstly, this just didn't work. After about 8s the query finishes and there are no cloned nodes.
Secondly, I realised a flaw in this approach (even if I got it working): relationships between nodes created in separate batches would not be created, and I see no obvious way to rectify that.

Is anybody aware of any graph algorithms / procedures I could utilize to try and achieve my goal?

I should note that taking the database down and using offline tools is not a feasible option, we'd like to do it "live".

Thanks in advance.

Nodes 2022
NODES 2022, Neo4j Online Education Summit

On November 16 and 17 for 24 hours across all timezones, you’ll learn about best practices for beginners and experts alike.