Is it possible to use apoc.cloneSubgraphFromPaths (or similar) to clone a large graph in batches to prevent OOM?

Hi,

I am attempting to clone a pretty large subgraph (needs to support cloning millions of nodes).
Currently I have been utilizing apoc.cloneSubgraphFromPaths in the following manner:

MATCH path=(doc:Document)-[*]->(:Node)
WHERE id(doc) = 23398
WITH doc, collect(path) as paths
CALL apoc.refactor.cloneSubgraphFromPaths(paths) YIELD input, output, error
SET output:Temp
WITH output 
WHERE output:Document
RETURN id(output)

Which works great, except as the subgraph size increases the memory usage of the db gets pretty big to the point of hitting an OOM error and requiring the db to be restarted.
Currently I'm running the database within docker and giving it just 2GB of RAM with the plan to try stuff out on smaller data sets (few hundred thousand) and try and get it successfully cloning without exceeding that 2GB. The reasoning is I'd like to be able to clone a subgraph without worrying about it potentially running out of memory.

I have tried copying a method I found in this presentation (slide 45) to do the cloning in batches (the presentation uses the method to delete nodes):

WITH range(2, 54) AS highr
UNWIND highr AS i1
CALL apoc.periodic.commit(
    'MATCH (doc:Document) WHERE ID(doc) = 541714
     WITH range(0, 9999) AS lowr, doc
     UNWIND lowr AS i2
     WITH '+i1+' * 10000 + i2 AS id WHERE id < 542330
     MATCH path=(doc)-[*]->(n:Node)
     WITH doc, COLLECT(path) AS paths LIMIT 10000
     CALL apoc.refactor.cloneSubgraphFromPaths(paths) YIELD output, error
     SET output:Temp',
     {batchSize:10000}
) YIELD updates
RETURN []

Firstly, this just didn't work. After about 8s the query finishes and there are no cloned nodes.
Secondly, I realised a flaw in this approach (even if I got it working): relationships between nodes created in separate batches would not be created, and I see no obvious way to rectify that.

Is anybody aware of any graph algorithms / procedures I could utilize to try and achieve my goal?

I should note that taking the database down and using offline tools is not a feasible option, we'd like to do it "live".

Thanks in advance.

Did u get any solution? I am also facing the same issue

The code in the presentation that you reference implements an update of a collection of entities based on their range of internal identifiers. It splits the range into groups of 1000 and process each batch with apoc.periodic.commit. I don't see an application of this to your scenario. In your case, you have one subgraph you are updating that originates from one root node. Since the cloneSubgraphFromPaths procedures takes then entire collection of paths, I don't see how to batch it using apoc.periodic.commit.

Your memory issue as the size of the graph increases may be due to the size of the path data growing. Keep in mind that the collection of paths originating from a single root nodes and traversing to a terminal node will have a lot of redundant data, as paths can contain segments that are in common and a long path will result in multiple paths of length 1, 2, 3, etc, until the end of the path.

Anyway, maybe using cloneSubgraph will help, as it just needs the collection of unique nodes that comprise the subgraph. The nodes can be determined from one of the apoc.path procedures. I got the following to work with my own test data. Maybe it will help with your memory footprint, assuming the results are the same. Worth a try?

MATCH (doc:Document where id(doc) = 23398)
call apoc.path.subgraphNodes(doc, {labelFilter: "Node"}) yield node
WITH collect(node) as nodes
CALL apoc.refactor.cloneSubgraph(nodes) YIELD output
SET output:Temp
WITH output
WHERE output:Document
RETURN id(output)

What is the purpose of line 7? Do the subgraph nodes have labels 'Node' and 'Document'?