Hi,
I am attempting to clone a pretty large subgraph (needs to support cloning millions of nodes).
Currently I have been utilizing apoc.cloneSubgraphFromPaths in the following manner:
MATCH path=(doc:Document)-[*]->(:Node)
WHERE id(doc) = 23398
WITH doc, collect(path) as paths
CALL apoc.refactor.cloneSubgraphFromPaths(paths) YIELD input, output, error
SET output:Temp
WITH output
WHERE output:Document
RETURN id(output)
Which works great, except as the subgraph size increases the memory usage of the db gets pretty big to the point of hitting an OOM error and requiring the db to be restarted.
Currently I'm running the database within docker and giving it just 2GB of RAM with the plan to try stuff out on smaller data sets (few hundred thousand) and try and get it successfully cloning without exceeding that 2GB. The reasoning is I'd like to be able to clone a subgraph without worrying about it potentially running out of memory.
I have tried copying a method I found in this presentation (slide 45) to do the cloning in batches (the presentation uses the method to delete nodes):
WITH range(2, 54) AS highr
UNWIND highr AS i1
CALL apoc.periodic.commit(
'MATCH (doc:Document) WHERE ID(doc) = 541714
WITH range(0, 9999) AS lowr, doc
UNWIND lowr AS i2
WITH '+i1+' * 10000 + i2 AS id WHERE id < 542330
MATCH path=(doc)-[*]->(n:Node)
WITH doc, COLLECT(path) AS paths LIMIT 10000
CALL apoc.refactor.cloneSubgraphFromPaths(paths) YIELD output, error
SET output:Temp',
{batchSize:10000}
) YIELD updates
RETURN []
Firstly, this just didn't work. After about 8s the query finishes and there are no cloned nodes.
Secondly, I realised a flaw in this approach (even if I got it working): relationships between nodes created in separate batches would not be created, and I see no obvious way to rectify that.
Is anybody aware of any graph algorithms / procedures I could utilize to try and achieve my goal?
I should note that taking the database down and using offline tools is not a feasible option, we'd like to do it "live".
Thanks in advance.