Hi -- currently using Neo4j Desktop 1.5.9, APOC 5.17.1
I've got a dataset of ~50,000 "Gene" nodes and I'm trying to create weighted edges with the following Cypher snippet:
CALL apoc.periodic.iterate(
"MATCH (n1:Gene)
WITH collect(n1) AS selected_n1, collect(n1) AS selected_n2
UNWIND selected_n1 AS n1_new
UNWIND selected_n2 AS n2_new
WITH n1_new, n2_new, [sample IN n1_new.samplesArray WHERE sample IN n2_new.samplesArray | sample] AS commonSamples
WHERE n1_new <> n2_new AND size(commonSamples) > 0
RETURN id(n1_new) as n1_id, id(n2_new) as n2_id, commonSamples",
"MATCH (n1_new), (n2_new)
WHERE id(n1_new) = n1_id AND id(n2_new) = n2_id
MERGE (n1_new)-[r:NUM_SHARED_SAMPLES]-(n2_new)
ON CREATE SET r.weight = size(commonSamples)
ON MATCH SET r.weight = size(commonSamples)",
{batchSize:50000, parallel:false})
YIELD batches, total, timeTaken, errorMessages
RETURN batches, total, timeTaken, errorMessages;
The weights are determined using the "samplesArray" property (a string array) on the two nodes chosen in the first part of the query, weight = the number of shared samples between the two genes.
This command took a long time to run, but I decided to set parallel to false because of conflicts, unsure of what the optimal batchSize should be to be honest. This took several days to complete, even though I indexed the nodes (CREATE INDEX FOR (n:Gene) ON (n.samplesArray)
) and maximized the heap. But what I'm most worried about is that the output every time I've run different versions of this call is that batches = 19244, total = 962158054 (no error messages). Overall ~481 million edges have been created, but I don't know if the command maxed out.
I also know that a bunch of commands within this are expensive (MERGE, ON CREATE/MATCH SET, etc.), but I'd love any pointers I can get on optimizing this, and an explanation for the output parameters. Thanks!
Addendum: anybody know if there is a way to create edges without directionality too? Supposedly, this code snippet should run through every pairwise permutation and create the weighted edge is there are matching samples between the gene nodes. But the relationships that have been created have an automatic directionality upon creation/not creating bidirectional edges.