Creating weighted edges with apoc.periodic.iterate

Hi -- currently using Neo4j Desktop 1.5.9, APOC 5.17.1

I've got a dataset of ~50,000 "Gene" nodes and I'm trying to create weighted edges with the following Cypher snippet:

CALL apoc.periodic.iterate(
    "MATCH (n1:Gene)
    WITH collect(n1) AS selected_n1, collect(n1) AS selected_n2
    UNWIND selected_n1 AS n1_new
    UNWIND selected_n2 AS n2_new
    WITH n1_new, n2_new, [sample IN n1_new.samplesArray WHERE sample IN n2_new.samplesArray | sample] AS commonSamples
    WHERE n1_new <> n2_new AND size(commonSamples) > 0
    RETURN id(n1_new) as n1_id, id(n2_new) as n2_id, commonSamples",

    "MATCH (n1_new), (n2_new)
    WHERE id(n1_new) = n1_id AND id(n2_new) = n2_id
    MERGE (n1_new)-[r:NUM_SHARED_SAMPLES]-(n2_new)
    ON CREATE SET r.weight = size(commonSamples)
    ON MATCH SET r.weight = size(commonSamples)",

    {batchSize:50000, parallel:false})
YIELD batches, total, timeTaken, errorMessages
RETURN batches, total, timeTaken, errorMessages;

The weights are determined using the "samplesArray" property (a string array) on the two nodes chosen in the first part of the query, weight = the number of shared samples between the two genes.

This command took a long time to run, but I decided to set parallel to false because of conflicts, unsure of what the optimal batchSize should be to be honest. This took several days to complete, even though I indexed the nodes (CREATE INDEX FOR (n:Gene) ON (n.samplesArray)) and maximized the heap. But what I'm most worried about is that the output every time I've run different versions of this call is that batches = 19244, total = 962158054 (no error messages). Overall ~481 million edges have been created, but I don't know if the command maxed out.

I also know that a bunch of commands within this are expensive (MERGE, ON CREATE/MATCH SET, etc.), but I'd love any pointers I can get on optimizing this, and an explanation for the output parameters. Thanks!

Addendum: anybody know if there is a way to create edges without directionality too? Supposedly, this code snippet should run through every pairwise permutation and create the weighted edge is there are matching samples between the gene nodes. But the relationships that have been created have an automatic directionality upon creation/not creating bidirectional edges.

Does this give you the same results and is it any faster:

CALL apoc.periodic.iterate(
    "
     MATCH (n1:Gene), (n2:Gene)
     WHERE id(n1) < id(n2)
     WITH n1, n2, apoc.coll.intersection(n1.samplesArray, n2.samplesArray) as commonSamples
     WHERE size(commonSamples) > 0
     RETURN n1, n2, commonSamples
    ",
    "
     MERGE (n1)-[r:NUM_SHARED_SAMPLES]->(n2)
     ON CREATE SET r.weight = size(commonSamples)
     ON MATCH SET r.weight = size(commonSamples)
    ",
    {batchSize:50000, parallel:false})
YIELD batches, total, timeTaken, errorMessages
RETURN batches, total, timeTaken, errorMessages;

BTW - your approach should have resulted in bidirectional relationships for NUM_SHARED_SAMPLES, since you allowed the Cartesian product to produce both orderings of the pairs (n1, n2) and (n2, n1) and the merge without a direction will make the direction "to the right". Did you want bidirectional relationships, as my solution avoided them.

Hi @glilienfield -- thanks for the response. Currently running some downstream analyses using the edges I already created but will try your snippet as soon as that finishes.

Upon visual inspection, it doesn't seem that bidirectional relationships were created -- see image below

which is why I am worried that not all possible edges were created. Having them bidirectional wasn't a necessity, though, because I end up creating graph projections and doing random walks on non-directional relationships anyway. But based on what you mentioned I'm guessing the merge did make the direction "to the right" anyway and just made it more computationally intensive.

I did want to point out the fact that the number of batches (output) ends up being the same for several calls (e.g., with parallel computing vs not, some commands I had run previously) -- should I be concerned about this? Also, any tips on batch size?