Consider the model (a1:Account)-->(t:Transfer {amount})-->(a2:Account)
where, say, :Account represents a bank account and :Transfer represents a bank transfer containing the numeric property "amount".
What is the best option to create additional relationships representing the total amount flown between all accounts pairs?
And how to make it scale to a graph containing hundreds of millions of :Transfers?
Naively one could do:
MATCH (a1:Account)-->(t:Transfer {amount})-->(a2:Account)
WITH a1, a2, sum(t.amount) as total
CREATE (a1)-[f:total_flow]->(a2)
SET f.amount = total
But the above does not use any parallelism and on a very large graph it will take ages and use up lots of heap space.
I have been thinking of using apoc.periodic.iterate() in the following way:
CALL apoc.periodic.iterate(
"MATCH RETURN a",
"MATCH (a)-->(t:Transfer)-->(:a2)
WITH a, a2, sum(t.amount) as total
CREATE (a)-[f:total_flow]->(a2)
SET f.amount = total
)
However because of the locking acquired by the db on the node a2 some CREATE operations (concurrently trying to add relationships to a2) will fail, right?
I guess the failure rate could be very high if the node a2 has a lot of incoming flows from many other accounts (i.e., there will be many queries that try to add relationships to the node a2 in parallel).
Setting an arbitrarily high retry value in apoc.periodic.iterate() does not seem a clean option.
Has anyone tackle this kind of problem or has any better strategy to do this large scale aggregation?
Is cypher not the best way to go for this kind of operations? Would it be any better to create a user defined procedure that accumulates the total flows on a concurrent java data structure while scanning the graph and then writes all the new edges at the end of the accumulation?