Aggregation of node triplets into tuples in parallel

morenobonaventura · March 23, 2019, 9:36am

Consider the model (a1:Account)-->(t:Transfer {amount})-->(a2:Account)

where, say, :Account represents a bank account and :Transfer represents a bank transfer containing the numeric property "amount".

What is the best option to create additional relationships representing the total amount flown between all accounts pairs?
And how to make it scale to a graph containing hundreds of millions of :Transfers?

Naively one could do:

MATCH (a1:Account)-->(t:Transfer {amount})-->(a2:Account)
WITH a1, a2, sum(t.amount) as total
CREATE (a1)-[f:total_flow]->(a2)
SET f.amount = total

But the above does not use any parallelism and on a very large graph it will take ages and use up lots of heap space.

I have been thinking of using apoc.periodic.iterate() in the following way:
CALL apoc.periodic.iterate(
"MATCH RETURN a",
"MATCH (a)-->(t:Transfer)-->(:a2)
WITH a, a2, sum(t.amount) as total
CREATE (a)-[f:total_flow]->(a2)
SET f.amount = total
)
However because of the locking acquired by the db on the node a2 some CREATE operations (concurrently trying to add relationships to a2) will fail, right?
I guess the failure rate could be very high if the node a2 has a lot of incoming flows from many other accounts (i.e., there will be many queries that try to add relationships to the node a2 in parallel).
Setting an arbitrarily high retry value in apoc.periodic.iterate() does not seem a clean option.

Has anyone tackle this kind of problem or has any better strategy to do this large scale aggregation?

Is cypher not the best way to go for this kind of operations? Would it be any better to create a user defined procedure that accumulates the total flows on a concurrent java data structure while scanning the graph and then writes all the new edges at the end of the accumulation?

michael.hunger · May 4, 2019, 3:34pm

No it will not fail as by default it's not running in parallel.

Interesting approach on the statement, I like the idea of having the accounts being the driving point for the aggregation as you otherwise would need to do global aggregation queries which are more expensive.

And add the relationship-types if possible!

CALL apoc.periodic.iterate(
"cypher runtime=slotted MATCH (a:Account) RETURN a",
"MATCH (a)-[:TRANSFER]->(t:Transfer)-[:TRANSFER]->(a2)
WITH a, a2, sum(t.amount) as total
CREATE (a)-[f:total_flow]->(a2)
SET f.amount = total
", {batchSize:10000, parallel:false})

You could further group by source or target nodes or run a clustering algo if you wanted to identify independent clusters.

What is the reason for creating those relationships? Just curious because for graph algorithms you don't necessarily need them.

morenobonaventura · October 3, 2019, 2:34pm

Thanks @michael.hunger

My goal is indeed to use parallelism if possible.
But I think we agree that if I change the default params to
CALL apoc.periodic.iterate("...", "...", {parallel: true})
I will incur in write locks on a2, right?

any other alternative idea?
I'm not interested in graoph algo here, but maybe there is something in the algo codebase that can be adapted to my scope? (maybe the code behind cypher projections can come in hand? https://neo4j.com/docs/graph-algorithms/current/projected-graph-model/cypher-projection/#cypher-projection). I'm happy to write a procedure if needed.

The motivation behind aggregation is purely performance-related.

Topic		Replies	Views
Parallel Cypher & Apoc Cypher apoc , cypher	8	3940	June 19, 2019
Using parallel queries to sum value of bitcoin outputs connected to bitcoin address nodes Cypher performance , cypher	1	1716	March 18, 2019
Summing up relationship attributes through a chain of nodes Cypher apoc , cypher	8	116	July 19, 2024
Creating relationship over several millions of nodes Cypher apoc , performance , cypher , relationship	23	2868	September 24, 2020
Statement using Apoc Periodic Iterate gets stuck, but works without the iterate Cypher apoc , cypher	3	192	March 10, 2023

July Summer Fun!

Aggregation of node triplets into tuples in parallel

Related topics