Hi, I have a bit of a challenge around optimising a query to run more efficiently, as currently it takes too long.
Just some background: The database size is 90m nodes, 700m relationships - although this query is being tested on a much smaller sample database size of 1m nodes, 20m relationships and the issue is the same.
What I am trying to achieve is create an aggregate count of a specific relationship between two already related nodes. <- This is actually working, however the issue is it is taking an extremely long time even on the smaller database sample.
So for example, a Person has opened a Campaign 5 times - so there is 5 :OPENED relationships between Person and Campaign. The objective is to create a summary relationship that includes this total count as a property on the new relationship being added TOTAL_OPENED.
// Starting with this
(p:Person)-[:OPENED]->(c:Campaign) (x5 times on the same Person and Campaign node pairs)
// Ending with this
(p:Person)-[newRel:TOTAL_OPENED]->(c:Campaign) SET newRel.total = 5
Here is the shape of the model I am starting with: (note that this is just one sample node pair being returned - the query actually creates a new relationship for 1m node pairs)
Here is how it ends up: (note the new relationship TOTAL_OPENED created)
Here is the query being run:
PROFILE MATCH (p:Person)-[rel:OPENED]->(c:Campaign)
WHERE NOT (p)-[:TOTAL_OPENED]->(c)
WITH p, rel, c LIMIT 2000
WITH COUNT(c) AS total, p, c
CREATE (p)-[totalRel:TOTAL_OPENED]->(c)
SET totalRel.total = total
RETURN p.emailId, total, c.campaignId
This query is being run with apoc.periodic.commit in batches of 2,000.
I have tried it also with putting the COUNT(c) AS total BEFORE the LIMIT, however it seems to increase the number of DB hits, not reduce it.
Here is a profile summary of the query:
Using the apoc.periodic.commit in 2k batches it took 1 hour 6 mins to actually do this on the 1m sample database.
Appreciate any help and all suggestions, thanks.