Neo4J: Multiple Aggregations without Repeating Match

How do I do multiple aggregations in Neo4j without repeating my MATCH query?

Let's say I have a database consisting of customers, the restaurants they go to, and the record of the bills for each customer/restaurant for a particular year. Note that a customer may frequent different restaurants. I want to get a list of customer-restaurant pair where the total bill amount for the year is at the 90th percentile and above.

I've tried something like the following, but Neo4j does not allow multiple aggregations to be used within a single query:

MATCH (c:Customer)-[:PAID]->(b:Bill)-[:AT]->(r:Restaurant)
WITH c, r, sum(b.amount) AS total_amount
WHERE total_amount > percentileCont(total_amount, 0.9)
RETURN c, r

The above syntax will return an error. I could try the following alternative instead, which does work:

MATCH (c:Customer)-[:PAID]->(b:Bill)-[:AT]->(r:Restaurant)
WITH c, r, sum(b.amount) AS total_amount
WITH percentileCont(total_amount, 0.9) as percentile

MATCH (c:Customer)-[:PAID]->(b:Bill)-[:AT]->(r:Restaurant)
WITH c, r, sum(b.amount) AS total amount, per
WHERE total_amount > percentile
RETURN c,r

However, this is rather inefficient as I have to run the exact same query twice. Is there a more elegant way to do this?

I think you might be looking for something like the below? You can chain multiple aggregations together. The issue is that you're passing a single value into the percentileCont function, rather than letting it calculate on a list of values (passing in b.amount instead).

MATCH (c:Customer)-[:PAID]->(b:Bill)-[:AT]->(r:Restaurant)
WITH c.name as customer, r.name as restaurant, sum(b.amount) as total_amount, percentileCont(b.amount, 0.9) as percentile
WHERE total_amount > percentile
RETURN customer, restaurant, total_amount, percentile

Hope this helps!

Cheers,
Jennifer

Jennifer's advice is good, you can sum the bills per customer/restaurant combo at the same time you calculate the percentile.

Some other things to help: you can defer property projection (such as names) until after the aggregations and filtering, as that allows you to project the property only once per customer/restaurant, and only after you're sure the restaurant passes the filter.

Also, to minimize heap issues, you may want to start at customers and use subqueries, as that allows the aggregations to be performed per customer, and that requires materializing fewer results than needing to aggregate across all results.

MATCH (c:Customer)
CALL {
 WITH c
 MATCH (c)-[:PAID]->(b:Bill)-[:AT]->(r:Restaurant)
 WITH r, sum(b.amount) as total_amount, percentileCont(b.amount, 0.9) as percentile
 WHERE total_amount > percentile
 RETURN r.name as restaurant, total_amount, percentile
}
RETURN customer.name as customer, restaurant, total_amount, percentile
1 Like

Thanks a lot, Jennifer, for your help.

However, what I'm trying to do right here is more like

MATCH (c:Customer)-[:PAID]->(b:Bill)-[:AT]->(r:Restaurant)
WITH c, r, sum(b.amount), percentileCont(sum(b.amount), 0.9)
WHERE sum(b.amount) > percentileCont(sum(b.amount), 0.9)
...

Given that sum() is an aggregation and percentileCont() is also an aggregation, I ran into problem in line 2 and 3 in trying to nest two aggregations together in a single query (i.e. percentileCont(sum(b.amount), 0.9))

(A customer can visit a restaurant multiple time, and get multiple bills in return. The customer may spend very little on each visit, but have so many visits that he becomes the top spender. I'm trying to find out, of the total amount spent by a particular customer at a particular restaurant for the year, which customer-restaurant pair is at the 90th and above percentile)

Thanks Andrew for the advice. However, there's something I don't quite understand here.

When we do this:

WITH c, r, sum(b.amount)

We are trying to do a 'group by' c, r and then sum up the bill

However, with your sub-query:

WITH c
MATCH (c)-[:PAID]->(b:Bill)-[:AT]->(r:Restaurant)
WITH r, sum(b.amount) as total_amount

Are we trying to achieve the same effect here? Will sum(b.amount) still be 'grouped by' both c and r?

Yes. Within a subquery, whatever variables were in scope just before the call remain, even if you are ignoring them within the subquery call. They will still be there after the call is finished for that row, and as such those variables act as a pseudo grouping key.

While the aggregation within the query doesn't need to know about it or consider it, the effect is the same, because the subquery was called for a particular c node.

If the logic in our subquery didn't require matching from the c node at all, we could even omit the WITH c and anything to do with c in the subquery, and any aggregation within the subquery would still effectively come out as if it was grouped with respect to c, because a particular c was in scope at the point of the call.