SUM on node property SLOW

Neo4j Version: 3.5.0

Trying to get sum on a node property

MATCH (a:Author)-[p:Published]->(b:Book) 
WITH a.author_id AS author_id, SUM(b.read_count) AS total_read_count 
WHERE total_read_count < 500 
RETURN author_id, total_read_count

have indexes on

  • Author node author_id property
  • Book node read_count property

Number of Author Nodes: 7696051
Number of Book Nodes: 1099438

The Query takes 30 seconds to give the complete result.
Any way to get it in less time by optimizing the query?

if i understand correctly a single Author publishes a book and we are looking for all Authors for which their book has less than 500 reads for all the authors books? Is this correct?

If thats the case then the first thing we should do is throw out any book that has more than 500 reads.
For example if

   Author:  Stephen King publishes  Book: The Green Mile    read_count: 4,029,232
   Author:  Stephen King publishes  Book: The Green Mile Sequel    read_count: 372

then this author would not appear in the result set since the sum of all book_reads for this author is
4,029,604.

So we should be able to

Match (b:Book) where b.read_count<500 with b
match (b)<-[p:Published]-(a:Author)
WITH a.author_id AS author_id, SUM(b.read_count) AS total_read_count 
WHERE total_read_count < 500 
RETURN author_id, total_read_count

What @dana.canzano said and create an index for :Book(read_count)
also don't group by author.author_id but by a and access the author_id later.

Also what is your memory config?

@dana.canzano Looking for Authors whose total_read_count including all their books is still less than 500 reads.

@michael.hunger
Index is already there
INDEX ON :Book(read_count)│[Book]│[read_count]│ONLINE│node_label_property │100.0 │{"version":"1.0","key":"lucene"}

Grouping by a instead of author.author_id didn't help in its response time

Memory Config
Pagecache: 50g
heap: 40g

Which neo version? Please upgrade to latest.
Can you drop and recreate the index so you have a native index?

And show the profile of Danas suggestion?

yes.. and

Match (b:Book) where b.read_count<500 with b
match (b)<-[p:Published]-(a:Author)
WITH a.author_id AS author_id, SUM(b.read_count) AS total_read_count 
WHERE total_read_count < 500 
RETURN author_id, total_read_count

should satisfy your question? no ???

@michael.hunger neo4j version 3.5.0

Will try creating a native index

Attaching the profile of @dana.canzano suggestion


this query took this much time Started streaming 110570 records after 35529 ms and completed after 35659 ms, displaying first 1000 rows

@dana.canzano yes my bad, this should satisfy my question but still the response time is above 30seconds.

@dana.canzano anything which can be done to improve on this?

although i cant say whether an upgrade would help but as you are running 3.5.0 which is not the most recent 3.5.x release, have you tried with a newer release, for example 3.5.11

also, im interested to know how you came about with the configuration settings of

Memory Config
Pagecache: 50g
heap: 40g

typically we do not see customers with such a large heap. Can you provide more details on how these values were determined?

Assuming author_id is unique on :Author nodes, do your aggregation on the node itself, not its property:

...
WITH a, SUM(b.read_count) AS total_read_count 
...

After the aggregation you can project out a.author_id AS author_id. That should help as you'll only perform property access once author is distinct in your query.

As for the approach of filtering out books with a read count greater than 500, I don't think that's the way to go, or rather if you want to do this kind of filtering, it's incomplete...although this does get us authors who have published books with a read count less than 500, it only does the aggregation over these books, so if an author has 2 books, one with a read count of 1000, and one with a read count of 200, its book with the higher read count will be thrown out, so it's as if it only has the 200 read count book so it will be incorrectly returned in the query.

The approach that I think was supposed to happen here was to match to authors with books >= 500 read count, and filter out the authors, not the books, so that when we continue the query to find authors of books we do not count the authors who have already been filtered out:

Match (b:Book)<-[:Published]-(a:Author) 
where b.read_count >= 500 
with collect(DISTINCT a) as filteredOutAuthors
match (b)<-[:Published]-(a:Author)
WHERE NOT a IN filteredOutAuthors
WITH a, SUM(b.read_count) AS total_read_count 
WHERE total_read_count < 500 
RETURN .author_id AS author_id, total_read_count

That said, the resulting list of filteredOutAuthors may be too large for the heap, and may be inefficient for filtering, so this may not be faster. Give it a try though and see how it fares.

There's an eager aggregation here for the count as well.

You might see if using a slightly different approach using pattern comprehensions and APOC functions (for summing values across a list) might perform any better, as this should allow streaming of the query, as it avoids the eager aggregation from sum():

MATCH (a:Author)
WITH a, apoc.coll.sum([(a)-[:Published]->(b:Book) | b.read_count]) as total_read_count
WHERE total_read_count < 500 
RETURN a.author_id AS author_id, total_read_count

This query took more time compared to previous ones, 58seconds to be precise
Will go ahead with the first one itself, since that is taking almost 28-30 seconds and will cache the response

If I create a property total_read_count in Author node then will it help in some way instead of doing aggregation?

Thanks for all the help @dana.canzano @michael.hunger @andrew.bowman

@dana.canzano Sure will see if we can update

I don't know why we have a large heap size, since it was already implemented. Will let you know when I come to know about it. :slight_smile: