Very slow performance when aggregate on node property

Hello,

I'm experimenting with Neo4j for the past 4 months and I'm progressively migrating my production environment from MySQL to Neo4j. I think Neo4j is a great database and is very good to model relationships in my data. But for some simpler queries, it is much slower than MySQL. I'm having a specific hard time when I try to query aggregated data, for some queries, it is 40x slower than the corresponding one in MySQL. I'll provide an example.

I have this data model in Neo4j:

.

And I want to get the aggregated students effort (questions answered and time spent studying) of all schools from a school manager to display in a dashboard for him. So I make this query:

match (:SchoolManager{schoolManagerId:"3aeb3009-07ca-48ae-b0c8-78b9f8f96bd9"})-[:MANAGER_OF_SCHOOL]->()<-[:CLASS_OF_SCHOOL]-()<-[:STUDENT_OF_CLASS]-()-[:ANSWERED]->(sa:StudentAnswer)
where sa.createdAt >= datetime("2020-01-01") and sa.createdAt <= datetime("2020-11-20")
return {questionsAnswered: count(sa), timeSpent: sum(sa.timeSpent)}

This query takes almost 16K ms. This is the PROFILE for the query:

In MySQL, the corresponding query is something like this:

select count(*) as questionsAnswered, sum(qa.timeSpent) as timeSpent from school_managers.users as sm left join schools.schools_members as sm on sm.user_id = sm.id left join schools.schools_classes as sc on sc.school_id = sm.school_id left join schools.schools_classes_members as scm on scm.school_class_id = sc.id left join quizzes.quizzes_attempts as qa on qa.user_id = scm.user_id where sm.id = 1 and qa.created_at >= "2020-01-01" and qa.created_at <= "2020-11-20";

It's much uglier, but much faster too, it runs in about 600 ms.

I already searched a lot online and know that the slowness is aggravated by the aggregation on a node property in sum(sa.timeSpent). If I don't access this node property, the query time decreases by half (which is still slow).

I little more info on my data:

  • :SchoolManager ~ 200 nodes
  • :School ~ 200 nodes
  • :SchoolClass ~ 1K nodes
  • :Student ~ 20K nodes
  • :StudentAnswer ~ 6M

I have indexes on :SchoolManager.schoolManagerId and :StudentAnswer.createdAt.

I'm using Neo4j Aura instances with 2GB RAM and 1 CPU. My browser and server configs are:

  • Neo4j Browser version: 4.1.4
  • Neo4j Server version: 4.0.0 (aura)

Is there something that I can do to improve the performance of this Neo4j query? Or for this kind of aggregated data on a property MySQL will always be a better choice?

My suspicion is by NOT specifying your nodes labels (node types) is causing you your speed problem.

If you do NOT specify a label, then Neo4J has to scan ALL nodes.

The RED blocks in the query explanation is what's you need to avoid (or decrease)

So, I believe this fragment would be an improvement:

-[:MANAGER_OF_SCHOOL]->(:School)<-[:CLASS_OF_SCHOOL]-(:SchoolClass)<-[:STUDENT_OF_CLASS]-(:Student)-[:ANSWERED]->(sa:StudentAnswer)

Also.... I suspect the datatime value has to be evaluated each time the subquery is run.

I think you need to add:
WITH datetime("2020-11-20") as enddate, datetime("2020-01-01") as startdate and use the startdate and enddate

If the relationships in that pattern can only connect nodes of those labels, then adding the labels here likely won't make much difference.

Precalculating dates is a good idea, though likely won't yield much savings.

It's worth seeing if performing the matching from the other direction (starting at answers in the range and checking for a connection to the manager) would be better or worse.

Give this a try and provide the PROFILE plan:

WITH datetime("2020-01-01") as startDate, datetime("2020-11-20") as endDate
MATCH (manager:SchoolManager{schoolManagerId:"3aeb3009-07ca-48ae-b0c8-78b9f8f96bd9"})
MATCH (sa:StudentAnswer)
WHERE startDate <= sa.createdAt <= endDate 
 AND (manager)-[:MANAGER_OF_SCHOOL]->()<-[:CLASS_OF_SCHOOL]-()<-[:STUDENT_OF_CLASS]-()-[:ANSWERED]->(sa)
RETURN {questionsAnswered: count(sa), timeSpent: sum(sa.timeSpent)}

Thank you for the response, @clem! It doesn't make much difference specify the node labels in my case, given that the relationships types between the nodes are unique. Actually, in this case, specify the unnecessary nodes labels makes the query slower, because the query plan will have to add a step to filtering the nodes by label. :slightly_frowning_face:

Thank you for the response, @andrew.bowman! As you said, precalculating the dates doesn't make much difference in the query response time in this case.

I tried switching the starting point to the StudentAnswered filtered by date and then going to the filtered SchoolManager, but it was far worse. Because there are more than 4M StudentAnswers in the given date range and the majority of them doesn't relate with the filtered SchoolManager (only ~0.7M). This way I get a lot of data that is not relevant in the beginning of the query and has to filter it out later.

Try this:

MATCH (manager:SchoolManager{schoolManagerId:"3aeb3009-07ca-48ae-b0c8-78b9f8f96bd9"})
MATCH (manager)-[:MANAGER_OF_SCHOOL]->(s:School)
where s.user_id = manager.schoolManagerId

with manager, s
MATCH (sc:SchoolClass)-[:CLASS_OF_SCHOOL]]->(s)
WHERE sc.school_id = s.school_id

with manager, s, sc
MATCH (st:Student)-[:STUDENT_OF_CLASS]->(sc)
WHERE st.school_class_id = sc.id

with manager, s, sc, st, datetime("2020-01-01") as startDate, datetime("2020-11-20") as endDate

MATCH (st)-[:ANSWERED]->(sa:StudentAnswer)
WHERE startDate <= sa.createdAt <= endDate 
and sa.user_id = st.user_id

RETURN {questionsAnswered: count(sa), timeSpent: sum(sa.timeSpent)}

I am guessing the node properties based on your SQL query.

In this case I would think it makes sense to force the index usage for StudentAnswer too
so it can come from two sides.

can you try to add an USING JOIN ON student ?

Also if you are on aura, did you talk to the aura support folks? Could be that this 1G instance is a bit small.

Did you try to download the dump and run it locally in desktop for comparison?

You can also send me the dump via email/pm and I can have a look.

1 Like

Hi Costa,

I'd like to drill in and get more detail about what's going and try to correlate what you're seeing with events in the Aura logs. I've opened a ticket on your behalf in our Support portal. Please respond there at your earliest convenience.

You can respond either by replying to the email you'll receive, or by logging into the portal using the same credentials you use to access Aura.

Thanks,
Cory

Thank you for the response, @ameyasoft! I tried this, with a few adjustments, but it didn't make much difference. I think it is doing the same thing as my original query but with the "MATCH" statements separated.

Thank you for the caring support, @cory.waddingham! I just responded in the Support Portal.

Thank you for the enlightenment, @michael.hunger!

Adding USING JOIN ON student to the query made it 3x faster, it took 5K ms (instead of 16K ms). This is the query PROFILE: https://imgur.com/a/h0ETiwC.

profile match (:SchoolManager{schoolManagerId:"3aeb3009-07ca-48ae-b0c8-78b9f8f96bd9"})-[:MANAGER_OF_SCHOOL]->()<-[:CLASS_OF_SCHOOL]-()<-[:STUDENT_OF_CLASS]-(student:Student)-[:ANSWERED]->(sa:StudentAnswer)
USING JOIN ON student
where sa.createdAt >= datetime("2020-01-01") and sa.createdAt <= datetime("2020-11-20")
return {questionsAnswered: count(sa), timeSpent: sum(sa.timeSpent)}

The funny fact is that with USING JOIN ON student the query makes much more db hits, ~19M db hits vs ~6M db hits in the original query. My hypothesis is that this filter operation in the original query is the major bottle neck: image . I think that the filter operation with large amount of data is expensive in Neo4j.

So, given that, the db hits is not necessarily correlated with the query time and I should try to avoid "filter" operations in Neo4j with high amount of data, even if I have to opt for more db hits. Is that line of thinking plausible?

Even with this great optimization, it would be really good to get this query around 1K ms, as I was used to see with MySQL. I'll mimic the database with dummy data, dump it and send to your email, so you can have a better ideia.

Regarding the Support of Aura folks, @cory.waddingham gave me great attention and opened a ticket for me in Aura Support!

It seems you have a number of pagecache misses.
Not sure if this is a memory constraint both for the graph data and the aggregation.

What memory size does the instance have that you test it on?

The other option would be to change your model and substructure your relationships by date or at least month. So eg. ANSWERED_2020_10_01 and then use those in the query

1 Like

The Aura instance I'm testing on has 2GB of RAM.

Regarding your suggestion to insert the date as a relationship type, in the form of "ANSWERED_2020_10_01", I have two questions:

  • Is probable that I hit some limit of the quantity of relationships types? Because each year would create more than 360 new types of relationships.
  • How can I make a query using this paradigm of the date in the relationship to filter answers in a given date range?

The limit is really high. I can’t remember exactly either 64k or several million.

You can use an enumeration: -[:A|B|C]->

Can you please provide an example that replaces the original query with this new logic?

match (:SchoolManager{schoolManagerId:"3aeb3009-07ca-48ae-b0c8-78b9f8f96bd9"})-[:MANAGER_OF_SCHOOL]->()<-[:CLASS_OF_SCHOOL]-()<-[:STUDENT_OF_CLASS]-()-[:ANSWERED]->(sa:StudentAnswer)
where sa.createdAt >= datetime("2020-01-01") and sa.createdAt <= datetime("2020-11-20")
return {questionsAnswered: count(sa), timeSpent: sum(sa.timeSpent)}

I found another interesting video. What is very interesting is at minute 15:30.

If your query has TWO things that you are filtering and that has an index, then it's better to force Neo4J to use both indexes.

I'm surprised by this! . (I tried this with version 4.2)

Their example using the Movie DB. Instead of this:

PROFILE MATCH p = (p1:Person)-[:ACTED_IN*6]-(p5:Person)
WHERE p1.name='Tom Cruise' and p5.name='Kevin Bacon' 
RETURN [n in nodes(p) | coalesce(n.title, n.name)]

do this:

PROFILE MATCH p = (p1:Person)-[:ACTED_IN*6]-(p5:Person) 
USING INDEX p1:Person(name)
USING INDEX p5:Person(name)
WHERE p1.name='Tom Cruise' and p5.name='Kevin Bacon' 
RETURN [n in nodes(p) | coalesce(n.title, n.name)]

I wonder if you were to use the two USING INDEX statements for the schoolManagerId and the date

Thank you for the response and detailed video, @clem! He gave great insights.

Unfortunately, I tried:

USING INDEX sm:SchoolManager(schoolManagerId)
USING INDEX sa:StudentAnswer(createdAt)

But that made the query time to go up, since it quadruplicated the db hits. :slightly_frowning_face:

Hi guys,

To give you a feedback. I tested with the 8gb/2 CPU instance on Neo4j Aura and the time of the query decreased from 20K ms to 3K ms. It looks like it was a CPU bottle neck (I was using a 1 CPU instance).

The sad news is that the 2 CPU instance is really expensive for us (our brazilian currency is devalued 5 times before the dollar). I will probably maintain only my production environment on Aura with 2 CPU and pass my staging env to AWS :(.

In any case, thank you all for your attention and your help!