Very slow performance when aggregate on node property

Hello,

I'm experimenting with Neo4j for the past 4 months and I'm progressively migrating my production environment from MySQL to Neo4j. I think Neo4j is a great database and is very good to model relationships in my data. But for some simpler queries, it is much slower than MySQL. I'm having a specific hard time when I try to query aggregated data, for some queries, it is 40x slower than the corresponding one in MySQL. I'll provide an example.

I have this data model in Neo4j:

.

And I want to get the aggregated students effort (questions answered and time spent studying) of all schools from a school manager to display in a dashboard for him. So I make this query:

match (:SchoolManager{schoolManagerId:"3aeb3009-07ca-48ae-b0c8-78b9f8f96bd9"})-[:MANAGER_OF_SCHOOL]->()<-[:CLASS_OF_SCHOOL]-()<-[:STUDENT_OF_CLASS]-()-[:ANSWERED]->(sa:StudentAnswer)
where sa.createdAt >= datetime("2020-01-01") and sa.createdAt <= datetime("2020-11-20")
return {questionsAnswered: count(sa), timeSpent: sum(sa.timeSpent)}

This query takes almost 16K ms. This is the PROFILE for the query:

In MySQL, the corresponding query is something like this:

select count(*) as questionsAnswered, sum(qa.timeSpent) as timeSpent from school_managers.users as sm left join schools.schools_members as sm on sm.user_id = sm.id left join schools.schools_classes as sc on sc.school_id = sm.school_id left join schools.schools_classes_members as scm on scm.school_class_id = sc.id left join quizzes.quizzes_attempts as qa on qa.user_id = scm.user_id where sm.id = 1 and qa.created_at >= "2020-01-01" and qa.created_at <= "2020-11-20";

It's much uglier, but much faster too, it runs in about 600 ms.

I already searched a lot online and know that the slowness is aggravated by the aggregation on a node property in sum(sa.timeSpent). If I don't access this node property, the query time decreases by half (which is still slow).

I little more info on my data:

  • :SchoolManager ~ 200 nodes
  • :School ~ 200 nodes
  • :SchoolClass ~ 1K nodes
  • :Student ~ 20K nodes
  • :StudentAnswer ~ 6M

I have indexes on :SchoolManager.schoolManagerId and :StudentAnswer.createdAt.

I'm using Neo4j Aura instances with 2GB RAM and 1 CPU. My browser and server configs are:

  • Neo4j Browser version: 4.1.4
  • Neo4j Server version: 4.0.0 (aura)

Is there something that I can do to improve the performance of this Neo4j query? Or for this kind of aggregated data on a property MySQL will always be a better choice?

I think 20 seconds down to 3 seconds with better hosting option is great. Currently my workload is taking 10 seconds to complete, so it sounds like once I start paying for the Aura DB, that it should be useable as is. I have a plan though, if it ultimately is too slow. Keep for dev purposes and slowly migrate to a custom drop-in-replacement for Neo4j. That way all the application code stays exactly the same, I would just need to custom write perhaps the python Neomodel wrapper. Just the functions that I use.

Hii, try rephrasing your query to filter first.
something like this
match (sa:StudentAnswer) where sa.createdAt >= datetime("2020-01-01") and sa.createdAt <= datetime("2020-11-20")
with sa
match (:SchoolManager{schoolManagerId:"3aeb3009-07ca-48ae-b0c8-78b9f8f96bd9"})-[:MANAGER_OF_SCHOOL]->()<-[:CLASS_OF_SCHOOL]-()<-[:STUDENT_OF_CLASS]-()-[:ANSWERED]->(sa)
return {questionsAnswered: count(sa), timeSpent: sum(sa.timeSpent)}