Hello,
I'm getting a weird behavior in a query. If I filter the relationships by date, I get much less dbhits, as I would expect, but the query takes a lot more time to complete than if I don't filter by date.
I'll detail my use case and technical specifications, so you can have a complete picture. We are a brazilian education startup, Jovens Gênios, we offer more fun and personalized learning for kids aged 10-15. On the personalized learning pillar we make recommendations of the best topic a student needs to practice in the moment, based on: her difficulties, interests and what other students like her are doing.
In this context, I think Neo4j has much potential for our use case. But the query that gets the most popular topics in a given school year is taking too long (about 30s).
This is a subgraph involved in the given query: https://i.imgur.com/QEY0vkS.png.
The slow query I'm talking about is this, the goal is to get the most popular topics among students in 7th grade in the last two months:
match (:Student {schoolYear: 7})-[a:ANSWERED]->(:Question)-[:ACTIVITY_OF]->(t:Topic)
where a.createdAt >= datetime("2020-07-01T00:00:00")
return t.name, count(*) as attempts
order by attempts desc
limit 25
The statistics of the database for the subgraph: 14.000 Student nodes, 40.000 Question nodes, 2.000 Topic nodes, 30.000 "ACTIVITY_OF" relationships and 4.000.000 "ANSWERED" relationships (a student can answer the same question multiple times).
The profile for the query filtering by date (6.984.788 total db hits in 29392 ms):
profile match (:Student {schoolYear: 7})-[a:ANSWERED]->(:Question)-[:ACTIVITY_OF]->(t:Topic)
where a.createdAt >= datetime("2020-07-01T00:00:00")
return t.name, count(*) as attempts
order by attempts desc
limit 25
The profile query without filtering by date (17.256.907 total db hits in 2675 ms):
profile match (:Student {schoolYear: 7})-[a:ANSWERED]->(:Question)-[:ACTIVITY_OF]->(t:Topic)
return t.name, count(*) as attempts
order by attempts desc
limit 25
I'm running this on Neo4j Aura in a 1GB memory/0.2 CPU/2GB storage instance. Neo4j Browser version: 4.1.0. Neo4j Server version: 4.0.0 (aura).
Do you know what I'm missing here? How can I improve this query so I can filter the relationships "ANSWERED" by date?
I suspect that this could be related to the fact that the property "createdAt" in "ANSWERED" relationships is not indexed, as its not possible to index properties in relationships in Neo4j right now. But I imagine if I create some intermediate node between Student and Question the data modeling would be stranger.