Slow query performance on filter by date on relationship

Hello,

I'm getting a weird behavior in a query. If I filter the relationships by date, I get much less dbhits, as I would expect, but the query takes a lot more time to complete than if I don't filter by date.

I'll detail my use case and technical specifications, so you can have a complete picture. We are a brazilian education startup, Jovens Gênios, we offer more fun and personalized learning for kids aged 10-15. On the personalized learning pillar we make recommendations of the best topic a student needs to practice in the moment, based on: her difficulties, interests and what other students like her are doing.

In this context, I think Neo4j has much potential for our use case. But the query that gets the most popular topics in a given school year is taking too long (about 30s).

This is a subgraph involved in the given query: https://i.imgur.com/QEY0vkS.png.

The slow query I'm talking about is this, the goal is to get the most popular topics among students in 7th grade in the last two months:

match (:Student {schoolYear: 7})-[a:ANSWERED]->(:Question)-[:ACTIVITY_OF]->(t:Topic)
where a.createdAt >= datetime("2020-07-01T00:00:00")
return t.name, count(*) as attempts
order by attempts desc
limit 25

The statistics of the database for the subgraph: 14.000 Student nodes, 40.000 Question nodes, 2.000 Topic nodes, 30.000 "ACTIVITY_OF" relationships and 4.000.000 "ANSWERED" relationships (a student can answer the same question multiple times).

The profile for the query filtering by date (6.984.788 total db hits in 29392 ms):

profile match (:Student {schoolYear: 7})-[a:ANSWERED]->(:Question)-[:ACTIVITY_OF]->(t:Topic)
where a.createdAt >= datetime("2020-07-01T00:00:00")
return t.name, count(*) as attempts
order by attempts desc
limit 25

The profile query without filtering by date (17.256.907 total db hits in 2675 ms):

profile match (:Student {schoolYear: 7})-[a:ANSWERED]->(:Question)-[:ACTIVITY_OF]->(t:Topic)
return t.name, count(*) as attempts
order by attempts desc
limit 25

I'm running this on Neo4j Aura in a 1GB memory/0.2 CPU/2GB storage instance. Neo4j Browser version: 4.1.0. Neo4j Server version: 4.0.0 (aura).

Do you know what I'm missing here? How can I improve this query so I can filter the relationships "ANSWERED" by date?

I suspect that this could be related to the fact that the property "createdAt" in "ANSWERED" relationships is not indexed, as its not possible to index properties in relationships in Neo4j right now. But I imagine if I create some intermediate node between Student and Question the data modeling would be stranger.

I recommend having a separate node for date time at the root level.
(date)-[:STUDENT]->(student)-[:ANSWERED]->(Question)-[:ACTIVITY_OF]->(topic)

This way you can select all the students on a given date or selected students on a given date.

Without changing your existing model, you can add date node and query on date node instead of 'ANSWERED' relationship.

2 Likes

Thank you for the reply, @ameyasoft! Can you give me some examples of this modeling? For example, how would I store this 3 cases:

  • Student "Sophia" answered a question in "2020-05-01T10:00:00".
  • Student "Sophia" answered a question in "2020-05-01T14:03:00".
  • Student "Sophia" answered a question in "2020-09-01T14:00:00".

With my current modeling, I would create 3 new relationships of type "ANSWERED" and store the datetime in the createdAt property.

In your suggestion would I create a node for each datetime response? In my examples, would I create 3 different "datetime nodes" for each of 3 Sophia's answers? In other words, do I have to create a "datetime node" and a "ANSWERED" relationship for each student's answer? In this case, even if I filter the datetime nodes, I would have to filter the ANSWERED relationships the same way, right?

Can you clarify with a concrete example?

Okay. Will create an example and post the code.

1 Like

Hi,

from your point of view it can be useful to have the node date in ISO number format, ex: 20200501 and to leave the hours in the relationship named [STUDENT]?

thank you

Alessio

Here it is:

MERGE (c:StartDate {date: "2020-05-01"})
CREATE (s:Student {name: "student1"})
CREATE (s1:Student {name: "student2"})
CREATE (q:Question {qstn: "Q1"})
CREATE (q1:Question {qstn: "Q2"})

CREATE (t:StartTime {time: "10:00:00"})
CREATE (t1:StartTime {time: "14:00:00"})

MERGE (c1:StartDate {date: "2020-09-01"})
CREATE (t2:StartTime {time: "14:00:00"})

MERGE (c)-[:STUDENT]->(s)
MERGE (s)-[:START_TIME]->(t)
MERGE (t)-[:ANSWERED]->(q)
MERGE (s)-[:START_TIME]->(t1)
MERGE (t1)-[:ANSWERED]->(q1)

MERGE (c)-[:STUDENT]->(s1)

CREATE (s2:Student {name: "student1"})
CREATE (q2:Question {qstn: "Q3"})
MERGE (c1)-[:STUDENT]->(s2)
MERGE (s2)-[:START_TIME]->(t2)
MERGE (t2)-[:ANSWERED]-(q2)

Added 11 labels, CREATEd 11 nodes, set 11 properties, CREATEd 9 relationships, completed after 803 ms.

You can remove createAt property from :ANSWERED relationship.

Result:

2 Likes

Thank you for the detailed response, @ameyasoft! I'll try your solution with my data.