Hep Optimizing A Sorted Min Aggregate Query


(Erikmiller) #1

I need help optimizing the following query.

MATCH (p:Program)-[ao:AIRS_ON]->(t:Timeslot)
WITH p, min(t.gmtDateTime) as minTs
MATCH (p1:Program)-[ao1:AIRS_ON]->(t1:Timeslot)
WHERE t1.gmtDateTime = minTs 
RETURN p1,t1,minTs ORDER BY t1.gmtDateTime LIMIT 100

There is an index on :Timeslot.gmtDateTime, but the query is still taking a couple minutes to run on a 16G 4 core EC2 instance.

In laymans terms I'm trying to get a list of Programs and the Timeslot with the earliest airDate for that program.
My database currently has about 1.6 million (:Program)-[:AIRS_ON]->(:Timeslot) relationships so it is somewhat large but not huge, and currently only contains about 5% of the final database size.

Any help would be appreciated.

(Andrew Bowman) #2

This is kind of a confusing query.

You initially get, for every :Program, it's min gmtDateTime across all airing timeslots, so you will get as many rows at this point as there are :Program nodes.

Then for EACH of those :Program nodes, you match to any :Timeslots with those min times and find any other :Program nodes airing on those timeslots.

You basically throw out the p :Program nodes you matched to in the first query.

Up to your LIMIT, you are again going to have to have at least as many rows as there are :Program nodes, and all of those will have to be in memory for the sort to take place, so that's another place where there could be slowdown.

Is this the intended behavior? You may also want to PROFILE the query, expand all elements, and add the visual query plan to your question, that will show the flow of rows between operations. You can confirm if the number of rows seems out of proportion to what you are intending this to do.

(Erikmiller) #3

So I think you are correct. The query wsan't doing what I wanted it to. I recreated the query using the same program nodes in both MATCH's and performance is better.

MATCH (p:Program)-[AIRS_ON]->(t:Timeslot)
WITH p, min(t.gmtDateTime) as minTs
MATCH (p:Program)-[AIRS_ON]->(t1:Timeslot)
WHERE t1.gmtDateTime = minTs 
RETURN p,t1,minTs ORDER BY t1.gmtDateTime LIMIT 100

It now returning in 5 seconds.

(Andrew Bowman) #4

Okay, now this we can work with.

We can improve this one, ordering the timeslots and getting the minimum per program, then ordering the remaining, without needing to rematch out:

MATCH (p:Program)-[AIRS_ON]->(t:Timeslot) 
WITH p, t
ORDER BY t.gmtDateTime ASC
WITH p, head(collect(t)) as t1
RETURN p, t1, t1.gmtDateTime as minTs

(Erikmiller) #5

Hmm, that's actually a little slower, took 8 seconds. I'll keep playing around with it.