-
I discovered today that alias length can affect query in a huge way. I'm doing a count query with some date filters, with indexes on date property (
NodeIndexSeekByRange). Profiling this query i discovered that using an alias with 4 or less characters makes the query 25x faster than using alias with more than 4 characters. TheNodeIndexSeekByRangewent from ~100 000 db hits to ~2700 db hits and total query db hits went from ~500 000 to ~21 000 by changing alias from 5 characters to 4 characters.(these are not my actual queries, but to illustrate my point)
MATCH (aaaaa:Node) WHERE aaaaa.date < datetime($someDate) < aaaaa.date RETURN COUNT(*) AS count = 500 000 total db hits MATCH (aaaa:Node) WHERE aaaa.date < datetime($someDate) < aaaa.date RETURN COUNT(*) AS count = 21 000 total db hitsSo apparently alias length can affect query planner in a huge way, is this covered in the docs?
-
This is neo4j 4.4 with BTREE index btw
-
Would love some under the hood explanation of this, I'm assuming this goes down to byte level
After some more digging it seems that this is related to predicate order in the where statement.
My original query was something like this:
MATCH (n:Node)
WHERE (datetime($date1) <= n.date1 <= datetime($date2)
OR datetime($date1) <= n.date2 <= datetime($date2))
OR (datetime($date1) > n.date1 AND datetime($date2) < n.date2)
RETURN COUNT(*) AS count
Changing the predicate order to this seems to have fixed the problem.
MATCH (n:Node)
WHERE (datetime($date1) > n.date1 AND datetime($date2) < n.date2)
OR (datetime($date1) <= n.date1 <= datetime($date2)
OR datetime($date1) <= n.date2 <= datetime($date2))
RETURN COUNT(*) AS count
So predicate order is really important for large count queries, though it is still strange that changing the alias would change the query planners interpretation of predicate order.
Changing predicate order again seems to fix the problem so that n.date2 index is seeked first. Though this query feels unreliable to me, since I don't really understand the query planner decisions, and can't find any docs on it.
MATCH (n:Node)
WHERE (datetime($date2) < n.date2 AND datetime($date1) > n.date1)
OR (datetime($date1) <= n.date2 <= datetime($date2)
OR datetime($date1) <= n.date1 <= datetime($date2))
RETURN COUNT(*) AS count
Also the indexes must be created in order such that the n.date2 index id > n.date1 index id
I am having issues again with this query
MATCH (n:Node)
WHERE (datetime($date1) > n.date1 AND datetime($date2) < n.date2)
OR (datetime($date1) <= n.date1 <= datetime($date2)
OR datetime($date1) <= n.date2 <= datetime($date2))
RETURN COUNT(*) AS count
I have a b-tree index on properties n.date1 and n.date2 (both separate and composite).
In this predicate the query planner will sometimes seek n.date1 index first, and sometimes seek n.date2 index first.
(datetime($date1) > n.date1 AND datetime($date2) < n.date2)
seeking n.date1 index first has huge performance loss vs seeking n.date2 index first, and i can't understand how the query planner decides what to use here, and how can i force it to use n.date2 index first? I have tried query hints like below, but with it still sometimes chooses to use n.date1 index first. Does creation order of index affect how the planner decides to use them?
MATCH (n:Node)
USING INDEX n:Node(date2)
USING INDEX n:Node(date1)
WHERE (datetime($date1) > n.date1 AND datetime($date2) < n.date2)
OR (datetime($date1) <= n.date1 <= datetime($date2)
OR datetime($date1) <= n.date2 <= datetime($date2))
RETURN COUNT(*) AS count
My main questions:
How does the query planner decide which index to seek first?
How can i force it to seek n.date2 index before n.date1 index?
This thread is kind of all over the place so I want to make a post to summarize the issue I'm having.
My original problem is that I want to filter nodes with a datetime period (n.dateStart, n.dateEnd) with a datetime period filter ($date.from, $date.to).

I want to get the nodes that has a "green line" time period.
The following query works, but can potentially be very slow on a large dataset, depending on filter datetimes, indexes and execution plan.
MATCH (t:Test)
WHERE (datetime($date.from) > t.dateStart AND datetime($date.to) < t.dateEnd)
OR (datetime($date.from) <= t.dateStart <= datetime($date.to)
OR datetime($date.from) <= t.dateEnd <= datetime($date.to))
RETURN COUNT(*) AS count
Lets create some test data, indexes and params to start testing:
CREATE (:Test {id: 1, dateStart: datetime("2023-02-04T00:00:00Z"), dateEnd: datetime("2023-02-04T00:00:00Z")});
CREATE (:Test {id: 2, dateStart: datetime("2023-02-05T00:00:00Z"), dateEnd: datetime("2023-02-05T00:00:00Z")});
CREATE (:Test {id: 3, dateStart: datetime("2023-02-06T00:00:00Z"), dateEnd: datetime("2023-02-06T00:00:00Z")});
CREATE (:Test {id: 4, dateStart: datetime("2023-02-07T00:00:00Z"), dateEnd: datetime("2023-02-07T00:00:00Z")});
CREATE (:Test {id: 5, dateStart: datetime("2023-02-08T00:00:00Z"), dateEnd: datetime("2023-02-08T00:00:00Z")});
CREATE (:Test {id: 6, dateStart: datetime("2023-02-09T00:00:00Z"), dateEnd: datetime("2023-02-09T00:00:00Z")});
CREATE (:Test {id: 7, dateStart: datetime("2023-02-10T00:00:00Z"), dateEnd: datetime("2023-02-10T00:00:00Z")});
CREATE (:Test {id: 8, dateStart: datetime("2023-02-11T00:00:00Z"), dateEnd: datetime("2023-02-11T00:00:00Z")});
CREATE (:Test {id: 9, dateStart: datetime("2023-02-12T00:00:00Z"), dateEnd: datetime("2023-02-12T00:00:00Z")});
CREATE (:Test {id: 10, dateStart: datetime("2023-02-13T00:00:00Z"), dateEnd: datetime("2023-02-13T00:00:00Z")});
CREATE INDEX test_dateStart_dateEnd IF NOT EXISTS
FOR (t:Test)
ON (t.dateStart, t.dateEnd);
CREATE INDEX test_dateStart IF NOT EXISTS
FOR (t:Test)
ON (t.dateStart);
CREATE INDEX test_dateEnd IF NOT EXISTS
FOR (t:Test)
ON (t.dateEnd);
:params {
"date": {
"from": "2023-02-13T00:00:00Z",
"to": "2023-02-13T23:59:59Z"
}
}
Profiling the previously mentioned query gives poor performance since t.dateStart index is used first which results in more db hits than it would if using t.dateEnd index first:
PROFILE MATCH (t:Test)
WHERE (datetime($date.from) > t.dateStart AND datetime($date.to) < t.dateEnd)
OR (datetime($date.from) <= t.dateStart <= datetime($date.to)
OR datetime($date.from) <= t.dateEnd <= datetime($date.to))
RETURN COUNT(*) AS count
Planner COST
Runtime SLOTTED
Runtime version 5.3
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| Operator | Id | Details | Estimated Rows | Rows | DB Hits | Memory (Bytes) | Page Cache Hits/Misses |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +ProduceResults | 0 | count | 1 | 1 | 0 | | 0/0 |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +EagerAggregation | 1 | count(*) AS count | 1 | 1 | 0 | 24 | 0/0 |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Filter | 2 | (cache[t.dateEnd] > datetime($date.to) OR cache[t.dateStart] >= datetime($date.from) AND cache[t.dat | 0 | 1 | 40 | | 0/0 |
| | | | eStart] <= datetime($date.to) OR cache[t.dateEnd] >= datetime($date.from) AND cache[t.dateEnd] <= da | | | | | |
| | | | tetime($date.to)) | | | | | |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Distinct | 3 | t | 0 | 10 | 0 | 192 | 0/0 |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union | 4 | | 0 | 11 | 0 | | 0/0 |
| |\ +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange | 5 | RANGE INDEX t:Test(dateEnd) WHERE dateEnd >= datetime($date.from) AND dateEnd <= datetime($date.to), | 0 | 1 | 4 | | 0/0 |
| | | | cache[t.dateEnd] | | | | | |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union | 6 | | 0 | 10 | 0 | | 0/0 |
| |\ +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange | 7 | RANGE INDEX t:Test(dateStart) WHERE dateStart >= datetime($date.from) AND dateStart <= datetime($dat | 0 | 1 | 4 | | 0/0 |
| | | | e.to), cache[t.dateStart] | | | | | |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +NodeIndexSeekByRange | 8 | RANGE INDEX t:Test(dateStart) WHERE dateStart < datetime($date.from), cache[t.dateStart] | 0 | 9 | 11 | | 0/0 |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
Total database accesses: 59, total allocated memory: 280
So I switch up the t.dateStart and t.dateEnd predicates which gives better performance since t.dateEnd index is used first instead:
PROFILE MATCH (t:Test)
WHERE (datetime($date.to) < t.dateEnd AND datetime($date.from) > t.dateStart)
OR (datetime($date.from) <= t.dateStart <= datetime($date.to)
OR datetime($date.from) <= t.dateEnd <= datetime($date.to))
RETURN COUNT(*) AS count
Planner COST
Runtime SLOTTED
Runtime version 5.3
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| Operator | Id | Details | Estimated Rows | Rows | DB Hits | Memory (Bytes) | Page Cache Hits/Misses |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +ProduceResults | 0 | count | 1 | 1 | 0 | | 0/0 |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +EagerAggregation | 1 | count(*) AS count | 1 | 1 | 0 | 24 | 0/0 |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Filter | 2 | (cache[t.dateStart] < datetime($date.from) OR cache[t.dateStart] >= datetime($date.from) AND cache[t | 0 | 1 | 4 | | 0/0 |
| | | | .dateStart] <= datetime($date.to) OR cache[t.dateEnd] >= datetime($date.from) AND cache[t.dateEnd] < | | | | | |
| | | | = datetime($date.to)) | | | | | |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Distinct | 3 | t | 0 | 1 | 0 | 192 | 0/0 |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union | 4 | | 0 | 2 | 0 | | 0/0 |
| |\ +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange | 5 | RANGE INDEX t:Test(dateEnd) WHERE dateEnd >= datetime($date.from) AND dateEnd <= datetime($date.to), | 0 | 1 | 4 | | 0/0 |
| | | | cache[t.dateEnd] | | | | | |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union | 6 | | 0 | 1 | 0 | | 0/0 |
| |\ +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange | 7 | RANGE INDEX t:Test(dateStart) WHERE dateStart >= datetime($date.from) AND dateStart <= datetime($dat | 0 | 1 | 4 | | 0/0 |
| | | | e.to), cache[t.dateStart] | | | | | |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +NodeIndexSeekByRange | 8 | RANGE INDEX t:Test(dateEnd) WHERE dateEnd > datetime($date.to), cache[t.dateEnd] | 0 | 0 | 2 | | 0/0 |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
Total database accesses: 14, total allocated memory: 280
However if I switch the filter to an earlier datetime period, then using t.dateStart index first would be better again and the latest query performs worse again.
:params {
"date": {
"from": "2023-02-04T00:00:00Z",
"to": "2023-02-04T23:59:59Z"
}
}
PROFILE MATCH (t:Test)
WHERE (datetime($date.to) < t.dateEnd AND datetime($date.from) > t.dateStart)
OR (datetime($date.from) <= t.dateStart <= datetime($date.to)
OR datetime($date.from) <= t.dateEnd <= datetime($date.to))
RETURN COUNT(*) AS count
Planner COST
Runtime SLOTTED
Runtime version 5.3
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| Operator | Id | Details | Estimated Rows | Rows | DB Hits | Memory (Bytes) | Page Cache Hits/Misses |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +ProduceResults | 0 | count | 1 | 1 | 0 | | 0/0 |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +EagerAggregation | 1 | count(*) AS count | 1 | 1 | 0 | 24 | 0/0 |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Filter | 2 | (cache[t.dateStart] < datetime($date.from) OR cache[t.dateStart] >= datetime($date.from) AND cache[t | 0 | 1 | 67 | | 0/0 |
| | | | .dateStart] <= datetime($date.to) OR cache[t.dateEnd] >= datetime($date.from) AND cache[t.dateEnd] < | | | | | |
| | | | = datetime($date.to)) | | | | | |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Distinct | 3 | t | 0 | 10 | 0 | 192 | 0/0 |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union | 4 | | 0 | 11 | 0 | | 0/0 |
| |\ +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange | 5 | RANGE INDEX t:Test(dateEnd) WHERE dateEnd >= datetime($date.from) AND dateEnd <= datetime($date.to), | 0 | 1 | 4 | | 0/0 |
| | | | cache[t.dateEnd] | | | | | |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union | 6 | | 0 | 10 | 0 | | 0/0 |
| |\ +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange | 7 | RANGE INDEX t:Test(dateStart) WHERE dateStart >= datetime($date.from) AND dateStart <= datetime($dat | 0 | 1 | 4 | | 0/0 |
| | | | e.to), cache[t.dateStart] | | | | | |
| | +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +NodeIndexSeekByRange | 8 | RANGE INDEX t:Test(dateEnd) WHERE dateEnd > datetime($date.to), cache[t.dateEnd] | 0 | 9 | 11 | | 0/0 |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
Total database accesses: 86, total allocated memory: 280
So what I would really like is for the query planner to understand when to use t.dateStart and when to use t.dateEnd index first for optimal performance.
Also in my real problem scenario the query planner will sometimes still use the t.dateStart index first even for the query with t.dateEnd predicate first (when t.dateEnd index would be better). So I don't know what I can do to make this query always be fast.
Can you please turn this into a new question because this is completely unrelated to your original question about the identifier lengths.
I agree with @michael.hunger, but I am going to give a short response. First, the following query should produce the same results:
PROFILE MATCH (t:Test)
WHERE t.dateEnd > (datetime($date.from) and t.dateStart < datetime($date.to)
RETURN COUNT(*) AS count
Second, which index is better (dateStart vs dateEnd) is highly dependent on where the 'to' and 'from' dates are within your data set. If your 'to' and 'from' overlap the lower end of your data, then the index on startDate is going to be more selective, since much less data will meet the criteria that 't.dateStart < datetime($date.to) '. In converse, if 'to' and 'from' are high up in your data set, then the index based on dateEnd will be more selective, since much less data will meet the criteria 't.dateEnd > (datetime($date.from) '.
If you have historical data and you are generally query for results over the more current values of the data, then the endDate will be the better index for your more 'live' queries. Performance will degrade as you query more historical data.
Everything in this thread revolves around the same problem, same query and same use case so I would not say it is completely unrelated. The thread is unstructured because the problem has evolved and I apologize for that.
The problem started with the identifier length affecting the planner to chose 'dateStart' index instead of 'dateEnd', which was strange and undocumented.
I have tried to fix this query many times, but the planner is not consistent in what index it chooses to use.
We currently have two environments with the same query, same indexes, same data and same filters. In one environment the query takes 1 second because it starts with 'dateStart' index, in the other it takes 50ms because it starts with 'dateEnd' index.
PROFILE MATCH (t:Test)
WHERE (datetime($date.to) < t.dateEnd AND datetime($date.from) > t.dateStart)
OR (datetime($date.from) <= t.dateStart <= datetime($date.to)
OR datetime($date.from) <= t.dateEnd <= datetime($date.to))
RETURN COUNT(*) AS count
I would just like a clear documentation of what parameters affects the planner to choose one index over the other so I can make this query stable and consistent.
Feel free to rename the thread "How does the planner choose index order?" because that is my question for the problem I'm having with this query and use case.
Currently the planner almost never makes decisions based on parameter values. We only sometimes use string/list parameter length, e.g. when query contains predicates like ENDS WITH or IN. For your queries that doesn't apply.
To determine which index to use here, we only consider index size and the number of unique values in the index. Given your sample data, :Test(dateStart) and :Test(dateEnd) are equivalent from the planner perspective. In that case the choice of an index depends mostly on the implementation details of various data structures and sorting algorithms used by the planner. You cannot meaningfully control that by changing variable names or reordering predicates.
To get more control over your query, I suggest rewriting it in a more verbose way. First, split your MATCH into a UNION. Then, add desired index hints to each branch. Finally, to COUNT the results, wrap it into a CALL subquery:
CALL {
MATCH (t:Test) USING INDEX t:Test(dateEnd)
WHERE datetime($date.from) > t.dateStart AND datetime($date.to) < t.dateEnd
RETURN t
UNION
MATCH (t:Test)
WHERE datetime($date.from) <= t.dateStart <= datetime($date.to)
RETURN t
UNION
MATCH (t:Test) USING INDEX t:Test(dateEnd)
WHERE datetime($date.from) <= t.dateEnd <= datetime($date.to)
RETURN t
}
RETURN COUNT(*) AS count
Hope this helps.