Cypher: Alias length >4 chars makes query 25x slower

Bertel · November 30, 2022, 7:47am

I discovered today that alias length can affect query in a huge way. I'm doing a count query with some date filters, with indexes on date property (NodeIndexSeekByRange). Profiling this query i discovered that using an alias with 4 or less characters makes the query 25x faster than using alias with more than 4 characters. The NodeIndexSeekByRange went from ~100 000 db hits to ~2700 db hits and total query db hits went from ~500 000 to ~21 000 by changing alias from 5 characters to 4 characters.

(these are not my actual queries, but to illustrate my point)
```
MATCH (aaaaa:Node)
WHERE aaaaa.date < datetime($someDate) < aaaaa.date
RETURN COUNT(*) AS count

= 500 000 total db hits


MATCH (aaaa:Node)
WHERE aaaa.date < datetime($someDate) < aaaa.date
RETURN COUNT(*) AS count

= 21 000 total db hits
```
So apparently alias length can affect query planner in a huge way, is this covered in the docs?
This is neo4j 4.4 with BTREE index btw
Would love some under the hood explanation of this, I'm assuming this goes down to byte level

Bertel · November 30, 2022, 11:20am

After some more digging it seems that this is related to predicate order in the where statement.

My original query was something like this:

MATCH (n:Node)
WHERE (datetime($date1) <= n.date1 <= datetime($date2) 
OR datetime($date1) <= n.date2 <= datetime($date2)) 
OR (datetime($date1) > n.date1 AND datetime($date2) < n.date2)
RETURN COUNT(*) AS count

Changing the predicate order to this seems to have fixed the problem.

MATCH (n:Node)
WHERE (datetime($date1) > n.date1 AND datetime($date2) < n.date2) 
OR (datetime($date1) <= n.date1 <= datetime($date2) 
OR datetime($date1) <= n.date2 <= datetime($date2))
RETURN COUNT(*) AS count

So predicate order is really important for large count queries, though it is still strange that changing the alias would change the query planners interpretation of predicate order.

Bertel · January 23, 2023, 12:11pm

Changing predicate order again seems to fix the problem so that n.date2 index is seeked first. Though this query feels unreliable to me, since I don't really understand the query planner decisions, and can't find any docs on it.

MATCH (n:Node)
WHERE (datetime($date2) < n.date2 AND datetime($date1) > n.date1) 
OR (datetime($date1) <= n.date2 <= datetime($date2)
OR datetime($date1) <= n.date1 <= datetime($date2))
RETURN COUNT(*) AS count

Also the indexes must be created in order such that the n.date2 index id > n.date1 index id

Bertel · January 23, 2023, 11:34am

I am having issues again with this query

MATCH (n:Node)
WHERE (datetime($date1) > n.date1 AND datetime($date2) < n.date2) 
OR (datetime($date1) <= n.date1 <= datetime($date2) 
OR datetime($date1) <= n.date2 <= datetime($date2))
RETURN COUNT(*) AS count

I have a b-tree index on properties n.date1 and n.date2 (both separate and composite).

In this predicate the query planner will sometimes seek n.date1 index first, and sometimes seek n.date2 index first.

(datetime($date1) > n.date1 AND datetime($date2) < n.date2)

seeking n.date1 index first has huge performance loss vs seeking n.date2 index first, and i can't understand how the query planner decides what to use here, and how can i force it to use n.date2 index first? I have tried query hints like below, but with it still sometimes chooses to use n.date1 index first. Does creation order of index affect how the planner decides to use them?

MATCH (n:Node)
USING INDEX n:Node(date2)
USING INDEX n:Node(date1)
WHERE (datetime($date1) > n.date1 AND datetime($date2) < n.date2) 
OR (datetime($date1) <= n.date1 <= datetime($date2) 
OR datetime($date1) <= n.date2 <= datetime($date2))
RETURN COUNT(*) AS count

My main questions:
How does the query planner decide which index to seek first?
How can i force it to seek n.date2 index before n.date1 index?

Bertel · February 13, 2023, 4:53pm

This thread is kind of all over the place so I want to make a post to summarize the issue I'm having.

My original problem is that I want to filter nodes with a datetime period (n.dateStart, n.dateEnd) with a datetime period filter ($date.from, $date.to).

datePeriodFilter

I want to get the nodes that has a "green line" time period.

The following query works, but can potentially be very slow on a large dataset, depending on filter datetimes, indexes and execution plan.

MATCH (t:Test)
WHERE (datetime($date.from) > t.dateStart AND datetime($date.to) < t.dateEnd) 
OR (datetime($date.from) <= t.dateStart <= datetime($date.to) 
OR datetime($date.from) <= t.dateEnd <= datetime($date.to))
RETURN COUNT(*) AS count

Lets create some test data, indexes and params to start testing:

CREATE (:Test {id: 1, dateStart: datetime("2023-02-04T00:00:00Z"), dateEnd: datetime("2023-02-04T00:00:00Z")});
CREATE (:Test {id: 2, dateStart: datetime("2023-02-05T00:00:00Z"), dateEnd: datetime("2023-02-05T00:00:00Z")});
CREATE (:Test {id: 3, dateStart: datetime("2023-02-06T00:00:00Z"), dateEnd: datetime("2023-02-06T00:00:00Z")});
CREATE (:Test {id: 4, dateStart: datetime("2023-02-07T00:00:00Z"), dateEnd: datetime("2023-02-07T00:00:00Z")});
CREATE (:Test {id: 5, dateStart: datetime("2023-02-08T00:00:00Z"), dateEnd: datetime("2023-02-08T00:00:00Z")});
CREATE (:Test {id: 6, dateStart: datetime("2023-02-09T00:00:00Z"), dateEnd: datetime("2023-02-09T00:00:00Z")});
CREATE (:Test {id: 7, dateStart: datetime("2023-02-10T00:00:00Z"), dateEnd: datetime("2023-02-10T00:00:00Z")});
CREATE (:Test {id: 8, dateStart: datetime("2023-02-11T00:00:00Z"), dateEnd: datetime("2023-02-11T00:00:00Z")});
CREATE (:Test {id: 9, dateStart: datetime("2023-02-12T00:00:00Z"), dateEnd: datetime("2023-02-12T00:00:00Z")});
CREATE (:Test {id: 10, dateStart: datetime("2023-02-13T00:00:00Z"), dateEnd: datetime("2023-02-13T00:00:00Z")});

CREATE INDEX test_dateStart_dateEnd IF NOT EXISTS
FOR (t:Test)
ON (t.dateStart, t.dateEnd);
CREATE INDEX test_dateStart IF NOT EXISTS
FOR (t:Test)
ON (t.dateStart);
CREATE INDEX test_dateEnd IF NOT EXISTS
FOR (t:Test)
ON (t.dateEnd);

:params {
    "date": {
        "from": "2023-02-13T00:00:00Z",
        "to": "2023-02-13T23:59:59Z"
    }
}

Profiling the previously mentioned query gives poor performance since t.dateStart index is used first which results in more db hits than it would if using t.dateEnd index first:

PROFILE MATCH (t:Test)
WHERE (datetime($date.from) > t.dateStart AND datetime($date.to) < t.dateEnd) 
OR (datetime($date.from) <= t.dateStart <= datetime($date.to) 
OR datetime($date.from) <= t.dateEnd <= datetime($date.to))
RETURN COUNT(*) AS count

Planner COST

Runtime SLOTTED

Runtime version 5.3

+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| Operator                | Id | Details                                                                                              | Estimated Rows | Rows | DB Hits | Memory (Bytes) | Page Cache Hits/Misses |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +ProduceResults         |  0 | count                                                                                                |              1 |    1 |       0 |                |                    0/0 |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +EagerAggregation       |  1 | count(*) AS count                                                                                    |              1 |    1 |       0 |             24 |                    0/0 |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Filter                 |  2 | (cache[t.dateEnd] > datetime($date.to) OR cache[t.dateStart] >= datetime($date.from) AND cache[t.dat |              0 |    1 |      40 |                |                    0/0 |
| |                       |    | eStart] <= datetime($date.to) OR cache[t.dateEnd] >= datetime($date.from) AND cache[t.dateEnd] <= da |                |      |         |                |                        |
| |                       |    | tetime($date.to))                                                                                    |                |      |         |                |                        |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Distinct               |  3 | t                                                                                                    |              0 |   10 |       0 |            192 |                    0/0 |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union                  |  4 |                                                                                                      |              0 |   11 |       0 |                |                    0/0 |
| |\                      +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange |  5 | RANGE INDEX t:Test(dateEnd) WHERE dateEnd >= datetime($date.from) AND dateEnd <= datetime($date.to), |              0 |    1 |       4 |                |                    0/0 |
| |                       |    | cache[t.dateEnd]                                                                                     |                |      |         |                |                        |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union                  |  6 |                                                                                                      |              0 |   10 |       0 |                |                    0/0 |
| |\                      +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange |  7 | RANGE INDEX t:Test(dateStart) WHERE dateStart >= datetime($date.from) AND dateStart <= datetime($dat |              0 |    1 |       4 |                |                    0/0 |
| |                       |    | e.to), cache[t.dateStart]                                                                            |                |      |         |                |                        |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +NodeIndexSeekByRange   |  8 | RANGE INDEX t:Test(dateStart) WHERE dateStart < datetime($date.from), cache[t.dateStart]             |              0 |    9 |      11 |                |                    0/0 |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+

Total database accesses: 59, total allocated memory: 280

So I switch up the t.dateStart and t.dateEnd predicates which gives better performance since t.dateEnd index is used first instead:

PROFILE MATCH (t:Test)
WHERE (datetime($date.to) < t.dateEnd AND datetime($date.from) > t.dateStart) 
OR (datetime($date.from) <= t.dateStart <= datetime($date.to) 
OR datetime($date.from) <= t.dateEnd <= datetime($date.to))
RETURN COUNT(*) AS count

Planner COST

Runtime SLOTTED

Runtime version 5.3

+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| Operator                | Id | Details                                                                                              | Estimated Rows | Rows | DB Hits | Memory (Bytes) | Page Cache Hits/Misses |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +ProduceResults         |  0 | count                                                                                                |              1 |    1 |       0 |                |                    0/0 |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +EagerAggregation       |  1 | count(*) AS count                                                                                    |              1 |    1 |       0 |             24 |                    0/0 |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Filter                 |  2 | (cache[t.dateStart] < datetime($date.from) OR cache[t.dateStart] >= datetime($date.from) AND cache[t |              0 |    1 |       4 |                |                    0/0 |
| |                       |    | .dateStart] <= datetime($date.to) OR cache[t.dateEnd] >= datetime($date.from) AND cache[t.dateEnd] < |                |      |         |                |                        |
| |                       |    | = datetime($date.to))                                                                                |                |      |         |                |                        |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Distinct               |  3 | t                                                                                                    |              0 |    1 |       0 |            192 |                    0/0 |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union                  |  4 |                                                                                                      |              0 |    2 |       0 |                |                    0/0 |
| |\                      +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange |  5 | RANGE INDEX t:Test(dateEnd) WHERE dateEnd >= datetime($date.from) AND dateEnd <= datetime($date.to), |              0 |    1 |       4 |                |                    0/0 |
| |                       |    | cache[t.dateEnd]                                                                                     |                |      |         |                |                        |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union                  |  6 |                                                                                                      |              0 |    1 |       0 |                |                    0/0 |
| |\                      +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange |  7 | RANGE INDEX t:Test(dateStart) WHERE dateStart >= datetime($date.from) AND dateStart <= datetime($dat |              0 |    1 |       4 |                |                    0/0 |
| |                       |    | e.to), cache[t.dateStart]                                                                            |                |      |         |                |                        |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +NodeIndexSeekByRange   |  8 | RANGE INDEX t:Test(dateEnd) WHERE dateEnd > datetime($date.to), cache[t.dateEnd]                     |              0 |    0 |       2 |                |                    0/0 |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+

Total database accesses: 14, total allocated memory: 280

However if I switch the filter to an earlier datetime period, then using t.dateStart index first would be better again and the latest query performs worse again.

:params {
    "date": {
        "from": "2023-02-04T00:00:00Z",
        "to": "2023-02-04T23:59:59Z"
    }
}

PROFILE MATCH (t:Test)
WHERE (datetime($date.to) < t.dateEnd AND datetime($date.from) > t.dateStart) 
OR (datetime($date.from) <= t.dateStart <= datetime($date.to) 
OR datetime($date.from) <= t.dateEnd <= datetime($date.to))
RETURN COUNT(*) AS count

Planner COST

Runtime SLOTTED

Runtime version 5.3

+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| Operator                | Id | Details                                                                                              | Estimated Rows | Rows | DB Hits | Memory (Bytes) | Page Cache Hits/Misses |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +ProduceResults         |  0 | count                                                                                                |              1 |    1 |       0 |                |                    0/0 |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +EagerAggregation       |  1 | count(*) AS count                                                                                    |              1 |    1 |       0 |             24 |                    0/0 |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Filter                 |  2 | (cache[t.dateStart] < datetime($date.from) OR cache[t.dateStart] >= datetime($date.from) AND cache[t |              0 |    1 |      67 |                |                    0/0 |
| |                       |    | .dateStart] <= datetime($date.to) OR cache[t.dateEnd] >= datetime($date.from) AND cache[t.dateEnd] < |                |      |         |                |                        |
| |                       |    | = datetime($date.to))                                                                                |                |      |         |                |                        |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Distinct               |  3 | t                                                                                                    |              0 |   10 |       0 |            192 |                    0/0 |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union                  |  4 |                                                                                                      |              0 |   11 |       0 |                |                    0/0 |
| |\                      +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange |  5 | RANGE INDEX t:Test(dateEnd) WHERE dateEnd >= datetime($date.from) AND dateEnd <= datetime($date.to), |              0 |    1 |       4 |                |                    0/0 |
| |                       |    | cache[t.dateEnd]                                                                                     |                |      |         |                |                        |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +Union                  |  6 |                                                                                                      |              0 |   10 |       0 |                |                    0/0 |
| |\                      +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| | +NodeIndexSeekByRange |  7 | RANGE INDEX t:Test(dateStart) WHERE dateStart >= datetime($date.from) AND dateStart <= datetime($dat |              0 |    1 |       4 |                |                    0/0 |
| |                       |    | e.to), cache[t.dateStart]                                                                            |                |      |         |                |                        |
| |                       +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+
| +NodeIndexSeekByRange   |  8 | RANGE INDEX t:Test(dateEnd) WHERE dateEnd > datetime($date.to), cache[t.dateEnd]                     |              0 |    9 |      11 |                |                    0/0 |
+-------------------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+

Total database accesses: 86, total allocated memory: 280

So what I would really like is for the query planner to understand when to use t.dateStart and when to use t.dateEnd index first for optimal performance.

Also in my real problem scenario the query planner will sometimes still use the t.dateStart index first even for the query with t.dateEnd predicate first (when t.dateEnd index would be better). So I don't know what I can do to make this query always be fast.

michael.hunger · February 16, 2023, 12:35am

Can you please turn this into a new question because this is completely unrelated to your original question about the identifier lengths.

glilienfield · February 16, 2023, 4:18am

I agree with @michael.hunger, but I am going to give a short response. First, the following query should produce the same results:

PROFILE MATCH (t:Test)
WHERE t.dateEnd > (datetime($date.from) and t.dateStart < datetime($date.to) 
RETURN COUNT(*) AS count

Second, which index is better (dateStart vs dateEnd) is highly dependent on where the 'to' and 'from' dates are within your data set. If your 'to' and 'from' overlap the lower end of your data, then the index on startDate is going to be more selective, since much less data will meet the criteria that 't.dateStart < datetime($date.to) '. In converse, if 'to' and 'from' are high up in your data set, then the index based on dateEnd will be more selective, since much less data will meet the criteria 't.dateEnd > (datetime($date.from) '.

If you have historical data and you are generally query for results over the more current values of the data, then the endDate will be the better index for your more 'live' queries. Performance will degrade as you query more historical data.

Bertel · February 17, 2023, 8:56am

Everything in this thread revolves around the same problem, same query and same use case so I would not say it is completely unrelated. The thread is unstructured because the problem has evolved and I apologize for that.

The problem started with the identifier length affecting the planner to chose 'dateStart' index instead of 'dateEnd', which was strange and undocumented.

I have tried to fix this query many times, but the planner is not consistent in what index it chooses to use.

We currently have two environments with the same query, same indexes, same data and same filters. In one environment the query takes 1 second because it starts with 'dateStart' index, in the other it takes 50ms because it starts with 'dateEnd' index.

PROFILE MATCH (t:Test)
WHERE (datetime($date.to) < t.dateEnd AND datetime($date.from) > t.dateStart) 
OR (datetime($date.from) <= t.dateStart <= datetime($date.to) 
OR datetime($date.from) <= t.dateEnd <= datetime($date.to))
RETURN COUNT(*) AS count

I would just like a clear documentation of what parameters affects the planner to choose one index over the other so I can make this query stable and consistent.

Feel free to rename the thread "How does the planner choose index order?" because that is my question for the problem I'm having with this query and use case.

georgiy.kargapolov · April 27, 2023, 3:22pm

@Bertel

Currently the planner almost never makes decisions based on parameter values. We only sometimes use string/list parameter length, e.g. when query contains predicates like ENDS WITH or IN. For your queries that doesn't apply.

To determine which index to use here, we only consider index size and the number of unique values in the index. Given your sample data, :Test(dateStart) and :Test(dateEnd) are equivalent from the planner perspective. In that case the choice of an index depends mostly on the implementation details of various data structures and sorting algorithms used by the planner. You cannot meaningfully control that by changing variable names or reordering predicates.

To get more control over your query, I suggest rewriting it in a more verbose way. First, split your MATCH into a UNION. Then, add desired index hints to each branch. Finally, to COUNT the results, wrap it into a CALL subquery:

CALL {
  MATCH (t:Test) USING INDEX t:Test(dateEnd)
  WHERE datetime($date.from) > t.dateStart AND datetime($date.to) < t.dateEnd
  RETURN t
 UNION
  MATCH (t:Test)
  WHERE datetime($date.from) <= t.dateStart <= datetime($date.to)
  RETURN t
 UNION
  MATCH (t:Test) USING INDEX t:Test(dateEnd)
  WHERE datetime($date.from) <= t.dateEnd <= datetime($date.to)
  RETURN t
}
RETURN COUNT(*) AS count

Hope this helps.

Topic		Replies	Views
Query optimalisation Cypher	25	2500	February 25, 2019
Upgrade Kills Query Cypher	9	1215	November 13, 2018
Extremely slow query when profile looks very good? Cypher	11	4120	October 3, 2019
An effort to better understand the underlying mechanism of query filters and profiling Graph Data Science / Graph Analytics performance , cypher	12	996	July 27, 2020
Tuning Cypher queries by understanding cardinality Cypher performance , cypher , knowledge-base	0	1035	August 23, 2018

Cypher: Alias length >4 chars makes query 25x slower

Related topics