Query optimalisation


(Tomswinkels) #1

There are a lot of difference option to make a query faster. But in my case sometimes is an other query faster than the other query.

But i need help, what is the best way!

I use Neo4j version 3.5.2.

In my case i have a big note with 4 miljoen messages and sometimes i want the last messages, and sometimes i want filter the messages. The messages has some relation to a hourly timetree and we have split all the words to a new note with relations between the word en message.

In some cases it is faster to use that and in some cases it is better to filter the message note by itself.

We want only filter or display the messages in the last month to make the performance better.

The examples are only word filters but there are more filters like a group-code (this code are not in the message but it is always a relation to a other note).

Some queries and results


PROFILE
MATCH (startleaf:Hour{hash: '2018/04/01/05'}), (endleaf:Hour{hash: '2018/04/30/05'}), p = shortestPath((startleaf)-[:NEXT*0..]->(endleaf))
UNWIND nodes(p) AS leaf
MATCH (leaf)<-[:SENDED]-(message:TS_P2000Message)
WITH distinct message
MATCH (message)-[:HAS_WORD]->(:TS_Word { name:'someren'})
WITH distinct message AS message
MATCH (message)-[:HAS_WORD]->(:TS_Word { name:'kruisbaan'})
WITH distinct message AS message
WITH count(message) AS results, collect(message) AS messages
UNWIND(messages) AS message
WITH results, message AS message
SKIP 0 LIMIT 15
RETURN results, message

First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 848193 total db hits in 2099 ms.
Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 848176 total db hits in 763 ms.


PROFILE
MATCH (startleaf:Hour{hash: '2018/04/01/05'}), (endleaf:Hour{hash: '2018/04/30/05'}), p = shortestPath((startleaf)-[:NEXT*0..]->(endleaf))
UNWIND nodes(p) AS leaf
MATCH (leaf)<-[:SENDED]-(message:TS_P2000Message)
WHERE message.message =~ '(?i).*someren.*' AND message.message =~ '(?i).*kruisbaan.*'
WITH count(message) AS results, collect(message) AS messages
UNWIND(messages) AS message
WITH results, message AS message
SKIP 0 LIMIT 15
RETURN results, message

First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 115168 total db hits in 3732 ms.

Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 115168 total db hits in 338 ms.


PROFILE 
MATCH  p = shortestPath((startleaf:Hour{hash: '2018/04/01/05'})-[:NEXT*0..]->(endleaf:Hour{hash: '2018/04/30/05'}))
WITH NODES(p) AS dates
  MATCH (message:TS_P2000Message)-[:SENDED]->(leaf),
        (message)-[:HAS_WORD]->(word:TS_Word)
  WHERE leaf IN dates AND
        word.name IN ['kruisbaan', 'someren']
  WITH distinct message AS message
  WITH count(message) AS results, collect(message) AS messages
  UNWIND(messages) AS message
  WITH results, message AS message
  SKIP 0 LIMIT 15
  RETURN results, message

First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 4694 total db hits in 1086 ms.

Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 4694 total db hits in 36 ms.

But this query is very long when i use this without word filter or with other words (popular words) this query is faster, check the next result

PROFILE 
MATCH  p = shortestPath((startleaf:Hour{hash: '2018/04/01/05'})-[:NEXT*0..]->(endleaf:Hour{hash: '2018/04/30/05'}))
WITH NODES(p) AS dates
  MATCH (message:TS_P2000Message)-[:SENDED]->(leaf),
        (message)-[:HAS_WORD]->(word:TS_Word)
  WHERE leaf IN dates AND
        word.name IN ['brand']
  WITH distinct message AS message
  WITH count(message) AS results, collect(message) AS messages
  UNWIND(messages) AS message
  WITH results, message AS message
  SKIP 0 LIMIT 15
  RETURN results, message

First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 149648 total db hits in 21066 ms.

Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 149652 total db hits in 1679 ms.


When we use without the other notes

PROFILE
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
WITH count(message) AS results, collect(message) AS messages
WITH results AS results, messages AS messages
UNWIND(messages) AS message
WITH results, message AS message
ORDER BY message.sended desc
SKIP 0 LIMIT 15
RETURN results, message

First: Started streaming 15 records after 3098 ms and completed after 3100 ms.

Second: Started streaming 15 records after 303 ms and completed after 303 ms.


Other way to filter.

PROFILE
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
WITH message
WHERE message.message =~ "(?i).*\\bsomeren\\b.*"
OR message.message =~ "(?i).*\\bbrand\\b.*"
WITH count(message) AS results, collect(message) AS messages
WITH results AS results, messages AS messages
UNWIND(messages) AS message
WITH results, message AS message
ORDER BY message.sended desc
SKIP 0 LIMIT 10
RETURN results, message

First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 342324 total db hits in 3346 ms.

Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 342324 total db hits in 1114 ms.


(Kees Vegter) #3

Hi Tom,

Did you also try CONTAINS?
And is TS_P2000Message.message and TS_P2000Message.sended indexed?

It is all about on controlling the size of the data set for each cypher step.
So assuming that there is also an index on message than you may start with the evaluation of the message first and after that filtering on time. If that results in a 'smaller' intermediate data set to work with for the next query step...

regards


(Tomswinkels) #4

Hi Kees,

Yes, have try that also but in the cases that i have test the results are not better.

I have a lot of indexes, see the list of my indexes

description,indexName,tokenNames,properties,state,type,progress,provider,id,failureMessage

INDEX ON :TS_P2000Message(message),index_1465,[TS_P2000Message],[message],ONLINE,node_label_property,100.0,"{version:1.0,key:native-btree}",1465,""
INDEX ON :TS_P2000Message(sended),index_1655,[TS_P2000Message],[sended],ONLINE,node_label_property,100.0,"{version:1.0,key:native-btree}",1655,""
INDEX ON :TS_P2000Message(message, sended)",index_1458,[TS_P2000Message],"[message,sended]",ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1458,""
INDEX ON :TS_P2000Message(uuid),index_1444,[TS_P2000Message],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1444,""

INDEX ON :Hour(hash),index_1670,[Hour],[hash],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1670,""
INDEX ON :Hour(uuid),index_1677,[Hour],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1677,""

INDEX ON :TS_P2000Capcode(capcode),index_1691,[TS_P2000Capcode],[capcode],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1691,""
INDEX ON :TS_P2000Capcode(uuid),index_1684,[TS_P2000Capcode],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1684,""


INDEX ON :TS_Word(name),index_1663,[TS_Word],[name],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1663,""
INDEX ON :TS_Word(uuid),index_1656,[TS_Word],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1656,""

New result for filter first on the message:

Query:

PROFILE
MATCH (message:TS_P2000Message)
WHERE message.message =~ "(?i).*\\bsomeren\\b.*"
OR message.message =~ "(?i).*\\bbrand\\b.*"
WITH message
WHERE 1546281754000 <= message.sended <= 1548960154000
WITH count(message) AS results, collect(message) AS messages
WITH results AS results, messages AS messages
UNWIND(messages) AS message
WITH results, message AS message
ORDER BY message.sended desc
SKIP 0 LIMIT 10
RETURN results, message

Result: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 342325 total db hits in 5378 ms.


(Michael Hunger) #5

Can you share the profile output?


(Michael Hunger) #6

I think if you simplify your query you should benefit from the new index backed order by operations in Neo4j 3.5
Esp if message.sended is indexed in a native index.

EXPLAIN
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
AND message.message =~ ("(?i).*\\b"+"(someren|brand)"+"\\b.*")
RETURN message
ORDER BY message.sended desc
SKIP 0 LIMIT 10

Leads to an quite optimal plan for me:


(Tomswinkels) #7

Sure!


(Tomswinkels) #8

I have running the query that you give me.


(Michael Hunger) #9

Sorry please run it with PROFILE


(Tomswinkels) #10

No problem :slight_smile:

Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 864 total db hits in 68 ms.


(Tomswinkels) #11

I think the count(results) is the problem in my case. But how can i get than the count of de total results of a filter?


(Michael Hunger) #12

You mean the total count? Yep that requires pulling through all data, which is not effective.
Esp. with the regexp filters which are very inefficient as they require each record to be touched.

Do you really need the total counts, or would be "more than xxx" be enough?


(Michael Hunger) #13

If you instead used a Neo4j 3.5 fulltext index on your messages, you could try to use this:

PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node as message1
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
AND message1 = message
RETURN count(*)

(Tomswinkels) #14

I use the total counts now for pagination.

The fulltext index takes very long time, there are 4 mil TS_P2000Message notes.

Maybe the best way is to use

PROFILE
MATCH (message:TS_P2000Message) 
WHERE 1546281754000 <= message.sended <= 1548960154000 
AND message.message =~ ("(?i).*\\b"+"(someren|brand)"+"\\b.*") 
RETURN message
ORDER BY message.sended desc 
SKIP 0 LIMIT 10

But in this case i get all the messages with someren or brand. But in some cases i want say only if someren and rosdaal. And in some cases i even want filter on codes (that is a relation between the message and code). When the word match is ok than i make some new examples.

Another thing what we do is give the relationships of something directly back (in one query). But is it better to do that, or a new request? I think it is better to do a new query but than you have with a limit of 50 results sometimes 1+50*(4 subqueries) = 201 queries for one request.

Now the indexes not used, this takes more time.

PROFILE
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
OPTIONAL MATCH (message)-[:HAS_URGENCY]->(urgency:P2000Urgency)
WITH {urgency: {urgency: urgency}, message: message} AS message
RETURN message
ORDER BY message.sended desc
SKIP 0 LIMIT 10

(Michael Hunger) #15

That last statement won't work as you need to use message.message.sended but then you loose all benefits of the index backed ordering.

So you should not shadow the name.

I can't believe that fetching results from an index with only 4M entries takes so long.
How big are your messages usually?


(Tomswinkels) #16

First i have create the fulltext index

CALL db.index.fulltext.createNodeIndex("messages",["TS_P2000Message"],["message"])

PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node as message1
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
AND message1 = message
RETURN count(*)

It takes more than 1035732 ms after that time i have killed the query.


(Michael Hunger) #17

What does the EXPLAIN look like?

And what does this look like:

PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node 
RETURN count(*)

(Michael Hunger) #18

Would it be possible to share your db with us for testing?
How big are your message texts usually?


(Tomswinkels) #19

I have run that query:


Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 1 total db hits in 3049 ms.

I will share my db with Neo4j, how can i do that? I can send a download link.

The messages texts are not big, max 200 characters.


(Michael Hunger) #20

Thanks I also meant the explain from the other statement that didn't finish.
You can send me the link to michael at neo4j.com or via PM.

Thanks so much.


(Tomswinkels) #21

Sorry, see here the explain.

I have send the link to you're mail.