Query Optimization - We wanna exclude the "distinct" part

performance
cypher
operations

(Amaier) #1

Hello community!

We are doing some cypher query optimization operations. And so far everything is good except one thing. Below you see our cypher query:

MATCH path=(s:Source)-[:Link]->(:A1)-[:Link]->(:A2)-[:Link]->(:Sink) 
WITH [s in nodes(path)|id(s)] AS node_traces
Return distinct node_traces

If we remove the "distinct" part of the return statement...

MATCH path=(s:Source)-[:Link]->(:A1)-[:Link]->(:A2)-[:Link]->(:Sink) 
WITH [s in nodes(path)|id(s)] AS node_traces
Return node_traces

We expected to receive the same output. But there are some paths which are matched twice. We have large datasets and we don't wanna use "distinct".

On the other hand for smaller TIMs (<4) We got the same output. For example: Here we got the same number of paths as result.

MATCH path=(s:Source)-[:Link]->(:A1)-[:Link]->(:Sink) 
WITH [s in nodes(path)|id(s)] AS node_traces
Return node_traces
MATCH path=(s:Source)-[:Link]->(:A1)-[:Link]->(:Sink) 
WITH [s in nodes(path)|id(s)] AS node_traces
Return distinct node_traces

Can anyone explain that phenomenon?


(Michael Hunger) #2

There might be different LINK relationships between two elements. That would produce different paths.
As the uniqueness is on the relationships not nodes.

How much does the distinct really affect your query time?

Did you try:

WITH distinct nodes(path) as nodes
RETURN [s in nodes | id(s)] AS node_traces

I don't think there is a path-uniqueness operation right now built in. As it still requires past paths to be kept in a datastructure to compare with.

Are you using enterprise with slotted runtime?