Performance of query with DISTINCT and unique index

Hi Neo4j Community,

I have a question about query performance and the use of indices. I have performed the query

PROFILE MATCH (p:Provider) WHERE
EXISTS(p.service_provider) AND EXISTS(p.sourceID) AND EXISTS(p.valid_until)
RETURN DISTINCT(p.service_provider)

and I have got a uniqueness constraint on p.service_provider. I would have thought that the query planner picks up on this and makes use of the unique index and notices that the DISTINCT is actually redundant here. The query planner looks as follows:

However, when I got rid of the uniqueness constraint and just created a (NONUNIQUE) index on p.service_provider the query planner seems to do exactly the same.

Do I maybe have to rewrite the query somehow that in the situation where I have the uniqueness constraint (and hence a UNIQUE index) that the query planner picks up on this or does this just not happen in Neo4j (so far)?

I found similar questions about indexing and query performance online, but nothing that seems to address exactly this.

Thanks for letting me know.

Best,
Philipp

Hello, Philipp! Thanks for reaching out.

I think the issue here is that your query is checking whether the property exists, not whether the value is unique or not. So your query will not hit the index because it's not checking the value of the property. An existence constraint would be more appropriate here, although then your query isn't needed because the constraint will ensure the property exists and fail if it doesn't. Documentation on each constraint type is here: Constraints - Cypher Manual.

Let me know if you have more questions on this!

Cheers,
Jennifer Reif

Hi Jennifer,

Thanks a lot for getting back to me. I am still left a little confused I have to say.

Even when just executing

PROFILE MATCH (p:Provider)
RETURN DISTINCT(p.service_provider)

then the query plan looks the same, whether I have a unique constraint on p.service_provider or not. How would I have to go about things that the redundant check for DISTINCT will be omitted, i.e. the query could be sped up?

I know that in this instance I could just leave out the DISTINCT as the property has to be unique due to the constraint that is enforced. However, say I have a database and I am not aware of which constraints are enforced, etc. how would such a query have to be written so the redundant DISTINCT check won't be executed?

Thanks for letting me know, much appreciated :slightly_smiling_face:

Got a bit more detail on this.....Cypher planner doesn't eliminate Distinct in this case. It does in some other cases in Neo4j 5.x, but nothing in 4.x.

However, your query's plan is using OrderedDistinct, which is much more efficient (improving query performance). For unique values coming from an index, OrderedDistinct will only need to keep a single row in memory, whereas plain Distinct must accumulate all rows in memory. The easiest way to get rid of OrderedDistinct is not to use DISTINCT in the query when you know that the values are unique. This method does require some knowledge of constraints and db operations, though.

Hi Jennifer,

Thanks a lot for the detailed explanation, and in particular about the difference between OrderedDistinct and (regular) Distinct in the query planner.

That helps a lot :slight_smile:

1 Like