When use DISTINCT after ORDER BY clause, is the order guaranteed to be preserved?

Hi there, I have a question regarding DISTINCT and ORDER BY clause. When I have a cypher like this:

MATCH (node2:person)-[:primary_residence]-(node2_primary_residence:address)
WITH node2, node2_primary_residence
ORDER BY node2_primary_residence.city, node2.record_number
RETURN DISTINCT node2 
SKIP $offset LIMIT $limit

where a node2 has multiple connected node2_primary_residence nodes. Will the order be preserved for DISTINCT? I read the official document says "Unless ORDER BY is used, Neo4j does not guarantee the row order of a query result." ORDER BY - Cypher Manual I'm wondering if the DISTINCT will guarantee the previous row orders?

I believe it will be. Here is an example:

1 Like

If the RETURN statement does not have an ORDER BY, then there is no order guarantee, as stated in the documentation.

The fact that it works now is an implementation artifact. The implementation might never change, or it might. If you rely on RETURN preserving the order enforced by previous statements, you might one day regret it.

1 Like

May I ask what should be the correct way to handle this scenario? From my understanding even if I do DISTINCT before the ORDER BY, there is no guarantee that which rows of the duplicates node2 will be removed. Do you mean there is a way to embed the ORDER BY inside RETURN DISTINCT? Thanks in advance.

Do you mean there is a way to embed the ORDER BY inside RETURN DISTINCT?

Looking at the example above

unwind [1,1,3,7,3,4,7,1,10] as item
return distinct item order by item

Looking at your query: you want to order for something that is not part of your returned results (node2_primary_residence.city). That is an impossibility.

Two options:

  1. You do not order for node2_primary_residence.city
MATCH (node2:person)-[:primary_residence]-(node2_primary_residence:address)
RETURN DISTINCT node2 ORDER BY node2.record_number
SKIP $offset LIMIT $limit
  1. you add node2_primary_residence.city to your returned values
MATCH (node2:person)-[:primary_residence]-(node2_primary_residence:address)
RETURN DISTINCT node2, node2_primary_residence 
ORDER BY node2_primary_residence.city, node2.record_number
SKIP $offset LIMIT $limit

Hope it helps.

Hi, thanks for the explanation! Our use case requires ordering by a specific attribute (city in the example) first, followed by record_number, so option 1 is not viable for us.

Regarding option 2, while we can return node2_primary_residence.city if needed, the issue arises when there are multiple node2_primary_residence rows. The RETURN DISTINCT node2, node2_primary_residence clause removes duplicate rows arbitrarily, which means the final results may not preserve the desired city order.

Given this, it seems that achieving the intended sorting using DISTINCT + ORDER BY alone is not feasible then. Do you know if there are any alternative approach I can look into? Thanks!

the issue arises when there are multiple node2_primary_residence rows

Are you saying that there can be several :Person at the same :Address? I am not sure I understand the problem. It would help if you could supply some simple data showing your problem.

I had a go:

create (p1:Person {record_number: 1})
create (p2:Person {record_number: 2})
create (p3:Person {record_number: 3})
create (p4:Person {record_number: 4})
create (p5:Person {record_number: 5})
create (p6:Person {record_number: 6})

create (a1:Address {city: 'City 1'})
create (a2:Address {city: 'City 2'})
create (a3:Address {city: 'City 3'})
create (a4:Address {city: 'City 4'})

create (p1)-[:PRIMARY_RESIDENCE]->(a1)
create (p5)-[:PRIMARY_RESIDENCE]->(a1)
create (p2)-[:PRIMARY_RESIDENCE]->(a2)
create (p3)-[:PRIMARY_RESIDENCE]->(a3)
create (p4)-[:PRIMARY_RESIDENCE]->(a4)
create (p6)-[:PRIMARY_RESIDENCE]->(a4)

p1 and p5 share address a1, p4 and p6 share address a4.

Your original query (I have taken out skip and limit), with added return value:

neo4j@order> MATCH (node2:Person)-[:PRIMARY_RESIDENCE]-(node2_primary_residence:Address)
             WITH node2, node2_primary_residence
             ORDER BY node2_primary_residence.city, node2.record_number
             RETURN DISTINCT node2, node2_primary_residence;
+------------------------------------------------------------+
| node2                        | node2_primary_residence     |
+------------------------------------------------------------+
| (:Person {record_number: 1}) | (:Address {city: "City 1"}) |
| (:Person {record_number: 5}) | (:Address {city: "City 1"}) |
| (:Person {record_number: 2}) | (:Address {city: "City 2"}) |
| (:Person {record_number: 3}) | (:Address {city: "City 3"}) |
| (:Person {record_number: 4}) | (:Address {city: "City 4"}) |
| (:Person {record_number: 6}) | (:Address {city: "City 4"}) |
+------------------------------------------------------------+

The query with ORDER BY on RETURN:

neo4j@order> MATCH (node2:Person)-[:PRIMARY_RESIDENCE]-(node2_primary_residence:Address)
             RETURN DISTINCT node2, node2_primary_residence
             ORDER BY node2_primary_residence.city, node2.record_number;
+------------------------------------------------------------+
| node2                        | node2_primary_residence     |
+------------------------------------------------------------+
| (:Person {record_number: 1}) | (:Address {city: "City 1"}) |
| (:Person {record_number: 5}) | (:Address {city: "City 1"}) |
| (:Person {record_number: 2}) | (:Address {city: "City 2"}) |
| (:Person {record_number: 3}) | (:Address {city: "City 3"}) |
| (:Person {record_number: 4}) | (:Address {city: "City 4"}) |
| (:Person {record_number: 6}) | (:Address {city: "City 4"}) |
+------------------------------------------------------------+

It would help if you could modify the example to show the problem.

Hi, our data structure is one person can have multiple addresses e.g.

create (p1:Person {record_number: 1})
create (p2:Person {record_number: 2})

create (a1:Address {city: 'City 1'})
create (a2:Address {city: 'City 2'})
create (a3:Address {city: 'City 3'})
create (a4:Address {city: 'City 4'})

create (p1)-[:PRIMARY_RESIDENCE]->(a1)
create (p1)-[:PRIMARY_RESIDENCE]->(a3)
create (p2)-[:PRIMARY_RESIDENCE]->(a2)
create (p2)-[:PRIMARY_RESIDENCE]->(a4)

e.g. if p1 has address a1 and a3; p2 has address a2 and a4. And we want to allow user to sort by address.city first so in this example we would expect to have a cypher to give result sets with p1 first and then p2.

If we use

MATCH (node2:Person)-[:PRIMARY_RESIDENCE]-(node2_primary_residence:Address)
             RETURN DISTINCT node2, node2_primary_residence
             ORDER BY node2_primary_residence.city, node2.record_number;

From my understanding this DISTINCT might have a chance to give

+------------------------------------------------------------+
| node2                        | node2_primary_residence     |
+------------------------------------------------------------+
| (:Person {record_number: 1}) | (:Address {city: "City 3"}) |
| (:Person {record_number: 2}) | (:Address {city: "City 2"}) |
+------------------------------------------------------------+

and lead to the order by result with p2 first and p1 second which is not as expected.

DISTINCT on a tuple determines the distinct tuples, not the distinct on just the first attribute in the return or with statement. As such, your suggestion will not work.

This seems to work. Here I collect each user's addresses individually, then sort them and keep the first one in the sort result. Now the result is rows of distinct Persons, but with their city that is first in the sort order. You can then sort this list to get the result you want.

Implementation for Ascending order:

MATCH (node2:Person)-[:PRIMARY_RESIDENCE]-(node2_primary_residence:Address)
WITH node2, collect(node2_primary_residence) as userAddresses
Call (node2, userAddresses) {
    unwind userAddresses as address
    order by address.city
    limit 1
    return address.city as usersAddress
}
return node2.record_number
order by usersAddress

Descending order requires you to set the order for both Order By clauses:

MATCH (node2:Person)-[:PRIMARY_RESIDENCE]-(node2_primary_residence:Address)
WITH node2, collect(node2_primary_residence) as userAddresses
Call (node2, userAddresses) {
    unwind userAddresses as address
    order by address.city desc
    limit 1
    return address.city as usersAddress
}
return node2.record_number
order by usersAddress desc