I do not understand why I have this result

Hi everyone,

I'm trying to get the result of the following query :

match (n:Student)-[r:WORKS_WITH]->(m:Tutor)
with collect(n.name) as studs,m
return studs, collect(m.name) as tutors

I have some students who works with some tutors. 1 or more students can have 1 or more tutors.
I had issues with aggregating the whole thing.

When I tried

match (n:Student)-[r:WORKS_WITH]->(m:Tutor)
return collect(n.name) as studs, collect(m.name) as tutors

It only gives me all students in one column and all tutors in the other one instead of giving me all student-tutor pairs.
If someone could explain me why.

But the interesting thing is the following :

Instead of having my 2 students on the left and my 3 tutors on the right in 1 line, I have these 2 lines. There is no duplicate nodes in db which could explain why I get this.
Can someone help me get it the right way ?

Try this:

match (n:Student)-[r:WORKS_WITH]->(m:Tutor)
with distinct n.name as student, collect(distinct m.name) as tutors
return student, tutors, size(tutors) as tutorCnt

1 Like

To answer why -

When you use an aggregating function like collect, the aggregation (group-by) has to be done by some group by field. In cypher, this group by is implicit based on the fields specified in the WITH or RETURN statement. Any non-aggregated fields used along with an aggregation field become the group-by fields.

So, In your first query, you had -

with collect(n.name) as studs,m

Studs is an aggregated field, m is not. m becomes the default group-by field. Here, it would collect i.e. form a list of students for the value of m who is a tutor. So, you group-by field is the tutor m and it returns a list of students in the studs variable like this - ['stud1', 'stud2','stud3'].

then you had - return studs, collect(m.name) as tutors

Now, the collect would construct a list of tutors using the studs list as the group-by field. Remember, studs is of the form ['stud1', 'stud2','stud3'] which means if there are two rows from the WITH statement where
stud = ['stud1', 'stud2','stud3'] and m = 'tutor1'
stud = ['stud1', 'stud2','stud3'] and m = 'tutor2'
then you final RETURN should give
stud = ['stud1', 'stud2','stud3'] and tutors =[ 'tutor1','tutor2']

Remember, the group-by values have to be exactly the same values, same order.

In your second query,

match (n:Student)-[r:WORKS_WITH]->(m:Tutor)
return collect(n.name) as studs, collect(m.name) as tutors

It only gives me all students in one column and all tutors in the other one instead of giving me all student-tutor pairs.

This is correct. It has no specific group-by fields, so it should simply group all values for both students and tutors and display it.

But the interesting thing is the following :
bug|690x288

For this are you able to display your graph here? Do both students works with all 3 tutors? If not, that might be why.

Also, can you post a sample of the output you want to see?

1 Like

Thank you for your answer. But it doesn't fit. The tutors are properly aggregated but there is duplicate as you can see here

The students are not group by same tutors

Thank you for explanation, I always get troubles when aggregating, it looks more clear to me now.

Blockquote
For this are you able to display your graph here? Do both students works with all 3 tutors? If not, that might be why.
Also, can you post a sample of the output you want to see?

As you said earlier :

Blockquote
Remember, the group-by values have to be exactly the same values, same order.

I have the same values but not the same order and I don't know why because I have no duplicate nodes.

Here is the graph :

The students are in blue and there tutors are in pink, the 2 interesting students are in the white circle.

The output I expect is the same as the image in my first post but with my 2 lines becoming only 1 line because my students "Imane and Julien" are the same nodes as "Julien and Imane"

Thanks

Edit :
Thanks to @poornima, I tried other way of aggregation and by aggregating my tutors first, I don't have my students in a different order.

I still don't understand why this way works but the other doesn't. Maybe if you or someone else can enlighten me I'd be grateful.

The query

match (n:Student)-[r:WORKS_WITH]->(m:Tutor)
with n,collect(m.name) as tutors
return collect(n.name) as studs, tutors

so :slight_smile:, I re-created partial data and ran both queries myself and both your original query and the latest one yield the same results for me!!!!

Your first query in original post -

Your latest query -

So, I re-ran the queries with the PROFILE keyword which tells me the steps the query would go through to get to the result.

Those Ordered By lines only occur in the first query but not in the second. Also, the ordering is by Student only even when grouping is by Tutor. However even with this, I got the same results for both queries. I re-created data with just the first names and I only created 5 nodes - 2 students and 3 tutors. Perhaps, last names make a difference in the sorting process?

I think it would help you to add the PROFILE keyword and analyze what's happening between the queries.

1 Like

Thank you for your help. I ran both queries with PROFILE and I got the same thing. The 1st query is Ordered by but the last is not.
If last name can change the sorting, I don't understand how is this possible because the last name is in the name property.

Indeed, it shouldn't. Ok, first - may I say how interesting this problem is? :slight_smile:

Initially I was going to give a long explanation, I need to confirm somethings. Can you simply run this query -

MATCH (m)-[r]->(n) 
RETURN m,r,n

and tell me in what order the value of the 'identity' is assigned, especially for the relationships? Must look like the below -

Tell me if the value of identity for the relationship WORKS_WITH between Steven and Julien is higher or lower than the one that connects Steven and Imane and similarly, for the other tutors for these 2 students.

You see just by changing the order of creation of the nodes and particularly the relationships, I was able to get your good query to Fail!!!! :grinning:

I believe the issue is because of the sorting + eager operation (the collect function).

I will also post here a resource and some keywords -

Keywords - sorting order, eager operators and anchoring.
Resource - How Queries Work in Neo4j - Cypher Query Tuning in Neo4j 4.x

Let's come back to this after I get your reply.

Try this:
match (n:Student)-[r:WORKS_WITH]->(m:Tutor)
with distinct m.name as tutor, collect(n.name) as studs
return tutor, studs, size(studs) as studentCnt

@ameyasoft hold on, I suspect this query will need a sort somewhere in between to be absolutely foolproof.
Also, Still doesn't give OP's expected result.

This is the query we should use -


CALL { 
MATCH (n:Student)-[r:WORKS_WITH]->(m:Tutor) 
RETURN n,m 
ORDER BY n.name,m.name
}
WITH n,collect(m.name) as tutors
RETURN collect(n.name) as studs, tutors

You said :

Blockquote
Also, the ordering is by Student only even when grouping is by Tutor

But when I use PROFILE in my queries and only the 1st is ordered by but not the last one. I forgot to tell you that for me, Students and tutors are both ordered by, not only students.

Anyway, when I look at the identity of the relationships for Imane, Julien and there tutors I got something like:
Imane, 83, Nassim
Imane, 81, Arnaud
Imane, 82, Steven
Julien, 78, Arnaud
Julien, 80, Nassim
Julien, 79, Steven

As you can see, the order is not exactly the same but Steven is the last tutor to appear for both students but his relationship with both Students has the identity number in the middle.

I quite don't understand this :

Blockquote
You see just by changing the order of creation of the nodes and particularly the relationships, I was able to get your good query to Fail!!!! :grinning:

When I created my db, I created all relationships between Julien and his tutors (Arnaud, Steven then Nassim) and then Imane and her tutors (same order Arnaud, Steven then Nassim).
It is quite interesting why order is not the same for these students. I have 2 other students who are related to 3 tutors but they appear in the exact same order (tutor 1, 2, 3 for both).

Thank you again for looking for explanation !

I also created that small part and get results as expected, one line with 2 studs in one column and 3 tutors in another. But have no ordering statements in PROFILE anywhere. Did you set any indexes on any of those Labels?

What happens if you just try the first part of the query:

match (n:Student)-[r:WORKS_WITH]->(m:Tutor)
with collect(n.name) as studs, m
return studs, m.name

should result in 3 lines with studs always the same array of 2 and each tutor name per row

Hi, So what I meant was without changing anything in the good query i.e. the one that worked for you, I was able to produce a different (And very wrong) result just by changing the order of creating nodes and relationships.

In your PROFILE/EXPLAIN queries, the ORDER BY statement which you see for your first query sorts by the node itself i.e. the internally assigned node identifier and not the name property of the node. The name property will only be used if explicitly stated to be used in a sort, as shown in the query below -

As to why this is so, I am still trying to figure it out.

The big takeaway here is: if you want results ordered in some way, you've got to explicitly use ORDER BY for the fields/variables you want for the ordering.

Otherwise there is no guarantee about the order. Even if you see an apparent order, Cypher is under no guarantees to keep that ordering if the planner decides to plan the query differently, or on later versions of Neo4j.