In Exercise 4.5, Why do I get duplicates?

patrick.monnoire · October 3, 2019, 9:35am

Hi all,
In Exercise 4.5 ( Retrieve all people that wrote movies by testing the relationship between two nodes), I've tried this query:

MATCH (p:Person) -- (m:Movie) WHERE ((p)-[:WROTE]->(m)) RETURN p.name, m.time

and I get

p.name	m.title
"Aaron Sorkin"	"A Few Good Men"
"Aaron Sorkin"	"A Few Good Men"
"Jim Cash"	"Top Gun"
"Cameron Crowe"	"Jerry Maguire"
"Cameron Crowe"	"Jerry Maguire"
"Cameron Crowe"	"Jerry Maguire"

But in the exercise solution, there is no duplicate

MATCH (a)-[rel]->(m)
WHERE a:Person AND type(rel) = 'WROTE' AND m:Movie
RETURN a.name as Name, m.title as Movie

What am I doing wrong?

elaine_rosenber · October 3, 2019, 12:36pm

Hello Patrick,

The MATCH statement, MATCH (p:Person)--(m:Movie) returns all rows where a person is related to a movie. So there is a row where Aaron Sorkin WROTE the movie and a row where Aaron Sorkin ACTED_IN the movie. From these rows, it tests if the p node has a WROTE relationship to the m node. I does for each row so as a result it returns two rows.

Later in the course, you will learn about DISTINCT which removes duplicate rows. For example, this query would remove the duplicates:

MATCH (p:Person) -- (m:Movie)
WHERE ((p)-[:WROTE]->(m))
RETURN DISTINCT p.name, m.title

Elaine

patrick.monnoire · October 3, 2019, 12:45pm

Hi Elaine,

Thanks for this explanation.

And what is, in term of performance, the best request between the exercise's one or the one with DISTINCT?

Patrick

andrew_bowman · October 16, 2019, 10:38pm

Well it depends on what you want out of the query.

The original query in the exercise:

MATCH (a)-[rel]->(m)
WHERE a:Person AND type(rel) = 'WROTE' AND m:Movie
RETURN a.name as Name, m.title as Movie

DISTINCT isn't required, because we're specifying paths with :WROTE relationships, and provided that our model only has a single :WROTE relationship between a person and movie (it does), there won't be duplicate rows.

Your query is different:

MATCH (p:Person) -- (m:Movie) 
WHERE ((p)-[:WROTE]->(m)) 
RETURN p.name, m.time

While you do filter out any paths where the person didn't write the movie, you're still getting a path per separate relationship type that exists between the person and the movie, but you don't make any use of these extra rows (no categorization by the type, no counting or aggregations). So while you could add a DISTINCT here to get rid of the duplicates, the bigger issue is that you've created a query that asks for more data than you need, requiring you to filter out the excess data at the end. It's better practice to fix your query such that you only get the exact data you need and nothing extra:

MATCH (p:Person)-[:WROTE]->(m:Movie) 
RETURN p.name, m.time

Topic		Replies	Views
Multiple title rows returned for a single acted Movie node Newbie Questions cypher	4	301	January 6, 2022
Query returns duplicates Cypher	2	299	March 20, 2023
Potentially incorrect answer to Exercise 6.4 in "4.0-intro-neo4j-exercises" Graph Academy	1	358	December 7, 2020
Question regarding different results in intro-neo4j-exercises 4 Newbie Questions	3	1433	March 23, 2020
Different results when I use DISTINCT Newbie Questions	20	5788	April 25, 2019

Take the Course Then Join The Aura Agent Hackathon

In Exercise 4.5, Why do I get duplicates?

Related topics

Take the Course Then Join
The Aura Agent Hackathon