Help me write a Cypher query that returns nodes with duplicate outgoing relationships

malik · March 15, 2020, 9:59am

Hello Cypher professionals,

I found myself in a situation where there's a discrepancy between the number of nodes and relationships coming out of these nodes. According to my business logic, the following should always be true:

Every node labeled 'Episode' should have exactly one outgoing relationship ':EPISODE_OF'.

Yet, right now, #'Episode' < #':EPISODE_OF' which makes me think that some episodes have several of these outgoing relationships. I know this because by checking node types on both ends of the path, I get the expected values. My goal is to identify these exceptional 'Episode' nodes.

I need help constructing a query that I would be able to run on a modest machine but a large graph (think ~25M relationships) (without killing it) that would return all distinct nodes which have more than one outgoing :EPISODE_OF relationship.

So, only return (e:Episode)-[r:EPISODE_OF]->(p:Podcast) if e has more than one r. Notice that, at the other end of this path, p can be the same node or a bunch of distinct nodes of the same type. I'm interested in both cases.

Both native and APOC-based solutions are acceptable.

Here's what I tried (and killed the server):

MATCH (e:Episode)-[:EPISODE_OF]->(p:Podcast)
WITH e,COUNT(p) as rels, collect(p) as podcasts
WHERE rels > 1
RETURN e,podcasts, rels

Thank you in advance.

ameyasoft · March 15, 2020, 6:05pm

Try this.

MATCH (e:Episode)-[r:EPISODE_OF]->(p:Podcast)
WITH e, p, collect(r) as rels
WHERE size(rels) > 1
RETURN e, p

MuddyBootsCode · March 15, 2020, 6:42pm

You might put the collect step after your WHERE clause. It would probably reduce the size of your collection by quite a bit.

malik · March 16, 2020, 5:29am

@ameyasoft, @MuddyBootsCode,

Thanks for chipping in. I highly appreciate your help. So, I tried both approaches and both of them yield the same result: query running indefinitely until the server dies. Here's the EXPLAIN result which happens to be identical for both approaches:

Is there a way to do some sort of periodic commit equivalent or something?

ameyasoft · March 16, 2020, 6:26am

Change the return statement to RETURN e, p limit 20 and see if you are getting the desired output.

malik · March 16, 2020, 6:41am

No luck:

Neo.TransientError.General.OutOfMemoryError

There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.

intouch_vivek · March 16, 2020, 6:49am

@malik,

Could you please try give a distinct Episode and test the result .
if it work then try to check if you do not have any loop in the graph

malik · March 16, 2020, 6:59am

Tried this:

MATCH (e:Episode)-[:EPISODE_OF]->(p:Podcast)
WITH DISTINCT e, COUNT(p) as rels
WHERE rels > 1
RETURN DISTINCT e LIMIT 10

and this:

MATCH (e:Episode)-[r:EPISODE_OF]->(p:Podcast)
WITH DISTINCT e, p, collect(r) as rels
WHERE size(rels) > 1
RETURN DISTINCT e, p LIMIT 10

To no avail... Still running indefinitely until I run out of memory.

intouch_vivek · March 16, 2020, 7:54am

What I meant distinct Episode is

MATCH (e:Episode{name:"some specific value")-[r:EPISODE_OF]->(p:Podcast)
WITH e, collect(r) as rels
WHERE size(rels) > 1
RETURN DISTINCT e, rels

malik · March 16, 2020, 8:11am

Oh, I see,

Yes, that works well but it always was.
I was trying to identify loops by doing the following:

MATCH p=(e1:Episode)-[*]->(e2:Episode)
WHERE e1.episodeId = e2.episodeId
RETURN NODES(p) as episodes

But the query didn't return any results. I guess, I can safely assume there are no loops involving Episodes as the query is relatively broad relationship-wise.

Here's another test I did to determine if there are any duplicates between e and p:

MATCH (e:Episode)-[r1:EPISODE_OF]->(p:Podcast)
WITH e,r1,p
MATCH (e)-[r2:EPISODE_OF]->(p)
WHERE ID(r2) <> ID(r1)
RETURN e,r1,r2,p

This query returned no results. I assume it means no pairs Episode+Podcast have duplicate EPISODE_OF relationships.

malik · March 16, 2020, 8:51am

Friends,

I got it!
For those looking for something similar (or for future me), do the following:

To find duplicate links to distinct nodes

MATCH (p1:Podcast)<-[:EPISODE_OF]-(e)-[:EPISODE_OF]->(p2:Podcast)
WHERE p1.podcastId <> p2.podcastId
RETURN e LIMIT 10

To find duplicate links between pairs of nodes

MATCH (e:Episode)-[r1:EPISODE_OF]->(p:Podcast)
WITH e,r1,p
MATCH (e)-[r2:EPISODE_OF]->(p)
WHERE ID(r2) <> ID(r1)
RETURN e,r1,r2,p

Now for the subsequent question:
Is there a way to set up a DB constraint that will prevent something like this from happening? A uniqueness constraing on a path of some sort? So far I was only working with UUID or String uniqueness but can it be more sophisticated than this?

MuddyBootsCode · March 16, 2020, 11:14am

There's currently no way to create a uniqueness constraint on relationships. If you're interacting with your instance via one of the officially supported drivers you could add some of your own code to perform checks that would probably be able to prevent this from happening.

malik · March 16, 2020, 12:30pm

Got it. Makes sense.

Topic		Replies	Views
Multiple relationships between two nodes all become duplicates Neo4j Graph Platform relationship	3	3520	March 28, 2019
Exact match - Check for duplicate nodes / check for duplicate relationships Cypher apoc , performance , cypher	0	8392	December 6, 2018
Delete duplicate nodes if they have a relationship to the same node Cypher	11	332	May 25, 2022
Avoiding Duplicate Relationship Cypher apoc , cypher	1	2151	April 30, 2020
Taking huge time to count node which has multiple edge with same specific node? Cypher cypher	11	1061	November 15, 2020

Help me write a Cypher query that returns nodes with duplicate outgoing relationships

Related topics