Help me write a Cypher query that returns nodes with duplicate outgoing relationships

Hello Cypher professionals,

I found myself in a situation where there's a discrepancy between the number of nodes and relationships coming out of these nodes. According to my business logic, the following should always be true:

Every node labeled 'Episode' should have exactly one outgoing relationship ':EPISODE_OF'.

Yet, right now, #'Episode' < #':EPISODE_OF' which makes me think that some episodes have several of these outgoing relationships. I know this because by checking node types on both ends of the path, I get the expected values. My goal is to identify these exceptional 'Episode' nodes.

I need help constructing a query that I would be able to run on a modest machine but a large graph (think ~25M relationships) (without killing it) that would return all distinct nodes which have more than one outgoing :EPISODE_OF relationship.

So, only return (e:Episode)-[r:EPISODE_OF]->(p:Podcast) if e has more than one r. Notice that, at the other end of this path, p can be the same node or a bunch of distinct nodes of the same type. I'm interested in both cases.

Both native and APOC-based solutions are acceptable.

Here's what I tried (and killed the server):

MATCH (e:Episode)-[:EPISODE_OF]->(p:Podcast)
WITH e,COUNT(p) as rels, collect(p) as podcasts
WHERE rels > 1
RETURN e,podcasts, rels

Thank you in advance.

Try this.

MATCH (e:Episode)-[r:EPISODE_OF]->(p:Podcast)
WITH e, p, collect(r) as rels
WHERE size(rels) > 1
RETURN e, p

You might put the collect step after your WHERE clause. It would probably reduce the size of your collection by quite a bit.

@ameyasoft, @MuddyBootsCode,

Thanks for chipping in. I highly appreciate your help. So, I tried both approaches and both of them yield the same result: query running indefinitely until the server dies. Here's the EXPLAIN result which happens to be identical for both approaches:

Is there a way to do some sort of periodic commit equivalent or something?

Change the return statement to RETURN e, p limit 20 and see if you are getting the desired output.

No luck:

Neo.TransientError.General.OutOfMemoryError

There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.

@malik,

Could you please try give a distinct Episode and test the result .
if it work then try to check if you do not have any loop in the graph

Tried this:

MATCH (e:Episode)-[:EPISODE_OF]->(p:Podcast)
WITH DISTINCT e, COUNT(p) as rels
WHERE rels > 1
RETURN DISTINCT e LIMIT 10

and this:

MATCH (e:Episode)-[r:EPISODE_OF]->(p:Podcast)
WITH DISTINCT e, p, collect(r) as rels
WHERE size(rels) > 1
RETURN DISTINCT e, p LIMIT 10

To no avail... Still running indefinitely until I run out of memory.

What I meant distinct Episode is

MATCH (e:Episode{name:"some specific value")-[r:EPISODE_OF]->(p:Podcast)
WITH e, collect(r) as rels
WHERE size(rels) > 1
RETURN DISTINCT e, rels

Oh, I see,

Yes, that works well but it always was.
I was trying to identify loops by doing the following:

MATCH p=(e1:Episode)-[*]->(e2:Episode)
WHERE e1.episodeId = e2.episodeId
RETURN NODES(p) as episodes

But the query didn't return any results. I guess, I can safely assume there are no loops involving Episodes as the query is relatively broad relationship-wise.

Here's another test I did to determine if there are any duplicates between e and p:

MATCH (e:Episode)-[r1:EPISODE_OF]->(p:Podcast)
WITH e,r1,p
MATCH (e)-[r2:EPISODE_OF]->(p)
WHERE ID(r2) <> ID(r1)
RETURN e,r1,r2,p

This query returned no results. I assume it means no pairs Episode+Podcast have duplicate EPISODE_OF relationships.

Friends,

I got it!
For those looking for something similar (or for future me), do the following:

To find duplicate links to distinct nodes

MATCH (p1:Podcast)<-[:EPISODE_OF]-(e)-[:EPISODE_OF]->(p2:Podcast)
WHERE p1.podcastId <> p2.podcastId
RETURN e LIMIT 10

To find duplicate links between pairs of nodes

MATCH (e:Episode)-[r1:EPISODE_OF]->(p:Podcast)
WITH e,r1,p
MATCH (e)-[r2:EPISODE_OF]->(p)
WHERE ID(r2) <> ID(r1)
RETURN e,r1,r2,p

Now for the subsequent question:
Is there a way to set up a DB constraint that will prevent something like this from happening? A uniqueness constraing on a path of some sort? So far I was only working with UUID or String uniqueness but can it be more sophisticated than this?

There's currently no way to create a uniqueness constraint on relationships. If you're interacting with your instance via one of the officially supported drivers you could add some of your own code to perform checks that would probably be able to prevent this from happening.

1 Like

Got it. Makes sense.