I found myself in a situation where there's a discrepancy between the number of nodes and relationships coming out of these nodes. According to my business logic, the following should always be true:
Every node labeled 'Episode' should have exactly one outgoing relationship ':EPISODE_OF'.
Yet, right now, #'Episode' < #':EPISODE_OF' which makes me think that some episodes have several of these outgoing relationships. I know this because by checking node types on both ends of the path, I get the expected values. My goal is to identify these exceptional 'Episode' nodes.
I need help constructing a query that I would be able to run on a modest machine but a large graph (think ~25M relationships) (without killing it) that would return all distinct nodes which have more than one outgoing :EPISODE_OF relationship.
So, only return (e:Episode)-[r:EPISODE_OF]->(p:Podcast) if e has more than one r. Notice that, at the other end of this path, p can be the same node or a bunch of distinct nodes of the same type. I'm interested in both cases.
Both native and APOC-based solutions are acceptable.
Here's what I tried (and killed the server):
MATCH (e:Episode)-[:EPISODE_OF]->(p:Podcast)
WITH e,COUNT(p) as rels, collect(p) as podcasts
WHERE rels > 1
RETURN e,podcasts, rels
Thanks for chipping in. I highly appreciate your help. So, I tried both approaches and both of them yield the same result: query running indefinitely until the server dies. Here's the EXPLAIN result which happens to be identical for both approaches:
There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.
Yes, that works well but it always was.
I was trying to identify loops by doing the following:
MATCH p=(e1:Episode)-[*]->(e2:Episode)
WHERE e1.episodeId = e2.episodeId
RETURN NODES(p) as episodes
But the query didn't return any results. I guess, I can safely assume there are no loops involving Episodes as the query is relatively broad relationship-wise.
Here's another test I did to determine if there are any duplicates between e and p:
MATCH (e:Episode)-[r1:EPISODE_OF]->(p:Podcast)
WITH e,r1,p
MATCH (e)-[r2:EPISODE_OF]->(p)
WHERE ID(r2) <> ID(r1)
RETURN e,r1,r2,p
This query returned no results. I assume it means no pairs Episode+Podcast have duplicate EPISODE_OF relationships.
I got it!
For those looking for something similar (or for future me), do the following:
To find duplicate links to distinct nodes
MATCH (p1:Podcast)<-[:EPISODE_OF]-(e)-[:EPISODE_OF]->(p2:Podcast)
WHERE p1.podcastId <> p2.podcastId
RETURN e LIMIT 10
To find duplicate links between pairs of nodes
MATCH (e:Episode)-[r1:EPISODE_OF]->(p:Podcast)
WITH e,r1,p
MATCH (e)-[r2:EPISODE_OF]->(p)
WHERE ID(r2) <> ID(r1)
RETURN e,r1,r2,p
Now for the subsequent question: Is there a way to set up a DB constraint that will prevent something like this from happening? A uniqueness constraing on a path of some sort? So far I was only working with UUID or String uniqueness but can it be more sophisticated than this?
There's currently no way to create a uniqueness constraint on relationships. If you're interacting with your instance via one of the officially supported drivers you could add some of your own code to perform checks that would probably be able to prevent this from happening.