Query for transitive relations excluding based on another relation type

Hi, I'm absolutely new to the graph db world; started looking at neo4j and got a small db created.
It has only one type of node and multiple relations to itself. The purpose of the DB is to group together possible duplicates.

It does so by creating relationships between one node and another that describe what data is duplicate in them (matches_email,matches_first_name etc.).

A node is considered a duplicate of another if it matches on at least two fields and the nature of a duplicate is transitive. if A is a duplicate of B and B is a duplicate of C then A is a duplicate of C.

There is one type of relationship (is_not_duplicate_of) that breaks the transitive dependency. here is an example:
A is a duplicate of B
B is a duplicate of C
C is a duplicate of D
A has a relationship of "is_not_duplicate_of" C
then C and D are not a duplicate of A
But C and D remain a duplicate of B.

To better illustrate I have this graph
Screenshot 2023-10-10 at 2.28.03 PM
In this example the black lines are relationships that mark a node as having a match on more than two fields and the blue line is the special relationship "is_not_duplicate_of". if I request the duplicates for "A" I should get:
AE, AG, AF, G

I built this query:

MATCH
(source:Contact{id:"01eqjjdjv5svz6r5xvrzrh0btz"}),
(source)-[:matches_first_name*5]->(duplicate:Contact),
(source)-[:matches_email|matches_phone*5]->(duplicate)
RETURN distinct duplicate.displayName

which returns:
AF, AE, A, AG, G, B, U, T, L
but of course this one does not exclude anything. I could limit this to a depth of 1 on the relationships and get the expected result but the transitive nature would be ignored. I tried a few variations but ultimately have not obtained the expected result as I have not managed to find how to build a path starting in A which goes to all nodes that have at least a match on two fields but break a branch as soon as they find a node that has the relationship "is_not_duplicate_of" with the initial node .

Could I get some help as to how to achieve the expected result?

I think I understand. You have two paths in your query because you want matches on at least two attributes. You also want to eliminate a match if there exists an is_not_duplicate_of relationship with any node along a path and the original node, source in this case.

Try this. I didn’t know if the direction of the “is not related” relationships mattered, so I left it bidirectional in the “exists” function calls.

MATCH (source:Contact{id:"01eqjjdjv5svz6r5xvrzrh0btz"})
MATCH p1=(source)-[:matches_first_name*5]->(duplicate:Contact)
WHERE none(n in nodes(p1) where exists( (n)-[:is_not_duplicate_of]-(source) ))
MATCH p2=(source)-[:matches_email|matches_phone*5]->(duplicate)
WHERE none(n in nodes(p2) where exists( (n)-[:is_not_duplicate_of]-(source) ))
RETURN distinct duplicate.displayName

Note, this may not execute quickly,

This may execute faster if you have a lot of duplicates, as it eliminates the duplicate nodes before checking for the second path.

MATCH (source:Contact{id:"01eqjjdjv5svz6r5xvrzrh0btz"})
MATCH p1=(source)-[:matches_first_name*5]->(duplicate:Contact)
WHERE none(n in nodes(p1) where exists( (n)-[:is_not_duplicate_of]-(source) ))
WITH source, collect(distinct duplicate) as dups
UNWIND dups as duplicate
MATCH p2=(source)-[:matches_email|matches_phone*5]->(duplicate)
WHERE none(n in nodes(p2) where exists( (n)-[:is_not_duplicate_of]-(source) ))
RETURN distinct duplicate.displayName

Actually, I think this is the same:

MATCH (source:Contact{id:"01eqjjdjv5svz6r5xvrzrh0btz"})
MATCH p1=(source)-[:matches_first_name*5]->(duplicate:Contact)
WHERE none(n in nodes(p1) where exists( (n)-[:is_not_duplicate_of]-(source) ))
WITH distinct source, duplicate
MATCH p2=(source)-[:matches_email|matches_phone*5]->(duplicate)
WHERE none(n in nodes(p2) where exists( (n)-[:is_not_duplicate_of]-(source) ))
RETURN distinct duplicate.displayName

Sorry, I am not at my computer to test.

This may be more efficient

MATCH (source:Contact{id:"01eqjjdjv5svz6r5xvrzrh0btz"})
MATCH p1=(source)-[:matches_first_name*5]->(duplicate:Contact)
WHERE none(n in nodes(p1) where exists( (n)-[:is_not_duplicate_of]-(source) ))
WITH distinct source, duplicate
WHERE exists {
  MATCH p2=(source)-[:matches_email|matches_phone*5]->(duplicate)
  WHERE none(n in nodes(p2) where exists( (n)-[:is_not_duplicate_of]-(source) ))
}
RETURN distinct duplicate.displayName
1 Like

This worked beautfilly and got me on the right track for a more complex scenario I will be using. Thanks a bunch @glilienfield

1 Like