Searching for Duplicates with CYPER match on properties

I am new to Cypher and work for a non-profit looking into financial crime. Most of our graph contains persons and entities and I want to check for duplicates. I tried the following simple query, but it returned everything

MATCH (a), (b)
WHERE a.name = b.name

How do I match for nodes with the exact same name property? How do I match for nodes where one name is contained in the other? For example one node is Mike Green and the other Mike Green Smith (Mike Green is contained completely in Mike Green Smith).

I really appreciate any advise you have! Just getting started and learning my way around.

Hello @mkretsch and welcome to the Neo4j community :slight_smile:

We will need an index on name for a quick search, I suppose you are using Person and Entity as node labels:

CALL db.index.fulltext.createNodeIndex("node_name", ["Person", "Entity"], ["name"])

This request will collect duplicates, thanks to a subquery, nodes for each name, the nodes which have the same name or the nodes which have a similar name:

MATCH (a)
CALL {
  WITH a.name AS name
  MATCH (b)
  WHERE name =~ '(?i)' + b.name
  WITH collect(b) AS nodes
  CALL db.index.fulltext.queryNodes("node_name", name) YIELD node
  RETURN name, collect(node) + nodes AS nodes
}
RETURN DISTINCT name, nodes

Regards,
Cobra

Thank you Cobra for your help. Unfortunately I am getting an error when I run this in Neo4j. It appears unhappy with the curly brackets and how the "name" was defined. Any thoughts on how to avoid these errors?

Can you show the error?

Invalid input '{': expected whitespace, comment, namespace of a procedure or a procedure name (line 3, column 6 (offset: 99))
"CALL {"
      ^
MATCH (a)
CALL {
  WITH a
  MATCH (b)
  WHERE a.name =~ '(?i)' + b.name
  WITH collect(b) AS nodes
  CALL db.index.fulltext.queryNodes("node_name", a.name) YIELD node
  RETURN a.name, collect(node) + nodes AS nodes
}
RETURN DISTINCT name, nodes

Still getting an error, any suggestions?

Invalid input '{': expected whitespace, comment, namespace of a procedure or a procedure name (line 2, column 6 (offset: 15))
"CALL {WITH a"
      ^

Which version of Neo4j are you using?

Neo4j Browser version: 4.0.8

Neo4j Server version: 3.5.18 (community)

That's why it's not working, this query only works on versions of Neo4j > 4.1 :slight_smile:

Can you upgrade or do you want another query for your current version?

This query work with all Neo4j version but it requires APOC:

MATCH (a)
WITH a.name AS name
CALL apoc.cypher.run('
    MATCH (b)
    WHERE name =~ "(?i)" + b.name
    WITH collect(b) AS nodes
    CALL db.index.fulltext.queryNodes("node_name", name) YIELD node
    RETURN name, collect(node) + nodes AS nodes
', {name:name})
YIELD value
RETURN

Regards,
Cobra

Thank you, that seems to work now! I really appreciate you reaching out to help.

1 Like

No problem, I'm happy to hear it :slight_smile: