Exact match - Check for duplicate nodes / check for duplicate relationships


(Joel Duerksen) #1

I am working to create a few cypher queries to check for previously encountered issues as new datasets are added. (issue may be caused by the data and/or load scripts). These checks are meant to be generic, and of general use (for us) but might not be applicable to everyone's situation? I expect I'm not the first to try to write these tests, so I'm interested to know if these can/should be improved (e.g. situations they won't work as expected), is there a better way, and/or with better performance?

Environment: Neo4j 3.5-Enterprise

Test 1: Count duplicate nodes.
Definition: Two nodes have exactly the same labels, and properties (keys and values match exactly)
Expected: 0

// count duplicate nodes
with labels(a) as la, properties(a) as p, count(properties(a)) as cpr
where cpr>1
return sum(cpr-1) as numDuplicateNodes

Test 2: Count duplicate relationships
Definition: For any a-[r]->b, there are two r, (same direction), with the same type, and properties (keys and values match exactly)
Expected: 0

// count duplicate relationships
MATCH (a)-[r]->(b)
with a, b, type(r) as tr, properties(r) as pr, count(properties(r)) as cpr
where cpr>1
return sum(cpr-1) as numDuplicateRelationships

----- in order to test the queries, currently I have to manually create the issues in a dev database.
The cypher I use to create the issues in the dev database may also be of interest. I know these would need to be redesigned if a database was very large. I'm working with less than million nodes, so they are fast enough in my situation.

// create random duplicate nodes
match (a)
with a, rand() as r
order by r asc
with a, r LIMIT 10
with properties(a) as pa, labels(a) as la
create (b) set b=pa
with b, la
CALL apoc.create.addLabels( [ id(b) ], la) YIELD node
return ID(node)

// create random duplicate relationships
match (a)-[r]->(b)
with a, r, b, rand() as rnd
order by rnd asc
with a, r, b LIMIT 10
with a,b, r, type(r) as tr, properties(r) as pr
call apoc.create.relationship(a, tr, pr, b) YIELD rel
return count(rel)