Lowest impact way of finding duplicate nodes

jonathan.h.leverett · August 15, 2019, 8:12am

I am trying to return details of a low number of a single type of duplicated node in a dataset of 10,000,000+ nodes.

The node has 3 labels which can be used to cut the dataset to about 25% of the overall dataset. The node has 2 properties, one of which should be unique but is duplicated among the problem nodes.

The below Cypher will return details the details needed. (The other two unused labels can be inserted into another where clause to reduce the dataset to about 2.5m nodes).

MATCH (m:Aaa)
WITH m.aaa as aaa, collect (m) as nodes
WHERE size (nodes) > 1
RETURN [n in nodes | n.aaa] as aaas, size(nodes)

The explain plan shows a NodeByLabelScan on the full dataset and seems to be too expensive to run on the dataset.

Currently there is no index on aaa so we can apply this.

Is there a better way to find this data or reduce the cost of the query please?

andrew_bowman · August 15, 2019, 9:32am

An index might actually help here.

First, if you know the problem node has multiple labels, use those labels in your query, that filtering should be fairly efficient, and should happen before you do your collects().

Next, the extraction in your RETURN isn't needed. You already have the aaa value from when you did the collect().

Maybe try this instead (I'm substituting for the two additional labels on the node:

MATCH (m:Aaa:Bbb:Ccc)
WITH m.aaa as aaa, count (m) as nodeCount
WHERE nodeCount > 1
RETURN aaa, nodeCount

At this point an index ought to be helpful, as then you know exactly which aaa value to use as the lookups to the problematic nodes.

Topic		Replies	Views
Not detecting repeated nodes Neo4j Graph Platform migrated	7	155	January 20, 2023
Filtering a node that only has 1 type of label Newbie Questions cypher	6	2131	July 7, 2020
Return collection of unique property values from different nodes Cypher apoc , cypher	5	4492	April 2, 2021
Exact match - Check for duplicate nodes / check for duplicate relationships Cypher apoc , performance , cypher	0	8191	December 6, 2018
Duplicate checking Cypher cypher	2	499	March 21, 2023

Lowest impact way of finding duplicate nodes

Related topics