Lowest impact way of finding duplicate nodes

I am trying to return details of a low number of a single type of duplicated node in a dataset of 10,000,000+ nodes.

The node has 3 labels which can be used to cut the dataset to about 25% of the overall dataset. The node has 2 properties, one of which should be unique but is duplicated among the problem nodes.

The below Cypher will return details the details needed. (The other two unused labels can be inserted into another where clause to reduce the dataset to about 2.5m nodes).

MATCH (m:Aaa)
WITH m.aaa as aaa, collect (m) as nodes
WHERE size (nodes) > 1
RETURN [n in nodes | n.aaa] as aaas, size(nodes)

The explain plan shows a NodeByLabelScan on the full dataset and seems to be too expensive to run on the dataset.

Currently there is no index on aaa so we can apply this.

Is there a better way to find this data or reduce the cost of the query please?

An index might actually help here.

First, if you know the problem node has multiple labels, use those labels in your query, that filtering should be fairly efficient, and should happen before you do your collects().

Next, the extraction in your RETURN isn't needed. You already have the aaa value from when you did the collect().

Maybe try this instead (I'm substituting for the two additional labels on the node:

MATCH (m:Aaa:Bbb:Ccc)
WITH m.aaa as aaa, count (m) as nodeCount
WHERE nodeCount > 1
RETURN aaa, nodeCount

At this point an index ought to be helpful, as then you know exactly which aaa value to use as the lookups to the problematic nodes.

2 Likes