I am trying to return details of a low number of a single type of duplicated node in a dataset of 10,000,000+ nodes.
The node has 3 labels which can be used to cut the dataset to about 25% of the overall dataset. The node has 2 properties, one of which should be unique but is duplicated among the problem nodes.
The below Cypher will return details the details needed. (The other two unused labels can be inserted into another where clause to reduce the dataset to about 2.5m nodes).
MATCH (m:Aaa)
WITH m.aaa as aaa, collect (m) as nodes
WHERE size (nodes) > 1
RETURN [n in nodes | n.aaa] as aaas, size(nodes)
The explain plan shows a NodeByLabelScan on the full dataset and seems to be too expensive to run on the dataset.
Currently there is no index on aaa so we can apply this.
Is there a better way to find this data or reduce the cost of the query please?