Compare nodes from one group without cartesian product

Hi guys! I'm new to neo4j and I need some help with the current task.

I have a large number of nodes (for example 20000 or more) in one group, for example, People. Each node in the group has a text property called name. We can describe each node with such a JSON document:
{ "name": "Andrew smth else" }.

The task is to remove all nodes with a similar name value, for comparison I have to use a function from the apoc module (for example apoc.text.levenshteinSimilarity).

The problem is that my query contains a cartesian product and its execution time for 20000 nodes is extremely long.

My current cypher request:

match (p1:People)
with p1
match (p2:People)
with p1, p2, apoc.text.levenshteinSimilarity(p1.name, p2.name) as lds
where p1 <> p2 and lds > 0.8
delete p2

Explain:

Maybe there is another way of such a comparison? Or a way to speed up this query?

I will be glad for any help! Thanks and have a good day!

You'll end up with a cartesian product either way (that's the only real way to compare every node with every other node in the set) but there is a more efficient way to end up with that result set rather than doing a full person match per person. If you collect the person nodes and UNWIND twice, you'll end up with the cartesian product in what should be a more efficient way:

match (p:People)
with collect(p) as people
unwind people as p1
unwind people as p2
with p1, p2, apoc.text.levenshteinSimilarity(p1.name, p2.name) as lds
where p1 <> p2 and lds > 0.8
delete p2
3 Likes

Hi Andrew!
Sorry to hear that, but thanks anyway for your reply and sample request!

One other note, to avoid mirrored results where the same nodes are used, but the variables switched (which would result in deleting all pairs for that query) you should use WHERE id(p1) < id(p2).

yes, I saw this expression in some posts, I use it too, thanks.
But I found another question, what if there are a lot more nodes in my database? For example 8M? collect(p) killed my machine =(
Is there a way to bypass this? I have 16Gb of RAM and the heap size set to 10Gb
Or do I need to get a server with more computation capabilities?

Ah, in this case you'll either have to up your RAM and heap size, or revert to your previous version of the query with the two MATCHes. No way around that if you have more person nodes than your heap can cope with at the same time.

That is, two MATCH statements for the People group can help with this, but with this dataset it will be a extremely long, am I right? + increase RAM with heap size