Compare nodes from one group without cartesian product

avoloshchenko · September 19, 2019, 6:44pm

Hi guys! I'm new to neo4j and I need some help with the current task.

I have a large number of nodes (for example 20000 or more) in one group, for example, People. Each node in the group has a text property called name. We can describe each node with such a JSON document:
{ "name": "Andrew smth else" }.

The task is to remove all nodes with a similar name value, for comparison I have to use a function from the apoc module (for example apoc.text.levenshteinSimilarity).

The problem is that my query contains a cartesian product and its execution time for 20000 nodes is extremely long.

My current cypher request:

match (p1:People)
with p1
match (p2:People)
with p1, p2, apoc.text.levenshteinSimilarity(p1.name, p2.name) as lds
where p1 <> p2 and lds > 0.8
delete p2

Explain:

Maybe there is another way of such a comparison? Or a way to speed up this query?

I will be glad for any help! Thanks and have a good day!

andrew_bowman · September 19, 2019, 6:52pm

You'll end up with a cartesian product either way (that's the only real way to compare every node with every other node in the set) but there is a more efficient way to end up with that result set rather than doing a full person match per person. If you collect the person nodes and UNWIND twice, you'll end up with the cartesian product in what should be a more efficient way:

match (p:People)
with collect(p) as people
unwind people as p1
unwind people as p2
with p1, p2, apoc.text.levenshteinSimilarity(p1.name, p2.name) as lds
where p1 <> p2 and lds > 0.8
delete p2

avoloshchenko · September 20, 2019, 9:12am

Hi Andrew!
Sorry to hear that, but thanks anyway for your reply and sample request!

andrew_bowman · September 23, 2019, 2:57am

One other note, to avoid mirrored results where the same nodes are used, but the variables switched (which would result in deleting all pairs for that query) you should use WHERE id(p1) < id(p2).

avoloshchenko · September 23, 2019, 8:57pm

yes, I saw this expression in some posts, I use it too, thanks.
But I found another question, what if there are a lot more nodes in my database? For example 8M? collect(p) killed my machine =(
Is there a way to bypass this? I have 16Gb of RAM and the heap size set to 10Gb
Or do I need to get a server with more computation capabilities?

andrew_bowman · September 23, 2019, 9:31pm

Ah, in this case you'll either have to up your RAM and heap size, or revert to your previous version of the query with the two MATCHes. No way around that if you have more person nodes than your heap can cope with at the same time.

avoloshchenko · September 23, 2019, 10:52pm

That is, two MATCH statements for the People group can help with this, but with this dataset it will be a extremely long, am I right? + increase RAM with heap size

Topic		Replies	Views
Extremely slow retrieval Neo4j Graph Platform performance	3	229	April 27, 2024
Avoid cartesian product when create relationships Cypher cypher	10	4869	July 18, 2024
Merge all nodes with the same property name Cypher	14	13493	January 9, 2021
Compare two graphs Neo4j Graph Platform cypher	20	1773	May 16, 2020
Cosine similarity on 1M person nodes Neo4j Graph Platform migrated	5	1054	August 22, 2023

August Summer Fun!

Compare nodes from one group without cartesian product

Related topics