KNN euclidean similarity calculation incorrect?

wu.phil · September 20, 2023, 10:53pm

Hi community

I am trying to use the KNN algorithm for measure similarity between properties of nodes, the euclidean similarity I measure manually doesn't match what neo4j calculates.

I have precomputed scaledProperties as an array of values. When I manually calculate the euclidean distance I get 0.976123. However, neo4j using KNN, gives me 0.9994. Screenshot below:

The cypher query I used to run KNN:

CALL gds.knn.mutate('mesoscale-graph-pca', {
nodeProperties:{scaledProperties:'EUCLIDEAN'},
topK:15,
sampleRate:1,
mutateProperty: 'similarity',
mutateRelationshipType: 'SIMILAR_MESOSCALE_PCA',
similarityCutoff: 0.75
})

Have I misunderstood something or done something wrong?

Many thanks
phil

wu.phil · September 27, 2023, 4:49am

After some digging I think what neo4j is doing is taking the mean of the all the similarity calculations for each property, which is not the same as the euclidean formula (across all properties). I have found the similarity scores to be off by as much as 10% with this approach.

is there a way to force GDS to use the true euclidean distance formula for nodeProperties?

wu.phil · September 29, 2023, 4:35am

After creating similarity scores manually and defining the relationships. I have found the clustering to be near similar, albeit at different similarity thresholds, for example 0.85 (manual) vs 0.98 (neo4j), produce similar graph diagrams. Essentially, neo4j scores are skewed higher, but retains the overall shape of the data.

florentin_dorre · October 4, 2023, 1:39pm

Hey @wu.phil ,
sorry for not responding earlier.

So if your use the same metric across multiple properties, you dont want to use the mean over them but treat them as one vector?

What confuses me though, in your example you only have a single array property. So there the computation should be the same.

For reference, our computation for Euclidean and the combined

veselin.nikolov · October 4, 2023, 2:43pm

@wu.phil we have reproduced the issue from your example, it is slightly off, if you look at the code for Euclidean that @florentin_dorre shared you can see that the similarity scores are normalised but we had forgotten the square root, this has been addressed and will be included in the upcoming GDS release.

Sorry for any inconvenience this may have caused.
Regards,
Ves

wu.phil · October 6, 2023, 2:30am

thanks, looking forward to the next release where this is fixed. at the moment i'm having to write a python script external to GDS to get the accurate scores, which is not ideal! thanks

Topic		Replies	Views
Euclidean Distance with Neo4j Graph Algorithms/Graph Data Science	1	391	June 24, 2021
Find similarity of given node with entire graph Neo4j Graph Platform migrated	9	251	December 8, 2022
Which is better for evaluating FastRP embeddings similarity with using cosine distance or euclidean distance? Neo4j Graph Platform migrated	2	222	June 13, 2022
Graph Data Science: K-Nearest Neighbors Graph Algorithms/Graph Data Science	4	907	December 5, 2020
Euclidian Distance Similarity Question Graph + AI	3	907	October 31, 2019

KNN euclidean similarity calculation incorrect?

Related topics