KNN euclidean similarity calculation incorrect?

Hi community

I am trying to use the KNN algorithm for measure similarity between properties of nodes, the euclidean similarity I measure manually doesn't match what neo4j calculates.

I have precomputed scaledProperties as an array of values. When I manually calculate the euclidean distance I get 0.976123. However, neo4j using KNN, gives me 0.9994. Screenshot below:

The cypher query I used to run KNN:

CALL gds.knn.mutate('mesoscale-graph-pca', {
nodeProperties:{scaledProperties:'EUCLIDEAN'},
topK:15,
sampleRate:1,
mutateProperty: 'similarity',
mutateRelationshipType: 'SIMILAR_MESOSCALE_PCA',
similarityCutoff: 0.75
})

Have I misunderstood something or done something wrong?

Many thanks
phil

After some digging I think what neo4j is doing is taking the mean of the all the similarity calculations for each property, which is not the same as the euclidean formula (across all properties). I have found the similarity scores to be off by as much as 10% with this approach.

is there a way to force GDS to use the true euclidean distance formula for nodeProperties?

After creating similarity scores manually and defining the relationships. I have found the clustering to be near similar, albeit at different similarity thresholds, for example 0.85 (manual) vs 0.98 (neo4j), produce similar graph diagrams. Essentially, neo4j scores are skewed higher, but retains the overall shape of the data.

Hey @wu.phil ,
sorry for not responding earlier.

So if your use the same metric across multiple properties, you dont want to use the mean over them but treat them as one vector?

What confuses me though, in your example you only have a single array property. So there the computation should be the same.

For reference, our computation for Euclidean and the combined

@wu.phil we have reproduced the issue from your example, it is slightly off, if you look at the code for Euclidean that @florentin_dorre shared you can see that the similarity scores are normalised but we had forgotten the square root, this has been addressed and will be included in the upcoming GDS release.

Sorry for any inconvenience this may have caused.
Regards,
Ves

2 Likes

thanks, looking forward to the next release where this is fixed. at the moment i'm having to write a python script external to GDS to get the accurate scores, which is not ideal! thanks