I am trying to use the KNN algorithm for measure similarity between properties of nodes, the euclidean similarity I measure manually doesn't match what neo4j calculates.
I have precomputed scaledProperties as an array of values. When I manually calculate the euclidean distance I get 0.976123. However, neo4j using KNN, gives me 0.9994. Screenshot below:
After some digging I think what neo4j is doing is taking the mean of the all the similarity calculations for each property, which is not the same as the euclidean formula (across all properties). I have found the similarity scores to be off by as much as 10% with this approach.
is there a way to force GDS to use the true euclidean distance formula for nodeProperties?
After creating similarity scores manually and defining the relationships. I have found the clustering to be near similar, albeit at different similarity thresholds, for example 0.85 (manual) vs 0.98 (neo4j), produce similar graph diagrams. Essentially, neo4j scores are skewed higher, but retains the overall shape of the data.
@wu.phil we have reproduced the issue from your example, it is slightly off, if you look at the code for Euclidean that @florentin_dorre shared you can see that the similarity scores are normalised but we had forgotten the square root, this has been addressed and will be included in the upcoming GDS release.
Sorry for any inconvenience this may have caused.
Regards,
Ves
thanks, looking forward to the next release where this is fixed. at the moment i'm having to write a python script external to GDS to get the accurate scores, which is not ideal! thanks