Pearson Similarity vs. movie recommendations

I followed the Pearson Similarity docs
and found 2 major characteristics of the described algorithm, that make it NOT a good choice for the suggested use-case.

  1. 0-deviation sets always give 0 in result:
RETURN algo.similarity.pearson([1,1,1,2], [1,1,1,1]) ---> 0.0
  1. linearity detection is the major goal of this algorithm:
RETURN algo.similarity.pearson([1, 2], [2, 1]) --> -1.0
RETURN algo.similarity.pearson([1, 2], [9, 10]) --> 1.0

Both may be generally in line with the definition, but IMO stay in opposite to the 'intuitive' similarity and make me wonder, how one can use Pearson for this kind of recommendations.

The point of this post is to ask for your experiences with similarity algos for recommendations (any suggestions?) and to warn you about using Pearson based on the docs solely, with no deeper consideration on how it works.

note: I use algo.similarity here (deprecated, I know) but the same apllies to gds.alpha.similarity

Hi ,

Iam too experimenting with the similarity algorithms. I the quest Ive read a number of papers and blogs on the internet. What I could understand

Threre are two types of algorithms
Set based comparison - Jacard and Overlap
Vector based comparison - Cosine, Euclidean and Pearson
(haven't explored any other apart from these)

What I found use after trying all of them

With the vector based algorithms we can use the weight factor for comparison whereas this is not possible with Set types.

Set based - Simple to understand and interpret (read calculate) results, but as they do not factor any weight I miss on some crucial details of the node

Vector based -
Cosine similarity measure the angle between vectors hence can represent the content of the nodes, but does not factor the magnitude of the vector.

Euclidean - is based on the distance between vectors which is based on the angle (content) and magnitude (length)

Pearson - this is just normalized cosine, Iam unable understand how the normalization impacts a real world scenario

A major problem with Vector based algorithms is that, you need a dense vector for it to give good results. In my case i had a total of 2500 categories but in any particular node had only 100 - 200 applicable categories. this creates a sparse vector with null or "nan" for more than 90% of the vector dimensions.

Thanks @mangesh.karangutkar! Nice.
For the problem you defined at the end, I successfuly applied a strategy to expand all the vectors to a common size, by filling them with a constant value. This value represents 'uncertainty' and is close to the average of the possible range. This way, longer similar vectors get promoted against shorter ones (even when more similar). This strategy may not apply to all cases though.

Hi @m-zielinski , thanks for that suggestion.

In my use case, the "nan's" or the blanks should be so (they aren't unknowns). As you rightly pointed this strategy may not apply. But i wonder if a 0 value instead of nan will have any different effect or may be a very small value like 0.1 or 0.01 or 1. Have you tried this ?

Also as you tried a value close to Average of the range and then did a Pearson similarity (which also does some normalization of the vector,) what good / bad effect will that have , did you observed anything ?

I didn't mention that I left alone Pearson due to the points I made above and switched to a custom distance algorithm, so I can't really answer your question :confused:
Looks like it all depends on your case and the goal you want to achieve with the distance metric...