SIMILARITY ALGORITHMS Setting Parameter values and Interpreting The Results

Hi Everyone:

I have been using the Graph algorithms over the last 2 months. I have spent so much time reading up/ watching videos / using slack / and reading virtually any online resource I can get my hands on, on how to use the algorithms. However the main massive obstacle that I am struggling to overcome is how to know how to use the config parameters for the algorithms in the correct way. The area in particular I am struggling with is the Similarity algorithms. I have observed from lots of the online resources on Graph Algorithms that running similarity algorithms seems to be an important pre-requisite to running further community detection algorithms, and I am pretty sure once I can get my head around the concepts of how to analyse the results of the Similarity algorithms then the usage of the other Community detection algorithms and Centrality algorithms will fall into place.

In a nut shell basically I am trying to determine how to analyse the results yielded from the similarity algorithms to determine what value I should be setting for the parameters: “topK”, “similarityCutoff”, “degreeCutoff”. I understand exactly what these parameters mean, but what I don’t know is when the results get yielded back:

  1. what is the result I should be looking for? to indicate that I have used the correct combination of values for the parameters: “topK”, “similarityCutoff”, “degreeCutoff”.

  2. And if the results yielded back are incorrect then how do I know which parameters I should tweak to get to closer to the desired results? I guess the tweaking of parameters would be an iterative process to get to the desired results.

I have put some my code below along with results back, also the counts of Nodes are as follows:

Customer Nodes: 35,724
Moment Nodes: 18,863
PERFORMS Relationships: 357,503

For the 1st iteration I have set degreeCutoff: 1 so I exclude dissimilar Customer Nodes and topK: 10 as this was some advice I was given (but I would like to know why and how I should tweak this based on yielded results)

MATCH (c:Customer)-[:PERFORMS]->(m:Moment)
WITH c, collect(id(m)) AS colM
WITH {item:id(c), categories: colM} as customerData
WITH collect(customerData) as data
CALL algo.similarity.jaccard(data, {degreeCutoff: 1, write:false, writeRelationshipType:'JACCARD_SIMILARITY', topK: 10})
YIELD nodes, similarityPairs, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100


So far anything I have seen online only shows off the algorithms and their uses, but there is very little online on how to use the parameters for the algorithms and how to interpret the yielded results. I was hoping that by reaching out to you guys that you may be able to help me find a solution to my 2 questions above please?

Thanks,
Johnny

Hi Johnny! Welcome to the Neo4j community - I'm glad you're experimenting with graph algorithms, and I totally understand your confusion with the configuration parameters.

TopK and similarityCutoff determine what values are return to you (either streamed as an output or written to the graph): topK is how many similar results you want returned (out of everything computed), and similarityCutoff is a way of saying "don't return anything below this threshold". So in the example in the documentation (under 7.6.3):

from to similarity
"Karin" "Arya" 0.66
"Karin" "Michael" 0.25
"Karin" "Praveena" 0.0
"Karin" "Zhen" 0.0

If you set topK to 2, you would only get Arya and Michael back, or if you set similarityCutoff to .5, you would only get Arya back. You could combine the two parameters to specify things like "the top three when greater than .5" etc.

The degreeCutoff parameter specifies whether or not you calculate similarity in the first place, based on how many items are in the comparison vector. This is useful if you have a minimum number of items that you want to have in common before considering two nodes to be similar. For example, you could calculate the similarity between two nodes with only a single item in each target set, and get a similarity of one, but that may be less informative than calculating the similarity between two nodes with 25 items in their shared target set where 20 are overlapping. In more concrete terms, it's the consideration of whether you want to treat algo.similarity.jaccard([1],[1]) in the same way as you might treat algo.similarity.jaccard([1,2,3,4,5,6,7],[1,2,3,4,8,9,11]). With a degreeCutoff of 5, you would never calculate similarity for the first set.

TL;DR - there's no right or wrong answer here or absolute "correct combination." For your use case, you want to figure out the right combination of parameters based on the question you're trying to answer. You may want to configure the values of the parameters based on domain knowledge about the data or the kind of output you want to have for the next step in your pipeline.

I hope that's helpful :slight_smile:

Hi Alicia ,

Thanks very much for taking the time to reply, your help is massively appreciated :slight_smile:

That all makes sense to me and that's how I understood the definition of each parameters too, i.e.

  • DegreeCutOff, to specify the minimum items in the 2 arrays for comparison
  • TopK: to reduce the amount of Similar Nodes returned per Node, e.g. Top 5 similar Nodes Per Node
  • SimilarityCutOff: to reduce the number of similairtyPairs returned based on how similar a Node is to another node, if the simialrity calculated between 2 nodes falls below this threshold then they are not defined as being similar and therefore no new relationship will be created.

Having had a bit more time to think about my domain and my use case, I have set the following:

DegreeCutOff to 1 so I exclude dissimilar Customer Nodes, i.e. nodes for comparison must have at least one item in the target arrays that are being compared

TopK I have now removed as a parameter. because I do not want to reduce my results by only including a top number of results per node. In my case I feel if any 2 nodes are similar then they should be connected.

That now leaves similarityCutOff, now this is where I am really getting stuck, I understand exactly what this parameter does, but where I am struggling is this:

How do I use the results yielded back, to understand what value I should use for SimilarityCutOff?
i.e. Specifically , What is the ideal values or proportion between the values I should be looking for in the results yielded back in the values: similairtyPairs, mean, stDev, p25, p50, p99, p100 etc?

Thanks for your help, I look forward to hearing more,

Many thanks,
Johnny