How to use of Cosine algorithm optimally, and why does it sometimes returning empty

bjoernoesth · December 23, 2019, 7:57pm

So Im trying to calculate the cosine distance across some embeddings. It seems to run fine for smaller subsets (like if i limit it to lets say 1000-5000 nodes) but then if i try to run it with lets say 20-30.000 it doesnt give a result, if I then delete the edges and try to run it again, no output. Then I can restart the database and it works fine again for smaller sets.
I run it on a machine with 10 cores and 64 gb of ram, and it seems not to be running out of ram at least.

Maybe I misunderstood it, but I thought the writeBatch was supposed to process it in chunks? or is it a better strategy to use the stream function if I have to process large amounts of data?
One example of a query is below

MATCH (c:Entries)
WITH {item:id(c), weights: c.embedding} as data
WITH collect(data) as data
CALL algo.similarity.cosine(data, {write:true, writeBatchSize:1000})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95

alicia.frame · December 30, 2019, 5:32pm

Hi Bjoern! What version of Neo4j are you running?

Does the algorithm finish running, but not return any results, or does it just hang?

If you can turn on your debug logs and share what shows up there while you run the algo, that would make it much easier to trouble shoot

bjoernoesth · January 3, 2020, 1:07pm

Hi Alicia, thank you for getting back.
Im running Version: 3.5.7, enterprise, edition.
Do you mean the debug.log file? I dont see anything in it related to this query when I run it.
My output with large datasets are below, however if I restart, run it on a subset, then it gives me results

nodes	p50	p75	p90	p99	p999	p100
106737	0.0	0.0	0.0	0.0	0.0	0.0

anthapu · January 3, 2020, 1:41pm

Hi Bjoerm,
If there is no cut off value then it becomes cartesian product and I remember the algo does not do any work as it becomes lot of work to do. In your case since you have around 100,000 nodes, then the number of relations that needs to be created are 10 ^10. That's way too many relations. Can you please try giving a similarityCutoff value bit higher say 0.1 and try the algo again to see if it works.

bjoernoesth · January 3, 2020, 3:49pm

Yeah I know its kind of big, ended up exporting the embeddings it and calculating it with python (using numba didnt took too long)
I want to use it for some predictions and would like to see the behavior across all similarities so I can see what true/false positives/negatives I should expect, thats why I would like to calculate it overall

So generally if there is no cutoff it doesnt output anything?

anthapu · January 3, 2020, 5:10pm

I think the problem is not with calculating similarities. When you want to create a relationship between 2 nodes no matter how similar they are it can be too much memory and storage usage. In this case it is going to create 10 billion relationships, which itself requires >300GB of storage. That's why similarity cutoff is important when you want to write back data to graph, to limit how many relationships are created.

alicia.frame · January 10, 2020, 6:49pm

@anthapu is correct - writing back 10B relationships will take a long time, but you're also comparing every node against every other node, so there are a tremendous number of comparisons to sort through (and unlikely that it will finish in a reasonable time frame)

@bjoernoesth You'll probably want to update to the most recent version of the library (3.5.14.0) to get some of our new algos and optimized implementations. I'd recommend trying out approximate nearest neighbors instead - this will limit the number the number of comparisons made and speed up the calculation time. Of course, you'll still want to set the cutoff to something -- 0.1 should give you a decent baseline (assume if there's no relationship, then the similarity is below that threshold).

Topic		Replies	Views
Algo cosine similarity error Cypher apoc , cypher , operations , knowledge-base	5	817	September 14, 2020
Couldnt use Cosine similarity Algorithm in my Neo4j Desktop Graph Algorithms/Graph Data Science	4	1769	October 9, 2018
Cosine Similarity Example Unexpected Results Graph Algorithms/Graph Data Science	1	683	August 24, 2020
Cosine similarity on 1M person nodes Neo4j Graph Platform migrated	5	1049	August 22, 2023
Text similarity using cosine similarity Neo4j Graph Platform migrated	2	899	January 3, 2023

August Summer Fun!

How to use of Cosine algorithm optimally, and why does it sometimes returning empty

Related topics