cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! Site migration is underway. Phase 1: replicate users.

How to use of Cosine algorithm optimally, and why does it sometimes returning empty

bjoernoesth
Node Link

So Im trying to calculate the cosine distance across some embeddings. It seems to run fine for smaller subsets (like if i limit it to lets say 1000-5000 nodes) but then if i try to run it with lets say 20-30.000 it doesnt give a result, if I then delete the edges and try to run it again, no output. Then I can restart the database and it works fine again for smaller sets.
I run it on a machine with 10 cores and 64 gb of ram, and it seems not to be running out of ram at least.

Maybe I misunderstood it, but I thought the writeBatch was supposed to process it in chunks? or is it a better strategy to use the stream function if I have to process large amounts of data?
One example of a query is below

MATCH (c:Entries)
WITH {item:id(c), weights: c.embedding} as data
WITH collect(data) as data
CALL algo.similarity.cosine(data, {write:true, writeBatchSize:1000})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95

6 REPLIES 6

alicia_frame
Neo4j
Neo4j

Hi Bjoern! What version of Neo4j are you running?

Does the algorithm finish running, but not return any results, or does it just hang?

If you can turn on your debug logs and share what shows up there while you run the algo, that would make it much easier to trouble shoot

bjoernoesth
Node Link

Hi Alicia, thank you for getting back.
Im running Version: 3.5.7, enterprise, edition.
Do you mean the debug.log file? I dont see anything in it related to this query when I run it.
My output with large datasets are below, however if I restart, run it on a subset, then it gives me results

nodes p50 p75 p90 p99 p999 p100
106737 0.0 0.0 0.0 0.0 0.0 0.0

Hi Bjoerm,
If there is no cut off value then it becomes cartesian product and I remember the algo does not do any work as it becomes lot of work to do. In your case since you have around 100,000 nodes, then the number of relations that needs to be created are 10 ^10. That's way too many relations. Can you please try giving a similarityCutoff value bit higher say 0.1 and try the algo again to see if it works.

Yeah I know its kind of big, ended up exporting the embeddings it and calculating it with python (using numba didnt took too long)
I want to use it for some predictions and would like to see the behavior across all similarities so I can see what true/false positives/negatives I should expect, thats why I would like to calculate it overall

So generally if there is no cutoff it doesnt output anything?

I think the problem is not with calculating similarities. When you want to create a relationship between 2 nodes no matter how similar they are it can be too much memory and storage usage. In this case it is going to create 10 billion relationships, which itself requires >300GB of storage. That's why similarity cutoff is important when you want to write back data to graph, to limit how many relationships are created.

@anthapu is correct - writing back 10B relationships will take a long time, but you're also comparing every node against every other node, so there are a tremendous number of comparisons to sort through (and unlikely that it will finish in a reasonable time frame)

@bjoernoesth You'll probably want to update to the most recent version of the library (3.5.14.0) to get some of our new algos and optimized implementations. I'd recommend trying out approximate nearest neighbors instead - this will limit the number the number of comparisons made and speed up the calculation time. Of course, you'll still want to set the cutoff to something -- 0.1 should give you a decent baseline (assume if there's no relationship, then the similarity is below that threshold).