So Im trying to calculate the cosine distance across some embeddings. It seems to run fine for smaller subsets (like if i limit it to lets say 1000-5000 nodes) but then if i try to run it with lets say 20-30.000 it doesnt give a result, if I then delete the edges and try to run it again, no output. Then I can restart the database and it works fine again for smaller sets.
I run it on a machine with 10 cores and 64 gb of ram, and it seems not to be running out of ram at least.
Maybe I misunderstood it, but I thought the writeBatch was supposed to process it in chunks? or is it a better strategy to use the stream function if I have to process large amounts of data?
One example of a query is below
MATCH (c:Entries)
WITH {item:id(c), weights: c.embedding} as data
WITH collect(data) as data
CALL algo.similarity.cosine(data, {write:true, writeBatchSize:1000})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
Hi Alicia, thank you for getting back.
Im running Version: 3.5.7, enterprise, edition.
Do you mean the debug.log file? I dont see anything in it related to this query when I run it.
My output with large datasets are below, however if I restart, run it on a subset, then it gives me results
Hi Bjoerm,
If there is no cut off value then it becomes cartesian product and I remember the algo does not do any work as it becomes lot of work to do. In your case since you have around 100,000 nodes, then the number of relations that needs to be created are 10 ^10. That's way too many relations. Can you please try giving a similarityCutoff value bit higher say 0.1 and try the algo again to see if it works.
Yeah I know its kind of big, ended up exporting the embeddings it and calculating it with python (using numba didnt took too long)
I want to use it for some predictions and would like to see the behavior across all similarities so I can see what true/false positives/negatives I should expect, thats why I would like to calculate it overall
So generally if there is no cutoff it doesnt output anything?
I think the problem is not with calculating similarities. When you want to create a relationship between 2 nodes no matter how similar they are it can be too much memory and storage usage. In this case it is going to create 10 billion relationships, which itself requires >300GB of storage. That's why similarity cutoff is important when you want to write back data to graph, to limit how many relationships are created.
@anthapu is correct - writing back 10B relationships will take a long time, but you're also comparing every node against every other node, so there are a tremendous number of comparisons to sort through (and unlikely that it will finish in a reasonable time frame)
@bjoernoesth You'll probably want to update to the most recent version of the library (3.5.14.0) to get some of our new algos and optimized implementations. I'd recommend trying out approximate nearest neighbors instead - this will limit the number the number of comparisons made and speed up the calculation time. Of course, you'll still want to set the cutoff to something -- 0.1 should give you a decent baseline (assume if there's no relationship, then the similarity is below that threshold).