Hi, im using Neo4j 5.10 with the latest GDS. I have lot's of nodes of the same type, that have a propery "genres" as a List of Strings. I want to select one of these nodes and compare them to all other nodes in the database only according to this one property. Nodes with a huge intersection between the lists should come on top. I did it in classic cypher (ranked them according to Jaccard score and transformed the arrays into numeral vectors to do cosine similarity). Both methods work, however, the performance is not good. I thought that a GDS projection, where I can isolate these nodes of a specific type, in combination with the built in GDS similarity algorithms might lead to better performance.
However, I'm already struggeling with the projection. It does not allow me to declare a node property that is a list of strings or a string. I read that projecting node properties with native projection requires numeral properties. Is there a trick how to compare nodes in GDS just by there properties (don't need any relationships there). If yes, is it even possible to use Strings here?
Hello @fabian.z ,
If I got your schema correctly you have nodes with genres.
I dont understand how you endup with relationship properties for GDS though.
I would suggest to also transform the genres as numerical vectors as well.
For loading just nodes I would suggest to use Cypher projection (assuming you use GDS 2.4)
MATCH (n)
WITH gds.graph.project(
'graphWithProperties',
n,
null,
{
sourceNodeProperties: {genres: n.numerical_genres}}
}
) as g
RETURN g.graphName AS graph, g.nodeCount AS nodes, g.relationshipCount AS rels
Essentially for your speed constraint, you might be interested into not doing an exact matching but approximate.
For this you could use KNN (K-Nearest Neighbors - Neo4j Graph Data Science).
thanks for the quick answer. Of course you are right, I mixed it up, its not about relationship, but about node properties. I corrected it in the text above. How do I convert a List of Strings property to a numerical property in the projections? I only worked with native projections so far.
My attempt would be to use list comprehensions such as WITH {a: 1, b: 2} AS mapping, ["a", "b"] AS genres RETURN [g IN genres | mapping[g]] AS generes_numeric.
Now the remaining part is how you get the mapping. I would guess its a fixed set and you could pass this from the outside.
Otherwise you could try to built it by matching over all nodes unwind the genres, find the distinct values. How to get from the distinct values to a map I would have to find out as well.
Hi Florentin. With some helper nodes, I managed, that every separate genre is mapped to a unique float number and all nodes that I want to compare now have an additional property (a List of Floats), that I can project. However, when I run the KNN, it's not working correctly. It gives out Nodes, where the arrays of Floats are completely different or only have a very small intersection, even for Nodes where I know, that other Elements with a much greater intersection are definitely in the graph.
Hey Fabian,
great to hear you could get further!
As you mention array of floats, I have an idea what could be wrong.
Essentially its the similarity metric used. As float arrays will be compared by cosine similarity.
However as you have categorical data, you want to use Jaccard or Overlap similarity instead.
So you got two options:
(a) Transform your arrays to integer before such as by using the toInteger(x) function in Cypher
(b) Explicitly specify the similarity metric you want to use such as gds.knn.stream('myGraph', {nodeProperties: {numeric_genres: 'JACCARD'}}) (see K-Nearest Neighbors - Neo4j Graph Data Science for more details)
Hi Florentin,
it says Jaccard and Overlap are not possible with floats, so I transformed all floats to ints. With that they both work, however, the performance is still not that good. With Jaccard, the results are quite good, but it's not that fast. Do you have any "secrets", how to get it faster?
Hey @fabian.z ,
There could be several things to try out.
First I would check how it currently converges. If you run stats, write or mutate mode, you can inspect the ranIterations, nodePairsConsidered, similarityDistribution
topK how many do you suggestions per node do you really need?
initialSampler: "randomWalk" could lead to an earlier convergence
randomJoins is default 10 which you could try to lower
lower the sampleRate (default 0.5)
maxIterations is default 100. Could try to lower as well
most impact - specify the concurrency parameter (default is 4). However you need a GDS license
In general you can go very fast, but the quality of the result will suffer at some point. The similarityDistribution is a good measure for comparing the quality changes.