Compare lots of nodes of the same type by a List<String> property with GDS Projection

fabian.z · August 2, 2023, 1:42pm

Hi, im using Neo4j 5.10 with the latest GDS. I have lot's of nodes of the same type, that have a propery "genres" as a List of Strings. I want to select one of these nodes and compare them to all other nodes in the database only according to this one property. Nodes with a huge intersection between the lists should come on top. I did it in classic cypher (ranked them according to Jaccard score and transformed the arrays into numeral vectors to do cosine similarity). Both methods work, however, the performance is not good. I thought that a GDS projection, where I can isolate these nodes of a specific type, in combination with the built in GDS similarity algorithms might lead to better performance.
However, I'm already struggeling with the projection. It does not allow me to declare a node property that is a list of strings or a string. I read that projecting node properties with native projection requires numeral properties. Is there a trick how to compare nodes in GDS just by there properties (don't need any relationships there). If yes, is it even possible to use Strings here?

Thanks a lot and kind regards
F

florentin_dorre · August 3, 2023, 8:07am

Hello @fabian.z ,
If I got your schema correctly you have nodes with genres.
I dont understand how you endup with relationship properties for GDS though.

I would suggest to also transform the genres as numerical vectors as well.

For loading just nodes I would suggest to use Cypher projection (assuming you use GDS 2.4)

MATCH (n)
WITH gds.graph.project(
  'graphWithProperties',
  n,
  null,
  {
    sourceNodeProperties: {genres: n.numerical_genres}}
  }
) as g
RETURN g.graphName AS graph, g.nodeCount AS nodes, g.relationshipCount AS rels

Essentially for your speed constraint, you might be interested into not doing an exact matching but approximate.
For this you could use KNN (K-Nearest Neighbors - Neo4j Graph Data Science).

fabian.z · August 3, 2023, 8:26am

Hi Florentin,

thanks for the quick answer. Of course you are right, I mixed it up, its not about relationship, but about node properties. I corrected it in the text above. How do I convert a List of Strings property to a numerical property in the projections? I only worked with native projections so far.

florentin_dorre · August 3, 2023, 4:01pm

I am not sure about builtin functionality.

My attempt would be to use list comprehensions such as WITH {a: 1, b: 2} AS mapping, ["a", "b"] AS genres RETURN [g IN genres | mapping[g]] AS generes_numeric.

Now the remaining part is how you get the mapping. I would guess its a fixed set and you could pass this from the outside.
Otherwise you could try to built it by matching over all nodes unwind the genres, find the distinct values. How to get from the distinct values to a map I would have to find out as well.

fabian.z · August 5, 2023, 1:53pm

Hi Florentin. With some helper nodes, I managed, that every separate genre is mapped to a unique float number and all nodes that I want to compare now have an additional property (a List of Floats), that I can project. However, when I run the KNN, it's not working correctly. It gives out Nodes, where the arrays of Floats are completely different or only have a very small intersection, even for Nodes where I know, that other Elements with a much greater intersection are definitely in the graph.

florentin_dorre · August 6, 2023, 7:33pm

Hey Fabian,
great to hear you could get further!
As you mention array of floats, I have an idea what could be wrong.
Essentially its the similarity metric used. As float arrays will be compared by cosine similarity.
However as you have categorical data, you want to use Jaccard or Overlap similarity instead.

So you got two options:

(a) Transform your arrays to integer before such as by using the toInteger(x) function in Cypher
(b) Explicitly specify the similarity metric you want to use such as gds.knn.stream('myGraph', {nodeProperties: {numeric_genres: 'JACCARD'}}) (see K-Nearest Neighbors - Neo4j Graph Data Science for more details)

fabian.z · August 7, 2023, 6:51pm

Hi Florentin,
it says Jaccard and Overlap are not possible with floats, so I transformed all floats to ints. With that they both work, however, the performance is still not that good. With Jaccard, the results are quite good, but it's not that fast. Do you have any "secrets", how to get it faster?

florentin_dorre · August 15, 2023, 7:39am

Hey @fabian.z ,
There could be several things to try out.
First I would check how it currently converges. If you run stats, write or mutate mode, you can inspect the ranIterations, nodePairsConsidered, similarityDistribution

topK how many do you suggestions per node do you really need?
initialSampler: "randomWalk" could lead to an earlier convergence
randomJoins is default 10 which you could try to lower
lower the sampleRate (default 0.5)
maxIterations is default 100. Could try to lower as well
most impact - specify the concurrency parameter (default is 4). However you need a GDS license

In general you can go very fast, but the quality of the result will suffer at some point. The similarityDistribution is a good measure for comparing the quality changes.

Topic		Replies	Views
Trying to write a Cypher query for node similarity to design a movie recommendation system Graph Algorithms/Graph Data Science cypher , gds	4	341	March 27, 2024
Different node properties for different labels in GDS Cypher projection Graph Algorithms/Graph Data Science	3	720	May 12, 2021
Issue with Handling Contextual Textual Similarity in Neo4j for Nodes and Relationships Graph Algorithms/Graph Data Science cypher , operations , knowledge-base	1	23	February 5, 2025
K-Mean Clustering in Neo4j Desktop version 5.13.0 Graph Algorithms/Graph Data Science cypher , cluster , gds	4	356	February 15, 2024
String Node properties and Relationship Properties projection Neo4j Graph Platform migrated	1	99	July 20, 2022

Compare lots of nodes of the same type by a List<String> property with GDS Projection

Related topics