Graph Data Science: K-Nearest Neighbors

I'm running Neo4j v 4.1 and gds v1.4.

I'm trying to utilize ML tools to gain insights about a genetic genealogy graph database. It has CB_Match nodes with chromosome segment data on individuals matching one another; that is, they share segments.

I've created a virtual graph:

CALL gds.graph.create.cypher(
  'myGraph',
  "match (c:CB_Match)  return id(c) as id",
  "match (c1:CB_Match)-[r:match_by_segment{phased:'Y'}]-(c2:CB_Match)  return id(c1) as source,id(c2) as target,r.cm as weight" 
)

From this I can create an embedding variable for CB_Match nodes:

CALL gds.fastRP.stream('myGraph', {embeddingDimension: 4})
YIELD nodeId,embedding
with  gds.util.asNode(nodeId).RN as RN,gds.util.asNode(nodeId).fullname as Name, embedding
return RN,Name,embedding order by RN,Name

I have used the write procedure to add this property to the CB_Match nodes.

Now I am trying to utilize the embedding property as described in the recent GDS anouncement, specifically neighborhood detection and visualization.

Following the documentation for KNN and its default value of {} for the configuration map, I ran the following:

CALL gds.beta.knn.stream(
  'myGraph',
{ }
) 
YIELD  node1,  node2,  similarity
with  gds.util.asNode(node1).fullname as Match1, gds.util.asNode(node1).fullname as Match2, similarity
return Match1, Match2, similarity limit 50

This produced an error, saying I omitted the required nodeWeightProperty from the configuration. So I added it

CALL gds.beta.knn.stream(
  'myGraph',
{nodeWeightProperty:'embedding' }
) 
YIELD  node1,  node2,  similarity
with  gds.util.asNode(node1).fullname as Match1, gds.util.asNode(node1).fullname as Match2, similarity
return Match1, Match2, similarity limit 50

and received an error that not every node had the embedding property ... which is not true.

Is this a bug or a problem with my logic?

In order to feed in the properties computed by FastRP you will need to use the mutate mode to add them to the in-memory graph (the one you call 'myGraph'). The write mode will only write them to Neo4j. You can reload them from Neo4j as well, but then you will have to project a new in-memory graph where you also declare the properties, and this is less efficient compared to using mutate.

You can read more about the different execution modes here: Running algorithms - Neo4j Graph Data Science

Thanks. The in memory graph I created did have the "embedding" property. It was a two step process which was less efficient as you note. But I did have the property in the 2nd iteration of the in memory graph. Yet I still got the error. So I still am puzzled by it not working. Is it a bug or my logic?

Hello, the error "that not every node had the embedding property" is because of the nodeQuery doesn't contain the embedding property and hence it is absent from the in-memory graph even though it is in the Neo4j DB. You can check the documentation how to add the node property: https://neo4j.com/docs/graph-data-science/current/management-ops/cypher-projection/#cypher-projection-properties.

I hope this helps.

1 Like

Your suggestion solved the initial problem. That is, the embedded property in the virtual graph now enables the KNN algo. Now I need to optimize the parameters!