Establishing similarity in a genealogical database

I have a genealogical graph database. If I run code like this

MATCH (horse:Horse {Name: "DANEHILL"})
MATCH path=(horse)-[:Child_Of*1..3]->(ancestor)
RETURN horse, ancestor

It (correctly) returns this pedigree to the 3 generations.

On every node/record there is a Property "Best Race Class" which has 5 distinct values.

What I want to know, is for all the records that I have where "Best Race Class" is NOT equal to "NONG" (a non valid record), I want to return the list of records by their similarity to the subject horse (in the query above "DANEHILL". It could be limited to only searching within 5 generations of the pedigree of each record if that made it a quicker search, but the idea would be to have a list, sorted by a similarity score (or other such metric)

What is the best way to approach this?

Thanks in advance.

Sorry, but I don’t fully understand yet. Are you asking to find all nodes without that property value and for each calculate the similarity to the subject horse?

If yes, what is your definition of similar? Usually the pair-wise similarity between two nodes would be something like the count of item that are similar between the nodes divided by the total number of items related to both nodes. Basically, the more of the total items the have in common, the more similar they are. How would you adapt it to utilize the ancestors of each node? How many ancestors they have in common divided the total number each node’s ancestors?

Yes.

So for the subset of nodes (horses) that have a Best Race Class value not equal to "NONG", I wanted to calculate the closeness of the relationship between the subject horse (in this case DANEHILL) and all other horses (it would be a group of about 5,000 that was not equal to "NONG") based on their ancestry (the parent, grandparent, great-grandparent, etc nodes). I could limit the generations searched for similarity

I had thought that there might be a suitable graph algorithm to use like a node Similarity or a K-means? I could calculate the coefficient of inbreeding for a hypothetical mating between the subject horse and all other horses, but as that is 5,000+ it would be computationally expensive and probably not realistic.

an afternoon with ChatGPT and Google Bard got me pretty close to the answer...

// Retrieve ancestors for DANZERO
MATCH (danzero:Horse {Name: "DANZERO"})-[:Child_Of*1..5]->(ancestor)
WITH COLLECT(DISTINCT ancestor) AS danzeroAncestors

// Iterate over G1, G2, G3, and LR horses and calculate Jaccard similarity
MATCH (g1Horse:Horse)
WHERE g1Horse.`Best Race Class` = "G1" OR 
      g1Horse.`Best Race Class` = "G2" OR 
      g1Horse.`Best Race Class` = "G3" OR 
      g1Horse.`Best Race Class` = "LR"
WITH g1Horse, danzeroAncestors
MATCH (g1Horse)-[:Child_Of*1..5]->(g1Ancestor)
WITH g1Horse.Name AS HorseName, danzeroAncestors, COLLECT(DISTINCT g1Ancestor) AS g1Ancestors
WHERE g1Horse.Name <> "DANZERO"

// Calculate intersection
WITH HorseName, danzeroAncestors, g1Ancestors,
     [x IN danzeroAncestors WHERE x IN g1Ancestors] AS intersectionList,
     danzeroAncestors + g1Ancestors AS combinedList
WITH HorseName, intersectionList,
     SIZE(intersectionList) AS intersectionSize,
     SIZE(apoc.coll.toSet(combinedList)) AS unionSize  
WITH HorseName, 
     intersectionSize, 
     unionSize,
     CASE WHEN unionSize > 0 THEN TOFLOAT(intersectionSize) / unionSize ELSE 0 END AS JaccardSimilarity
ORDER BY JaccardSimilarity DESC
LIMIT 30
RETURN HorseName, JaccardSimilarity

Obviously this is just for a single horse but it compares it against the set of other horses with a given Best Race Class.