Hi all!
So I'm trying to find the similar nodes using the node similarity procedure in gds.
My structure of the data is:
I have a Person node which is attached to four other nodes, namely, a Language node, a Pronoun node, a Track node and an AgeSpan node.
There are many various instances of each node with each a different language, pronoun, track and age-span values.
For instance the each of the language nodes can have the properties "English", "German" etc. And pronoun can be "he", "she" and so on. And age span could be "18-26", "27-35" etc. For tracks, the person can have either track 1, 2, 3 or 4.
A Person node has outgoing relationships to these nodes called either :SPEAKS, :GOES_BY, :USES_TRACK, :IS_AGE depending on the node.
So, I create the projected graph first using the following cypher query:
CALL gds.graph.create(
'myGraph',
['Person', 'AgeSpan','Language','Pronoun','Track'],
{
GOES_BY: {
type: 'GOES_BY',
properties: {
strength: {
property: 'strength',
defaultValue: 1.0
}
}
},
USES_TRACK:{
type: 'USES_TRACK',
properties: {
strength: {
property: 'strength',
defaultValue: 1.0
}
}
},
IS_AGE:{
type: 'IS_AGE',
properties: {
strength: {
property: 'strength',
defaultValue: 1.0
}
}
},
SPEAKS:{
type: 'SPEAKS',
properties: {
strength: {
property: 'strength',
defaultValue: 1.0
}
}
}
}
);
There is one thing however, the data that I'm using is partially complete, so each Person that did not have an entry for pronoun in the data is automatically assigned "missing". In the graph this makes it so that those who do not have a pronoun are connected to a Pronoun node which has the property "missing".
I have researched the node similarity algorithm and I think what I'm after is using weights. So what I did was each relationship that a Person has to a "missing" pronoun node, has also in its relationship the property "strength: 0.5", to indicate that when using the algorithm to find similar nodes if both nodes have a matching Pronoun with the property "missing" it shouldn't count as high as a real value.
Now when using the stream algorithm and running the node similarity algorithm on this projected graph I get some weird similarity values.
Im running the gds.nodeSimilarity.stream function with the added argument relationshipWeightProperty: 'strength'.
Here is a chart that depicts the different similarities depending on which nodes are matching. The first part denotes how many of the nodes are matching i.e. if they are talking the same Language, have the same Age Span, go by the same pronoun etc. (At least, thats how I interpreted how the Jaccard similarity works)
Missing chart:
4/4 matches where match in common is "missing" = 1.0
4/4 matches without "missing" = 1.0
3/4 matches where a match in common is "missing" = 0.4
3/4 matches where a match in common is "missing" = 0.5555555555555556
3/4 matches without "missing" = 0.6
3/4 matches where a match not in common is "missing" = 0.666666666666
2/4 matches where a match not in common is "missing" = 0.36363636363636365
2/4 matches without "missing" = 0.33333333333333
1/4 matches where a match not in common is "missing" = 0.15384615384615385
I'm having such a hard time trying to interpret these results. Like for instance, when having 4/4 matches with a "missing" in common should not amount to 1.0!
How come there are two different similarity values for when having 3/4 matches in common with a "missing". And how come 3/4 matches where a match not in common is "missing" is higher than having 3/4 without missing, shouldn't it be lower?
I'm pretty sure I'm doing something obviously wrong and I cant wrap my head around what.
So now I'm asking you guys, what am I doing wrong?
Cheers!