Getting strange results with Node Similarity using weights

Hi all!

So I'm trying to find the similar nodes using the node similarity procedure in gds.

My structure of the data is:

I have a Person node which is attached to four other nodes, namely, a Language node, a Pronoun node, a Track node and an AgeSpan node.

There are many various instances of each node with each a different language, pronoun, track and age-span values.
For instance the each of the language nodes can have the properties "English", "German" etc. And pronoun can be "he", "she" and so on. And age span could be "18-26", "27-35" etc. For tracks, the person can have either track 1, 2, 3 or 4.

A Person node has outgoing relationships to these nodes called either :SPEAKS, :GOES_BY, :USES_TRACK, :IS_AGE depending on the node.

So, I create the projected graph first using the following cypher query:

CALL gds.graph.create(
    'myGraph',
    ['Person', 'AgeSpan','Language','Pronoun','Track'],
    {
        GOES_BY: {
            type: 'GOES_BY',
            properties: {
                strength: {
                    property: 'strength',
                    defaultValue: 1.0
                }
            }
        },
        USES_TRACK:{
        type: 'USES_TRACK',
            properties: {
                strength: {
                    property: 'strength',
                    defaultValue: 1.0
                }
            }
        },
        IS_AGE:{
        type: 'IS_AGE',
            properties: {
                strength: {
                    property: 'strength',
                    defaultValue: 1.0
                }
            }
        },
        SPEAKS:{
        type: 'SPEAKS',
            properties: {
                strength: {
                    property: 'strength',
                    defaultValue: 1.0
                }
            }
        }
    }
);

There is one thing however, the data that I'm using is partially complete, so each Person that did not have an entry for pronoun in the data is automatically assigned "missing". In the graph this makes it so that those who do not have a pronoun are connected to a Pronoun node which has the property "missing".

I have researched the node similarity algorithm and I think what I'm after is using weights. So what I did was each relationship that a Person has to a "missing" pronoun node, has also in its relationship the property "strength: 0.5", to indicate that when using the algorithm to find similar nodes if both nodes have a matching Pronoun with the property "missing" it shouldn't count as high as a real value.

Now when using the stream algorithm and running the node similarity algorithm on this projected graph I get some weird similarity values.

Im running the gds.nodeSimilarity.stream function with the added argument relationshipWeightProperty: 'strength'.

Here is a chart that depicts the different similarities depending on which nodes are matching. The first part denotes how many of the nodes are matching i.e. if they are talking the same Language, have the same Age Span, go by the same pronoun etc. (At least, thats how I interpreted how the Jaccard similarity works)

Missing chart:
4/4 matches where match in common is "missing" = 1.0
4/4 matches without "missing" = 1.0
3/4 matches where a match in common is "missing" = 0.4
3/4 matches where a match in common is "missing" = 0.5555555555555556
3/4 matches without "missing" = 0.6
3/4 matches where a match not in common is "missing" = 0.666666666666
2/4 matches where a match not in common is "missing" = 0.36363636363636365
2/4 matches without "missing" = 0.33333333333333
1/4 matches where a match not in common is "missing" = 0.15384615384615385

I'm having such a hard time trying to interpret these results. Like for instance, when having 4/4 matches with a "missing" in common should not amount to 1.0!
How come there are two different similarity values for when having 3/4 matches in common with a "missing". And how come 3/4 matches where a match not in common is "missing" is higher than having 3/4 without missing, shouldn't it be lower?

I'm pretty sure I'm doing something obviously wrong and I cant wrap my head around what.

So now I'm asking you guys, what am I doing wrong?

Cheers!

Node similarity is calculated, basically, as a ratio of common neighbors vs. non overlapping neighbors. So if you add a "missing" node - the algorithm sees common "missing" nodes as common neighbors.


From an architectural perspective, there are two possible solutions: either have no node where something is missing (so that it isn't counted as common or not common) or each node with a missing property needs a unique missing property node.


One important point to remember, though, is that the similarity scores are pair wise: it's a ratio of common vs not common neighbors per node, versus scaled across the whole database. If you're looking to have equal sized sets for comparison (so fore very pair of nodes, you're considering four possible things they could have in common) you either want to scale your similarity score result, or pre-process the data into equal length vectors (like with the one hot encoding function)

Hi Alicia!

First and foremost, thank you for your quick answer!

 
Yeah, I like your ideas regarding the architectural perspective of the graph structure. That could be a great option, to just skip the nodes with "missing" altogether.

 

I've done some googling around regarding setting the nodes with an unique missing property. But I'm not quite sure I understand it. How I have it now in my databse, is that since each Person has an unique ID attached to them, naturally I added an unique constraint to the ID property of Person.
 

How I understand this, is that that makes it so I can not create two nodes with the same ID. But, I'm not sure how would this exactly work with my "missing" node?

 

Thanks for your help, much appreciated!