Projecting a graph

I have a graph consisting of a bunch of mostly disconnected trees (I call them H-trees) of the form:


Nodes with label H have one or more 'children' with labels M, and zero or more with labels T or F. A node with label M has zero or 1 child with label E and zero or 1 child with label O. Distinct H-trees could have common nodes with label O or E. No other nodes are common among H-trees. Each of these nodes has properties (strings or integers). Each property is from a finite set. The set of possible properties is different for each label.

What I ultimately want is to find groups of similar nodes H, where the similarity is measured by their connections to nodes of type M which have similar properties and connections to similar nodes of types E and O, and also, the number of connections to nodes of type T or F.

How do I go about it?

I have tried multiple kinds of projections, running jaccard and kmeans similarity algos, and have not come up with anything satisfactory.

You can try calculating the Jaccard metric between each two pair of H-nodes. You would need to define what the conditions are to determine the nodes that are in common. It sounds like you have several different ideas of similarity between H nodes. If we measured each individually, how would you use them? You could create an overall similarity score as a weighted some of the individual components, where the weights sum to one. This would allow you to control how each contributes to the similar, but this assume you know how you want them weighted.

Do you have any concrete ideas of what defines similarity? If so, maybe we can build out a query. This is not a GDS solution. Maybe a GDS user can help in that area.

Yes, I do know what I mean by similarity. I would consider two nodes with label H very similar if:

  1. they were connected to 'about the same number' of nodes with label F
  2. they were connected to 'about the same number' of nodes with label T
  3. the values of the Integer-encoded properties of their children with label M were 'about the same'
  4. the values of the Integer-encoded properties of the corresponding children of these children were 'about the same'

One of the 'techniques' I tried was to 'flatten' the graph and incorporate the encoded properties all within disconnected nodes with label HH. I have given examples of two such nodes below. I hoped to run the k-means similarity algo on these. I don't like this, because it defeats the purpose of using a graph. Also, since k-means requires a single nodeProperty, I have to further pad all the property vectors to be of the same length and then concatenate them to a single nodeProperty vector. It may just be easier to put them in a csv file and run k-means on the file!

I expected that neo4j could look for similarities of simple subgraphs of nodes with Integer properties, right out of the box. If there is a better solution, I would love to hear it.

Also, I did the 'flattening' using CYPHER (and python). If there is a way to do this via a projection please let me know.

HH_1:

Fs: 0
Ts: 1
m_a: [48,21,45,15]
m_e: [4,5,4,5]
m_g: [1,2,2,2]
m_o: [2,0,0,0]
m_w: [5,2,1,2]

HH_2:

Fs: 1
Ts: 1
m_a: [58,57,42,29,25,23]
m_e: [9,9,5,8,7,7]
m_g: [1,2,2,1,1,1]
m_o: [12,0,10,8,12,9]
m_w: [5,1,5,5,5,5]

It looks like the 'flattening' technique works efficiently. It may be that 'flattening' or 'condensing' graphs may be the way to look for similarity. Basically its an 'embedding'. But I'm still not happy with it as it loses the power of graph structure. For example, one does not eliminate possible similarities early based on the dissimilarities closer to the root of subgraphs which are trees, without assigning (arbitrary) weightages. I'm surprised that I haven't seen a neo4j gds algorithm to check/discover similarity of subgraphs within graphs based on properties of nodes.

Using k-means makes it a simpler problem, as it conceals the definition of 'similarity'. The algorithm clusters based on 'distance' between nodes using the encoded properties as a vector. The algorithm is declarative, as you ask it to find 'similar' nodes and it figures out how.

Instead of k-means, if you want to use the ideas you have enumerated in your list, then you need to define what it means to be 'about the same' in each category and figure out how to utilize each individual category of 'similarity', i.e. weight their importance or some other form of merging.

Hey! Thanks for the ideas!

Yes, I think that definition of similarity by distance does work, since I have encoded all property values by their ids. I want to consider two H nodes similar if they are, for instance, connected to a set of M nodes which have properties (either node properties or common neighbors) with common ids.

For example, the H-nodes h1 and h2 in these two diagrams would be similar:

(h1:H)-->(:T)
(h1)-->(m1:M: {a: 48, g:1})
(h1)-->(m2:M: {a: 21, g:2})
(m1)-->(:O {id:2})
(m1)-->(:E {id:4})
(m1)-->(:W {id:2})
(m2)-->(:E {id:5})
(m2)-->(:W {id:2})

(h2:H)-->(:T {id:1})
(h2)-->(m3:M: {a: 47, g:1})
(h2)-->(m4:M: {a: 25, g:2})
(h2)-->(m5:M: {a: 22, g:3})
(m3)-->(:O {id:2})
(m3)-->(:E {id:4})
(m3)-->(:W {id:2})
(m4)-->(:E {id:5})
(m4)-->(:W {id:2})
(m5)-->(:O {id:2})
(m5)-->(:E {id:4})
(m5)-->(:W {id:2})

which would flatten to:

Fs: 0
Ts: 1
m_a: [48,21]
m_e: [4,5]
m_g: [1,2]
m_o: [2,0]
m_w: [5,2]

Fs: 0
Ts: 1
m_a: [47,25,22]
m_e: [4,5,4]
m_g: [1,2,3]
m_o: [2,0,2]
m_w: [5,2,5]

This is the reason I was trying to apply Jaccard similarity. The operation of 'flattening' is, I believe, equivalent to the creation of edges from H nodes directly to end nodes of relevant paths (e.g. to O, W or E nodes), so they all become neighbors of H nodes. So for example, h1 would become:

(h1:H)-->(:T)
(h1)-->(m1:M: {a: 48, g:1})
(h1)-->(m2:M: {a: 21, g:2})
(h1)-->(:O {id:2})
(h1)-->(:E {id:4})
(h1)-->(:W {id:2})
(h1)-->(:W {id:2})
(h1)-->(:E {id:5})

Unfortunately, I think, one can't then directly use node similarity after projecting node properties, because the target nodes must have the same properties. I believe one would have to make each of the (now) neighbors of the H nodes have a set of properties that is the union of the original properties of all of the (now) neighbors. Am I correct?

Regarding assigning weights to columns (properties or relations), that is exactly the kind of arbitrary/trial-error process I am trying to avoid. I would like to see if there are 'natural' clusters that indicate similarity of H-trees.

I am not very familiar with all the ways the gds similarity algorithms can be used or tweaked. I was hoping I could use one to find occurrences of similar subgraphs (e.g., those that have the 'same shape', i.e., a large 'overlap' of properties and edges when overlaid on each other)