Hey! Thanks for the ideas!
Yes, I think that definition of similarity by distance does work, since I have encoded all property values by their ids. I want to consider two H nodes similar if they are, for instance, connected to a set of M nodes which have properties (either node properties or common neighbors) with common ids.
For example, the H-nodes h1 and h2 in these two diagrams would be similar:
(h1:H)-->(:T)
(h1)-->(m1:M: {a: 48, g:1})
(h1)-->(m2:M: {a: 21, g:2})
(m1)-->(:O {id:2})
(m1)-->(:E {id:4})
(m1)-->(:W {id:2})
(m2)-->(:E {id:5})
(m2)-->(:W {id:2})
(h2:H)-->(:T {id:1})
(h2)-->(m3:M: {a: 47, g:1})
(h2)-->(m4:M: {a: 25, g:2})
(h2)-->(m5:M: {a: 22, g:3})
(m3)-->(:O {id:2})
(m3)-->(:E {id:4})
(m3)-->(:W {id:2})
(m4)-->(:E {id:5})
(m4)-->(:W {id:2})
(m5)-->(:O {id:2})
(m5)-->(:E {id:4})
(m5)-->(:W {id:2})
which would flatten to:
Fs: 0
Ts: 1
m_a: [48,21]
m_e: [4,5]
m_g: [1,2]
m_o: [2,0]
m_w: [5,2]
Fs: 0
Ts: 1
m_a: [47,25,22]
m_e: [4,5,4]
m_g: [1,2,3]
m_o: [2,0,2]
m_w: [5,2,5]
This is the reason I was trying to apply Jaccard similarity. The operation of 'flattening' is, I believe, equivalent to the creation of edges from H nodes directly to end nodes of relevant paths (e.g. to O, W or E nodes), so they all become neighbors of H nodes. So for example, h1 would become:
(h1:H)-->(:T)
(h1)-->(m1:M: {a: 48, g:1})
(h1)-->(m2:M: {a: 21, g:2})
(h1)-->(:O {id:2})
(h1)-->(:E {id:4})
(h1)-->(:W {id:2})
(h1)-->(:W {id:2})
(h1)-->(:E {id:5})
Unfortunately, I think, one can't then directly use node similarity after projecting node properties, because the target nodes must have the same properties. I believe one would have to make each of the (now) neighbors of the H nodes have a set of properties that is the union of the original properties of all of the (now) neighbors. Am I correct?
Regarding assigning weights to columns (properties or relations), that is exactly the kind of arbitrary/trial-error process I am trying to avoid. I would like to see if there are 'natural' clusters that indicate similarity of H-trees.
I am not very familiar with all the ways the gds similarity algorithms can be used or tweaked. I was hoping I could use one to find occurrences of similar subgraphs (e.g., those that have the 'same shape', i.e., a large 'overlap' of properties and edges when overlaid on each other)