Clustering directed paths with graphSAGE

Hi all,

Brand new here, and pretty new to graph networks in general - excited to be a part of this great community! I was hoping to get a sense check on a project I'm working on - there are a lot of moving parts and I'd just like to know if I've got the right approach.

Background
I have a website widget tool that users click through to get support, and the tool asks them a series of questions, and the user clicks to respond, and this will take them to another card in a tree-like fashion until they reach a leaf node, which will contain a solution to their problem.

Objective
I'm trying to cluster the routes that users can take through the tool to find similar paths - there are many instances of the tool (on different websites), but there are many issues/things to be resolved that are similar across the websites (such as wanting a refund).

Process
I've mapped every node onto a graph, and so each instance of the tool will be a distinct set of nodes and edges in the same graph, but unconnected to other instances of the tool.

  • For the title and description/explanation of each node I've run this through GPT and attached the embeddings as properties to each node.
  • For the users response (the choice / button they click on each node ) I've done the same, but attached the embedding as an edge property between the two nodes that the user response relates to.

For example, a node might be "Have you tried turning it off and on again?" and the user might click "No, don't know how to" and then they will visit a corresponding node.

I'm creating a projection of the graph:

node_spec = {
    "Node": {"properties": {
            "title_embedding": {"property": "title_embedding", "defaultValue": [0] * 768},
            "description_embedding": {"property": "description_embedding", "defaultValue": [0] * 768}
}}}

relationship_spec = {"NEXT_NODE": {
     "orientation": "NATURAL", 
     "properties": {
     "answer_embedding": {"property": "answer_embedding", "defaultValue": 0}
 }}}


G, projection_result = gds.graph.project(projection_name, node_spec,
                                                         relationship_spec)

And then training a graphSAGE model, the parameters for which I have tuned to minimise losses using tuning software:

params = {
            'featureProperties':['title_embedding','description_embedding'],
            'epochs': 150,
            'maxIterations': 20,
            'embeddingDimension': 64,
            'relationshipWeightProperty': 'answer_embedding', # not sure about this
            'penaltyL2' : 0.004,
            'learningRate' : 0.01315
        }

model = gds.beta.graphSage.train(G, modelName=model_name, <params here>)

Then I take the nodes for each unique path, and retrieve the graph sage embedding, and am performing an exponential decayed weighted average of each path - I do this because many of the earlier nodes are very similar and generic "Welcome to our tool" "How can we help" etc, so I'd like to focus on the more important aspects later on in each path.

I then am experimenting with DBSCAN / HDBSCAN to cluster the resulting averaged embeddings, to try and find similar paths - but the results are pretty bad :frowning: - the clusters are a mess, and I still think heavily influenced by earlier nodes so I may need to cut these out.

Does this approach seem reasonable for a first-ever attempt at graph networks? Any advice or corrections would be hugely appreciated.

Many thanks,
David

Hi @DavidM,

This is an interesting question. You're right that node embeddings out of the box is not enough to solve it.

First of all on:

I've mapped every node onto a graph, and so each instance of the tool will be a distinct set of nodes and edges in the same graph, but unconnected to other instances of the tool.

I want to check if all your data is one graph (one connected component), or a collection of disjoint small graphs (i.e multiple connected components)? This will make a difference. GraphSAGE (or FastRP/N2V/HashGNN) embeddings work best when it's one large graph. If it is indeed multiple graphs, perhaps it is worth trying putting them into one first.

A few ideas that perhaps is worth trying out:

  • I suggest to not use answer_embedding as relationshipWeightProperty, which you've commented as well. relationshipWeightProperty is suppose to be used to weigh the corresponding features, which I believe the GPT generated answer_embedding do not represent such weights/importance-scores.

  • Embedding dimension is a difficult parameter to tune. I'm not sure how large your graph is. If the number of nodes/edges is small (below 1k), then perhaps use a smaller dim (e.g 32). Otherwise if it's > 10xthousands than try higher dim.

  • Trying the decayed weighted average to lift node embeddings into one for the path makes sense. The issue is that the embeddings generated by GS might not work best to represent such paths using averaging. There's no guarantee in GS training that this is optimized. A possible approach to try to make improvements are: 1. Somehow verify your GS node embeddings are of good quality, if you have another simple node-level task. 2. Then try variants of your decayed weighted average to form embeddings for the path. A simple one I would try is to simple concatenate the nodes feature together, then given two paths of equal length, you can find their similarity by taking the distance (or cosine distance). This could be a step closer to the goal, which only works for comparing equal length paths.

Hope this helps, let us know how it goes.

Thanks,
Brian

1 Like

Hi Brian,

I can't thank you enough for your time and detailed reply, it is hugely appreciated.

I will dive into your suggestions and report back.

Kind regards,
David

Hi @brian.shi1 ,

I'm making progress!

I combined all of the disjointed graphs into a single graph, and as you suggested, it was still difficult to produce meaningful clusters.

So I tried a hyperparameter sweep by noting down the indices of paths for a specific topic, with the intention of maximising the number of total clusters while simultaneously trying to maximise the number of paths of my chosen topic that were clustered together.

So far, I've found that HDBSCAN + FastRP yields good results when I only cluster the final node in each path - so this is a great step forward. Clustering entire paths is also starting to produce a few clusters that contain similar paths from what would have originally been multiple disjointed graphs, which is also great, but it's still not perfect. Perhaps I need to manually label more paths.

Something that probably is letting it down is the inability to add the users response - which is a relationship property - to the FastRP representation. After being shown a 'node' in the widget, the user selects a response, which then directs them to the relevant next node. By not incorporating this I might be eliminating quite strong signals. However I can't see a relationshipProperties parameter in either of graphSAGE or FastRP (similar to featureProperties) - is there one? Or alternatively do you think it would be effective to add the response embedding to the child node?

As I'm now using FastRP, which doesn't appear to support heterogenous node types, I'm also losing out on this type of signal too (it looks like graphSAGE support heterogenous nodes).

Are there any algorithms that support heterogenous nodes and allow the specification of relationshipProperties ?

Many thanks,
David

Hi @David,

That's good news.

Unfortunately we don't have any embedding that take general array relationshipProperties. It is work in progress. We are also in general improving support for heterogeneous graphs.

For the specific path-clustering problem you have, you'll need to do quite some pre & post processing for your final application to work.

Another idea that I would also try is:

  • If I understand correctly, you currently have one relationhshipType NEXT_NODE with different answer_embedding values to represent different actions. An alternative is to have different relationshipTypes to directly represent relationship-heterogeneous graphs. (i.e have relTypes LIKES, REFUND, etc.) We then have HashGNN that support heterogeneous relationships.

  • Use node embeddings only for generating node embeddings, afterwards use some graph algorithm, or custom written script, to take graph with node embeddings -> final clustered graphs, and leverage the relationshipProperties/heterogeneous relationship types in this 'post-processing' step.

Hope it helps.

Best,
Brian