FastRP Embedding - Tuning Suggestions to preserve Monopartite Hop Distance

I have a monopartite graph projection (one node type, one undirected relationship type) filtered to include only one connected component (the largest) so all nodes are reachable from all others. The largest hop distance is 12.

I have set up in python to embed with parameters, compute distance matrix, and compute pearson r correlation between the hop distance, embedding distance for all node pairs, and plot a heatmap of the 2d histogram of embedding distance vs hop distance.

My objective is to embed the nodes such that Euclidean distances in the embedding are maximally correlated to hop distances in the graph. I'm looking for suggestions to improve the results. Any tips appreciated. Thanks in advance!

The projection is made as follows:

MATCH (source)
WHERE source:Organization AND source.componentId IN [0] //[8,13,5,6,7,9,10,11,12,4,3,2,1,0]
OPTIONAL MATCH (source)-[r:R]-(target)
WHERE target:Organization AND target.componentId IN [0] //[8,13,5,6,7,9,10,11,12,4,3,2,1,0]
WITH gds.graph.project(
  'testComps',
  source,
  target,
  {
    sourceNodeProperties: source { componentId: source.componentId },
    targetNodeProperties: target { componentId: target.componentId },
    sourceNodeLabels: labels(source),
    targetNodeLabels: labels(target),
    relationshipType: type(r)
  },
  {undirectedRelationshipTypes: ['R']}
) AS g
RETURN g.graphName AS graph, g.nodeCount AS nodes, g.relationshipCount AS rels;

and an example of the embedding parameters is

fastRP_params = {
    'embeddingDimension': 1024,
    'iterationWeights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
    'propertyRatio': 0.0,
    'nodeSelfInfluence': 1.0
}

cypher = f"""CALL gds.fastRP.stream(
  'testComps',
  {{
    embeddingDimension: {fastRP_params['embeddingDimension']},
    iterationWeights: {fastRP_params['iterationWeights']},
    randomSeed: 42,
    propertyRatio: {fastRP_params['propertyRatio']},
    nodeSelfInfluence: {fastRP_params['nodeSelfInfluence']}
  }}
);
"""

which results in

Hi @stauntonjr ,

This is a very interesting question. I believe the objective you're trying to optimize for is something like:

CodeCogsEqn (2)

And none of the embedding methods work out of the box.
It is very possible by controlling the initialization and hyperparams, some of the existing methods can simulate this. I think it is nontrivial.

A few ideas I think could be worth trying, if they fail, their negative results might suggest some good next steps:

  • Reduce embedding dimension from 1024. From the example I see you have 14 nodes. An embedding dimension of this order should be sufficient to capture the topology. E.g try emb dim = 8, 16. The issue with a dimension that is too high is that there aren't enough samples (nodes) to provide that much information

  • I can see why you want to try 12 iteration weights (with all 1.0), since max possible hop = 12, and this allows a random walk that covers the longest path. There is however no exact correlation between no. of iteration and max hop. I would also have use 12. But I'd try to see what happens with lower (e.g 6)

  • Try node2vec too. I would start with some high inOutFactor and returnFactor to make the RWs stay local. Maybe also try GraphSAGE with mean aggregator.

  • On initialization: I see you use random initialization in FastRP. I would also try to initialise all nodes with features all 1's and see what happens, for FastRP, N2V and GS. For FastRP you'll need to set propertyRatio != 0.

Also, I think there might be some algorithmic way of solving this? The difficulty I see is in trading off exact HOP counts to the 'best compromise', which is what loss function does.

But, maybe a simple solution could be:

  1. Run all pairs shortest path to count all HOPS All Pairs Shortest Path - Neo4j Graph Data Science
  2. For each node, collect it's list of HOPs to all nodes. It's HOP to itself = 0. Keep the list ordered by the nodes.
  3. Turn this list into an embedding. Maybe this already works? It is an 'embedding' of dimension = 14 (node count). If not, I think consider the use of some algorithms (maybe to initialize embeddings) might be helpful too.

Hope this helps. Let us know how it goes.

Thanks,
Brian

Thanks very much for your thoughtful response, Brian.

I did try lower dimension (2, 4, 8, 16, 32, 512, 1024) and several combinations of iteration weights. By the way there are thousands of nodes, the commented list in my query was different connected components.

I have since pivoted to a different approach. Since the allPairsShortestPaths is ultimately what I want to embed, I just calculated that and then generated a distance matrix from it. Then I set the diagonal distances to 0, and set all remaining NaNs (representing paths that do not exist) to 10^6. Then I used sklearn.manifold.MDS (multi-dimensional scaling) to reduce the dimensionality from NxN to NxM, where the result is a 1xM vector for each node in the M-dimensional embedding space. The Pearson correlation it has with the hop distance matrix is > 0.999, which is likely better than I get from a graph embedding approach. There is one pocket of distances that correlate poorly (hop distance 1, but ~10^5 embedding distance) that persists even when I change the parameters. I'm interested in better understanding this artifact but this approach looks like it will work perfectly for the vast majority of nodes.

While I have this working on the CORA dataset, I have yet to scale this up to a graph as large as what I need in production, and am still unsure how it will perform or if I will be able to find workarounds to any bottlenocks. Prior experience suggests that even for very large graphs, the weakly connected components can at least be computed quickly, and that may make it more manageable as this approach (allPairsShortestPaths and then MDS embedding) could be done on a component-by-component basis, as the results could be patched together afterwards. I'm hoping to get to the point where I can experiment on that soon, but I also need to set up other parts of the project (namely, HNSW hybrid search in Vespa for the embedding vectors in combination with text vectors).

Thanks again,
Jack

Hi Jack,

Makes sense that using APSP hops count directly as 'embedding'/'score' works well, our APSP implementation should scale to very large graphs. Do let us know how it goes.

Good luck,
Brian