FastRP Settings in link prediction pipelines

xar.zax · November 15, 2023, 10:48pm

Hello guys,
i am trying to create a link prediction pipeline using the gds library. I have seen an example on neo4j academy's course "Neo4j Graph Data Science Fundamentals". It uses FastRP for node embeddings.

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('pipe', 'fastRP', {
    mutateProperty: 'embedding',
    embeddingDimension: 128,
    randomSeed: 7474
}) YIELD nodePropertySteps;

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('pipe', 'degree', {
    mutateProperty: 'degree'
}) YIELD nodePropertySteps;

Can someone explain what is randomSeed and why it's set to 7474? In most of the examples i've seen so far random seed is set to 42 for some reason. Also, is there any tip for embeddingDimension? i have read that this should be a power of 2 number. In neo4j's documentation it states that the higher the embedding dimension is, the more accurate the results are. I would like to know though, what if my dataset has 500 nodes and what if it has millions of nodes? what embedding dimension should i set?
Final question, why do i need a degree centrality?
Thanks in advance.

adam_schill-col · November 17, 2023, 8:40am

Hi @xar.zax ,

The random seed is used to get the same deterministic result of the algorithm each time it is run. What the actual number is, is not important. If determinism like this is important to you you can use it.
Embedding dimension is tricky. This is something that you may want to tune (test multiple values and choose the best), but a good starting point is 256, even if the graph is very big. The reason is that FastRP is quite local in the sense that it only takes as many hops as you specify with iterationWeights. If you're graph is very small though, like 500 nodes, you can probably get away with less. And if you have a large graph and memory is scarce you may want to start with something lower than 256. It mustn't really be a power of two. Theoretically speaking a higher dimension gives you more topological information, but at some point it doesn't really help any more. Have in mind that the FastRP output dimension will (likely) be the input dimension to your downstream ML model, so will impact time and memory consumption there too.
Here's a more rigorous treatment of node embedding dimension selection if you're interested: Principled approach to the selection of the embedding dimension of networks | Nature Communications
You may not need degree centrality. It's a way of passing the degree (a kind of measure of node importance) of each node to the downstream ML model, but for your use case that may not be so helpful. You'll need to test and see what results you get.

Hope this is helpful,
Adam

Topic		Replies	Views
FastRP is giving different embeddings for the same graph Graph Algorithms/Graph Data Science	1	451	March 29, 2021
How does FastRPExtended from gds work? Graph Algorithms/Graph Data Science	1	366	November 1, 2021
Link Prediction Pipeline Experiment[Help please] - Node Embeddings Comparision Graph Algorithms/Graph Data Science	2	76	December 6, 2024
FastRP - Different value but same embedding Graph Algorithms/Graph Data Science	1	420	March 19, 2022
Link Prediction Pipeline - help on 'Add Features' step Graph Algorithms/Graph Data Science	5	107	December 6, 2024

July Summer Fun!

FastRP Settings in link prediction pipelines

Related topics