FastRP Settings in link prediction pipelines

Hello guys,
i am trying to create a link prediction pipeline using the gds library. I have seen an example on neo4j academy's course "Neo4j Graph Data Science Fundamentals". It uses FastRP for node embeddings.

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('pipe', 'fastRP', {
    mutateProperty: 'embedding',
    embeddingDimension: 128,
    randomSeed: 7474
}) YIELD nodePropertySteps;

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('pipe', 'degree', {
    mutateProperty: 'degree'
}) YIELD nodePropertySteps;

Can someone explain what is randomSeed and why it's set to 7474? In most of the examples i've seen so far random seed is set to 42 for some reason. Also, is there any tip for embeddingDimension? i have read that this should be a power of 2 number. In neo4j's documentation it states that the higher the embedding dimension is, the more accurate the results are. I would like to know though, what if my dataset has 500 nodes and what if it has millions of nodes? what embedding dimension should i set?
Final question, why do i need a degree centrality?
Thanks in advance.

Hi @xar.zax ,

  • The random seed is used to get the same deterministic result of the algorithm each time it is run. What the actual number is, is not important. If determinism like this is important to you you can use it.
  • Embedding dimension is tricky. This is something that you may want to tune (test multiple values and choose the best), but a good starting point is 256, even if the graph is very big. The reason is that FastRP is quite local in the sense that it only takes as many hops as you specify with iterationWeights. If you're graph is very small though, like 500 nodes, you can probably get away with less. And if you have a large graph and memory is scarce you may want to start with something lower than 256. It mustn't really be a power of two. Theoretically speaking a higher dimension gives you more topological information, but at some point it doesn't really help any more. Have in mind that the FastRP output dimension will (likely) be the input dimension to your downstream ML model, so will impact time and memory consumption there too.
    Here's a more rigorous treatment of node embedding dimension selection if you're interested: Principled approach to the selection of the embedding dimension of networks | Nature Communications
  • You may not need degree centrality. It's a way of passing the degree (a kind of measure of node importance) of each node to the downstream ML model, but for your use case that may not be so helpful. You'll need to test and see what results you get.

Hope this is helpful,
Adam