Link prediction predicting wrong links

dg_22 · June 26, 2022, 2:40am

Hi,

 I am using the link prediction pipeline to predict links between two node labels using their corresponding

relationships.
https://neo4j.com/docs/graph-data-science/current/machine-learning/linkprediction-pipelines/predict/

I have specified these labels/relationships in the graph projection as well as setting them as nodeLabels and relationshipTypes when using gds.beta.pipeline.linkPrediction.train and gds.beta.pipeline.linkPrediction.predict.stream. However when streaming it and yielding node1, node2, and the probability probability, it is providing links between the same set of labels. How do I specify I want to predict relationships between two sets of different labels only using specific relationships? Please let me know if further information is needed. Thank you!

dg_22 · June 28, 2022, 10:38pm

.
(migrated from khoros post Solved: Re: link prediction predicting wrong links - Neo4j - 57151)

alicia_frame1 · June 28, 2022, 12:15pm

Can you post your code, and the schema of the graph?

bratanic_tomaz · June 28, 2022, 10:42am

Can you please include more code and how you use nodeLabels and relationshipTypes parameters?

dg_22 · June 28, 2022, 11:10pm

Sure, so as far as the graph schema I am creating a projection out of subset of a much larger knowledge graph and selecting two node labels (A,B) and their two corresponding relationship types that I am interested in predicting. Please let me know if you need any further clarification/details in regards to this.

As far as the code, below is an example of the type of link prediction pipeline I am running. In this initial setup I have to types of labels and two possible sets of relationships that link between them. In the training and the predicting steps I thought I was specifying that I am only interested in seeing links predicted between these two sets of labels but what I realized is happening is that links are being predicted between every type of node label (so A-A, A-B, B-B). Right now this maybe more of an inconvenience as I could always delete the same label relationships I don't want.

But in the future I want to include many types of labels that link to A and B as well as all of those corresponding types of relationships in my projection in order to incorporate them into the node embeddings. But I want to make sure to train and predict only on potential links between A-B in the graph because otherwise it would be computationally too expensive and inconvenient so would like to know how to specify that. Thank you and please let me know if anything else is needed.

CALL gds.beta.pipeline.linkPrediction.create('lp-pipeline')

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline', 'fastRP', {
mutateProperty: 'embedding',
embeddingDimension: 256,
randomSeed: 42
})

CALL gds.beta.pipeline.linkPrediction.addFeature('lp-pipeline', 'cosine', {
nodeProperties: ['embedding']
}) YIELD featureSteps;

CALL gds.beta.pipeline.linkPrediction.configureSplit(
'lp-pipeline', {
testFraction: 0.3,
trainFraction: 0.6,
validationFolds: 7})
YIELD splitConfig

CALL gds.graph.project(
'a_b_graph',
{

LabelA: {
label: 'LabelA'
},
LabelB: {
label: 'LabelB'
}
}, {
rel1_labelA-labelB: {
type: 'rel1 label A - label B association',
orientation: 'UNDIRECTED'
},
rel2_labelA_labelB: {
type: 'rel2 label A - label B association',
orientation: 'UNDIRECTED'
}
})

CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('lp-pipeline')
YIELD parameterSpace

CALL gds.beta.pipeline.linkPrediction.train('gene_disease',
{pipeline: 'lp-pipeline2',
modelName: 'lp-model2',
nodeLabels: ['LabelA', 'LabelB'],
relationshipTypes: ['rel1_labelA-labelB','rel2_labelA-labelB'],
randomSeed: 42})
YIELD modelInfo
RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;

#stream predictions

CALL gds.beta.pipeline.linkPrediction.predict.stream('gene_disease', {
modelName: 'lp-model2',
topN: 10,
threshold: 0.45,
nodeLabels: ['LabelA', 'LabelB'],
relationshipTypes: ['rel1_labelA-labelB', 'rel2_labelA-labelB']
})
YIELD node1, node2, probability
RETURN gds.util.asNode(node1).name AS Label_A, gds.util.asNode(node2).name AS Label_B, probability
ORDER BY probability DESC, Label_A

CALL gds.beta.pipeline.linkPrediction.predict.mutate('a_b_graph', {
modelName: 'lp-model',
relationshipTypes: ['rel1_labelA-labelB', 'rel2_labelA-labelB'],
mutateRelationshipType: 'LINKS_APPROX_PREDICTED',
sampleRate: 0.5,
topK: 1,
randomJoins: 2,
maxIterations: 3,
concurrency: 1,
randomSeed: 42
})
YIELD relationshipsWritten, samplingStats

dg_22 · July 4, 2022, 8:39pm

Sure, below is some sample code where I have a created a link prediction pipeline and am trying to predict links between two labels (A and B). As you can see in both the training and prediction steps I specify that I am only interested in labels A and B and relationships between them ('rel1_labelA-labelB', 'rel2_labelA-labelB'). However when predicting I get predictions for same label relationships (A - A, B - B).

When creating a node embedding for instance I want to include all the relationships for that node but when training/predicting I want to specify that the model should only consider relationships between A and B. Please let me know if you need any other information. Thank you!

CALL gds.beta.pipeline.linkPrediction.create('lp-pipeline')

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline', 'fastRP', {
mutateProperty: 'embedding',
embeddingDimension: 256,
randomSeed: 42
})

CALL gds.beta.pipeline.linkPrediction.addFeature('lp-pipeline', 'cosine', {
nodeProperties: ['embedding']
}) YIELD featureSteps;

CALL gds.beta.pipeline.linkPrediction.configureSplit(
'lp-pipeline', {
testFraction: 0.3,
trainFraction: 0.6,
validationFolds: 7})
YIELD splitConfig

CALL gds.graph.project(
'a_b_graph',
{

LabelA: {
label: 'LabelA'
},
LabelB: {
label: 'LabelB'
}
}, {
rel1_labelA-labelB: {
type: 'rel1 label A - label B association',
orientation: 'UNDIRECTED'
},
rel2_labelA_labelB: {
type: 'rel2 label A - label B association',
orientation: 'UNDIRECTED'
}
})

CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('lp-pipeline')
YIELD parameterSpace

CALL gds.beta.pipeline.linkPrediction.train('gene_disease',
{pipeline: 'lp-pipeline2',
modelName: 'lp-model2',
nodeLabels: ['LabelA', 'LabelB'],
relationshipTypes: ['rel1_labelA-labelB','rel2_labelA-labelB'],
randomSeed: 42})
YIELD modelInfo
RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;

#stream predictions

CALL gds.beta.pipeline.linkPrediction.predict.stream('gene_disease', {
modelName: 'lp-model2',
topN: 10,
threshold: 0.45,
nodeLabels: ['LabelA', 'LabelB'],
relationshipTypes: ['rel1_labelA-labelB', 'rel2_labelA-labelB']
})
YIELD node1, node2, probability
RETURN gds.util.asNode(node1).name AS Label_A, gds.util.asNode(node2).name AS Label_B, probability
ORDER BY probability DESC, Label_A

CALL gds.beta.pipeline.linkPrediction.predict.mutate('a_b_graph', {
modelName: 'lp-model',
relationshipTypes: ['rel1_labelA-labelB', 'rel2_labelA-labelB'],
mutateRelationshipType: 'LINKS_APPROX_PREDICTED',
sampleRate: 0.5,
topK: 1,
randomJoins: 2,
maxIterations: 3,
concurrency: 1,
randomSeed: 42
})
YIELD relationshipsWritten, samplingStats

florentin_dorre · July 14, 2022, 8:38am

Hello @dg_22 ,
at the moment you cannot prevent the model from predicting A--A links.
So as you already pointed out, you can filter the current predictions for A--B links for now.

But, we are working on a feature that will allow you to specify between which kind of relationship type and which source/target label you want to train your model for.
Also there will be a `SAME_CATEGORY` link feature combiner to help with using categorical features.

So stay tuned for the next GDS version

dg_22 · July 15, 2022, 5:10pm

Hi, thanks for letting me know. So just to confirm the training metrics I receive are based on predicting all types of relationships between the 2 labels I have provided right? So in my case since all the provided links are between A-B those will be the positive samples and as far as negative samples it could be A-B, B-B, A-A as long as link doesn't already exist. So if we cannot specify the links we are interested in what is the purpose of the relationshipTypes parameter in the both the train and predict step in the pipeline?

dg_22 · July 18, 2022, 2:14am

Nevermind, based on my understanding I believe those are just used to filter the positive examples for both training/predicting. Thanks again for the information and I look forward to new features!

Topic		Replies	Views
Using Neo4j for Heterogeneous nodes link prediction Graph Data Science / Graph Analytics cypher , migrated	3	475	April 3, 2024
Graph projections Neo4j Graph Platform migrated	35	443	August 8, 2022
How can one predict links only for one node in the network Graph Data Science / Graph Analytics	0	263	December 14, 2021
Accessing Test predictions in link prediction pipeline Neo4j Graph Platform migrated	5	94	August 13, 2022
Link Prediction with Neo4j Part 1: An Introduction Neo4j Developer Blog Archive	3	2408	November 27, 2019

Link prediction predicting wrong links

Related topics