cancel
Showing results for 
Search instead for 
Did you mean: 

Join the community at Nodes 2022, our free virtual event on November 16 - 17.

link prediction predicting wrong links

dg_22
Node Link

Hi, 

     I am using the link prediction pipeline to predict links between two node labels using their corresponding
relationships.
https://neo4j.com/docs/graph-data-science/current/machine-learning/linkprediction-pipelines/predict/

I have specified these labels/relationships in the graph projection as well as setting them as nodeLabels and relationshipTypes when using gds.beta.pipeline.linkPrediction.train and gds.beta.pipeline.linkPrediction.predict.stream. However when streaming it and yielding node1, node2, and the probability probability, it is providing links between the same set of labels. How do I specify I want to predict relationships between two sets of different labels only using specific relationships? Please let me know if further information is needed. Thank you!

1 ACCEPTED SOLUTION

Hello @dg_22 ,
at the moment you cannot prevent the model from predicting A--A links.
So as you already pointed out, you can filter the current predictions for A--B links for now.

But, we are working on a feature that will allow you to specify between which kind of relationship type and which source/target label you want to train your model for.
Also there will be a `SAME_CATEGORY`  link feature combiner to help with using categorical features.

So stay tuned for the next GDS version 🙂

View solution in original post

8 REPLIES 8

Can you please include more code and how you use nodeLabels and relationshipTypes parameters?

 

Sure, below is some sample code where I have a created a link prediction pipeline and am trying to predict links between two labels (A and B). As you can see in both the training and prediction steps I specify that I am only interested in labels A and B and relationships between them ('rel1_labelA-labelB', 'rel2_labelA-labelB'). However when predicting I get predictions for same label relationships (A - A, B - B). 

When creating a node embedding for instance I want to include all the relationships for that node but when training/predicting I want to specify that the model should only consider relationships between A and B. Please let me know if you need any other information. Thank you!

 

CALL gds.beta.pipeline.linkPrediction.create('lp-pipeline')

 

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline', 'fastRP', {
mutateProperty: 'embedding',
embeddingDimension: 256,
randomSeed: 42
})

 

CALL gds.beta.pipeline.linkPrediction.addFeature('lp-pipeline', 'cosine', {
nodeProperties: ['embedding']
}) YIELD featureSteps;

 

CALL gds.beta.pipeline.linkPrediction.configureSplit(
'lp-pipeline', {
testFraction: 0.3,
trainFraction: 0.6,
validationFolds: 7})
YIELD splitConfig

 

CALL gds.graph.project(
'a_b_graph',
{

LabelA: {
label: 'LabelA'
},
LabelB: {
label: 'LabelB'
}
}, {
rel1_labelA-labelB: {
type: 'rel1 label A - label B association',
orientation: 'UNDIRECTED'
},
rel2_labelA_labelB: {
type: 'rel2 label A - label B association',
orientation: 'UNDIRECTED'
}
})

 

CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('lp-pipeline')
YIELD parameterSpace

 

CALL gds.beta.pipeline.linkPrediction.train('gene_disease',
{pipeline: 'lp-pipeline2',
modelName: 'lp-model2',
nodeLabels: ['LabelA', 'LabelB'],
relationshipTypes: ['rel1_labelA-labelB','rel2_labelA-labelB'],
randomSeed: 42})
YIELD modelInfo
RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;

 

#stream predictions

CALL gds.beta.pipeline.linkPrediction.predict.stream('gene_disease', {
modelName: 'lp-model2',
topN: 10,
threshold: 0.45,
nodeLabels: ['LabelA', 'LabelB'],
relationshipTypes: ['rel1_labelA-labelB', 'rel2_labelA-labelB']
})
YIELD node1, node2, probability
RETURN gds.util.asNode(node1).name AS Label_A, gds.util.asNode(node2).name AS Label_B, probability
ORDER BY probability DESC, Label_A

 

CALL gds.beta.pipeline.linkPrediction.predict.mutate('a_b_graph', {
modelName: 'lp-model',
relationshipTypes: ['rel1_labelA-labelB', 'rel2_labelA-labelB'],
mutateRelationshipType: 'LINKS_APPROX_PREDICTED',
sampleRate: 0.5,
topK: 1,
randomJoins: 2,
maxIterations: 3,
concurrency: 1,
randomSeed: 42
})
YIELD relationshipsWritten, samplingStats

Can you post your code, and the schema of the graph? 

Sure, so as far as the graph schema I am creating a projection out of subset of a much larger knowledge graph and selecting two node labels (A,B) and their two corresponding relationship types that I am interested in predicting. Please let me know if you need any further clarification/details in regards to this.

As far as the code, below is an example of the type of link prediction pipeline I am running. In this initial setup I have to types of labels and two possible sets of relationships that link between them. In the training and the predicting steps I thought I was specifying that I am only interested in seeing links predicted between these two sets of labels but what I realized is happening is that links are being predicted between every type of node label (so A-A, A-B, B-B). Right now this maybe more of an inconvenience as I could always delete the same label relationships I don't want.

But in the future I want to include many types of labels that link to A and B as well as all of those corresponding types of relationships in my projection in order to incorporate them into the node embeddings. But I want to make sure to train and predict only on potential links between A-B in the graph because otherwise it would be computationally too expensive and inconvenient so would like to know how to specify that. Thank you and please let me know if anything else is needed.


CALL gds.beta.pipeline.linkPrediction.create('lp-pipeline')

 

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline', 'fastRP', {
mutateProperty: 'embedding',
embeddingDimension: 256,
randomSeed: 42
})

 

CALL gds.beta.pipeline.linkPrediction.addFeature('lp-pipeline', 'cosine', {
nodeProperties: ['embedding']
}) YIELD featureSteps;

 

CALL gds.beta.pipeline.linkPrediction.configureSplit(
'lp-pipeline', {
testFraction: 0.3,
trainFraction: 0.6,
validationFolds: 7})
YIELD splitConfig

 

CALL gds.graph.project(
'a_b_graph',
{

LabelA: {
label: 'LabelA'
},
LabelB: {
label: 'LabelB'
}
}, {
rel1_labelA-labelB: {
type: 'rel1 label A - label B association',
orientation: 'UNDIRECTED'
},
rel2_labelA_labelB: {
type: 'rel2 label A - label B association',
orientation: 'UNDIRECTED'
}
})

 

CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('lp-pipeline')
YIELD parameterSpace

 

CALL gds.beta.pipeline.linkPrediction.train('gene_disease',
{pipeline: 'lp-pipeline2',
modelName: 'lp-model2',
nodeLabels: ['LabelA', 'LabelB'],
relationshipTypes: ['rel1_labelA-labelB','rel2_labelA-labelB'],
randomSeed: 42})
YIELD modelInfo
RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;

 

#stream predictions

CALL gds.beta.pipeline.linkPrediction.predict.stream('gene_disease', {
modelName: 'lp-model2',
topN: 10,
threshold: 0.45,
nodeLabels: ['LabelA', 'LabelB'],
relationshipTypes: ['rel1_labelA-labelB', 'rel2_labelA-labelB']
})
YIELD node1, node2, probability
RETURN gds.util.asNode(node1).name AS Label_A, gds.util.asNode(node2).name AS Label_B, probability
ORDER BY probability DESC, Label_A

 

CALL gds.beta.pipeline.linkPrediction.predict.mutate('a_b_graph', {
modelName: 'lp-model',
relationshipTypes: ['rel1_labelA-labelB', 'rel2_labelA-labelB'],
mutateRelationshipType: 'LINKS_APPROX_PREDICTED',
sampleRate: 0.5,
topK: 1,
randomJoins: 2,
maxIterations: 3,
concurrency: 1,
randomSeed: 42
})
YIELD relationshipsWritten, samplingStats

dg_22
Node Link

.

Hello @dg_22 ,
at the moment you cannot prevent the model from predicting A--A links.
So as you already pointed out, you can filter the current predictions for A--B links for now.

But, we are working on a feature that will allow you to specify between which kind of relationship type and which source/target label you want to train your model for.
Also there will be a `SAME_CATEGORY`  link feature combiner to help with using categorical features.

So stay tuned for the next GDS version 🙂

Hi, thanks for letting me know. So just to confirm the training metrics I receive are based on predicting all types of relationships between the 2 labels I have provided right? So in my case since all the provided links are between A-B those will be the positive samples and as far as negative samples it could be A-B, B-B, A-A as long as link doesn't already exist. So if we cannot specify the links we are interested in what is the purpose of the relationshipTypes parameter in the both the train and predict step in the pipeline?

Nevermind, based on my understanding I believe those are just used to filter the positive examples for both training/predicting. Thanks again for the information and I look forward to new features!