Sure, so as far as the graph schema I am creating a projection out of subset of a much larger knowledge graph and selecting two node labels (A,B) and their two corresponding relationship types that I am interested in predicting. Please let me know if you need any further clarification/details in regards to this.
As far as the code, below is an example of the type of link prediction pipeline I am running. In this initial setup I have to types of labels and two possible sets of relationships that link between them. In the training and the predicting steps I thought I was specifying that I am only interested in seeing links predicted between these two sets of labels but what I realized is happening is that links are being predicted between every type of node label (so A-A, A-B, B-B). Right now this maybe more of an inconvenience as I could always delete the same label relationships I don't want.
But in the future I want to include many types of labels that link to A and B as well as all of those corresponding types of relationships in my projection in order to incorporate them into the node embeddings. But I want to make sure to train and predict only on potential links between A-B in the graph because otherwise it would be computationally too expensive and inconvenient so would like to know how to specify that. Thank you and please let me know if anything else is needed.
CALL gds.beta.pipeline.linkPrediction.create('lp-pipeline')
CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline', 'fastRP', {
mutateProperty: 'embedding',
embeddingDimension: 256,
randomSeed: 42
})
CALL gds.beta.pipeline.linkPrediction.addFeature('lp-pipeline', 'cosine', {
nodeProperties: ['embedding']
}) YIELD featureSteps;
CALL gds.beta.pipeline.linkPrediction.configureSplit(
'lp-pipeline', {
testFraction: 0.3,
trainFraction: 0.6,
validationFolds: 7})
YIELD splitConfig
CALL gds.graph.project(
'a_b_graph',
{
LabelA: {
label: 'LabelA'
},
LabelB: {
label: 'LabelB'
}
}, {
rel1_labelA-labelB: {
type: 'rel1 label A - label B association',
orientation: 'UNDIRECTED'
},
rel2_labelA_labelB: {
type: 'rel2 label A - label B association',
orientation: 'UNDIRECTED'
}
})
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('lp-pipeline')
YIELD parameterSpace
CALL gds.beta.pipeline.linkPrediction.train('gene_disease',
{pipeline: 'lp-pipeline2',
modelName: 'lp-model2',
nodeLabels: ['LabelA', 'LabelB'],
relationshipTypes: ['rel1_labelA-labelB','rel2_labelA-labelB'],
randomSeed: 42})
YIELD modelInfo
RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;
#stream predictions
CALL gds.beta.pipeline.linkPrediction.predict.stream('gene_disease', {
modelName: 'lp-model2',
topN: 10,
threshold: 0.45,
nodeLabels: ['LabelA', 'LabelB'],
relationshipTypes: ['rel1_labelA-labelB', 'rel2_labelA-labelB']
})
YIELD node1, node2, probability
RETURN gds.util.asNode(node1).name AS Label_A, gds.util.asNode(node2).name AS Label_B, probability
ORDER BY probability DESC, Label_A
CALL gds.beta.pipeline.linkPrediction.predict.mutate('a_b_graph', {
modelName: 'lp-model',
relationshipTypes: ['rel1_labelA-labelB', 'rel2_labelA-labelB'],
mutateRelationshipType: 'LINKS_APPROX_PREDICTED',
sampleRate: 0.5,
topK: 1,
randomJoins: 2,
maxIterations: 3,
concurrency: 1,
randomSeed: 42
})
YIELD relationshipsWritten, samplingStats