Node Classification with two node labels

Hello,

I am trying to run a node classification on a fraud dataset.
The relevant properties are splitted between two nodes: Customer [age, gender, fastRP) and Transaction [amount, fraud]. If I run the code, I get this error:

The feature properties ['age_group', 'amount', 'fastrp_embedding', 'gender_group'] are not present for all requested labels. Requested labels: ['Customer', 'Transaction']. Properties available on all requested labels: ['']

CALL gds.alpha.ml.nodeClassification.train('fraud_model_data', {
   nodeLabels: ['Transaction','Customer'],
   modelName: 'fraud-model-properties',
   featureProperties: ['age_group', 'fastrp_embedding', 'gender_group','amount'], 
   targetProperty: 'fraud',
   metrics: ['F1_WEIGHTED','ACCURACY'],
   holdoutFraction: 0.2,
   validationFolds: 5,
   randomSeed: 2,
   params: [
       {penalty: 0.0625, maxIterations: 1000},
       {penalty: 0.125, maxIterations: 1000},
       {penalty: 0.25, maxIterations: 1000},
       {penalty: 0.5, maxIterations: 1000},
       ]
    }) YIELD modelInfo

If I only select a single nodeLabel ('Transaction' or 'Customer'), I am able to see the properties of the selected node but not the properties from the other node.

This is the code to create the in-memory graph:

CALL gds.graph.create(
    'fraud_model_data', {
        Customer: { 
            label: 'Customer',
            properties: {
                fastrp_embedding:{property:'fastRPExtended-embedding', defaultValue:0},
                gender_group:{property:'gender_group', defaultValue:0},
                age_group:{property:'age_group', defaultValue:0}
            }
         },
        Transaction: { 
            label: 'Transaction',
            properties: {
                fraud:{property:'fraud', defaultValue:0},
                amount:{property:'amount', defaultValue:0},
                category_group:{property:'category_group', defaultValue:0}
            }
        },
        Bank: { 
            label: 'Bank',
            properties: {
            }
        }
    },
    '*'
)
YIELD graphName, nodeCount, relationshipCount;

Do you have any solution for this problem? Thank you very much!

You'll need to either:

  • create a mono-partite projection (so you only have customers) using a Cypher Projection or collapse path, or
  • pad the missing properties with default values when you load the graph (so Bank nodes have a fraud property but it's always 0, for example).

If you choose the second option, you'll likely need to post process your predictions, because there's no easy way to tell the node classification model not to predict banks could be fraudulent. Although, using bank nodes as part of your negative dataset - and making sure they aren't incorrectly predicted to be fraudsters - could be part of your model tuning and evaluation.

Thanks @alicia.frame1 for the tips.

Unfortunately, I am not sure how to create a mono-partite projection since for example the same customer did 5 normal transactions and 1 fraud transaction. In theory, I would need to replicate the a customer node as often as they did a transaction and project every attribute of the Transaction node (fraud, amount) to the specific Customer node. Do I understand it correctly?

Sadly I don't know how to implement it in Neo4j - could you help me with this?

@alicia.frame1 do you have any advice fo me?