BankSim Fraud Detection - ML Comparison

Hey everyone,
I'm currently writing my Master-Thesis about a comparison between Graph ML and traditional ML.

For this purpose I use this BankSim dataset and want to perform a Logistic Regression with Python and a Node Classification from the Graph Data Science Library. Unfortunately, in neo4j I do not achieve an accuracy above 68% (with Python I achieve 98%). Since I am using the same dataset for the algorithms, I assume that there are problems with the in-memory graph. I have used the following query to create it:

//12. Create in-memory graph for predictions
CALL gds.graph.create(
    'fraud_model_data', {
        Customer: { 
            label: 'Model_Data',
            properties: {
                fastrp_embedding:{property:'fastrp-embedding', defaultValue:0},
                did_fraud:{property:'did_fraud', defaultValue:0}
        Holdout_Customer: { 
            label: 'Holdout_Data',
            properties: {
                fastrp_embedding:{property:'fastrp-embedding', defaultValue:0},
                did_fraud:{property:'did_fraud', defaultValue:0}
YIELD graphName, nodeCount, relationshipCount;

I would appreciate if someone could assist me with this problem.

Best regards,

Hi @philip - I think we need a bit more information to be able to help you out.

Are you using the same features in Python and GDS? And the same datasets for your model vs the holdout?

Have you spent time tuning your embeddings, and the parameters for node classification?

The graph create command looks ok, but the difference likely comes down to input features and tuning.

Hi @alicia_frame1,

thanks for your answer.
I am doing the preprocessing part (incl. Downsampling) in python and export the .csv for GDS to apply the algorithms to the same set of data.

I am using this in-memory graph and apply fastRP with the following parameters:

//04. Create in-memory graph
CALL gds.graph.create(
YIELD graphName, nodeCount, relationshipCount,createMillis;

//08. Apply fastRP
CALL gds.fastRP.write(
    embeddingDimension: 256, 
    writeProperty: 'fastrp-embedding'
YIELD nodePropertiesWritten

If I check the nodes with the embedding, it seems that there are only zeros as embeddings (0,0,0,0,...,0). I think that is not the way it should be, right?

To select the fraud customers, I am using the following code

//09. Select fraud customers
MATCH (c:Customer)-[:MAKE]->(t:Transaction{fraud: 1})
SET c.did_fraud=1, c:Model_Data;

and to select non-fraud customers:

//10. Select non-fraud customers
MATCH (c:Customer)-[:MAKE]->(t:Transaction{fraud: 0})
SET c.did_fraud=0, c:Model_Data;

Hope this helps, otherwise I am happy to provide more information.

As a first step, you'll want to tune your embeddings so they're representative of your graph. Our docs have a great section on the different parameters you can tweak and what they mean. There are multiple ways to do this: generating different embeddings with different parameter combinations and trying each one out with your model (and seeing which gets the highest score), using NEuler to visualize your results, or evaluating them using python tooling.

Running embeddings with just default values is unlikely to provide you with anything informative - which you're seeing with your all zero embeddings.

You may also want to consider which embedding you use: if you have node properties (and I assume you do, if you're using the same data in a standard classifier), you can use fastRPExtended or graphSage to embed both graph topology and node properties.

You'll also want to look at tuning your ML model: setting the penalty, batchSize, tolerance etc will all impact your model accuracy. We have a docs page on that as well: Training methods - Neo4j Graph Data Science

Finally, you'll want to look at the data you're using in your model. If it's too imbalanced, your results might be poor - and we don't automatically correct for imbalanced data (although you can adjust your evaluation metrics).

If you're looking for a quick walk through - without much tuning - you can check out my PaySim demo here: GitHub - AliciaFrame/PaySim_GDS_demo: Demo of using the graph data science library (Neo4j GDS 1.6) on a simulated fraud data set with Bloom

Hi Philip, in the neo4j documentation,I saw that the node classification algorithm uses Logistic Regression as a classifier. May I know what algorithm did you use in python for classification? May I also know how did you give the embeddings as input to the algorithm?

@alicia_frame1 According to the documentation: Fast Random Projection - Neo4j Graph Data Science

FastRP can only be used for homogenous graph. In this project, there are two label types: customer and transaction. Can FastRP be used here? Just want to confirm this.