BankSim Fraud Detection - ML Comparison

philip · July 15, 2021, 1:10pm

Hey everyone,
I'm currently writing my Master-Thesis about a comparison between Graph ML and traditional ML.

For this purpose I use this BankSim dataset and want to perform a Logistic Regression with Python and a Node Classification from the Graph Data Science Library. Unfortunately, in neo4j I do not achieve an accuracy above 68% (with Python I achieve 98%). Since I am using the same dataset for the algorithms, I assume that there are problems with the in-memory graph. I have used the following query to create it:

//12. Create in-memory graph for predictions
CALL gds.graph.create(
    'fraud_model_data', {
        Customer: { 
            label: 'Model_Data',
            properties: {
                fastrp_embedding:{property:'fastrp-embedding', defaultValue:0},
                did_fraud:{property:'did_fraud', defaultValue:0}
            }
         },
        Holdout_Customer: { 
            label: 'Holdout_Data',
            properties: {
                fastrp_embedding:{property:'fastrp-embedding', defaultValue:0},
                did_fraud:{property:'did_fraud', defaultValue:0}
            }
        }
    },
    '*'
)
YIELD graphName, nodeCount, relationshipCount;

I would appreciate if someone could assist me with this problem.

Best regards,
Philip

alicia_frame1 · July 19, 2021, 12:53pm

Hi @philip - I think we need a bit more information to be able to help you out.

Are you using the same features in Python and GDS? And the same datasets for your model vs the holdout?

Have you spent time tuning your embeddings, and the parameters for node classification?

The graph create command looks ok, but the difference likely comes down to input features and tuning.

philip · July 19, 2021, 2:14pm

Hi @alicia_frame1,

thanks for your answer.
I am doing the preprocessing part (incl. Downsampling) in python and export the .csv for GDS to apply the algorithms to the same set of data.

I am using this in-memory graph and apply fastRP with the following parameters:

//04. Create in-memory graph
CALL gds.graph.create(
    'my-graph', 
    ['Customer','Transaction'],
    ['MAKE','WITH']
)
YIELD graphName, nodeCount, relationshipCount,createMillis;

//08. Apply fastRP
CALL gds.fastRP.write(
  'my-graph',
  {
    embeddingDimension: 256, 
    writeProperty: 'fastrp-embedding'
  }
)
YIELD nodePropertiesWritten

If I check the nodes with the embedding, it seems that there are only zeros as embeddings (0,0,0,0,...,0). I think that is not the way it should be, right?

To select the fraud customers, I am using the following code

//09. Select fraud customers
MATCH (c:Customer)-[:MAKE]->(t:Transaction{fraud: 1})
SET c.did_fraud=1, c:Model_Data;

and to select non-fraud customers:

//10. Select non-fraud customers
MATCH (c:Customer)-[:MAKE]->(t:Transaction{fraud: 0})
SET c.did_fraud=0, c:Model_Data;

Hope this helps, otherwise I am happy to provide more information.

alicia_frame1 · July 19, 2021, 10:02pm

As a first step, you'll want to tune your embeddings so they're representative of your graph. Our docs have a great section on the different parameters you can tweak and what they mean. There are multiple ways to do this: generating different embeddings with different parameter combinations and trying each one out with your model (and seeing which gets the highest score), using NEuler to visualize your results, or evaluating them using python tooling.

Running embeddings with just default values is unlikely to provide you with anything informative - which you're seeing with your all zero embeddings.

You may also want to consider which embedding you use: if you have node properties (and I assume you do, if you're using the same data in a standard classifier), you can use fastRPExtended or graphSage to embed both graph topology and node properties.

You'll also want to look at tuning your ML model: setting the penalty, batchSize, tolerance etc will all impact your model accuracy. We have a docs page on that as well: Training methods - Neo4j Graph Data Science

Finally, you'll want to look at the data you're using in your model. If it's too imbalanced, your results might be poor - and we don't automatically correct for imbalanced data (although you can adjust your evaluation metrics).

If you're looking for a quick walk through - without much tuning - you can check out my PaySim demo here: GitHub - AliciaFrame/PaySim_GDS_demo: Demo of using the graph data science library (Neo4j GDS 1.6) on a simulated fraud data set with Bloom

mujtaba.mirza · September 20, 2021, 6:36am

Hi Philip, in the neo4j documentation,I saw that the node classification algorithm uses Logistic Regression as a classifier. May I know what algorithm did you use in python for classification? May I also know how did you give the embeddings as input to the algorithm?

lingvisa · October 8, 2021, 5:28am

@alicia_frame1 According to the documentation: Fast Random Projection - Neo4j Graph Data Science

FastRP can only be used for homogenous graph. In this project, there are two label types: customer and transaction. Can FastRP be used here? Just want to confirm this.

Topic		Replies	Views
GDS 1.4.0 (Graph Native Machine Learning!) is live! Graph Algorithms/Graph Data Science	2	428	November 10, 2020
Graph data + graph algorithms + machine learning for fraud detection/prevention Cypher	2	653	April 14, 2020
Using Neo4j Graph Data Science in Python to Improve Machine Learning Models Community Content & Blogs migrated	0	184	July 12, 2022
Need help in loading data to neo4j and get graph features Graph Algorithms/Graph Data Science	3	455	October 2, 2020
Neo4j Graph/Node Embeddings Graph Algorithms/Graph Data Science embeddings , graph-embeddings , node-embeddings	0	157	May 21, 2024

BankSim Fraud Detection - ML Comparison

Related topics