Hey everyone,
I'm currently writing my Master-Thesis about a comparison between Graph ML and traditional ML.
For this purpose I use this BankSim dataset and want to perform a Logistic Regression with Python and a Node Classification from the Graph Data Science Library. Unfortunately, in neo4j I do not achieve an accuracy above 68% (with Python I achieve 98%). Since I am using the same dataset for the algorithms, I assume that there are problems with the in-memory graph. I have used the following query to create it:
thanks for your answer.
I am doing the preprocessing part (incl. Downsampling) in python and export the .csv for GDS to apply the algorithms to the same set of data.
I am using this in-memory graph and apply fastRP with the following parameters:
If I check the nodes with the embedding, it seems that there are only zeros as embeddings (0,0,0,0,...,0). I think that is not the way it should be, right?
To select the fraud customers, I am using the following code
//09. Select fraud customers
MATCH (c:Customer)-[:MAKE]->(t:Transaction{fraud: 1})
SET c.did_fraud=1, c:Model_Data;
and to select non-fraud customers:
//10. Select non-fraud customers
MATCH (c:Customer)-[:MAKE]->(t:Transaction{fraud: 0})
SET c.did_fraud=0, c:Model_Data;
Hope this helps, otherwise I am happy to provide more information.
Running embeddings with just default values is unlikely to provide you with anything informative - which you're seeing with your all zero embeddings.
You may also want to consider which embedding you use: if you have node properties (and I assume you do, if you're using the same data in a standard classifier), you can use fastRPExtended or graphSage to embed both graph topology and node properties.
You'll also want to look at tuning your ML model: setting the penalty, batchSize, tolerance etc will all impact your model accuracy. We have a docs page on that as well: Training methods - Neo4j Graph Data Science
Finally, you'll want to look at the data you're using in your model. If it's too imbalanced, your results might be poor - and we don't automatically correct for imbalanced data (although you can adjust your evaluation metrics).
Hi Philip, in the neo4j documentation,I saw that the node classification algorithm uses Logistic Regression as a classifier. May I know what algorithm did you use in python for classification? May I also know how did you give the embeddings as input to the algorithm?
FastRP can only be used for homogenous graph. In this project, there are two label types: customer and transaction. Can FastRP be used here? Just want to confirm this.