I'm working on my thesis project, where I'm developing a system that processes exports from business registries. The goal is to clean and store the data in a Neo4j database, with a strong focus on accurate entity matching and creating relationships between companies and individuals to enable fraud detection.
The Problem
I'm encountering significant performance issues when processing larger datasets. Initially, processing 500 companies took around 700 seconds. After implementing improvements such as caching, better indexing, and query optimization, I managed to cut that time in half — to around 350–400 seconds.
However, when scaling up to around 1,200–1,300 companies (~25k nodes and ~25k relationships — not that large), the performance drops significantly, and processing slows down dramatically.
What I've Tried So Far
I'm using Cypher queries combined with Neomodel for data handling and relationship creation.
I’ve already adjusted several settings in the neo4j.conf file (like memory limits and transaction settings) to try to improve performance — but the slowdown persists.
I've also improved indexing and query structure, and added caching where possible, but the scalability issue remains.
What I Need Help With
I'm looking for someone experienced with Neo4j who could:
Chat with me about my code and optimization approach.
Potentially help identify the bottlenecks in my code.
Guide me toward further improvements.
This is crucial because my dataset will eventually contain over 1 million companies for one country — and I have another country dataset with 600k companies coming next.
If anyone has faced similar issues or has insights into optimizing Neo4j for large-scale data ingestion, I’d be super grateful for your help!
It is obvious that performance will degrade - if it is linear (A companies = X time / 2A companies = 2X time) it is probably the nature of the queries you are running over the nodes/edges.
If it is exponential, it is either lack of infrastructure or a problem with the queries that can be optimised.
Hey, at first I was only using Cypher but I struggled with a robust bullet-proof matching logic, since the data are not very reliable in terms of same formatting and correct matching of same entities is very important. I refactored to neomodel and I was able to do what I wanted much quicker.
I could share the main script where all the inserting and matching is happening, but its over 1000 lines. I don't wanna be rude but would you be available for a private conversation where I would be able to share it more clearly? I could maybe share the function where I try to find matching persons so you see how im doing
Well what I meant is, that sometimes a person has a name like this:
“John Smith PhD.”, which needs to be correctly matched to a Person node with name “Smith John”
Or things like different address formatting:
Fashion Street 123/45, London, UK
Fashion Street 45, London, UK
There has to be some logic in the matching process here, right? Im not using any AI to parse everything into same form beforehand. However im doing some pre-procesing of the data, cleaning, or trying to put addresses in the uniform form like so:
Street StreetNumber, PostalCode City, Country
Beyond that, I tried to do the robust “algorithms” for comparing slightly different nodes, to be accurate, as the data wont always be in 100% state. Do u think this is wrong approach?
Do you think it would be better to process data and match entities correctly to postgres for example, then just use export to CSV and import this to neo4j? Like that?
Is it bad a approach to do complex cypher/neomodel queries for correct matching?