Performance issues as database gets bigger

Hi everyone,

I'm working on my thesis project, where I'm developing a system that processes exports from business registries. The goal is to clean and store the data in a Neo4j database, with a strong focus on accurate entity matching and creating relationships between companies and individuals to enable fraud detection.

The Problem

I'm encountering significant performance issues when processing larger datasets. Initially, processing 500 companies took around 700 seconds. After implementing improvements such as caching, better indexing, and query optimization, I managed to cut that time in half — to around 350–400 seconds.

However, when scaling up to around 1,200–1,300 companies (~25k nodes and ~25k relationships — not that large), the performance drops significantly, and processing slows down dramatically.

What I've Tried So Far

  • I'm using Cypher queries combined with Neomodel for data handling and relationship creation.
  • I’ve already adjusted several settings in the neo4j.conf file (like memory limits and transaction settings) to try to improve performance — but the slowdown persists.
  • I've also improved indexing and query structure, and added caching where possible, but the scalability issue remains.

What I Need Help With

I'm looking for someone experienced with Neo4j who could:

  • Chat with me about my code and optimization approach.
  • Potentially help identify the bottlenecks in my code.
  • Guide me toward further improvements.

This is crucial because my dataset will eventually contain over 1 million companies for one country — and I have another country dataset with 600k companies coming next.

If anyone has faced similar issues or has insights into optimizing Neo4j for large-scale data ingestion, I’d be super grateful for your help!

Thanks in advance! :grinning_face_with_smiling_eyes:

What is the size of your infrastructure?

It is obvious that performance will degrade - if it is linear (A companies = X time / 2A companies = 2X time) it is probably the nature of the queries you are running over the nodes/edges.

If it is exponential, it is either lack of infrastructure or a problem with the queries that can be optimised.

It is unlikely it will get logarithmic (faster).

Thanks for the response,

My setup is:

  • CPU: AMD Ryzen 5 5600X
  • RAM: 32GB DDR4
  • Storage: Samsung 980 PRO 1TB (SSD)
  • OS: Windows 11 Pro

It feels more exponential than linear, so I suspect it's a query or indexing issue rather than hardware limitations

Or you are running out of RAM and moving in a cycle of SSD DB -> RAM -> Virtual memory (which is SSD)

Have a look at your Memory and CPU when you run the query.

The red flag for me is the mention of neomodel. It shouldn't take that long to import 25k nodes and relationships.

Can you share your code and/or Cypher queries?

I did, its around 75-80% used

Hey, at first I was only using Cypher but I struggled with a robust bullet-proof matching logic, since the data are not very reliable in terms of same formatting and correct matching of same entities is very important. I refactored to neomodel and I was able to do what I wanted much quicker.

I could share the main script where all the inserting and matching is happening, but its over 1000 lines. I don't wanna be rude but would you be available for a private conversation where I would be able to share it more clearly? I could maybe share the function where I try to find matching persons so you see how im doing :smiley:

the data are not very reliable in terms of same formatting and correct matching

That's probably another reason for your performance handicap ... you should have created a uniform model before hand

Well what I meant is, that sometimes a person has a name like this:

“John Smith PhD.”, which needs to be correctly matched to a Person node with name “Smith John”

Or things like different address formatting:
Fashion Street 123/45, London, UK
Fashion Street 45, London, UK

There has to be some logic in the matching process here, right? Im not using any AI to parse everything into same form beforehand. However im doing some pre-procesing of the data, cleaning, or trying to put addresses in the uniform form like so:
Street StreetNumber, PostalCode City, Country

Beyond that, I tried to do the robust “algorithms” for comparing slightly different nodes, to be accurate, as the data wont always be in 100% state. Do u think this is wrong approach?

I can't account for your design - if you have to do those conversions, you have to do them.

I am just trying to explain what are the possible sources of performance bottlenecks (and how it can get exponentially slower)

I see.

Do you think it would be better to process data and match entities correctly to postgres for example, then just use export to CSV and import this to neo4j? Like that?

Is it bad a approach to do complex cypher/neomodel queries for correct matching?

I do c# and straight to neo4J ... but i can try to give you a few pointers:

  • Python will be slightly more inefficient than running a raw query inside the DB
  • you can do either:
    -- ETL: extract from source -> transform to your target -> load to Neo4J
    -- ELT: extract -> load to Neo4J -> transform

There is no right/wrong, it mostly depends on:

  • frequency: how often you will execute things
  • readiness: how clean you need the data as it lands in your database

If you have a huge amount of data that needs structuring, but it is loaded only once - maybe a simple approach works best:

  • create a python script that reads the input (if you need speed, perl is probably faster, but you'll need to learn)
  • the script creates a 'clean' CSV/JSON
  • the script 'chunks' outputs into sections (e.g. 100K records)
  • create a query where you use the apoc scripts to load the files as a batch

If you are going to have periodic loads of unstructured data, then you need a more complex pipeline to make sure nothing breaks.

If you had less data, perhaps ELT would work ... but only you know :)