Performance issues as database gets bigger

grejty · March 21, 2025, 12:18pm

Hi everyone,

I'm working on my thesis project, where I'm developing a system that processes exports from business registries. The goal is to clean and store the data in a Neo4j database, with a strong focus on accurate entity matching and creating relationships between companies and individuals to enable fraud detection.

The Problem

I'm encountering significant performance issues when processing larger datasets. Initially, processing 500 companies took around 700 seconds. After implementing improvements such as caching, better indexing, and query optimization, I managed to cut that time in half — to around 350–400 seconds.

However, when scaling up to around 1,200–1,300 companies (~25k nodes and ~25k relationships — not that large), the performance drops significantly, and processing slows down dramatically.

What I've Tried So Far

I'm using Cypher queries combined with Neomodel for data handling and relationship creation.
I’ve already adjusted several settings in the neo4j.conf file (like memory limits and transaction settings) to try to improve performance — but the slowdown persists.
I've also improved indexing and query structure, and added caching where possible, but the scalability issue remains.

What I Need Help With

I'm looking for someone experienced with Neo4j who could:

Chat with me about my code and optimization approach.
Potentially help identify the bottlenecks in my code.
Guide me toward further improvements.

This is crucial because my dataset will eventually contain over 1 million companies for one country — and I have another country dataset with 600k companies coming next.

If anyone has faced similar issues or has insights into optimizing Neo4j for large-scale data ingestion, I’d be super grateful for your help!

Thanks in advance!

joshcornejo · March 21, 2025, 12:33pm

What is the size of your infrastructure?

It is obvious that performance will degrade - if it is linear (A companies = X time / 2A companies = 2X time) it is probably the nature of the queries you are running over the nodes/edges.

If it is exponential, it is either lack of infrastructure or a problem with the queries that can be optimised.

It is unlikely it will get logarithmic (faster).

grejty · March 21, 2025, 12:41pm

Thanks for the response,

My setup is:

CPU: AMD Ryzen 5 5600X
RAM: 32GB DDR4
Storage: Samsung 980 PRO 1TB (SSD)
OS: Windows 11 Pro

It feels more exponential than linear, so I suspect it's a query or indexing issue rather than hardware limitations

joshcornejo · March 21, 2025, 12:43pm

Or you are running out of RAM and moving in a cycle of SSD DB -> RAM -> Virtual memory (which is SSD)

Have a look at your Memory and CPU when you run the query.

adam_cowley · March 21, 2025, 2:49pm

The red flag for me is the mention of neomodel. It shouldn't take that long to import 25k nodes and relationships.

Can you share your code and/or Cypher queries?

grejty · March 21, 2025, 6:53pm

I did, its around 75-80% used

grejty · March 21, 2025, 6:59pm

Hey, at first I was only using Cypher but I struggled with a robust bullet-proof matching logic, since the data are not very reliable in terms of same formatting and correct matching of same entities is very important. I refactored to neomodel and I was able to do what I wanted much quicker.

I could share the main script where all the inserting and matching is happening, but its over 1000 lines. I don't wanna be rude but would you be available for a private conversation where I would be able to share it more clearly? I could maybe share the function where I try to find matching persons so you see how im doing

joshcornejo · March 21, 2025, 7:44pm

the data are not very reliable in terms of same formatting and correct matching

That's probably another reason for your performance handicap ... you should have created a uniform model before hand

grejty · March 21, 2025, 11:31pm

Well what I meant is, that sometimes a person has a name like this:

“John Smith PhD.”, which needs to be correctly matched to a Person node with name “Smith John”

Or things like different address formatting:
Fashion Street 123/45, London, UK
Fashion Street 45, London, UK

There has to be some logic in the matching process here, right? Im not using any AI to parse everything into same form beforehand. However im doing some pre-procesing of the data, cleaning, or trying to put addresses in the uniform form like so:
Street StreetNumber, PostalCode City, Country

Beyond that, I tried to do the robust “algorithms” for comparing slightly different nodes, to be accurate, as the data wont always be in 100% state. Do u think this is wrong approach?

joshcornejo · March 22, 2025, 10:59am

I can't account for your design - if you have to do those conversions, you have to do them.

I am just trying to explain what are the possible sources of performance bottlenecks (and how it can get exponentially slower)

grejty · March 22, 2025, 8:29pm

I see.

Do you think it would be better to process data and match entities correctly to postgres for example, then just use export to CSV and import this to neo4j? Like that?

Is it bad a approach to do complex cypher/neomodel queries for correct matching?

joshcornejo · March 23, 2025, 9:03am

I do c# and straight to neo4J ... but i can try to give you a few pointers:

Python will be slightly more inefficient than running a raw query inside the DB
you can do either:
-- ETL: extract from source -> transform to your target -> load to Neo4J
-- ELT: extract -> load to Neo4J -> transform

There is no right/wrong, it mostly depends on:

frequency: how often you will execute things
readiness: how clean you need the data as it lands in your database

If you have a huge amount of data that needs structuring, but it is loaded only once - maybe a simple approach works best:

create a python script that reads the input (if you need speed, perl is probably faster, but you'll need to learn)
the script creates a 'clean' CSV/JSON
the script 'chunks' outputs into sections (e.g. 100K records)
create a query where you use the apoc scripts to load the files as a batch

If you are going to have periodic loads of unstructured data, then you need a more complex pipeline to make sure nothing breaks.

If you had less data, perhaps ELT would work ... but only you know :)

Topic		Replies	Views
How do I find and fix unpredictable performance and CPU utilization in Neo4J Enterprise? General	10	1818	February 18, 2022
Hi I'm new to Neo4j, very very excited on the possibilities but struggling with performance within the very first datasets Introduce-Yourself	9	900	August 1, 2019
Tyler from Texas - Massive Dataset Introduce-Yourself performance , import	2	450	February 29, 2020
Performance of Neo4j Installation	2	88	August 23, 2024
Performance query over millions of relationships Cypher	2	2545	January 31, 2020

Performance issues as database gets bigger

The Problem

What I've Tried So Far

What I Need Help With

Related topics