Hey guys,
We have a neo4j cluster of 3 nodes running on kubernetes. We self host neo4j enterprise on GKE with 1.5 vCPU and 7.5GB RAM. We have 3 primaries and 0 secondaries and are seeing very high memory usage on the write leader of the cluster. I am attaching a screenshot of the memory trend. We see on average 80%+ memory utilization, and it keeps climbing up daily. All of the downward trends are from restarts. We are using the following memory configuration, which was obtained from neo4j-admin server memory-recommendation command.
server.memory.heap.initial_size: 3500m
server.memory.heap.max_size: 3500m
server.memory.pagecache.size: 1800m
If memory usage goes above 95% it starts causing issues like rejecting requests or becoming very slow. What can we do about the problem?
Memory usage chart over the last 2 months - pretty much every memory downtick is a restart. We ended up adding a 4th node to the cluster about 4 days ago to see if that would alleviate some pressure from other nodes, but it doesn’t look to be working. We have a pretty good monitoring setup so I will be able to provide more details if necessary
Thank you for the help!
@vatsalpatel.me
What version of Neo4j?
The image your provide. The red line is total RAM? and the blue line is in use RAM? is that correct? Sorry there is not a legend to the image describing what the red and blue lines represent
If you are using Neo4j Enterprise and thus have a enterprise license you could open a ticket at support.neo4j.com
Hi Dana,
We are using 5.26.0 enterprise. We used to be on the startup program that was sunsetted last month, and we are currently in the process of applying for the new startup program. It would be nice if we could get this answered here in the meantime.
The red line is the RAM limit for the pod, and the blue line is the RAM utilisation as reported on Kubernetes. Additionally, we do run a grafana alloy container within the pod for query logs
@vatsalpatel.me
Not that an upgrade will simply make this all go away but 5.26.0 is somewhat old https://neo4j.com/release-notes/database/neo4j-5/ and it was released December 9 2024, some 18+ months ago.
You could implement a db.memory.transaction.total.max Configuration settings - Operations Manual such that
Limit the amount of memory that all transactions in one database can consume, in bytes (or kilobytes with the 'k' suffix, megabytes with 'm' and gigabytes with 'g'). Zero means 'unlimited'.
and such that if a query is run and in total this pushes memory beyond this threshold then the query would not be allowed to run.
Has any investigation been done so as to look at if queries are properly defined
Thanks for the quick response @dana_canzano
We will get the neo4j version upgraded within the next few days. We have also added the flag that you recommended and increased the memory temporarily.
Regarding the query questions, we have been observing the query logs for some time now and have optimised just about every query to be under 1s and be under 1000-2000 page hits. We see a large number of small queries with <20 page hits and under 100-200ms and very very few large queries. We have had to keep our connection limit at 600 per node and are tweaking the value.
Do you have any more suggestions or configurations we could try?
@vatsalpatel.me
increasing memory will certainly help for your initial 7.5GB of RAM appears small.
With the new parameter if a query is run and in total all running queries consume more memory than what is specified by the parameter then the query which triggers the threshold will error. This might help identify queries needing further optimization. It could be either the single query in question or it could be a matter of too many queries at once.
@dana_canzano
What would you recommend for the RAM allocation? 7.5gb is per node, not the total. We have bumped it up to 9.5gb per node now because of the memory pressure.
Also we occassionally get this issue, where one of the nodes gets stuck and stops acknowledging write transactions, resulting in write transactions timing out. We had 3 primaries but added another to try and alleviate the memory pressure so now it is 4 primaries and 0 secondaries. Read transactions work just fine but looks like one of the nodes not acknowledging the write commit causes this. This issue gets resolved upon restarting the node (usuallyy the one with the highest memory usage). We have observed that the node in question may not be the write leader, but it can still stop write transactions from committing.
What could be the issue here?
This is the message we get, we are using the go driver so the timeout is referred through context cancellation and happens after 1 minute (our request timeout)
"ConnectivityError: Connection lost during commit: Timeout while reading from connection [server-side timeout hint: 1m0s, user-provided context deadline: 2025-09-04 16:08:33.950195659 +0000 UTC m=+106727.943778403]: context deadline exceeded"
@dana_canzano
While investigating memory pressure on our Neo4j instance, I noticed that the RSS is growing beyond what the JVM and the Native Memory Tracking report. Specifically, we see roughly an extra 900 MB that isn’t accounted for by NMT about 18 hours since the last database restart. This also grows slowly over time.
There is a good chance this is netty buffers because we see the number of connections opened on one of our cluster is 5000 over 12 hours. On another cluster, it only increased by 800 over the same period but we do see memory pressure on this one as well
We are considering setting -XX:MaxDirectMemorySize to ~1.5 GB but this warning was included, so I wanted to get your opinion on this first. What are your thoughts on the netty buffer being an issue?
THESE ARE SENSITIVE SETTINGS THAT WILL AFFECT THE NORMAL FUNCTIONALITY OF NEO4J. Please do not change these settings without consulting with Neo4j’s professionals. You can just log a support ticket if you’re running into issues with Direct Memory and we’ll advise you the best we can.