Hi !
I'm a bit puzzled by a database availability issue I'm facing.
For the context, i have 5 workers merging data inside 5 separate transactions.
The database is running on the same host (32GB of RAM) and I configured it as following:
I ran neo4j-admin server memory-recommendation
tool:
root@cea600baf881:/var/lib/neo4j# neo4j-admin server memory-recommendation --memory 20GB --docker
# Memory settings recommendation:
#
# Assuming the system is dedicated to running Neo4j and has 20.00GiB of memory,
# we recommend a heap size of around 6700m, and a page cache of around 8700m,
# and that about 5g is left for the operating system, and the native memory
# needed by Lucene and Netty.
#
# Tip: If the indexing storage use is high, e.g. there are many indexes or most
# data indexed, then it might advantageous to leave more memory for the
# operating system.
#
# Tip: The more concurrent transactions your workload has and the more updates
# they do, the more heap memory you will need. However, don't allocate more
# than 31g of heap, since this will disable pointer compression, also known as
# "compressed oops", in the JVM and make less effective use of the heap.
#
# Tip: Setting the initial and the max heap size to the same value means the
# JVM will never need to change the heap size. Changing the heap size otherwise
# involves a full GC, which is desirable to avoid.
#
# Based on the above, the following memory settings are recommended:
NEO4J_server_memory_heap_initial__size='6700m'
NEO4J_server_memory_heap_max__size='6700m'
NEO4J_server_memory_pagecache_size='8700m'
#
# It is also recommended turning out-of-memory errors into full crashes,
# instead of allowing a partially crashed database to continue running:
NEO4J_server_jvm_additional='-XX:+ExitOnOutOfMemoryError'
#
# The numbers below have been derived based on your current databases located at: '/var/lib/neo4j/data/databases'.
# They can be used as an input into more detailed memory analysis.
# Total size of lucene indexes in all databases: 0k
# Total size of data and native indexes in all databases: 1600m
And I added some tweaks in my docker compose configuration
# do not put an artifical limit on transaction max memory
NEO4J_dbms_memory_transaction_total_max: 0
# increase default thread pool size to avoid starvation
NEO4J_server_bolt_thread__pool__max__size: 800
But when i run my Github Actions workflow and capture my data, all my workers end up with a:
Traceback (most recent call last):
File "/path/to/site-packages/neo4j/_sync/io/_common.py", line 51, in _buffer_one_chunk
receive_into_buffer(self._socket, self._buffer, 2)
File "/path/to/site-packages/neo4j/_sync/io/_common.py", line 328, in receive_into_buffer
raise OSError("No data")
OSError: No data
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/path/to/site-packages/custom_module/__main__.py", line 14, in wrapper
return f(*args, **kwargs)
File "/path/to/site-packages/custom_module/__main__.py", line 68, in runner
module(commit)
File "/path/to/site-packages/custom_module/types.py", line 38, in __call__
self.run(commit)
File "/path/to/site-packages/custom_module/modules/specific_module.py", line 58, in run
cypher_query_with_backoff(query, {"data_id": data.id, "data_type": file_type})
File "/path/to/site-packages/custom_service/service/db_service.py", line 46, in cypher_query_with_backoff
return db.cypher_query(query, params)
File "/path/to/site-packages/neomodel/sync_/core.py", line 83, in wrapper
return func(self, *args, **kwargs)
File "/path/to/site-packages/neomodel/sync_/core.py", line 458, in cypher_query
results, meta = self._run_cypher_query(
File "/path/to/site-packages/neomodel/sync_/core.py", line 494, in _run_cypher_query
response: Result = session.run(query, params)
File "/path/to/site-packages/neo4j/_sync/work/transaction.py", line 168, in run
result._tx_ready_run(query, parameters)
File "/path/to/site-packages/neo4j/_sync/work/result.py", line 131, in _tx_ready_run
self._run(query, parameters, None, None, None, None, None, None)
File "/path/to/site-packages/neo4j/_sync/work/result.py", line 181, in _run
self._attach()
File "/path/to/site-packages/neo4j/_sync/work/result.py", line 301, in _attach
self._connection.fetch_message()
File "/path/to/site-packages/neo4j/_sync/io/_common.py", line 178, in inner
func(*args, **kwargs)
File "/path/to/site-packages/neo4j/_sync/io/_bolt.py", line 847, in fetch_message
tag, fields = self.inbox.pop(
File "/path/to/site-packages/neo4j/_sync/io/_common.py", line 72, in pop
self._buffer_one_chunk()
File "/path/to/site-packages/neo4j/_sync/io/_common.py", line 68, in _buffer_one_chunk
Util.callback(self.on_error, error)
File "/path/to/site-packages/neo4j/_async_compat/util.py", line 118, in callback
return cb(*args, **kwargs)
File "/path/to/site-packages/neo4j/_sync/io/_bolt.py", line 873, in _set_defunct_read
self._set_defunct(message, error=error, silent=silent)
File "/path/to/site-packages/neo4j/_sync/io/_bolt.py", line 920, in _set_defunct
raise ServiceUnavailable(message) from error
neo4j.exceptions.ServiceUnavailable: Failed to read from defunct connection IPv4Address(('localhost', 7687)) (ResolvedIPv4Address(('127.0.0.1', 7687)))
And I also observed that all my workers died at the same time, with the same error.
So, Neo4j became unavailable ?
And there is nothing meaningful in the logs:
2024-09-24 12:33:28.063+0000 INFO Logging config in use: File '/var/lib/neo4j/conf/user-logs.xml'
2024-09-24 12:33:28.083+0000 INFO Starting...
2024-09-24 12:33:29.005+0000 INFO This instance is ServerId{7a6e3a2e} (7a6e3a2e-f0c6-4f00-b814-ed560a1fb174)
2024-09-24 12:33:30.222+0000 INFO ======== Neo4j 5.23.0 ========
2024-09-24 12:33:33.167+0000 INFO Anonymous Usage Data is being sent to Neo4j, see https://neo4j.com/docs/usage-data/
2024-09-24 12:33:33.233+0000 INFO Bolt enabled on 0.0.0.0:7687.
2024-09-24 12:33:33.987+0000 INFO HTTP enabled on 0.0.0.0:7474.
2024-09-24 12:33:33.987+0000 INFO Remote interface available at http://localhost:7474/
2024-09-24 12:33:33.991+0000 INFO id: 6D67A78550509A78BA795FC0971D3D4AE3FBB32A9691282F99D653212BA945B0
2024-09-24 12:33:33.991+0000 INFO name: system
2024-09-24 12:33:33.991+0000 INFO creationDate: 2024-07-07T18:37:38.308Z
2024-09-24 12:33:33.991+0000 INFO Started.
2024-09-24 14:31:36.128+0000 ERROR Increase in network aborts detected (more than 2 network related connection aborts over a period of 600000 ms) - This may indicate an issue with the network environment or an overload condition
The last error seems more related to the fact that the drivers gave up on the connection after receiving the ServiceUnavailable error ?
If my transaction where to consume too much memory, shouldn't I see a Java.heap.OOM error in the log output ?
I'm confused by the fact that Neo4j seems to have been unavailable for a short time, while leaving the logs intact ?
- Neo4j 5.23
- Python stack, neo4j driver, direct queries
Thanks a lot for your help !