Neo4j ServiceUnavailable while no error in log output

Hi !

I'm a bit puzzled by a database availability issue I'm facing.

For the context, i have 5 workers merging data inside 5 separate transactions.
The database is running on the same host (32GB of RAM) and I configured it as following:

I ran neo4j-admin server memory-recommendation tool:

root@cea600baf881:/var/lib/neo4j# neo4j-admin server memory-recommendation --memory 20GB --docker
# Memory settings recommendation:
#
# Assuming the system is dedicated to running Neo4j and has 20.00GiB of memory,
# we recommend a heap size of around 6700m, and a page cache of around 8700m,
# and that about 5g is left for the operating system, and the native memory
# needed by Lucene and Netty.
#
# Tip: If the indexing storage use is high, e.g. there are many indexes or most
# data indexed, then it might advantageous to leave more memory for the
# operating system.
#
# Tip: The more concurrent transactions your workload has and the more updates
# they do, the more heap memory you will need. However, don't allocate more
# than 31g of heap, since this will disable pointer compression, also known as
# "compressed oops", in the JVM and make less effective use of the heap.
#
# Tip: Setting the initial and the max heap size to the same value means the
# JVM will never need to change the heap size. Changing the heap size otherwise
# involves a full GC, which is desirable to avoid.
#
# Based on the above, the following memory settings are recommended:
NEO4J_server_memory_heap_initial__size='6700m'
NEO4J_server_memory_heap_max__size='6700m'
NEO4J_server_memory_pagecache_size='8700m'
#
# It is also recommended turning out-of-memory errors into full crashes,
# instead of allowing a partially crashed database to continue running:
NEO4J_server_jvm_additional='-XX:+ExitOnOutOfMemoryError'
#
# The numbers below have been derived based on your current databases located at: '/var/lib/neo4j/data/databases'.
# They can be used as an input into more detailed memory analysis.
# Total size of lucene indexes in all databases: 0k
# Total size of data and native indexes in all databases: 1600m

And I added some tweaks in my docker compose configuration

# do not put an artifical limit on transaction max memory
NEO4J_dbms_memory_transaction_total_max: 0
# increase default thread pool size to avoid starvation
NEO4J_server_bolt_thread__pool__max__size: 800

But when i run my Github Actions workflow and capture my data, all my workers end up with a:

Traceback (most recent call last):
  File "/path/to/site-packages/neo4j/_sync/io/_common.py", line 51, in _buffer_one_chunk
    receive_into_buffer(self._socket, self._buffer, 2)
  File "/path/to/site-packages/neo4j/_sync/io/_common.py", line 328, in receive_into_buffer
    raise OSError("No data")
OSError: No data

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/path/to/site-packages/custom_module/__main__.py", line 14, in wrapper
    return f(*args, **kwargs)
  File "/path/to/site-packages/custom_module/__main__.py", line 68, in runner
    module(commit)
  File "/path/to/site-packages/custom_module/types.py", line 38, in __call__
    self.run(commit)
  File "/path/to/site-packages/custom_module/modules/specific_module.py", line 58, in run
    cypher_query_with_backoff(query, {"data_id": data.id, "data_type": file_type})
  File "/path/to/site-packages/custom_service/service/db_service.py", line 46, in cypher_query_with_backoff
    return db.cypher_query(query, params)
  File "/path/to/site-packages/neomodel/sync_/core.py", line 83, in wrapper
    return func(self, *args, **kwargs)
  File "/path/to/site-packages/neomodel/sync_/core.py", line 458, in cypher_query
    results, meta = self._run_cypher_query(
  File "/path/to/site-packages/neomodel/sync_/core.py", line 494, in _run_cypher_query
    response: Result = session.run(query, params)
  File "/path/to/site-packages/neo4j/_sync/work/transaction.py", line 168, in run
    result._tx_ready_run(query, parameters)
  File "/path/to/site-packages/neo4j/_sync/work/result.py", line 131, in _tx_ready_run
    self._run(query, parameters, None, None, None, None, None, None)
  File "/path/to/site-packages/neo4j/_sync/work/result.py", line 181, in _run
    self._attach()
  File "/path/to/site-packages/neo4j/_sync/work/result.py", line 301, in _attach
    self._connection.fetch_message()
  File "/path/to/site-packages/neo4j/_sync/io/_common.py", line 178, in inner
    func(*args, **kwargs)
  File "/path/to/site-packages/neo4j/_sync/io/_bolt.py", line 847, in fetch_message
    tag, fields = self.inbox.pop(
  File "/path/to/site-packages/neo4j/_sync/io/_common.py", line 72, in pop
    self._buffer_one_chunk()
  File "/path/to/site-packages/neo4j/_sync/io/_common.py", line 68, in _buffer_one_chunk
    Util.callback(self.on_error, error)
  File "/path/to/site-packages/neo4j/_async_compat/util.py", line 118, in callback
    return cb(*args, **kwargs)
  File "/path/to/site-packages/neo4j/_sync/io/_bolt.py", line 873, in _set_defunct_read
    self._set_defunct(message, error=error, silent=silent)
  File "/path/to/site-packages/neo4j/_sync/io/_bolt.py", line 920, in _set_defunct
    raise ServiceUnavailable(message) from error
neo4j.exceptions.ServiceUnavailable: Failed to read from defunct connection IPv4Address(('localhost', 7687)) (ResolvedIPv4Address(('127.0.0.1', 7687)))

And I also observed that all my workers died at the same time, with the same error.

So, Neo4j became unavailable ?

And there is nothing meaningful in the logs:

2024-09-24 12:33:28.063+0000 INFO  Logging config in use: File '/var/lib/neo4j/conf/user-logs.xml'
2024-09-24 12:33:28.083+0000 INFO  Starting...
2024-09-24 12:33:29.005+0000 INFO  This instance is ServerId{7a6e3a2e} (7a6e3a2e-f0c6-4f00-b814-ed560a1fb174)
2024-09-24 12:33:30.222+0000 INFO  ======== Neo4j 5.23.0 ========
2024-09-24 12:33:33.167+0000 INFO  Anonymous Usage Data is being sent to Neo4j, see https://neo4j.com/docs/usage-data/
2024-09-24 12:33:33.233+0000 INFO  Bolt enabled on 0.0.0.0:7687.
2024-09-24 12:33:33.987+0000 INFO  HTTP enabled on 0.0.0.0:7474.
2024-09-24 12:33:33.987+0000 INFO  Remote interface available at http://localhost:7474/
2024-09-24 12:33:33.991+0000 INFO  id: 6D67A78550509A78BA795FC0971D3D4AE3FBB32A9691282F99D653212BA945B0
2024-09-24 12:33:33.991+0000 INFO  name: system
2024-09-24 12:33:33.991+0000 INFO  creationDate: 2024-07-07T18:37:38.308Z
2024-09-24 12:33:33.991+0000 INFO  Started.
2024-09-24 14:31:36.128+0000 ERROR Increase in network aborts detected (more than 2 network related connection aborts over a period of 600000 ms) - This may indicate an issue with the network environment or an overload condition

The last error seems more related to the fact that the drivers gave up on the connection after receiving the ServiceUnavailable error ?

If my transaction where to consume too much memory, shouldn't I see a Java.heap.OOM error in the log output ?
I'm confused by the fact that Neo4j seems to have been unavailable for a short time, while leaving the logs intact ?

  • Neo4j 5.23
  • Python stack, neo4j driver, direct queries

Thanks a lot for your help !

@Wenzel

anything of interest in logs/debug.log ?

Hi @dana_canzano

I can see those messages:

2024-09-24 12:33:33.233+0000 INFO  [o.n.b.p.c.c.n.SocketNettyConnector] Bolt enabled on 0.0.0.0:7687.
2024-09-24 12:33:33.236+0000 WARN  [i.n.b.ServerBootstrap] Unknown channel option 'SO_REUSEADDR' for channel '[id: 0xbbe9553d]'
2024-09-24 12:33:33.241+0000 INFO  [o.n.b.BoltServer] Bolt server started
2024-09-24 12:33:33.242+0000 INFO  [o.n.s.A.ServerComponentsLifecycleAdapter] Starting web server
2024-09-24 12:33:33.293+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job finished: descriptor=Index( id=9, name='constraint_a3044a2f', type='RANGE', schema=(:Label[0] {PropertyKey[15]}), indexProvider='range-1.0', owningConstraint=10 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/9/index-9 Number of pages visited: 12677, Number of tree nodes: 12660, Number of cleaned crashed pointers: 0, Time spent: 88ms
2024-09-24 12:33:33.294+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job closed: descriptor=Index( id=9, name='constraint_a3044a2f', type='RANGE', schema=(:Label[0] {PropertyKey[15]}), indexProvider='range-1.0', owningConstraint=10 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/9/index-9
2024-09-24 12:33:33.294+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job started: descriptor=Index( id=11, name='constraint_1ea4ee12', type='RANGE', schema=(:Label[1] {PropertyKey[15]}), indexProvider='range-1.0', owningConstraint=12 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/11/index-11
2024-09-24 12:33:33.328+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job finished: descriptor=Index( id=11, name='constraint_1ea4ee12', type='RANGE', schema=(:Label[1] {PropertyKey[15]}), indexProvider='range-1.0', owningConstraint=12 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/11/index-11 Number of pages visited: 4145, Number of tree nodes: 4138, Number of cleaned crashed pointers: 0, Time spent: 33ms
2024-09-24 12:33:33.328+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job closed: descriptor=Index( id=11, name='constraint_1ea4ee12', type='RANGE', schema=(:Label[1] {PropertyKey[15]}), indexProvider='range-1.0', owningConstraint=12 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/11/index-11
2024-09-24 12:33:33.328+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job started: descriptor=Index( id=13, name='constraint_66a5ffc4', type='RANGE', schema=(:Label[2] {PropertyKey[15]}), indexProvider='range-1.0', owningConstraint=14 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/13/index-13
2024-09-24 12:33:33.329+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job finished: descriptor=Index( id=13, name='constraint_66a5ffc4', type='RANGE', schema=(:Label[2] {PropertyKey[15]}), indexProvider='range-1.0', owningConstraint=14 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/13/index-13 Number of pages visited: 7, Number of tree nodes: 6, Number of cleaned crashed pointers: 0, Time spent: 0ms
2024-09-24 12:33:33.329+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job closed: descriptor=Index( id=13, name='constraint_66a5ffc4', type='RANGE', schema=(:Label[2] {PropertyKey[15]}), indexProvider='range-1.0', owningConstraint=14 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/13/index-13
2024-09-24 12:33:33.330+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job started: descriptor=Index( id=15, name='constraint_88316a80', type='RANGE', schema=(:Label[3] {PropertyKey[16]}), indexProvider='range-1.0', owningConstraint=16 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/15/index-15
2024-09-24 12:33:33.330+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job finished: descriptor=Index( id=15, name='constraint_88316a80', type='RANGE', schema=(:Label[3] {PropertyKey[16]}), indexProvider='range-1.0', owningConstraint=16 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/15/index-15 Number of pages visited: 3, Number of tree nodes: 2, Number of cleaned crashed pointers: 0, Time spent: 0ms
2024-09-24 12:33:33.330+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job closed: descriptor=Index( id=15, name='constraint_88316a80', type='RANGE', schema=(:Label[3] {PropertyKey[16]}), indexProvider='range-1.0', owningConstraint=16 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/15/index-15
2024-09-24 12:33:33.330+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job started: descriptor=Index( id=17, name='constraint_63e86c1e', type='RANGE', schema=(:Label[4] {PropertyKey[19]}), indexProvider='range-1.0', owningConstraint=18 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/17/index-17
2024-09-24 12:33:33.331+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job finished: descriptor=Index( id=17, name='constraint_63e86c1e', type='RANGE', schema=(:Label[4] {PropertyKey[19]}), indexProvider='range-1.0', owningConstraint=18 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/17/index-17 Number of pages visited: 3, Number of tree nodes: 2, Number of cleaned crashed pointers: 0, Time spent: 0ms
2024-09-24 12:33:33.331+0000 INFO  [o.n.k.i.i.s.RangeIndexProvider] [neo4j/a91b10a2] Schema index cleanup job closed: descriptor=Index( id=17, name='constraint_63e86c1e', type='RANGE', schema=(:Label[4] {PropertyKey[19]}), indexProvider='range-1.0', owningConstraint=18 ), indexFile=/data/databases/neo4j/schema/index/range-1.0/17/index-17
2024-09-24 12:33:33.987+0000 INFO  [o.n.s.CommunityNeoWebServer] HTTP enabled on 0.0.0.0:7474.
2024-09-24 12:33:33.987+0000 INFO  [o.n.s.CommunityNeoWebServer] Remote interface available at http://localhost:7474/
2024-09-24 12:33:33.987+0000 INFO  [o.n.s.A.ServerComponentsLifecycleAdapter] Web server started.
2024-09-24 12:33:33.991+0000 INFO  [o.n.g.f.DatabaseManagementServiceFactory] id: 6D67A78550509A78BA795FC0971D3D4AE3FBB32A9691282F99D653212BA945B0
2024-09-24 12:33:33.991+0000 INFO  [o.n.g.f.DatabaseManagementServiceFactory] name: system
2024-09-24 12:33:33.991+0000 INFO  [o.n.g.f.DatabaseManagementServiceFactory] creationDate: 2024-07-07T18:37:38.308Z

But they don't sound like fatal errors to me ?

Thanks !

@Wenzel

agreed debug.log doesnt look unusual.

you state

I have 5 wokers merging data inside 5 seperate trasactions.

Are you able to share the cypher statments? Do you have indexes to support the MERGE statements?

So yes, I have unique constraints applied for faster lookup.

The merge statements look like this:

MERGE (m:MimeType {mime: $mime_type})
WITH m
MATCH (b:Blob {hash: $blob_hash})
MERGE (b)-[:HAS_MIME_TYPE]->(m)

And I now realize that I have other problems such as Deadlock issues between the transactions, since the graph is heavily connected.

Looking at possible solutions for this, I implemented a wrapper on the cypher_query function with a backoff mechanism:

def cypher_query_with_backoff(
    query: str, params: Dict[str, Any], max_retries: int = 10, initial_retry_delay=0.5
) -> List[Dict[str, Any]]:
    retries = 0
    while True:
        try:
            return db.cypher_query(query, params)
        except exceptions.TransientError as e:
            if e.code != "Neo.TransientError.Transaction.DeadlockDetected":
                raise
            logging.info("Deadlock detected, retrying (%s)", retries)
            retries += 1
            if retries > max_retries:
                raise e
            time.sleep(initial_retry_delay * (2**retries))

But that doesn't seem to work at all.
There is one detection fo the Deadlock, and then a TransactionError is raised.

2024-09-24 17:25:22,143 INFO:root:Deadlock detected, retrying (0)
2024-09-24 17:25:23,170 ERROR:root:Post Mortem: An unhandled exception occurred.
Traceback (most recent call last):
  File "[...]/service/database.py", line 46, in cypher_query_with_backoff
    return db.cypher_query(query, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/core.py", line 83, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/core.py", line 458, in cypher_query
    results, meta = self._run_cypher_query(
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/core.py", line 494, in _run_cypher_query
    response: Result = session.run(query, params)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/[...]/neo4j/_sync/work/transaction.py", line 168, in run
    result._tx_ready_run(query, parameters)
  File "/[...]/neo4j/_sync/work/result.py", line 131, in _tx_ready_run
    self._run(query, parameters, None, None, None, None, None, None)
  File "/[...]/neo4j/_sync/work/result.py", line 181, in _run
    self._attach()
  File "/[...]/neo4j/_sync/work/result.py", line 301, in _attach
    self._connection.fetch_message()
  File "/[...]/neo4j/_sync/io/_common.py", line 178, in inner
    func(*args, **kwargs)
  File "/[...]/neo4j/_sync/io/_bolt.py", line 850, in fetch_message
    res = self._process_message(tag, fields)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/[...]/neo4j/_sync/io/_bolt5.py", line 369, in _process_message
    response.on_failure(summary_metadata or {})
  File "/[...]/neo4j/_sync/io/_common.py", line 245, in on_failure
    raise Neo4jError.hydrate(**metadata)
neo4j.exceptions.TransientError: {code: Neo.TransientError.Transaction.DeadlockDetected} {message: ForsetiClient[transactionId=4396, clientId=7] can't acquire ExclusiveLock{owner=ForsetiClient[transactionId=4329, clientId=3]} on NODE_RELATIONSHIP_GROUP_DELETE(837093) because holders of that lock are waiting for ForsetiClient[transactionId=4396, clientId=7].
 Wait list:ExclusiveLock[
Client[4329] waits for [ForsetiClient[transactionId=4396, clientId=7]]]}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "[...]/main.py", line 14, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "[...]/main.py", line 68, in runner
    plugin(commit)
  File "[...]/types.py", line 38, in __call__
    self.run(commit)
  File "[...]/plugins/filetype.py", line 58, in run
    cypher_query_with_backoff(query, {"blob_hash": blob.hash, "mime_type": file_type})
  File "[...]/service/database.py", line 46, in cypher_query_with_backoff
    return db.cypher_query(query, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/core.py", line 83, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/core.py", line 458, in cypher_query
    results, meta = self._run_cypher_query(
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/core.py", line 494, in _run_cypher_query
    response: Result = session.run(query, params)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/[...]/neo4j/_sync/work/transaction.py", line 151, in run
    raise TransactionError(self,
neo4j.exceptions.TransactionError: Transaction failed
> /[...]/neo4j/_sync/work/transaction.py(151)run()
    150         if self._last_error:
--> 151             raise TransactionError(self,
    152                                    "Transaction failed") from self._last_error

At this point I'm wondering hwo to insert my data concurrently into the database ?
I thought that the transactions were handling these issues by isolating the graph modifications until they were committed.

And the backoff mechanism to retry the query later isn't a good option, since it would need to wait for the other transactions to complete (if I understand correctly).

I have transactions and insertions that can take up to one hour and a half, hence the need for concurrent workers processing data and inserting at the same time ...

Any tips on these Deadlocks that you can share ?

@Wenzel

given

do you have indexes on :MimeType(mime) and :Blob(hash)

also deadlocks are possible if for example

07:00:01 txn1 : merge into :Mime with property mime='abc' ... and thus a lock on this node
07:00:02 tnx2 : merge into :Mime with property mime='def' ... and thus a lock on this niode

07:00:03 txn1: match (b:Blob) with property hash=123
07:00:04 txn2: match (b:Blob) with property hash=456

07:00:05 txn1 now merges a relationship between the the node at 07:00:01 and 07:00:03 and as such a lock is created on these 2 nodes

07:00:06 txn2 now merges a relationship between the the node at 07:00:02 and 07:00:04 and as such a lock is created on these 2 nodes

07:00:05 txn1 tries to merge into :Mime with property mime='def' but can't since txn2 still has a lock on it and so this transaction will wait until txn2 releases its lock

07:00:06 txn2 tries to merge into :Mime with property mime='abc' but cant since txn1 still has a lock on it and so this transaction will wait until txn1 releases its lock... but txn1 is waiting ( at 07:00:05) for txn2 to release its lock and thus the deadlock.

if you are going to merge concurrently is there any ability to make sure a 'worker' X doesnt merge on data that a worker Y might also merge upon?

or in the above example can we commit after the txn at 07:00:05 which would thus release its locks?

Thank you for your detailed response @dana_canzano !

do you have indexes on :MimeType(mime) and :Blob(hash)

Yes, I have unique indexes on both Nodes.

if you are going to merge concurrently is there any ability to make sure a 'worker' X doesnt merge on data that a worker Y might also merge upon?

So each worker will gather all blobs leafs of a given tree. A subset of them will be shared between the trees that each worker has to process, so I can't design this garantee in any way.

However, the data inserted should be the same for a given blob, regardless of the workers.

or in the above example can we commit after the txn at 07:00:05 which would thus release its locks?

So the solution would be to move each single query inside its own transaction.
However this breaks a bit the garantees I wanted about the data model, where a partial insertion is discarded, should an exception arise during the processsing/insertion phase.

And today, i'm very confused by the results I get.

I attempted to remove the workers concurrency, reducing to a single worker processing the data sequentially:

  run_plugins:
    needs: build_matrix
    runs-on: self-hosted
    strategy:
      matrix: ${{ fromJson(needs.build_matrix.outputs.matrix) }}
      max-parallel: 1
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4
       ...

and I still get that DeadLock issue again at some point:

 neo4j.exceptions.TransientError: {code: Neo.TransientError.Transaction.DeadlockDetected} {message: ForsetiClient[transactionId=4612, clientId=19] can't acquire ExclusiveLock{owner=ForsetiClient[transactionId=4503, clientId=6]} on NODE_RELATIONSHIP_GROUP_DELETE(837093) because holders of that lock are waiting for ForsetiClient[transactionId=4612, clientId=19].
 Wait list:ExclusiveLock[
Client[4503] waits for [ForsetiClient[transactionId=4612, clientId=19]]]}

There shouldn't be any opened or transient transactions left, as I'm using a Python context manager ensuring that the transaction is committed and closed upon exit.

from neomodel import db

with db.transaction:
    self.process_and_insert()

I don't really understand how this is possible ?