Fatal error occurred when handling a client connection causes crash

Hi,

I've been running Neo4j Community for an internal app for several months: recently I've started to get the below crash situation (from debug.log) which causes neo4j and the VM to become unusable for ~1hr unless its restarted. The Detected VM stop-the-world pause repeats until neo4j restarts itself.

Neo4j Community 3.5.6
Linux (ubuntu 18.04) an Azure VM
.net driver 1.7.2

Prior to the final crash there are many ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: ... which don't result in a crash. From reading around Github and this forum it would seem the eventual crash is a result of the server connection pool becoming full?

I've traced the cause off the connection not getting closed correctly on the app side due to a switch to UWP and not correctly handling the closing of connections when the app goes into a suspended state: this I can fix.

However is there anything I can do on the server side to reset the connection pool at a certain limit/time period and prevent the crash?

Thanks,

Paul

2019-07-24 07:15:45.615+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0xf2451b2e, L:/10.0.0.4:7687 - R:/202.4.165.254:46375] readAddress(..) failed: Connection timed out
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out
at io.netty.channel.unix.Errors.newIOException(Errors.java:122)
at io.netty.channel.unix.Errors.ioResult(Errors.java:146)
at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:178)
at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:343)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:785)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:433)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:330)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
at java.lang.Thread.run(Thread.java:748)
2019-07-24 07:33:06.745+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1343, gcTime=0, gcCount=0}
2019-07-24 07:33:09.195+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2199, gcTime=0, gcCount=0}

Hi,

The bolt server does not perform any filter on incoming connections as long as the connects are properly authenticated. The server also never actively kills an idle connection. If you want to kill connections from a non-responding application, then you need to kill these connections via operation system or firewall.

The bolt server connection pool will never be full, as there is no size limit on how big the connection pool is. The error you get seems caused by operation system not being able to establish new socket connections. This happens when there is no available ports on operation system for example.

The bolt server only enforces a limit on working threads. [See more docs here: https://neo4j.com/docs/operations-manual/current/performance/bolt-thread-pool-configuration/]. But it is not really helping for your case, as the thread limits only apply to the amount of concurrent jobs. What you need is some function that goes over all connections on server every 4h and kill all idle connections. I will make a note and hopefully we will add it in a future release.

Zhen

Hi Zhen,

Thanks for the reply - I'll first sort out the app side to see if that helps.

Regarding the closing of TCP sockets on the server, following your point around ports I spent some time learning more about Linux and it looks like the Bolt TCP connections are using keepalive? As I understood it this should close the unused connections after tcp_keepalive_intvl - so its surprising that there are no sockets available: we have very few users and over night all the connections would likely be closed. I recently had that same crash when monitoring the connections and there were only around 10 at the time.

Is there any more logging etc. I can turn on/look at in the community version to troubleshoot further?

Thanks,

Paul

Is there any other error besides the connection timed out error?

It seems something is holding up some resources on your server that caused this unstable of server. Bolt server might just be a result of the problem. 10 connections are really trivial job load for bolt server to handle. Is there any other services running on the server machine? If not, maybe try to get more resource usage via metrics (https://neo4j.com/docs/operations-manual/3.5/monitoring/metrics/). Finally if all failed, then maybe a thread dump and/or heap dump?

Zhen

Hi Zhen,

Just Neo4j running on the server. The only other errors in the logs are the types below - the LEAK error occured around 13 hrs before the last crash.

I'll look into a clean install of the server etc. and the additional logging.

2019-07-25 06:58:18.057+0000 WARN [o.n.b.t.p.ProtocolHandshaker] Unsupported connection type: 'HTTP'. Bolt protocol only operates over a TCP connection or WebSocket.

2019-07-25 07:01:27.477+0000 WARN [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection, remote peer unexpectedly closed connection: [id: 0xaee5fff7, L:/10.0.0.4:7687 - R:/202.4.165.254:45197]

2019-07-29 05:01:35.028+0000 ERROR [io.netty.util.ResourceLeakDetector] LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.AbstractByteBufAllocator.compositeDirectBuffer(AbstractByteBufAllocator.java:221)
	io.netty.buffer.AbstractByteBufAllocator.compositeBuffer(AbstractByteBufAllocator.java:199)
	io.netty.handler.codec.MessageAggregator.decode(MessageAggregator.java:255)
	io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
	io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
	io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.handler.codec.ByteToMessageDecoder.handlerRemoved(ByteToMessageDecoder.java:249)
	io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
	io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
	io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799)
	io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:433)
	io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:330)
	io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
	java.lang.Thread.run(Thread.java:748)

Having spent some more time investigating I think the log was pointing me in the wrong direction - the Fatal error occurred when handling a client connection errors just happened to coincide with the crash.

I could not get the crash to occur on a local version of the database and I noticed the crash was occurring when running a large complicated query and this lead me to https://neo4j.com/docs/operations-manual/current/performance/memory-configuration/. I've increased the memory on the VM running the database and optimized the memory configuration using neo4j-admin memrec - since doing that I've not had a crash.