Fatal error occurred when handling a client connection causes crash

pauldorehill · July 25, 2019, 12:40am

Hi,

I've been running Neo4j Community for an internal app for several months: recently I've started to get the below crash situation (from debug.log) which causes neo4j and the VM to become unusable for ~1hr unless its restarted. The Detected VM stop-the-world pause repeats until neo4j restarts itself.

Neo4j Community 3.5.6
Linux (ubuntu 18.04) an Azure VM
.net driver 1.7.2

Prior to the final crash there are many ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: ... which don't result in a crash. From reading around Github and this forum it would seem the eventual crash is a result of the server connection pool becoming full?

I've traced the cause off the connection not getting closed correctly on the app side due to a switch to UWP and not correctly handling the closing of connections when the app goes into a suspended state: this I can fix.

However is there anything I can do on the server side to reset the connection pool at a certain limit/time period and prevent the crash?

Thanks,

Paul

2019-07-24 07:15:45.615+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0xf2451b2e, L:/10.0.0.4:7687 - R:/202.4.165.254:46375] readAddress(..) failed: Connection timed out
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out
at io.netty.channel.unix.Errors.newIOException(Errors.java:122)
at io.netty.channel.unix.Errors.ioResult(Errors.java:146)
at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:178)
at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:343)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:785)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:433)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:330)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
at java.lang.Thread.run(Thread.java:748)
2019-07-24 07:33:06.745+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1343, gcTime=0, gcCount=0}
2019-07-24 07:33:09.195+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2199, gcTime=0, gcCount=0}

zhen.li · July 25, 2019, 9:13pm

Hi,

The bolt server does not perform any filter on incoming connections as long as the connects are properly authenticated. The server also never actively kills an idle connection. If you want to kill connections from a non-responding application, then you need to kill these connections via operation system or firewall.

The bolt server connection pool will never be full, as there is no size limit on how big the connection pool is. The error you get seems caused by operation system not being able to establish new socket connections. This happens when there is no available ports on operation system for example.

The bolt server only enforces a limit on working threads. [See more docs here: Bolt thread pool configuration - Operations Manual]. But it is not really helping for your case, as the thread limits only apply to the amount of concurrent jobs. What you need is some function that goes over all connections on server every 4h and kill all idle connections. I will make a note and hopefully we will add it in a future release.

Zhen

pauldorehill · July 29, 2019, 2:31am

Hi Zhen,

Thanks for the reply - I'll first sort out the app side to see if that helps.

Regarding the closing of TCP sockets on the server, following your point around ports I spent some time learning more about Linux and it looks like the Bolt TCP connections are using keepalive? As I understood it this should close the unused connections after tcp_keepalive_intvl - so its surprising that there are no sockets available: we have very few users and over night all the connections would likely be closed. I recently had that same crash when monitoring the connections and there were only around 10 at the time.

Is there any more logging etc. I can turn on/look at in the community version to troubleshoot further?

Thanks,

Paul

zhen.li · July 29, 2019, 12:19pm

Is there any other error besides the connection timed out error?

It seems something is holding up some resources on your server that caused this unstable of server. Bolt server might just be a result of the problem. 10 connections are really trivial job load for bolt server to handle. Is there any other services running on the server machine? If not, maybe try to get more resource usage via metrics (Metrics - Operations Manual). Finally if all failed, then maybe a thread dump and/or heap dump?

Zhen

pauldorehill · July 29, 2019, 11:09pm

Hi Zhen,

Just Neo4j running on the server. The only other errors in the logs are the types below - the LEAK error occured around 13 hrs before the last crash.

I'll look into a clean install of the server etc. and the additional logging.

2019-07-25 06:58:18.057+0000 WARN [o.n.b.t.p.ProtocolHandshaker] Unsupported connection type: 'HTTP'. Bolt protocol only operates over a TCP connection or WebSocket.

2019-07-25 07:01:27.477+0000 WARN [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection, remote peer unexpectedly closed connection: [id: 0xaee5fff7, L:/10.0.0.4:7687 - R:/202.4.165.254:45197]

2019-07-29 05:01:35.028+0000 ERROR [io.netty.util.ResourceLeakDetector] LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.AbstractByteBufAllocator.compositeDirectBuffer(AbstractByteBufAllocator.java:221)
	io.netty.buffer.AbstractByteBufAllocator.compositeBuffer(AbstractByteBufAllocator.java:199)
	io.netty.handler.codec.MessageAggregator.decode(MessageAggregator.java:255)
	io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
	io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
	io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.handler.codec.ByteToMessageDecoder.handlerRemoved(ByteToMessageDecoder.java:249)
	io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
	io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
	io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799)
	io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:433)
	io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:330)
	io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
	java.lang.Thread.run(Thread.java:748)

pauldorehill · August 9, 2019, 6:56am

Having spent some more time investigating I think the log was pointing me in the wrong direction - the Fatal error occurred when handling a client connection errors just happened to coincide with the crash.

I could not get the crash to occur on a local version of the database and I noticed the crash was occurring when running a large complicated query and this lead me to https://neo4j.com/docs/operations-manual/current/performance/memory-configuration/. I've increased the memory on the VM running the database and optimized the memory configuration using neo4j-admin memrec - since doing that I've not had a crash.

Topic		Replies	Views
Fatal error occurred when handling a client connection, remote peer unexpectedly closed connection: Neo4j Graph Platform performance	0	1420	January 27, 2020
Neo4j connections are growing and reach connection timeout Server performance	0	615	August 10, 2021
Neo4j Randomly Terminating Connection With Spring Service Neo4j Graph Platform network-connectivity , operations , spring-data-neo4j--n	17	541	September 12, 2024
Connection closed after few minutes: defunct connection Address Python	4	2548	July 20, 2020
Neo4j server refuses connection after alot of sessions Integrations & Ecosystem migrated	1	141	June 20, 2022

August Summer Fun!

Fatal error occurred when handling a client connection causes crash

Related topics