Server crashes on Full GC Failure


(Tim Hanssen) #1

Last night both our servers crasht (or stopped working for about 10 minutes) after a GC error.

This is the GC log record from the last incident this morning:

2018-10-15T11:10:46.108+0200: 1367360.645: [Full GC (Allocation Failure) 23G->21G(23G), 85.9388715 secs] [Eden: 0.0B(1208.0M)->0.0B(1208.0M) Survivors: 0.0

  • Unbuntu 14 LTS
  • 3.4.7 Enterprise in HA clustering
  • BOLT

The logs from both servers are in this drive: https://drive.google.com/drive/folders/1PhwMXwImOYjBMzFvGqk3REopX9hxnHki?usp=sharing

We restarted N2 after the incident, this morning N1 freezed again on the same error.

I guess our memory settings are not optimized.

dbms.memory.heap.initial_size=24200m
dbms.memory.heap.max_size=24200m
dbms.memory.pagecache.size=28100m

Both servers are running with 6 CPU on 64 GB.


(Michael Hunger) #2

Can you enable query logging and gc logging and share the query logs and gc logs?


(Tim Hanssen) #3

Hi Michael,

GC logs are in the Google drive folder already. Together with the debug logs and neo log.

I don't think Query logging will be much of a use. We run thousands of queries a minute.


(Michael Hunger) #4

Query logs would still be helpful.
You might choose to add a treshold but then it might filter out some relevant bits.


(Michael Hunger) #5

Looking at the logs there are a lot of resource exhaustions,

the bolt thread pool seems to be full (you might want to increase the pool size) leading to a lot of rejected/aborted queries.

Also the heap utilization is always almost on top (the gc log starts with 21G of 23G) and creeps upwards to 23G.
Already at startup the store uses almost all memory which is really odd.

Just out of curiosity? Why are you still using HA with 3.4 ? It's meant to go away in 4.0 so you might want to consider migrating.

Would it be possible to start a test instance on a copy of the store with the same page-cache setting but less heap e.g. 4G or 6G and share debug.log? And then also take a heapdump to figure out what takes the initial memory. jmap -dump:file=myheap.bin {pid of the JVM}


(Tim Hanssen) #6

Hi Michael,

The bolt thread pool seems to be full (you might want to increase the pool size) leading to a lot of rejected/aborted queries.

  • We will increase the pool size. And let you know if that helps.

Just out of curiosity? Why are you still using HA with 3.4 ? It's meant to go away in 4.0 so you might want to consider migrating.

  • We will move to CC somewhere in the next month, until now we used a 2 server setup with a arbiter. Since that is no option we CC we first needed to move to a new multi dc hosting cluster.

Would it be possible to start a test instance on a copy of the store with the same page-cache setting but less heap e.g. 4G or 6G and share debug.log? And then also take a heapdump to figure out what takes the initial memory. jmap -dump:file=myheap.bin {pid of the JVM}

  • We just restarted the server with 6G heap, and after the server came online I took the heapdump. The dump with all the logs are in the google drive:

https://drive.google.com/drive/folders/1InbtELmbuThBCl-PXtpDvtTId64ZT6j2?usp=sharing

Thxs again for looking into it.


(Michael Hunger) #7

Unfortunately as you can see from the logs that server has a very different startup behavior, the heap memory is all free at startup and also freed during GC.

# debug.log
Memory Pool: G1 Old Gen (Heap memory): committed=5.55 GB, used=0.00 B, max=5.86 GB, threshold=0.00 B

# gc.log
[Eden: 2486.0M(2486.0M)->0.0B(3536.0M) Survivors: 64.0M->64.0M Heap: 2617.4M(6000.0M)->130.4M(6000.0M)]

So not really sure how to continue, except trying to take a heap dump of that prod server.
This is definitely something that should be a support issue.


(Tim Hanssen) #8

Hey Michael,

I can take a heap dump from the prod server. Need to do that when restarted or just now when it's running?


(Michael Hunger) #9

if it's started and shows that 23G used of 23G in debug log or GC log with almost 23G full.