Server not responding for x seconds

operations

(Tim Hanssen) #1

We are experiencing some interruptions in requested handled by one of our servers (our primary master). It looks like the server just stops responding for a few seconds. The logs don't show any message, but we can see it happen in the security.log

Restarting the server seems to fix the issue for a few days.

  • 29 GB Total Store Size
  • 6 CPU
  • 64 GB memory
  • Ubuntu 14 LTS

dbms.memory.heap.initial_size=24200m
dbms.memory.heap.max_size=24200m
dbms.memory.pagecache.size=28100m

  • 3.4 Enterprise (in HAPROXY cluster)
  • BOLT
  • neo4j.log / debug.log show no messages

security.log:

2018-09-29 12:27:13.790+0000 INFO [neo4j]: logged in
2018-09-29 12:27:13.809+0000 INFO [neo4j]: logged in
2018-09-29 12:27:13.941+0000 INFO [api5-eu1]: logged in
--- PAUSES ----
2018-09-29 12:27:24.542+0000 INFO [neo4j]: logged in
2018-09-29 12:27:24.543+0000 INFO [api5-eu1]: logged in
2018-09-29 12:27:24.543+0000 INFO [api5-eu2]: logged in
2018-09-29 12:27:24.543+0000 INFO [api5-eu3]: logged in

2018-09-29 12:21:24.999+0000 INFO [neo4j]: logged in
2018-09-29 12:21:25.001+0000 INFO [neo4j]: logged in
2018-09-29 12:21:25.200+0000 INFO [api5-eu2]: logged in
--- PAUSES ----
2018-09-29 12:21:36.293+0000 INFO [neo4j]: logged in
2018-09-29 12:21:36.294+0000 INFO [neo4j]: logged in
2018-09-29 12:21:36.294+0000 INFO [neo4j]: logged in
2018-09-29 12:21:36.294+0000 INFO [neo4j]: logged in

It happens about every 2/3 minutes.
I also checked the gc.log:

2018-09-29T14:21:23.192+0200: 961797.109: Total time for which application threads were stopped: 0.0324231 seconds, Stopping threads took: 0.0002327 seconds
2018-09-29T14:21:23.192+0200: 961797.109: [GC concurrent-root-region-scan-start]
2018-09-29T14:21:23.200+0200: 961797.118: [GC concurrent-root-region-scan-end, 0.0087465 secs]
2018-09-29T14:21:23.200+0200: 961797.118: [GC concurrent-mark-start]
2018-09-29T14:21:36.259+0200: 961810.176: Total time for which application threads were stopped: 0.0035505 seconds, Stopping threads took: 0.0002953 seconds
2018-09-29T14:21:36.261+0200: 961810.179: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 79691776 bytes, new threshold 15 (max 15)

  • age 1: 1465304 bytes, 1465304 total
  • age 2: 635936 bytes, 2101240 total
  • age 3: 775168 bytes, 2876408 total
  • age 4: 646584 bytes, 3522992 total
  • age 5: 837128 bytes, 4360120 total
  • age 6: 214200 bytes, 4574320 total
  • age 7: 646888 bytes, 5221208 total
  • age 8: 152920 bytes, 5374128 total
  • age 9: 105456 bytes, 5479584 total
  • age 10: 130528 bytes, 5610112 total
  • age 11: 176752 bytes, 5786864 total
  • age 12: 324216 bytes, 6111080 total
  • age 13: 135240 bytes, 6246320 total
  • age 14: 236096 bytes, 6482416 total
  • age 15: 1346704 bytes, 7829120 total
    , 0.0304577 secs]
    [Parallel Time: 28.5 ms, GC Workers: 6]
    [GC Worker Start (ms): Min: 961810179.3, Avg: 961810179.9, Max: 961810181.1, Diff: 1.7]
    [Ext Root Scanning (ms): Min: 1.9, Avg: 6.3, Max: 15.2, Diff: 13.3, Sum: 37.6]
    [Update RS (ms): Min: 8.7, Avg: 17.8, Max: 24.6, Diff: 15.8, Sum: 106.5]
    [Processed Buffers: Min: 5, Avg: 21.8, Max: 32, Diff: 27, Sum: 131]
    [Scan RS (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.3, Sum: 1.2]
    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
    [Object Copy (ms): Min: 0.1, Avg: 3.4, Max: 4.2, Diff: 4.1, Sum: 20.4]
    [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum: 0.7]
    [Termination Attempts: Min: 1, Avg: 5.5, Max: 11, Diff: 10, Sum: 33]
    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.2]
    [GC Worker Total (ms): Min: 26.6, Avg: 27.8, Max: 28.4, Diff: 1.7, Sum: 166.8]
    [GC Worker End (ms): Min: 961810207.7, Avg: 961810207.7, Max: 961810207.7, Diff: 0.0][Code Root Fixup: 0.0 ms][Code Root Purge: 0.0 ms]
    [Clear CT: 0.4 ms]
    [Other: 1.5 ms]
    [Choose CSet: 0.0 ms]
    [Ref Proc: 0.5 ms]
    [Ref Enq: 0.0 ms]
    [Redirty Cards: 0.1 ms]
    [Humongous Register: 0.1 ms]
    [Humongous Reclaim: 0.0 ms]
    [Free CSet: 0.2 ms]
    [Eden: 1192.0M(1192.0M)->0.0B(1192.0M) Survivors: 16.0M->16.0M Heap: 23.5G(23.6G)->22.4G(23.6G)]
    [Times: user=0.12 sys=0.01, real=0.03 secs]
    2018-09-29T14:21:36.292+0200: 961810.210: Total time for which application threads were stopped: 0.0337601 seconds, Stopping threads took: 0.0000933 seconds

DB Server Manager pages loading slowly
(Michael Hunger) #2

Can you please enable GC logging and share those logs as well as the debug.log?

Area you using neo4j metrics monitoring? If not please enable it and also share the metrics from those time periods.