Stop the world pauses on every query and periodically

Hi,

I am debugging why certain queries cause neo4j to become unresponsive. I was finding that the same query run with `profile cypher runtime=interpreted cypher replan=force` (and after resetting the database each time) takes vastly different amounts of time (of order 20 seconds versus 100 milliseconds).

On investigating further, I can see that there is a periodic huge increase in CPU usage every 5 mins or so lasting about 10 seconds. There are no queries being submitted here just an idle db instance. At the end of this spike of activity, I can always see a stop the world pause. This is not a result of checkpointing since that is logged elsewhere.

This is clearly impacting my ability to write and read. What is this? Is there anything I can do?

debug.log:

2022-11-15 11:06:42.940+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12398, gcTime=0, gcCount=0}
2022-11-15 11:12:47.553+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12733, gcTime=12818, gcCount=2}
2022-11-15 11:18:43.723+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13377, gcTime=13546, gcCount=2}
2022-11-15 11:23:46.394+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12506, gcTime=12664, gcCount=2}
2022-11-15 11:30:42.660+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12325, gcTime=12461, gcCount=2}
2022-11-15 11:36:42.464+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12271, gcTime=12290, gcCount=2}
2022-11-15 11:43:39.413+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12805, gcTime=12963, gcCount=2}
2022-11-15 11:48:43.079+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12803, gcTime=12901, gcCount=2}
2022-11-15 11:53:41.940+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=11640, gcTime=11711, gcCount=2}
2022-11-15 11:59:40.950+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13284, gcTime=13379, gcCount=2}
2022-11-15 12:05:40.255+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12184, gcTime=12234, gcCount=2}
2022-11-15 12:11:21.063+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12493, gcTime=12642, gcCount=2}
2022-11-15 12:17:02.112+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12988, gcTime=13104, gcCount=2}
2022-11-15 12:21:42.700+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12379, gcTime=12491, gcCount=2}

2022-11-15 12:21:49.179+0000 ERROR [o.n.b.t.p.MessageDecoder] Failed to read an inbound message:
b00f

2022-11-15 12:21:49.182+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0xc7476cb4, L:/127.0.0.1:7687 - R:/127.0.0.1:58596]
java.lang.NullPointerException: null
at org.neo4j.bolt.transaction.StatementProcessorTxManager.interrupt(StatementProcessorTxManager.java:132) ~[neo4j-bolt-4.4.4.jar:4.4.4]
at org.neo4j.bolt.runtime.statemachine.impl.AbstractBoltStateMachine.interrupt(AbstractBoltStateMachine.java:189) ~[neo4j-bolt-4.4.4.jar:4.4.4]
at org.neo4j.bolt.runtime.DefaultBoltConnection.interrupt(DefaultBoltConnection.java:344) ~[neo4j-bolt-4.4.4.jar:4.4.4]
at org.neo4j.bolt.v3.messaging.decoder.ResetMessageDecoder.decode(ResetMessageDecoder.java:57) ~[neo4j-bolt-4.4.4.jar:4.4.4]
at org.neo4j.bolt.messaging.BoltRequestMessageReader.doRead(BoltRequestMessageReader.java:90) ~[neo4j-bolt-4.4.4.jar:4.4.4]
at org.neo4j.bolt.messaging.BoltRequestMessageReader.read(BoltRequestMessageReader.java:60) ~[neo4j-bolt-4.4.4.jar:4.4.4]
at org.neo4j.bolt.transport.pipeline.MessageDecoder.channelRead0(MessageDecoder.java:60) ~[neo4j-bolt-4.4.4.jar:4.4.4]
at org.neo4j.bolt.transport.pipeline.MessageDecoder.channelRead0(MessageDecoder.java:36) ~[neo4j-bolt-4.4.4.jar:4.4.4]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:327) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:299) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
at org.neo4j.bolt.transport.pipeline.MessageAccumulator.channelRead(MessageAccumulator.java:62) [neo4j-bolt-4.4.4.jar:4.4.4]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:327) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:299) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at org.neo4j.bolt.transport.pipeline.WebSocketFrameTranslator.channelRead(WebSocketFrameTranslator.java:43) [neo4j-bolt-4.4.4.jar:4.4.4]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:93) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.handler.codec.http.websocketx.Utf8FrameValidator.channelRead(Utf8FrameValidator.java:82) [netty-codec-http-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:327) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:299) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) [netty-transport-classes-epoll-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480) [netty-transport-classes-epoll-4.1.73.Final.jar:4.1.73.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [netty-transport-classes-epoll-4.1.73.Final.jar:4.1.73.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.73.Final.jar:4.1.73.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.73.Final.jar:4.1.73.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.73.Final.jar:4.1.73.Final]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-11-15 12:27:43.465+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13254, gcTime=13255, gcCount=2}
2022-11-15 12:33:03.503+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13386, gcTime=13529, gcCount=2}
2022-11-15 12:38:42.647+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12092, gcTime=12229, gcCount=2}
2022-11-15 12:44:42.702+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12509, gcTime=12531, gcCount=2}
2022-11-15 12:49:42.761+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12396, gcTime=12552, gcCount=2}
2022-11-15 12:55:43.363+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13075, gcTime=13157, gcCount=2}
2022-11-15 13:00:43.686+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13285, gcTime=13480, gcCount=2}
2022-11-15 13:06:05.018+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12574, gcTime=12590, gcCount=2}
2022-11-15 13:12:42.878+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12593, gcTime=12706, gcCount=2}
2022-11-15 13:17:43.266+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12880, gcTime=13076, gcCount=2}
2022-11-15 13:22:42.858+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12482, gcTime=12664, gcCount=2}
2022-11-15 13:28:43.300+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12948, gcTime=13107, gcCount=2}
2022-11-15 13:34:27.204+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12701, gcTime=12700, gcCount=2}
2022-11-15 13:39:47.333+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12427, gcTime=12438, gcCount=2}
2022-11-15 13:45:48.433+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=12973, gcTime=13115, gcCount=2}
2022-11-15 13:51:44.088+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13730, gcTime=13874, gcCount=2}


dbms.memory.heap.initial_size=24g
dbms.memory.heap.max_size=24g

Hi @philastrophist - as a hunch... which JRE/JVM are you using on the local and remote instances?

Incidently, I have recreated the db on my local machine using the same version of neo4j and it works find and even faster than the remote server I want to use. This is despite my local machine having fewer cores (12 vs 32) and less heap memory (4 vs 24). I really don't know where to begin

Thank you so much for you quick response!

To address your implicit questions:

  • When there are no queries being executed, the stop-the-world still happens periodically
  • stop-the-world occurs even if a query is very simple. I've looked through the query plan for these queries with PROFILE and there is nothing that jumps out as intense memory usage/excessive db hits. Sometimes these queries do run as quickly as expected (though it seems to be unpredictable).
  • My database currently consists of only 9GB of actual data under /var/lib/neo4j/data so should fit comfortably in the 24gb heap. Indeed the page cache shows 100% utilisation.

The root cause is the "stop-the-world" GC pauses. The reason you're seeing CPU spikes is that the database is allocating memory and then getting stuck in periods where it has to free up (garbage collect) memory and the logs that say "stop-the-world" are a strong indicator that the performance interruptions are caused by that process.

So -- some Garbage Collection Basics to explain what's going on here --

A JVM (java virtual machine, e.g. the Neo4J server) allocates required memory to newly created java objects. This memory is part of the total heap memory allocation configured at startup via neo4j.conf. The JVM periodically checks for unused objects in heap, which are subsequently discarded to reclaim heap memory via a process called Garbage Collection or GC in short.

Assuming heap sizing has been optimally done using Neo4j memory configurations (see https://neo4j.com/developer/kb/understanding-memory-consumption/), spikes in usage/data volume may still lead to increased heap utilisation and consequently longer and/or more frequent garbage collection. Memory leaks are another cause of increased heap utilisation, in many cases leading to a heap out of memory situation. A memory leak in Java is a situation where some objects are no longer used by the application but Garbage Collection fails to recognise this.

In such cases, adding more heap may simply postpone the JVM running out of heap space (throwing the java.lang.OutOfMemoryError: Java heap space error). Additionally, increasing the amount of Java heap space also tends to increase the length of GC pauses. One can execute heap dumps and use e.g. Eclipse MAT (ensure that your machine has sufficient memory for the analysis) to diagnose memory leaks. Other tools like JDK, jconsole, visualvm, jstat and indeed the neo4j gc.log may help identify any outlier transactions.

So what can you do about it?

Based on the framing of your question, it's probably about the query plan associated with your query. You have a 24G heap (which is kinda large) and I'm going to guess that the query you're doing is going all over a large graph that is probably bigger than 24G.

What exactly to do about it I can't say because you haven't specified the query. I would start with ensuring that you've got good indexes on your graph, and I would run a PROFILE on the query to see what the resulting query plan is.

Have a look at this related thread where one of our support engineers helped step through the process. This kind of process you see in this thread is the same thought process you need to bring to your query.

https://community.neo4j.com/t5/neo4j-graph-platform/vm-stop-the-world-pause-for-simple-load-csv-query-for-creating-a/td-p/25233

Here's htop on the remote

From the debug.log file from both installations

On the remote (not bundled - Linux host 3.10.0-1160.36.2.el7.x86_64 #1 SMP Mon Jul 26 08:21:44 CDT 2021 x86_64 x86_64 x86_64 GNU/Linux):

VM Name: OpenJDK 64-Bit Server VM
VM Vendor: Red Hat, Inc.
VM Version: 11.0.14+9-LTS
JIT compiler: HotSpot 64-Bit Tiered Compilers
VM Arguments: [-Xms25165824k, -Xmx25165824k, -XX:+ExitOnOutOfMemoryError, -XX:+UseG1GC, -XX:-OmitStackTraceInFastThrow, -XX:+$

/usr/bin/java --version
openjdk 11.0.17 2022-10-18 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.17.0.8-2.el7_9) (build 11.0.17+8-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.17.0.8-2.el7_9) (build 11.0.17+8-LTS, mixed mode, sharing)

On my local (bundled with the desktop interface - Linux host 5.15.0-41-generic #44~20.04.1-Ubuntu SMP Fri Jun 24 13:27:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux):

VM Name: OpenJDK 64-Bit Server VM
VM Vendor: Azul Systems, Inc.
VM Version: 11.0.17+8-LTS
JIT compiler: HotSpot 64-Bit Tiered Compilers
VM Arguments: [-Xms524288k, -Xmx1048576k, -XX:+UseG1GC, -XX:-OmitStackTraceInFastThrow, -XX:+AlwaysPreTouch, -XX:+UnlockExperimentalVMOptions, -XX:+TrustFinalNonStaticFields, -XX:+DisableExplicitGC, -XX:MaxInlineLevel=15, -XX:-UseBiasedLocking, -Djdk.nio.maxCachedBufferSize=262144, -Dio.netty.tryReflectionSetAccessible=true, -Djdk.tls.ephemeralDHKeySize=2048, -Djdk.tls.rejectClientInitiatedRenegotiation=true, -XX:FlightRecorderOptions=stackdepth=256, -XX:+UnlockDiagnosticVMOptions, -XX:+DebugNonSafepoints, -Dlog4j2.disable.jmx=true, -Dlog4j2.formatMsgNoLookups=true, -Dfile.encoding=UTF-8]

/home/user/.config/Neo4j Desktop/Application/distributions/java/zulu11.60.19-ca-jdk11.0.17/bin/java" --version
openjdk 11.0.17 2022-10-18 LTS
OpenJDK Runtime Environment Zulu11.60+19-CA (build 11.0.17+8-LTS)
OpenJDK 64-Bit Server VM Zulu11.60+19-CA (build 11.0.17+8-LTS, mixed mode)

Is this what you're looking for?

For the remote, I can see that the java that is in use (viewable from top) is different from the one described in the debug.log

I'm not sure I understand how optimising my query can stop memory leaks.