Neo4j Community Server Thread stuck at 100% for over a day

performance

(D Barnhill) #1

I installed Neo4j Community Edition 3.4.7 on a Linux server, then loaded it with several gigabytes of data. There are no queries running, but there is one Java thread stuck at 100% CPU. There is no disk I/O. If I try to stop neo4j using sudo systemctl stop neo4j it does not shut down in time and is killed by the system. If I start neo4j again, the 100% thread comes back.

The last part of my debug log is:
2018-10-03 01:34:24.247+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 548
2018-10-03 01:35:01.874+0000 INFO [o.n.k.i.a.i.IndexPopulationJob] Completed node store scan. Flushing all pending updates.
BatchingMultipleIndexPopulator{activeTasks=3, executor=java.util.concurrent.ThreadPoolExecutor@12cd0ed2[Running, pool size = 7, active threads = 3, queued tasks = 0, completed tasks = 1791], batchedUpdates = [0 updates], queuedUpdates = 0}
2018-10-03 01:35:01.874+0000 INFO [o.n.k.i.a.i.IndexPopulationJob] Shutting down executor.
BatchingMultipleIndexPopulator{activeTasks=3, executor=java.util.concurrent.ThreadPoolExecutor@12cd0ed2[Running, pool size = 7, active threads = 3, queued tasks = 0, completed tasks = 1792], batchedUpdates = [0 updates], queuedUpdates = 0}
2018-10-03 01:35:10.957+0000 INFO [o.n.k.i.a.i.IndexPopulationJob] Index creation finished. Index [:Diagnosis(IsCancer) [provider: {key=lucene+native, version=2.0}]] is ONLINE.
2018-10-03 01:49:24.276+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 3981434 checkpoint started...
2018-10-03 01:49:26.639+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 3981434 to [/data/neo4j/data/databases/graph.db/neostore.counts.db.b], from [/data/neo4j/data/databases/graph.db/neostore.counts.db.a].
2018-10-03 01:49:52.672+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 3981434 checkpoint completed in 28s 395ms
2018-10-03 01:49:52.672+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 548

The last part of the neo4j log:

Oct 02 23:26:49 nomad-neo4j-qa systemd[1]: Stopping Neo4j Graph Database...
Oct 02 23:26:49 nomad-neo4j-qa neo4j[7082]: 2018-10-02 23:26:49.358+0000 INFO Neo4j Server shutdown initiated by request
Oct 02 23:26:49 nomad-neo4j-qa neo4j[7082]: 2018-10-02 23:26:49.401+0000 INFO Stopping...
Oct 02 23:28:49 nomad-neo4j-qa systemd[1]: neo4j.service: State 'stop-sigterm' timed out. Killing.
Oct 02 23:28:52 nomad-neo4j-qa systemd[1]: neo4j.service: Main process exited, code=killed, status=9/KILL
Oct 02 23:28:52 nomad-neo4j-qa systemd[1]: Stopped Neo4j Graph Database.
Oct 02 23:28:52 nomad-neo4j-qa systemd[1]: neo4j.service: Unit entered failed state.
Oct 02 23:28:52 nomad-neo4j-qa systemd[1]: neo4j.service: Failed with result 'signal'.
Oct 02 23:30:52 nomad-neo4j-qa systemd[1]: Started Neo4j Graph Database.
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: Active database: graph.db
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: Directories in use:
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: home: /var/lib/neo4j
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: config: /etc/neo4j
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: logs: /var/log/neo4j
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: plugins: /var/lib/neo4j/plugins
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: import: /data/neo4j/import
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: data: /data/neo4j/data
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: certificates: /var/lib/neo4j/certificates
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: run: /var/run/neo4j
Oct 02 23:30:52 nomad-neo4j-qa neo4j[21313]: Starting Neo4j.
Oct 02 23:30:54 nomad-neo4j-qa neo4j[21313]: 2018-10-02 23:30:54.387+0000 INFO ======== Neo4j 3.4.7 ========
Oct 02 23:30:54 nomad-neo4j-qa neo4j[21313]: 2018-10-02 23:30:54.433+0000 INFO Starting...
Oct 02 23:31:03 nomad-neo4j-qa neo4j[21313]: 2018-10-02 23:31:03.997+0000 INFO Bolt enabled on 0.0.0.0:7687.
Oct 02 23:31:31 nomad-neo4j-qa neo4j[21313]: 2018-10-02 23:31:31.887+0000 INFO Started.
Oct 02 23:31:36 nomad-neo4j-qa neo4j[21313]: 2018-10-02 23:31:36.549+0000 INFO Remote interface available at http://localhost:7474/

I have a java thread dump as well, but it's too big to put in this message.


(Mark Needham) #2

Can you put the thread dump on S3 or similar and link to it from here?


(D Barnhill) #3

I figured out last night that what caused it was a date index on a large set of nodes (>120,000,000). If I removed the date index then restarted the server, the problem went away. After looking through the threads the only one that was not waiting on something was this one thread that looks like it's doing sampling:

Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: "neo4j.IndexSampling-1" #30 daemon prio=6 os_prio=0 tid=0x00007f015c012800 nid=0x1c2e runnable [0x00007f01738f7000]
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: java.lang.Thread.State: RUNNABLE
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.index.internal.gbptree.SeekCursor.traverseDownToFirstLeaf(SeekCursor.java:461)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.index.internal.gbptree.SeekCursor.readAndValidateNextKeyValueBatch(SeekCursor.java:668)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.index.internal.gbptree.SeekCursor.next(SeekCursor.java:545)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.kernel.impl.index.schema.FullScanNonUniqueIndexSampler.result(FullScanNonUniqueIndexSampler.java:65)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.kernel.impl.index.schema.NativeSchemaIndexReader$$Lambda$338/1736150547.sampleIndex(Unknown Source)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.kernel.impl.index.schema.fusion.FusionIndexSampler.sampleIndex(FusionIndexSampler.java:41)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.kernel.impl.api.index.sampling.OnlineIndexSamplingJob.run(OnlineIndexSamplingJob.java:69)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.kernel.impl.api.index.sampling.IndexSamplingJobTracker.lambda$scheduleSamplingJob$0(IndexSamplingJobTracker.java:83)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.kernel.impl.api.index.sampling.IndexSamplingJobTracker$$Lambda$350/1022784020.run(Unknown Source)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.kernel.impl.scheduler.ThreadPool.lambda$submit$0(ThreadPool.java:58)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at org.neo4j.kernel.impl.scheduler.ThreadPool$$Lambda$319/838627755.run(Unknown Source)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Oct 02 23:14:22 nomad-neo4j-qa neo4j[7082]: at java.lang.Thread.run(Thread.java:748)

Again, this happened for several days in a row with no discernible I/O