Slow writes after upgrade

I recently upgrade our Neo4j Enterprise to v3.5.9. The problem is that since the DB started, any write query hangs for 60seconds and the query is killed by the runtime (We have dbms.transaction.timeout=60s in the config)

Even this simple query will never complete:

create (:Foo)

Or one with a SET statament:

match(n:Node)
where n.uuid = $uuid
set n.updateDate = "2019-09-04T22:27:31.516Z"

On the other hand, read queries are fine. Only writes are hanging.

I see a lot of lines like this on my debug.log:

Please attach a snippet (as text) of your debug.log showing a startup sequence

How do you execute those write queries?

can you call dbms.listQueries()
when they hang and share the output?

Sure. I'll share my neo4j.log as well.

This is my neo4j,log after starting and long time after the query was executed

2019-09-05 02:58:37.995+0000 INFO  ======== Neo4j 3.5.9 ========
2019-09-05 02:58:38.006+0000 INFO  Starting...
2019-09-05 02:58:39.487+0000 INFO  Initiating metrics...
2019-09-05 02:58:43.321+0000 INFO  GraphAware Runtime enabled, bootstrapping...
2019-09-05 02:58:43.357+0000 INFO  Bootstrapping module with order 1, ID UIDM, using com.graphaware.module.uuid.UuidBootstrapper
2019-09-05 02:58:43.544+0000 INFO  Relationship Inclusion Policy set to com.graphaware.common.policy.inclusion.composite.CompositeRelationshipInclusionPolicy@e12e95a1
2019-09-05 02:58:43.554+0000 INFO  Registering module UIDM with GraphAware Runtime.
2019-09-05 02:58:43.556+0000 INFO  GraphAware Runtime bootstrapped, starting the Runtime...
2019-09-05 02:58:43.724+0000 INFO  Deprecated index providers in use: lucene-1.0 (34 indexes). Use procedure 'db.indexes()' to see what indexes use which index provider.
2019-09-05 02:58:44.008+0000 INFO  Starting GraphAware...
2019-09-05 02:58:44.015+0000 INFO  Loading module metadata...
2019-09-05 02:58:44.015+0000 INFO  Loading metadata for module UIDM
2019-09-05 02:58:44.131+0000 INFO  Module UIDM seems to have been registered for the first time.
2019-09-05 02:58:44.131+0000 INFO  Module UIDM seems to have been registered for the first time, will try to initialize...
2019-09-05 02:58:44.135+0000 INFO  InitializeUntil set to 9223372036854775807 and it is 1567652324135. Will initialize.
2019-09-05 02:59:05.875+0000 INFO  Sending metrics to CSV file at /var/lib/neo4j/metrics
2019-09-05 02:59:07.031+0000 INFO  Bolt enabled on 0.0.0.0:7687.
2019-09-05 02:59:08.909+0000 WARN  Server thread metrics not available (missing neo4j.server.threads.jetty.all)
2019-09-05 02:59:08.911+0000 WARN  Server thread metrics not available (missing neo4j.server.threads.jetty.idle)
2019-09-05 02:59:12.693+0000 INFO  Started.
2019-09-05 02:59:13.611+0000 INFO  Mounted REST API at: /db/manage
2019-09-05 02:59:13.665+0000 INFO  started
2019-09-05 02:59:13.678+0000 INFO  Mounted unmanaged extension [com.graphaware.server] at [/graphaware]
2019-09-05 02:59:13.890+0000 INFO  Server thread metrics have been registered successfully
2019-09-05 02:59:14.300+0000 INFO  Google Analytics enabled
2019-09-05 03:29:21.328+0000 WARN  Exception while producing input! Transaction was marked as successful, but unable to commit transaction so rolled back.
org.neo4j.graphdb.TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back.
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:99)
	at com.graphaware.tx.executor.input.TransactionalInput.closeTx(TransactionalInput.java:118)
	at com.graphaware.tx.executor.input.TransactionalInput.fetchNextOrNull(TransactionalInput.java:79)
	at org.neo4j.helpers.collection.PrefetchingIterator.peek(PrefetchingIterator.java:60)
	at org.neo4j.helpers.collection.PrefetchingIterator.hasNext(PrefetchingIterator.java:46)
	at com.graphaware.tx.executor.batch.IterableInputBatchTransactionExecutor.lambda$populateQueue$0(IterableInputBatchTransactionExecutor.java:82)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.neo4j.graphdb.TransactionTerminatedException: The transaction has been terminated. Retry your operation in a new transaction, and you should see a successful result. The transaction has not completed within the specified timeout (dbms.transaction.timeout). You may want to retry with a longer timeout.
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.failOnNonExplicitRollbackIfNeeded(KernelTransactionImplementation.java:639)
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.closeTransaction(KernelTransactionImplementation.java:589)
	at org.neo4j.internal.kernel.api.Transaction.close(Transaction.java:178)
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:77)
	... 6 more
2019-09-05 03:37:56.266+0000 WARN  The client is unauthorized due to authentication failure.
2019-09-05 05:43:22.664+0000 WARN  Exception while producing input! Transaction was marked as successful, but unable to commit transaction so rolled back.
org.neo4j.graphdb.TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back.
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:99)
	at com.graphaware.tx.executor.input.TransactionalInput.closeTx(TransactionalInput.java:118)
	at com.graphaware.tx.executor.input.TransactionalInput.fetchNextOrNull(TransactionalInput.java:79)
	at org.neo4j.helpers.collection.PrefetchingIterator.peek(PrefetchingIterator.java:60)
	at org.neo4j.helpers.collection.PrefetchingIterator.hasNext(PrefetchingIterator.java:46)
	at com.graphaware.tx.executor.batch.IterableInputBatchTransactionExecutor.lambda$populateQueue$0(IterableInputBatchTransactionExecutor.java:82)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.neo4j.graphdb.TransactionTerminatedException: The transaction has been terminated. Retry your operation in a new transaction, and you should see a successful result. The transaction has not completed within the specified timeout (dbms.transaction.timeout). You may want to retry with a longer timeout.
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.failOnNonExplicitRollbackIfNeeded(KernelTransactionImplementation.java:639)
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.closeTransaction(KernelTransactionImplementation.java:589)
	at org.neo4j.internal.kernel.api.Transaction.close(Transaction.java:178)
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:77)
	... 6 more
2019-09-05 05:43:31.787+0000 INFO  Creating fresh metadata for module UIDM.
2019-09-05 05:43:31.787+0000 INFO  Module UIDM has not changed configuration since last run, already initialized.
Exception in thread "GraphAware Starter" org.neo4j.graphdb.TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back.
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:103)
	at com.graphaware.runtime.metadata.GraphPropertiesMetadataRepository.persistModuleMetadata(GraphPropertiesMetadataRepository.java:111)
	at com.graphaware.runtime.metadata.GraphPropertiesMetadataRepository.persistModuleMetadata(GraphPropertiesMetadataRepository.java:98)
	at com.graphaware.runtime.manager.BaseModuleManager.persistMetadata(BaseModuleManager.java:217)
	at com.graphaware.runtime.manager.BaseModuleManager.loadMetadata(BaseModuleManager.java:158)
	at com.graphaware.runtime.manager.BaseModuleManager.loadMetadata(BaseModuleManager.java:125)
	at com.graphaware.runtime.TxDrivenRuntime.loadMetadata(TxDrivenRuntime.java:130)
	at com.graphaware.runtime.ProductionRuntime.loadMetadata(ProductionRuntime.java:80)
	at com.graphaware.runtime.BaseGraphAwareRuntime.startModules(BaseGraphAwareRuntime.java:154)
	at com.graphaware.runtime.TxDrivenRuntime.startModules(TxDrivenRuntime.java:146)
	at com.graphaware.runtime.ProductionRuntime.startModules(ProductionRuntime.java:70)
	at com.graphaware.runtime.BaseGraphAwareRuntime.start(BaseGraphAwareRuntime.java:134)
	at com.graphaware.runtime.bootstrap.RuntimeKernelExtension.lambda$start$0(RuntimeKernelExtension.java:117)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.neo4j.kernel.impl.store.InvalidRecordException: Property[0,used=false,prev=-1,next=-1, (blocks not loaded)] not in use
	at org.neo4j.kernel.impl.store.record.RecordLoad.verify(RecordLoad.java:63)
	at org.neo4j.kernel.impl.store.CommonAbstractStore.verifyAfterReading(CommonAbstractStore.java:1238)
	at org.neo4j.kernel.impl.store.CommonAbstractStore.readRecordFromPage(CommonAbstractStore.java:1107)
	at org.neo4j.kernel.impl.store.CommonAbstractStore.readIntoRecord(CommonAbstractStore.java:1057)
	at org.neo4j.kernel.impl.store.CommonAbstractStore.getRecord(CommonAbstractStore.java:1024)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.Loaders$2.load(Loaders.java:182)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.Loaders$2.load(Loaders.java:162)
	at org.neo4j.kernel.impl.transaction.state.RecordChanges.getOrLoad(RecordChanges.java:69)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.PropertyCreator.primitiveSetProperty(PropertyCreator.java:84)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.TransactionRecordState.graphAddProperty(TransactionRecordState.java:531)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.TransactionToRecordStateVisitor.visitGraphPropertyChanges(TransactionToRecordStateVisitor.java:149)
	at org.neo4j.storageengine.api.txstate.TxStateVisitor$Delegator.visitGraphPropertyChanges(TxStateVisitor.java:218)
	at org.neo4j.kernel.impl.api.state.TxState.accept(TxState.java:165)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine.createCommands(RecordStorageEngine.java:272)
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.commit(KernelTransactionImplementation.java:687)
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.closeTransaction(KernelTransactionImplementation.java:594)
	at org.neo4j.internal.kernel.api.Transaction.close(Transaction.java:178)
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:77)
	... 13 more

This is my most recent debug.log. Its quite long so I put it on my Google Drive.

I ran dbms.listQueries right after create (:Foo) and I had this output.

queryId,username,metaData,query,parameters,planner,runtime,indexes,startTime,protocol,clientAddress,requestUri,status,resourceInformation,activeLockCount,elapsedTimeMillis,cpuTimeMillis,waitTimeMillis,idleTimeMillis,allocatedBytes,pageHits,pageFaults,connectionId
query-15000,neo4j,"{type:user-direct,app:neo4j-browser_v3.2.20}",call dbms.listQueries(),{},procedure,procedure,[],2019-09-05T20:20:11.987Z,bolt,190.92.63.238:58578,172.31.24.237:7687,running,{},0,0,null,0,null,null,0,0,bolt-275

I think that Graphaware uuid plugin it's involve in this. I removed the plugin and its configs, then restarted the database and it works as usual. The problem is that we really depend on this plugin.

I am not sure if this is related but when I run sudo service neo4j stop it timeout. I have to sudo kill -9 <pid> to stop the database :frowning:

using a kill -9 <neo4j pid> is not recommended/encouraged. When you shut down Neo4j properly using a bin/neo4j stop or similar this will cause Neo4j to exit gracefully and exit such that immediately before exiting it will checkpoint, i.e. write any updated data from the pagecache (i.e. the representation of graph data in RAM) to the actual files in disk at graph.db/neostore.*store.db.

If you kill -9 <neo4j pid> this force kill has no opportunity to checkpoint and thus the updated data in the pagecache can not be written to graph.db/neostore.*store.db. However the data is not lost. Rather upon the next start, it detects that there are transactions in the transaction log which have not been commtted to graph.db/neostore.*store.db and this needs to commit these transactions. As such either you checkpoint via a graceful shutdown or you effectively have to checkpoint on the next startup.

1 Like

Thanks @dana_canzano! Yeah, I've notice that the next time I start there are lines in the logs saying the database needs recovery. I use kill as my last option, only when I have a database that does not stop with the usual neo4j stop command.

APOC now also has some UUID support, perhaps that works better?

You can also use call dbms.listActiveLocks() and call dbms.listTransactions() for some more details on what happens when it hangs.

I think there might be some lock being held that keeps your query from progressing.

call dbms.listTransactions() and call dbms.listActiveLocks() did not provide useful information :frowning:.

call dbms.listTransactions(), shows three rows, one for the procedure and two almost empty rows. There is no queryId which I need in order to call dbms.listActiveLocks().

I studied the APOC support, I wish we could have a way to install handlers for all nodes and relationships at once. I understood we'd need to call apoc.uuid.install() for each label.

So today I I tried with a fresh Neo4j. I configured the plugins and all good. Then I mounted a backup of my data (from the server that was failing) and the problem continued. That makes me think that there's something going on our data/store/indexes.

I know that Graphaware uuid uses explicit indexes, not sure if that could be involve.

I see a lot of this in my debug.log.

Memory?

2019-09-06 23:33:21.793+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=125, gcTime=129, gcCount=1}
2019-09-06 23:33:34.294+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=112, gcTime=116, gcCount=1}
2019-09-06 23:33:37.146+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=146, gcTime=149, gcCount=1}
2019-09-06 23:33:44.649+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=143, gcTime=146, gcCount=1}
2019-09-06 23:33:57.366+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=105, gcTime=137, gcCount=1}
2019-09-06 23:34:02.391+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=129, gcTime=146, gcCount=1}
2019-09-06 23:34:07.410+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=123, gcTime=124, gcCount=1}
2019-09-06 23:34:10.024+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=109, gcTime=129, gcCount=1}

I am not sure but I'll keep digging.

So, after startup I see two very long running transactions. They seem to be internal transactions because they do not have query-id. They have high values for page hits and page faults, the last I checked had 5810 page hits and 209 page faults.

I debugged the code for graphaware server and uuid and found that when the uuid module bootstraps it's calling all the nodes in the graph through a series of wrappers that finally invoke getAllNodes method in GraphDatabaseService class. This is where the process hangs. The result getAllNodes is expected to execute a unit of work nothing is generated.

Now I increases the timeout for transactions, it used to be 1minute and now it's 1hour. I am crossing fingers to see it load successfully. :crossed_fingers:
Here is where the uuid module initializes

Here it waits for the result of getAllNodes. My logs point to the line 86 on this file.

@dmiranda2791 which version of Neo4j did you upgrade from? How many nodes do you have? And did they all have a uuid assigned before upgrade?
And finally, do you use Community or Enterprise?

Upgraded from 3.1.1, we have about 9 million nodes, they all had uuid assigned before and we are using enterprise edition.

I think it was an issue with the database store. We mounted a backup of our production store on the dev instance then we did the upgrade and all went good. Also the previous store use to take long times to startup.

Fortunately, we solved our problem :smiley: .

1 Like

Good to know, thanks

do you by chance still have the old store around for forensics?

Yeah, I have an AWS image with the old version of the database before starting the upgrade. How would that forensics work be?

@Luanne_Misquitt, I am experiencing the same problem again on our dev DB :frowning:, I am pretty sure it has something to do with the initialization of Graphaware uuid module. If I disable this module startup is fast and I can do write queries almost immediately.

When Uuid module is enable, Neo4j starts but I can't do any write queries. The last time I let it run for about 10 hours and I still couldn't.

I have a server with 30GB of ram, I have tested different heap and page cache configurations such as: 18GB for heap, 8 for page cache and the other way around 8GB for heap and 18 for page cache. My store in the version is about 13GB.

Each time I see the module initializing:

2019-09-11 16:19:51.064+0000 INFO  Module UIDM seems to have been registered for the first time.
2019-09-11 16:19:51.064+0000 INFO  Module UIDM seems to have been registered for the first time, will try to initialize...

And a lot of GC pauses on the debug.log

2019-09-11 16:33:11.966+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=109, gcTime=115, gcCount=1}
2019-09-11 16:34:57.481+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=104, gcTime=121, gcCount=1}
2019-09-11 16:35:31.614+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=108, gcTime=117, gcCount=1}
2019-09-11 16:36:10.861+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=115, gcTime=116, gcCount=1}

It'd be great if you can help me figure out what's going on.

we would like to send it to engineering for analytics, if you can share it via DM with me.

I'm very disappointed with all these upgrades.

The new native relationship indexes are much slower than APOC, which has been deprecated.

Other (normal) requests that don't query relationship properties are also much slower: