Slow writes after upgrade

dmiranda2791 · September 4, 2019, 10:30pm

I recently upgrade our Neo4j Enterprise to v3.5.9. The problem is that since the DB started, any write query hangs for 60seconds and the query is killed by the runtime (We have dbms.transaction.timeout=60s in the config)

Even this simple query will never complete:

create (:Foo)

Or one with a SET statament:

match(n:Node)
where n.uuid = $uuid
set n.updateDate = "2019-09-04T22:27:31.516Z"

On the other hand, read queries are fine. Only writes are hanging.

I see a lot of lines like this on my debug.log:

stefan.armbruster · September 5, 2019, 6:48pm

Please attach a snippet (as text) of your debug.log showing a startup sequence

michael.hunger · September 5, 2019, 7:32pm

How do you execute those write queries?

can you call dbms.listQueries()
when they hang and share the output?

dmiranda2791 · September 5, 2019, 8:15pm

Sure. I'll share my neo4j.log as well.

This is my neo4j,log after starting and long time after the query was executed

2019-09-05 02:58:37.995+0000 INFO  ======== Neo4j 3.5.9 ========
2019-09-05 02:58:38.006+0000 INFO  Starting...
2019-09-05 02:58:39.487+0000 INFO  Initiating metrics...
2019-09-05 02:58:43.321+0000 INFO  GraphAware Runtime enabled, bootstrapping...
2019-09-05 02:58:43.357+0000 INFO  Bootstrapping module with order 1, ID UIDM, using com.graphaware.module.uuid.UuidBootstrapper
2019-09-05 02:58:43.544+0000 INFO  Relationship Inclusion Policy set to com.graphaware.common.policy.inclusion.composite.CompositeRelationshipInclusionPolicy@e12e95a1
2019-09-05 02:58:43.554+0000 INFO  Registering module UIDM with GraphAware Runtime.
2019-09-05 02:58:43.556+0000 INFO  GraphAware Runtime bootstrapped, starting the Runtime...
2019-09-05 02:58:43.724+0000 INFO  Deprecated index providers in use: lucene-1.0 (34 indexes). Use procedure 'db.indexes()' to see what indexes use which index provider.
2019-09-05 02:58:44.008+0000 INFO  Starting GraphAware...
2019-09-05 02:58:44.015+0000 INFO  Loading module metadata...
2019-09-05 02:58:44.015+0000 INFO  Loading metadata for module UIDM
2019-09-05 02:58:44.131+0000 INFO  Module UIDM seems to have been registered for the first time.
2019-09-05 02:58:44.131+0000 INFO  Module UIDM seems to have been registered for the first time, will try to initialize...
2019-09-05 02:58:44.135+0000 INFO  InitializeUntil set to 9223372036854775807 and it is 1567652324135. Will initialize.
2019-09-05 02:59:05.875+0000 INFO  Sending metrics to CSV file at /var/lib/neo4j/metrics
2019-09-05 02:59:07.031+0000 INFO  Bolt enabled on 0.0.0.0:7687.
2019-09-05 02:59:08.909+0000 WARN  Server thread metrics not available (missing neo4j.server.threads.jetty.all)
2019-09-05 02:59:08.911+0000 WARN  Server thread metrics not available (missing neo4j.server.threads.jetty.idle)
2019-09-05 02:59:12.693+0000 INFO  Started.
2019-09-05 02:59:13.611+0000 INFO  Mounted REST API at: /db/manage
2019-09-05 02:59:13.665+0000 INFO  started
2019-09-05 02:59:13.678+0000 INFO  Mounted unmanaged extension [com.graphaware.server] at [/graphaware]
2019-09-05 02:59:13.890+0000 INFO  Server thread metrics have been registered successfully
2019-09-05 02:59:14.300+0000 INFO  Google Analytics enabled
2019-09-05 03:29:21.328+0000 WARN  Exception while producing input! Transaction was marked as successful, but unable to commit transaction so rolled back.
org.neo4j.graphdb.TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back.
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:99)
	at com.graphaware.tx.executor.input.TransactionalInput.closeTx(TransactionalInput.java:118)
	at com.graphaware.tx.executor.input.TransactionalInput.fetchNextOrNull(TransactionalInput.java:79)
	at org.neo4j.helpers.collection.PrefetchingIterator.peek(PrefetchingIterator.java:60)
	at org.neo4j.helpers.collection.PrefetchingIterator.hasNext(PrefetchingIterator.java:46)
	at com.graphaware.tx.executor.batch.IterableInputBatchTransactionExecutor.lambda$populateQueue$0(IterableInputBatchTransactionExecutor.java:82)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.neo4j.graphdb.TransactionTerminatedException: The transaction has been terminated. Retry your operation in a new transaction, and you should see a successful result. The transaction has not completed within the specified timeout (dbms.transaction.timeout). You may want to retry with a longer timeout.
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.failOnNonExplicitRollbackIfNeeded(KernelTransactionImplementation.java:639)
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.closeTransaction(KernelTransactionImplementation.java:589)
	at org.neo4j.internal.kernel.api.Transaction.close(Transaction.java:178)
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:77)
	... 6 more
2019-09-05 03:37:56.266+0000 WARN  The client is unauthorized due to authentication failure.
2019-09-05 05:43:22.664+0000 WARN  Exception while producing input! Transaction was marked as successful, but unable to commit transaction so rolled back.
org.neo4j.graphdb.TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back.
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:99)
	at com.graphaware.tx.executor.input.TransactionalInput.closeTx(TransactionalInput.java:118)
	at com.graphaware.tx.executor.input.TransactionalInput.fetchNextOrNull(TransactionalInput.java:79)
	at org.neo4j.helpers.collection.PrefetchingIterator.peek(PrefetchingIterator.java:60)
	at org.neo4j.helpers.collection.PrefetchingIterator.hasNext(PrefetchingIterator.java:46)
	at com.graphaware.tx.executor.batch.IterableInputBatchTransactionExecutor.lambda$populateQueue$0(IterableInputBatchTransactionExecutor.java:82)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.neo4j.graphdb.TransactionTerminatedException: The transaction has been terminated. Retry your operation in a new transaction, and you should see a successful result. The transaction has not completed within the specified timeout (dbms.transaction.timeout). You may want to retry with a longer timeout.
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.failOnNonExplicitRollbackIfNeeded(KernelTransactionImplementation.java:639)
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.closeTransaction(KernelTransactionImplementation.java:589)
	at org.neo4j.internal.kernel.api.Transaction.close(Transaction.java:178)
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:77)
	... 6 more
2019-09-05 05:43:31.787+0000 INFO  Creating fresh metadata for module UIDM.
2019-09-05 05:43:31.787+0000 INFO  Module UIDM has not changed configuration since last run, already initialized.
Exception in thread "GraphAware Starter" org.neo4j.graphdb.TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back.
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:103)
	at com.graphaware.runtime.metadata.GraphPropertiesMetadataRepository.persistModuleMetadata(GraphPropertiesMetadataRepository.java:111)
	at com.graphaware.runtime.metadata.GraphPropertiesMetadataRepository.persistModuleMetadata(GraphPropertiesMetadataRepository.java:98)
	at com.graphaware.runtime.manager.BaseModuleManager.persistMetadata(BaseModuleManager.java:217)
	at com.graphaware.runtime.manager.BaseModuleManager.loadMetadata(BaseModuleManager.java:158)
	at com.graphaware.runtime.manager.BaseModuleManager.loadMetadata(BaseModuleManager.java:125)
	at com.graphaware.runtime.TxDrivenRuntime.loadMetadata(TxDrivenRuntime.java:130)
	at com.graphaware.runtime.ProductionRuntime.loadMetadata(ProductionRuntime.java:80)
	at com.graphaware.runtime.BaseGraphAwareRuntime.startModules(BaseGraphAwareRuntime.java:154)
	at com.graphaware.runtime.TxDrivenRuntime.startModules(TxDrivenRuntime.java:146)
	at com.graphaware.runtime.ProductionRuntime.startModules(ProductionRuntime.java:70)
	at com.graphaware.runtime.BaseGraphAwareRuntime.start(BaseGraphAwareRuntime.java:134)
	at com.graphaware.runtime.bootstrap.RuntimeKernelExtension.lambda$start$0(RuntimeKernelExtension.java:117)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.neo4j.kernel.impl.store.InvalidRecordException: Property[0,used=false,prev=-1,next=-1, (blocks not loaded)] not in use
	at org.neo4j.kernel.impl.store.record.RecordLoad.verify(RecordLoad.java:63)
	at org.neo4j.kernel.impl.store.CommonAbstractStore.verifyAfterReading(CommonAbstractStore.java:1238)
	at org.neo4j.kernel.impl.store.CommonAbstractStore.readRecordFromPage(CommonAbstractStore.java:1107)
	at org.neo4j.kernel.impl.store.CommonAbstractStore.readIntoRecord(CommonAbstractStore.java:1057)
	at org.neo4j.kernel.impl.store.CommonAbstractStore.getRecord(CommonAbstractStore.java:1024)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.Loaders$2.load(Loaders.java:182)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.Loaders$2.load(Loaders.java:162)
	at org.neo4j.kernel.impl.transaction.state.RecordChanges.getOrLoad(RecordChanges.java:69)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.PropertyCreator.primitiveSetProperty(PropertyCreator.java:84)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.TransactionRecordState.graphAddProperty(TransactionRecordState.java:531)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.TransactionToRecordStateVisitor.visitGraphPropertyChanges(TransactionToRecordStateVisitor.java:149)
	at org.neo4j.storageengine.api.txstate.TxStateVisitor$Delegator.visitGraphPropertyChanges(TxStateVisitor.java:218)
	at org.neo4j.kernel.impl.api.state.TxState.accept(TxState.java:165)
	at org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine.createCommands(RecordStorageEngine.java:272)
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.commit(KernelTransactionImplementation.java:687)
	at org.neo4j.kernel.impl.api.KernelTransactionImplementation.closeTransaction(KernelTransactionImplementation.java:594)
	at org.neo4j.internal.kernel.api.Transaction.close(Transaction.java:178)
	at org.neo4j.kernel.impl.coreapi.TopLevelTransaction.close(TopLevelTransaction.java:77)
	... 13 more

This is my most recent debug.log. Its quite long so I put it on my Google Drive.

dmiranda2791 · September 5, 2019, 8:21pm

I ran dbms.listQueries right after create (:Foo) and I had this output.

queryId,username,metaData,query,parameters,planner,runtime,indexes,startTime,protocol,clientAddress,requestUri,status,resourceInformation,activeLockCount,elapsedTimeMillis,cpuTimeMillis,waitTimeMillis,idleTimeMillis,allocatedBytes,pageHits,pageFaults,connectionId
query-15000,neo4j,"{type:user-direct,app:neo4j-browser_v3.2.20}",call dbms.listQueries(),{},procedure,procedure,[],2019-09-05T20:20:11.987Z,bolt,190.92.63.238:58578,172.31.24.237:7687,running,{},0,0,null,0,null,null,0,0,bolt-275

dmiranda2791 · September 5, 2019, 8:23pm

I think that Graphaware uuid plugin it's involve in this. I removed the plugin and its configs, then restarted the database and it works as usual. The problem is that we really depend on this plugin.

I am not sure if this is related but when I run sudo service neo4j stop it timeout. I have to sudo kill -9 <pid> to stop the database

dana_canzano · September 6, 2019, 12:52am

using a kill -9 <neo4j pid> is not recommended/encouraged. When you shut down Neo4j properly using a bin/neo4j stop or similar this will cause Neo4j to exit gracefully and exit such that immediately before exiting it will checkpoint, i.e. write any updated data from the pagecache (i.e. the representation of graph data in RAM) to the actual files in disk at graph.db/neostore.*store.db.

If you kill -9 <neo4j pid> this force kill has no opportunity to checkpoint and thus the updated data in the pagecache can not be written to graph.db/neostore.*store.db. However the data is not lost. Rather upon the next start, it detects that there are transactions in the transaction log which have not been commtted to graph.db/neostore.*store.db and this needs to commit these transactions. As such either you checkpoint via a graceful shutdown or you effectively have to checkpoint on the next startup.

dmiranda2791 · September 6, 2019, 4:09am

Thanks @dana_canzano! Yeah, I've notice that the next time I start there are lines in the logs saying the database needs recovery. I use kill as my last option, only when I have a database that does not stop with the usual neo4j stop command.

michael.hunger · September 6, 2019, 11:09am

APOC now also has some UUID support, perhaps that works better?

You can also use call dbms.listActiveLocks() and call dbms.listTransactions() for some more details on what happens when it hangs.

I think there might be some lock being held that keeps your query from progressing.

dmiranda2791 · September 6, 2019, 4:48pm

call dbms.listTransactions() and call dbms.listActiveLocks() did not provide useful information .

call dbms.listTransactions(), shows three rows, one for the procedure and two almost empty rows. There is no queryId which I need in order to call dbms.listActiveLocks().

I studied the APOC support, I wish we could have a way to install handlers for all nodes and relationships at once. I understood we'd need to call apoc.uuid.install() for each label.

dmiranda2791 · September 6, 2019, 11:38pm

So today I I tried with a fresh Neo4j. I configured the plugins and all good. Then I mounted a backup of my data (from the server that was failing) and the problem continued. That makes me think that there's something going on our data/store/indexes.

I know that Graphaware uuid uses explicit indexes, not sure if that could be involve.

I see a lot of this in my debug.log.

Memory?

2019-09-06 23:33:21.793+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=125, gcTime=129, gcCount=1}
2019-09-06 23:33:34.294+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=112, gcTime=116, gcCount=1}
2019-09-06 23:33:37.146+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=146, gcTime=149, gcCount=1}
2019-09-06 23:33:44.649+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=143, gcTime=146, gcCount=1}
2019-09-06 23:33:57.366+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=105, gcTime=137, gcCount=1}
2019-09-06 23:34:02.391+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=129, gcTime=146, gcCount=1}
2019-09-06 23:34:07.410+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=123, gcTime=124, gcCount=1}
2019-09-06 23:34:10.024+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=109, gcTime=129, gcCount=1}

I am not sure but I'll keep digging.

dmiranda2791 · September 7, 2019, 4:48am

So, after startup I see two very long running transactions. They seem to be internal transactions because they do not have query-id. They have high values for page hits and page faults, the last I checked had 5810 page hits and 209 page faults.

I debugged the code for graphaware server and uuid and found that when the uuid module bootstraps it's calling all the nodes in the graph through a series of wrappers that finally invoke getAllNodes method in GraphDatabaseService class. This is where the process hangs. The result getAllNodes is expected to execute a unit of work nothing is generated.

Now I increases the timeout for transactions, it used to be 1minute and now it's 1hour. I am crossing fingers to see it load successfully.
Here is where the uuid module initializes

github.com

graphaware/neo4j-uuid/blob/master/src/main/java/com/graphaware/module/uuid/UuidModule.java#L104


      
                      } else {
                          continue;
                      }
                  }
          
          
        if (uuidHasChanged(change)) {
                      if (isImmutable()) {
                          throw new DeliberateTransactionRollbackException("You are not allowed to modify the " + uuidConfiguration.getUuidProperty() + " property");
                      } else {
                          continue;
                      }
                  }
          
          
        if (!hadUuid(change) && hasNoUuid(change)) {
                      assignUuid(change.getCurrent());
                  }
              }
          }
          
          
private void assignUuid(Entity entity) {
              String uuidProperty = uuidConfiguration.getUuidProperty();

Here it waits for the result of getAllNodes. My logs point to the line 86 on this file.

github.com

graphaware/neo4j-framework/blob/master/tx-executor/src/main/java/com/graphaware/tx/executor/batch/IterableInputBatchTransactionExecutor.java#L82


      
          @Override
          protected void doExecute() {
              populateQueue();
              processQueue();
          }
          
          
protected final void populateQueue() {
              new Thread(() -> {
                  try {
                      for (T input : IterableInputBatchTransactionExecutor.this.input) {
                          queue.offer(input);
                      }
                  } catch (Exception e) {
                      LOG.warn("Exception while producing input!", e);
                  } finally {
                      finished.set(true);
                  }
              }).start();
          }
          
          
protected final void processQueue() {

Luanne_Misquitt · September 10, 2019, 12:49am

@dmiranda2791 which version of Neo4j did you upgrade from? How many nodes do you have? And did they all have a uuid assigned before upgrade?
And finally, do you use Community or Enterprise?

dmiranda2791 · September 10, 2019, 2:10am

Upgraded from 3.1.1, we have about 9 million nodes, they all had uuid assigned before and we are using enterprise edition.

I think it was an issue with the database store. We mounted a backup of our production store on the dev instance then we did the upgrade and all went good. Also the previous store use to take long times to startup.

Fortunately, we solved our problem .

Luanne_Misquitt · September 10, 2019, 2:26am

Good to know, thanks

michael.hunger · September 10, 2019, 10:29am

do you by chance still have the old store around for forensics?

dmiranda2791 · September 10, 2019, 5:55pm

Yeah, I have an AWS image with the old version of the database before starting the upgrade. How would that forensics work be?

dmiranda2791 · September 11, 2019, 4:41pm

@Luanne_Misquitt, I am experiencing the same problem again on our dev DB , I am pretty sure it has something to do with the initialization of Graphaware uuid module. If I disable this module startup is fast and I can do write queries almost immediately.

When Uuid module is enable, Neo4j starts but I can't do any write queries. The last time I let it run for about 10 hours and I still couldn't.

I have a server with 30GB of ram, I have tested different heap and page cache configurations such as: 18GB for heap, 8 for page cache and the other way around 8GB for heap and 18 for page cache. My store in the version is about 13GB.

Each time I see the module initializing:

2019-09-11 16:19:51.064+0000 INFO  Module UIDM seems to have been registered for the first time.
2019-09-11 16:19:51.064+0000 INFO  Module UIDM seems to have been registered for the first time, will try to initialize...

And a lot of GC pauses on the debug.log

2019-09-11 16:33:11.966+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=109, gcTime=115, gcCount=1}
2019-09-11 16:34:57.481+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=104, gcTime=121, gcCount=1}
2019-09-11 16:35:31.614+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=108, gcTime=117, gcCount=1}
2019-09-11 16:36:10.861+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=115, gcTime=116, gcCount=1}

It'd be great if you can help me figure out what's going on.

michael.hunger · September 11, 2019, 10:37pm

we would like to send it to engineering for analytics, if you can share it via DM with me.

deemeetree · April 24, 2020, 12:08pm

I'm very disappointed with all these upgrades.

The new native relationship indexes are much slower than APOC, which has been deprecated.

Other (normal) requests that don't query relationship properties are also much slower:

Topic		Replies	Views
Sporadic Long-Running Queries During Batched Writes – What Could Be the Cause? Cypher performance , cypher	4	80	June 25, 2025
Driver hanging and eventually failing with empty read buffer .NET neo4j-driver	3	687	April 19, 2021
Neo4j instance hang by a heavy write operation Neo4j Graph Platform performance , lock	0	401	May 7, 2020
Read / write performance dramatically degrades with concurrent queries Neo4j Graph Platform migrated	1	173	July 17, 2022
Intermittent issue in query execution time Neo4j Graph Platform migrated	6	236	July 4, 2022

July Summer Fun!

Slow writes after upgrade

Related topics