Queries stuck (for over 12 hours) after restoring backup

We recently shifted Neo4j from AWS to Azure by backing up our aws instances and restoring it on azure.

While the database has managed to spin up correctly, we noticed that our queries are taking much longer to complete. Now, we have 1000+ queries running that have been stuck in the running state for over 12 hours. These are queries that used to take a few hundred ms before.

Do we need to do something else in order to restore the database back to its original speed (the machine types are equivalent, and the disk is slightly slower)

Neo4j version: 3.4
Driver: Javascript v1.6

Here is what :queries returns:

Upon digging further, the queries seem to be stuck in the waiting state:

queryId	username	status	activeLockCount	cpuTimeMillis	waitTimeMillis	idleTimeMillis	pageHits	pageFaults
"query-22273824"	"neo4j"	"waiting"	49	null	46068202	null	186	0
"query-22028352"	"neo4j"	"waiting"	35	null	47060864	null	74	0
"query-24696389"	"neo4j"	"waiting"	58	null	27333190	null	1192	22
"query-21900462"	"neo4j"	"waiting"	44	null	47157005	null	108	0
"query-21802972"	"neo4j"	"waiting"	37	null	47227062	null	90	0
"query-21979827"	"neo4j"	"waiting"	41	null	47095846	null	122	0
"query-22168283"	"neo4j"	"waiting"	36	null	46829604	null	60	0
"query-24998134"	"neo4j"	"waiting"	9	null	27029505	null	165	5
"query-22031403"	"neo4j"	"waiting"	41	null	47057162	null	122	0
"query-24782911"	"neo4j"	"waiting"	68	null	27238637	null	817	38
"query-22055104"	"neo4j"	"waiting"	41	null	47031869	null	123	0
"query-21822317"	"neo4j"	"waiting"	53	null	47214662	null	218	0
"query-22122749"	"neo4j"	"waiting"	36	null	46949313	null	64	0
"query-21831185"	"neo4j"	"waiting"	39	null	47209538	null	106	0
"query-22122210"	"neo4j"	"waiting"	41	null	46950780	null	122	0
"query-22047665"	"neo4j"	"waiting"	35	null	47040682	null	73	1
"query-22293153"	"neo4j"	"waiting"	48	null	46003533	null	160	0
"query-25112820"	"neo4j"	"waiting"	33	null	26919621	null	563	27

Neo4j CPU usage: 50% (32 cores)
IOPS usage: 0.01% - 3.15%
Debug.log contains a bunch of messages like this:

2019-05-25 06:43:17.059+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0xf793ca34, L:/10.0.2.6:7687 - R:/54.251.151.37:53032] readAddress(..) failed: Connection timed out
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out
	at io.netty.channel.unix.Errors.newIOException(Errors.java:122)
	at io.netty.channel.unix.Errors.ioResult(Errors.java:146)
	at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:178)
	at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:348)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:794)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:404)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	at java.lang.Thread.run(Thread.java:748)

Any suggestions will be very helpful.

Not sure how, but you've got some lock contention issue going on...all of your queries are waiting on locks, and either this isn't a deadlock state or the deadlock detection isn't working, so you're stuck.

You'll either need to kill the queries or restart the db.

You said you're on Neo4j 3.4. Is that 3.4.0? If so, get off that .0 release, these contain brand new features and thus the highest potential for bugs with their first introduction. When you can please upgrade to the highest patch release for your minor version, in this case 3.4.14. That can at least remove the possibility that you're being affected by some bug that has since been patched along the 3.4.x line.

Thanks Andrew,

We are on version 3.4.1, but we will update to 3.4.14.

I learnt from one of my colleagues that when we moved from AWS to Azure, we directly started the database on the backup that was created and did not use neo4j-admin restore tool (it was taking too long according to them)

What are the implications of not using the restor tool? Did that cause the error?

We typically recommend neo4j-admin restore as it does several other housekeeping things for you that among other things may prevent some forms of corruption later (if used in a cluster).

You may want to double-check :schema from the browser to ensure all your expected indexes are present and online. You can use dbms.listQueries() and dbms.listTransactions() to see what's executing, and you should also be able to see query state as well.

You can also use the following query to get locking information via JMX beans:

call dbms.queryJmx("org.neo4j:instance=kernel#0,name=Locking") yield attributes
return attributes

Also, double-check permissions within your graph.db structure, I believe the read and write permissions should be set to the user that is used to run Neo4j.

Hi Andrew,

Thanks for your reply.

Here is what we tried:

  • Verified that all constraints and indexes are online
  • Upgrade to 3.4.14 was successful
  • Set a lockAcquisitionTimeout on the driver to see which queries are unable to acquire locks

None of the above works, and now we have thousands of queries failing because they couldn't acquire a lock (in 30s).

We are going to rebuild the indexes and see if that might help. What else can we do?

Update: We can't rebuild indexes either as setting a new label for 3k nodes is taking over 3 minutes, and we have over 1B+ nodes that need migration. Going to drop that idea. Our last resort is moving to a new machine type and os type.

Update 2: Weirdly, the queries that are running are not always shown in dbms.listQueries(). For example, one of the queries that we executed in the browser reported that it took 10s to complete, however this statement only showed up after 100s. This query was available in the list of queries generated by dbms.listQueries for 10s, but not afterwards. Neo4j thinks the query is done, but in reality it is still executing. Future queries seem to be waiting for the locks to be released by these phantom queries.

Here is a screenshot of such a query:

The query above eventually completed, but reported as taking 14 seconds (irl took about 90s)

@andrew.bowman do you have any tips for us?

Hard to say without more info. There are some other tools I provided for checking lock health, try those out.

You can EXPLAIN some of the queries in question that are running into contention issues, that is a good sanity check to ensure they're using index lookups where expected.

You may also want to set dbms.transaction.timeout to the same value as your lock acquisition timeout.

As far as adding new labels to many nodes, or otherwise processing > 50k-100k nodes/rels at once, you should batch those, usually with apoc.periodic.iterate() from APOC Procedures.