We recently shifted Neo4j from AWS to Azure by backing up our aws instances and restoring it on azure.
While the database has managed to spin up correctly, we noticed that our queries are taking much longer to complete. Now, we have 1000+ queries running that have been stuck in the running state for over 12 hours. These are queries that used to take a few hundred ms before.
Do we need to do something else in order to restore the database back to its original speed (the machine types are equivalent, and the disk is slightly slower)
Neo4j version: 3.4
Here is what
Upon digging further, the queries seem to be stuck in the waiting state:
queryId username status activeLockCount cpuTimeMillis waitTimeMillis idleTimeMillis pageHits pageFaults "query-22273824" "neo4j" "waiting" 49 null 46068202 null 186 0 "query-22028352" "neo4j" "waiting" 35 null 47060864 null 74 0 "query-24696389" "neo4j" "waiting" 58 null 27333190 null 1192 22 "query-21900462" "neo4j" "waiting" 44 null 47157005 null 108 0 "query-21802972" "neo4j" "waiting" 37 null 47227062 null 90 0 "query-21979827" "neo4j" "waiting" 41 null 47095846 null 122 0 "query-22168283" "neo4j" "waiting" 36 null 46829604 null 60 0 "query-24998134" "neo4j" "waiting" 9 null 27029505 null 165 5 "query-22031403" "neo4j" "waiting" 41 null 47057162 null 122 0 "query-24782911" "neo4j" "waiting" 68 null 27238637 null 817 38 "query-22055104" "neo4j" "waiting" 41 null 47031869 null 123 0 "query-21822317" "neo4j" "waiting" 53 null 47214662 null 218 0 "query-22122749" "neo4j" "waiting" 36 null 46949313 null 64 0 "query-21831185" "neo4j" "waiting" 39 null 47209538 null 106 0 "query-22122210" "neo4j" "waiting" 41 null 46950780 null 122 0 "query-22047665" "neo4j" "waiting" 35 null 47040682 null 73 1 "query-22293153" "neo4j" "waiting" 48 null 46003533 null 160 0 "query-25112820" "neo4j" "waiting" 33 null 26919621 null 563 27
Neo4j CPU usage: 50% (32 cores)
IOPS usage: 0.01% - 3.15%
Debug.log contains a bunch of messages like this:
2019-05-25 06:43:17.059+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0xf793ca34, L:/10.0.2.6:7687 - R:/126.96.36.199:53032] readAddress(..) failed: Connection timed out io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out at io.netty.channel.unix.Errors.newIOException(Errors.java:122) at io.netty.channel.unix.Errors.ioResult(Errors.java:146) at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:178) at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:348) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:794) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:404) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) at java.lang.Thread.run(Thread.java:748)
Any suggestions will be very helpful.
Not sure how, but you've got some lock contention issue going on...all of your queries are waiting on locks, and either this isn't a deadlock state or the deadlock detection isn't working, so you're stuck.
You'll either need to kill the queries or restart the db.
You said you're on Neo4j 3.4. Is that 3.4.0? If so, get off that .0 release, these contain brand new features and thus the highest potential for bugs with their first introduction. When you can please upgrade to the highest patch release for your minor version, in this case 3.4.14. That can at least remove the possibility that you're being affected by some bug that has since been patched along the 3.4.x line.
We are on version 3.4.1, but we will update to 3.4.14.
I learnt from one of my colleagues that when we moved from AWS to Azure, we directly started the database on the backup that was created and did not use
neo4j-admin restore tool (it was taking too long according to them)
What are the implications of not using the restor tool? Did that cause the error?
We typically recommend
neo4j-admin restore as it does several other housekeeping things for you that among other things may prevent some forms of corruption later (if used in a cluster).
You may want to double-check
:schema from the browser to ensure all your expected indexes are present and online. You can use
dbms.listTransactions() to see what's executing, and you should also be able to see query state as well.
You can also use the following query to get locking information via JMX beans:
call dbms.queryJmx("org.neo4j:instance=kernel#0,name=Locking") yield attributes return attributes
Also, double-check permissions within your graph.db structure, I believe the read and write permissions should be set to the user that is used to run Neo4j.
Thanks for your reply.
Here is what we tried:
lockAcquisitionTimeouton the driver to see which queries are unable to acquire locks
None of the above works, and now we have thousands of queries failing because they couldn't acquire a lock (in 30s).
We are going to rebuild the indexes and see if that might help. What else can we do?
Update: We can't rebuild indexes either as setting a new label for 3k nodes is taking over 3 minutes, and we have over 1B+ nodes that need migration. Going to drop that idea. Our last resort is moving to a new machine type and os type.
Update 2: Weirdly, the queries that are running are not always shown in dbms.listQueries(). For example, one of the queries that we executed in the browser reported that it took 10s to complete, however this statement only showed up after 100s. This query was available in the list of queries generated by
dbms.listQueries for 10s, but not afterwards. Neo4j thinks the query is done, but in reality it is still executing. Future queries seem to be waiting for the locks to be released by these phantom queries.
Here is a screenshot of such a query:
The query above eventually completed, but reported as taking 14 seconds (irl took about 90s)
Hard to say without more info. There are some other tools I provided for checking lock health, try those out.
You can EXPLAIN some of the queries in question that are running into contention issues, that is a good sanity check to ensure they're using index lookups where expected.
You may also want to set
dbms.transaction.timeout to the same value as your lock acquisition timeout.
As far as adding new labels to many nodes, or otherwise processing > 50k-100k nodes/rels at once, you should batch those, usually with
apoc.periodic.iterate() from APOC Procedures.