Hi All,
I have a neo4j 4.1.3 enterprise database running in an Azure Container Instance, and I'm trying to perform a backup. The operation hangs when trying to retrieve the neostore.counts.db file and subsequently fails with a timeout error.
Here's the logs + stack trace:
root@SandboxHost-637403743601256193:/var/lib/neo4j# neo4j-admin backup --backup-dir=/var/lib/neo4j/backups --database=neo4j --verbose
neo4j 4.1.3
VM Name: OpenJDK 64-Bit Server VM
VM Vendor: Oracle Corporation
VM Version: 11.0.9+11
JIT compiler: HotSpot 64-Bit Tiered Compilers
VM Arguments: [-XX:+UseParallelGC, -Dfile.encoding=UTF-8]
2020-11-08 03:22:04.781+0000 INFO [c.n.b.i.BackupStrategyWrapper] Previous backup not found, a new full backup will be performed.
2020-11-08 03:22:04.790+0000 INFO [c.n.b.i.BackupStrategyWrapper] Full backup will be first performed to a temporary directory '/var/lib/neo4j/backups/neo4j.temp.0' because the specified directory '/var/lib/neo4j/backups/neo4j' already exists and is not empty
2020-11-08 03:22:04.836+0000 INFO [c.n.b.i.DefaultBackupStrategy] Remote backup address is localhost:6362
2020-11-08 03:22:05.060+0000 INFO [c.n.c.p.i.ClientChannelInitializer] Initializing client channel [id: 0x5c52cb01]
2020-11-08 03:22:05.203+0000 DEBUG [c.n.c.p.i.InitClientHandler] Channel [id: 0x5c52cb01, L:/127.0.0.1:55944 - R:localhost/127.0.0.1:6362] received a correct magic message
2020-11-08 03:22:05.231+0000 INFO [c.n.c.p.h.HandshakeClientInitializer] Initiating handshake on channel [id: 0x5c52cb01, L:/127.0.0.1:55944 - R:localhost/127.0.0.1:6362]
2020-11-08 03:22:05.280+0000 INFO [c.n.c.p.h.HandshakeClientInitializer] Connected to localhost/127.0.0.1:6362 [catchup version:3.0]
2020-11-08 03:22:05.282+0000 INFO [c.n.c.p.h.HandshakeClientInitializer] Handshake completed on channel [id: 0x5c52cb01, L:/127.0.0.1:55944 - R:localhost/127.0.0.1:6362]. Installing: catchup version:3.0
2020-11-08 03:22:05.424+0000 INFO [c.n.b.i.DefaultBackupStrategy] Database id is DatabaseId{394666b4[neo4j]}
2020-11-08 03:22:05.461+0000 INFO [c.n.b.i.DefaultBackupStrategy] Remote store id is StoreId{creationTime=1604135315159, randomId=1624527527294936977, storeVersion=3471765337752883975, upgradeTime=1604135315159, upgradeTxId=1}
2020-11-08 03:22:05.465+0000 INFO [c.n.b.i.DefaultBackupStrategy] Local store id is StoreId{creationTime=1604135315159, randomId=1624527527294936977, storeVersion=3471765337752883975, upgradeTime=1604135315159, upgradeTxId=1}
2020-11-08 03:22:05.475+0000 INFO [c.n.c.c.s.StoreCopyClient] Requesting store listing from: localhost:6362
2020-11-08 03:22:07.994+0000 INFO [c.n.c.c.s.StoreCopyClient] Receiving file: neostore.counts.db
2020-11-08 03:32:07.995+0000 WARN [c.n.c.c.s.StoreCopyClient] Request timed out after period of inactivity. Time since last response: OptionalLong[600000] ms.
2020-11-08 03:32:08.011+0000 INFO [c.n.c.p.h.HandshakeClientInitializer] Lost connection to localhost/127.0.0.1:6362 [catchup version:3.0]
org.neo4j.cli.CommandFailedException: Execution of backup failed. Request timed out after period of inactivity. Time since last response: OptionalLong[600000] ms.
at com.neo4j.backup.impl.OnlineBackupCommand.execute(OnlineBackupCommand.java:134)
at org.neo4j.cli.AbstractCommand.call(AbstractCommand.java:67)
at org.neo4j.cli.AbstractCommand.call(AbstractCommand.java:33)
at picocli.CommandLine.executeUserObject(CommandLine.java:1783)
at picocli.CommandLine.access$900(CommandLine.java:145)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2150)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2144)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2108)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:1975)
at picocli.CommandLine.execute(CommandLine.java:1904)
at org.neo4j.cli.AdminTool.execute(AdminTool.java:77)
at org.neo4j.cli.AdminTool.main(AdminTool.java:58)
Caused by: com.neo4j.backup.impl.BackupExecutionException: com.neo4j.causalclustering.catchup.storecopy.StoreCopyFailedException: java.util.concurrent.TimeoutException: Request timed out after period of inactivity. Time since last response: OptionalLong[600000] ms.
at com.neo4j.backup.impl.DefaultBackupStrategy.performFullBackup(DefaultBackupStrategy.java:70)
at com.neo4j.backup.impl.BackupStrategyWrapper.fullBackupWithTemporaryFolderResolutions(BackupStrategyWrapper.java:148)
at com.neo4j.backup.impl.BackupStrategyWrapper.performBackupWithoutLifecycle(BackupStrategyWrapper.java:90)
at com.neo4j.backup.impl.BackupStrategyWrapper.doBackup(BackupStrategyWrapper.java:58)
at com.neo4j.backup.impl.BackupStrategyCoordinator.performBackup(BackupStrategyCoordinator.java:56)
at com.neo4j.backup.impl.OnlineBackupExecutor.executeBackup(OnlineBackupExecutor.java:77)
at com.neo4j.backup.impl.OnlineBackupCommand.execute(OnlineBackupCommand.java:125)
... 11 more
Caused by: com.neo4j.causalclustering.catchup.storecopy.StoreCopyFailedException: java.util.concurrent.TimeoutException: Request timed out after period of inactivity. Time since last response: OptionalLong[600000] ms.
at com.neo4j.causalclustering.catchup.storecopy.StoreCopyClient.prepareStoreCopy(StoreCopyClient.java:198)
at com.neo4j.causalclustering.catchup.storecopy.StoreCopyClient.copyStoreFiles(StoreCopyClient.java:65)
at com.neo4j.causalclustering.catchup.storecopy.RemoteStore.copy(RemoteStore.java:103)
at com.neo4j.backup.impl.BackupDelegator.copy(BackupDelegator.java:53)
at com.neo4j.backup.impl.DefaultBackupStrategy.performFullBackup(DefaultBackupStrategy.java:65)
... 17 more
Caused by: java.util.concurrent.TimeoutException: Request timed out after period of inactivity. Time since last response: OptionalLong[600000] ms.
at com.neo4j.causalclustering.helper.OperationProgressMonitor.throwOnTimeout(OperationProgressMonitor.java:79)
at com.neo4j.causalclustering.helper.OperationProgressMonitor.get(OperationProgressMonitor.java:68)
at com.neo4j.causalclustering.catchup.CatchupClient$Builder.request(CatchupClient.java:112)
at com.neo4j.causalclustering.catchup.storecopy.StoreCopyClient.prepareStoreCopy(StoreCopyClient.java:194)
... 21 more
The same operation performed on a Docker container running locally succeeds with no issues.
Has anyone ever come across a similar problem? It seems like the Azure instance is deadlocking somewhere.
Thanks
Max