Backup/Restore


(Tharshy) #1

@david.allen, Would like to know,

  1. if we backup a graph.db file on external hard drive, what would be the best practice restore them back on my Causal cluster?

  2. If we have one separate VM to store backup, should we connect that to CORE cluster system?

last time, you advice me to do the following to add the graph.db on each instances, however after I did that, I could not get the clusters to form again, did we miss anything here... Thank you

  1. create a zip archive of a graph.db folder
  2. copy it to all three machines

Do the following for each machine, so in other words, each step 3 times:
3. ensure the system service is not running on those machines
4. unzip the same graph.db.zip into the appropriate database folder
5. run neo4j-admin unbind to remove existing cluster state
6. restart the neo4jj system service

Thank you


(M. David Allen) #2

Before we get to best practices, have you read the restore documentation? Can you confirm that you're running restore according to these directions? In the steps you're providing above, those are what needs to happen before the restore, but don't actually include the restore step.

https://neo4j.com/docs/operations-manual/current/backup/restore-backup/


(Tharshy) #3

sorry for the late response, my understanding was when I copy the graph.db on each instances, then unbind and restart all neo4j instances. all of them should have the graph.db file, that I have manually copied in there, and clusters should form again.
I do that all the time on single instance of neo4j. but for cluster it would not form?
how we can store a graph.db from remote instances, without restore procedure (restore commnd?
should the above work on manually adding graph.db on each instances and restart cluster?

my cluster vm's are not on the same network as backup vm? I would like to know, if above manual procedure would work or not?
Thank you


(M. David Allen) #4

We're missing a lot of information to help here. Copying the graph.db folder instead of using the backup/restore tool should work OK, however you haven't indicated how the cluster doesn't form, what error message you're seeing, whether or not you have anything else in your debug.log file.

Try looking through your debug.log file, particularly for warnings or errors. Also please paste the output of the system logs for neo4j (not just debug logs) so we can see what is happening. Cluster not forming may be unrelated to restoring the data and may have to do with network configuration, but it's tough to tell without a lot more detail.


(Tharshy) #5

Good Morning! @david.allen, I will get back to you with debug logs, meanwhile, in order for the backup and restore to perform, do we need to set memory allocation in neo4j.template, I see this warning is always there on my debug.log file

2018-09-17 11:55:40.287+0000 WARN [o.n.i.p.PageCache] The dbms.memory.pagecache.size setting has not been configured. It is recommended that this setting is always explicitly configured, to ensure the system has a balanced configuration. Until then, a computed heuristic value of 1451847680 bytes will be used instead. Run neo4j-admin memrec for memory configuration suggestions.


(M. David Allen) #6

Please read the warning and do as it says. :slight_smile:


(Tharshy) #7

hi @david.allen, i have followed documentation on how to do backup on causal clustering and restore. here is what I am trying
Backup the causal clusters while they are running:
neo4j-admin backup --protocol=catchup --from:127.0.0.1:6362 --backup-dir=/mnt/backup --name=graph.db.1 --pagecache=2G

above process successfully backup the files inside /mnt/backup

Restore process:
shutdown all three core instances: sudo systemctl stop neo4j
unbind all three instances: sudo neo4j-admin unbind
restore graph.db: sudo neo4j-admin restore --from=/mnt/backup/graph.db.1 --database=graph.db --force

and the above process also successfully restore the graph.db file

start instances: sudo systemctl restart neo4j
after this I am unable to use the database, my db browser https://35.xx.xx.xx won't work

or neither cypher shell. I have to do reset each cluster vm through google console to have them to start.. then I loose all data.

logs:

2018-09-18 14:09:37.023+0000 INFO [o.n.c.i.ClusterBinder] Published: ClusterId{uuid=56ec27aa-f21a-4e93-9924-0fc3cda6558d}

2018-09-18 14:09:37.032+0000 INFO [o.n.c.c.c.l.s.SegmentedRaftLog] Skipping from {index: -1, term: -1} to {index: 0, term: 0}

2018-09-18 14:09:37.050+0000 INFO [o.n.c.c.s.LocalDatabase] Starting with storeId: Store{creationTime:1537279776817, randomId:-1578084232665967531, upgradeTime:1537279776817, upgradeId:1}

2018-09-18 14:09:37.222+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Selected RecordFormat:StandardV3_4[v0.A.9] record format from store /var/lib/neo4j/data/databases/graph.db

2018-09-18 14:09:37.396+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Selected RecordFormat:StandardV3_4[v0.A.9] record format from store /var/lib/neo4j/data/databases/graph.db

2018-09-18 14:09:37.397+0000 INFO [o.n.k.i.s.f.RecordFormatSelector] Format not configured. Selected format from the store: RecordFormat:StandardV3_4[v0.A.9]

2018-09-18 14:09:37.948+0000 INFO [o.n.k.i.i.l.NativeLabelScanStore] No scan store found, this might just be first use. Preparing to rebuild.

2018-09-18 14:09:38.086+0000 INFO [o.n.k.NeoStoreDataSource] No check point found in transaction log

2018-09-18 14:09:38.087+0000 INFO [o.n.k.NeoStoreDataSource] Recovery required from position LogPosition{logVersion=0, byteOffset=16}

2018-09-18 14:09:38.176+0000 INFO [o.n.k.r.Recovery] 10% completed

2018-09-18 14:09:38.176+0000 INFO [o.n.k.r.Recovery] 20% completed

2018-09-18 14:09:38.178+0000 INFO [o.n.k.r.Recovery] 30% completed

2018-09-18 14:09:38.180+0000 INFO [o.n.k.r.Recovery] 40% completed

2018-09-18 14:09:38.181+0000 INFO [o.n.k.r.Recovery] 50% completed

2018-09-18 14:09:38.227+0000 INFO [o.n.k.r.Recovery] 60% completed

2018-09-18 14:09:38.227+0000 INFO [o.n.k.r.Recovery] 70% completed

2018-09-18 14:09:38.227+0000 INFO [o.n.k.r.Recovery] 80% completed

2018-09-18 14:09:38.236+0000 INFO [o.n.k.r.Recovery] 90% completed

2018-09-18 14:09:38.236+0000 INFO [o.n.k.r.Recovery] 100% completed

2018-09-18 14:09:38.241+0000 INFO [o.n.k.NeoStoreDataSource] Recovery completed. 1 transactions, first:2, last:2 recovered

2018-09-18 14:09:38.257+0000 INFO [o.n.k.i.s.DynamicArrayStore] Rebuilding id generator for[/var/lib/neo4j/data/databases/graph.db/neostore.nodestore.db.labels] ...

2018-09-18 14:09:38.266+0000 INFO [o.n.k.i.s.DynamicArrayStore] [/var/lib/neo4j/data/databases/graph.db/neostore.nodestore.db.labels] high id=1 (defragged=0)

2018-09-18 14:09:38.266+0000 INFO [o.n.k.i.s.DynamicArrayStore] [/var/lib/neo4j/data/databases/graph.db/neostore.nodestore.db.labels] high id=1 (defragged=0)

2018-09-18 14:09:38.266+0000 INFO [o.n.k.i.s.DynamicArrayStore] /var/lib/neo4j/data/databases/graph.db/neostore.nodestore.db.labels rebuild id generator, highId=1 defragged count=0

2018-09-18 14:09:38.266+0000 INFO [o.n.k.i.s.NodeStore] Rebuilding id generator for[/var/lib/neo4j/data/databases/graph.db/neostore.nodestore.db] ...

2018-09-18 14:09:38.272+0000 INFO [o.n.k.i.s.NodeStore] [/var/lib/neo4j/data/databases/graph.db/neostore.nodestore.db] high id=0 (defragged=0)

2018-09-18 14:09:38.272+0000 INFO [o.n.k.i.s.NodeStore] /var/lib/neo4j/data/databases/graph.db/neostore.nodestore.db rebuild id generator, highId=0 defragged count=0

2018-09-18 14:09:38.272+0000 INFO [o.n.k.i.s.DynamicStringStore] Rebuilding id generator for[/var/lib/neo4j/data/databases/graph.db/neostore.propertystore.db.index.keys] ...

2018-09-18 14:09:38.278+0000 INFO [o.n.k.i.s.DynamicStringStore] [/var/lib/neo4j/data/databases/graph.db/neostore.propertystore.db.index.keys] high id=1 (defragged=0)

2018-09-18 14:09:38.278+0000 INFO [o.n.k.i.s.DynamicStringStore] /var/lib/neo4j/data/databases/graph.db/neostore.propertystore.db.index.keys rebuild id generator, highId=1 defragged count=0

2018-09-18 14:09:38.278+0000 INFO [o.n.k.i.s.PropertyKeyTokenStore] Rebuilding id generator for[/var/lib/neo4j/data/databases/graph.db/neostore.propertystore.db.index] ...

2018-09-18 14:09:38.282+0000 INFO [o.n.k.i.s.PropertyKeyTokenStore] [/var/lib/neo4j/data/databases/graph.db/neostore.propertystore.db.index] high id=0 (defragged=0)

2018-09-18 14:09:38.283+0000 INFO [o.n.k.i.s.PropertyKeyTokenStore] /var/lib/neo4j/data/databases/graph.db/neostore.propertystore.db.index rebuild id generator, highId=0 defragged count=0

2018-09-18 14:09:38.286+0000 INFO [o.n.k.i.s.DynamicStringStore] Rebuilding id generator for[/var/lib/neo4j/data/databases/graph.db/neostore.propertystore.db.strings] ...

2018-09-18 14:09:38.291+0000 INFO [o.n.k.i.s.DynamicStringStore] [/var/lib/neo4j/data/databases/graph.db/neostore.propertystore.db.strings] high id=1 (defragged=0)

2018-09-18 14:09:38.292+0000 INFO [o.n.k.i.s.DynamicStringStore] /var/lib/neo4j/data/databases/graph.db/neostore.propertystore.db.strings rebuild id generator, highId=1 defragged count=0

2018-09-18 14:09:38.292+0000 INFO [o.n.k.i.s.DynamicArrayStore] Rebuilding id generator for[/var/lib/neo4j/data/databases/graph.db/neostore.propertystore.db.arrays] ...

2018-09-18 14:09:38.346+0000 INFO [o.n.k.i.s.MetaDataStore] /var/lib/neo4j/data/databases/graph.db/neostore rebuild id generator, highId=15 defragged count=0

2018-09-18 14:09:38.377+0000 WARN [o.n.k.i.s.MetaDataStore] Missing counts store, rebuilding it.

2018-09-18 14:09:38.579+0000 WARN [o.n.k.i.s.MetaDataStore] Counts store rebuild completed.

2018-09-18 14:09:38.622+0000 INFO [o.n.k.i.i.l.NativeLabelScanStore] Rebuilding scan store, this may take a while

2018-09-18 14:09:39.370+0000 INFO [o.n.k.i.i.l.NativeLabelScanStore] Scan store rebuilt (roughly 0 nodes)

2018-09-18 14:09:39.447+0000 INFO [o.n.k.i.DatabaseHealth] Database health set to OK

2018-09-18 14:09:40.144+0000 INFO [o.n.c.c.s.m.t.LastCommittedIndexFinder] Last transaction id in metadata store 2

2018-09-18 14:09:40.147+0000 INFO [o.n.c.c.s.m.t.LastCommittedIndexFinder] Start id of last committed transaction in transaction log 1

2018-09-18 14:09:40.148+0000 INFO [o.n.c.c.s.m.t.LastCommittedIndexFinder] Last committed transaction id in transaction log 2

2018-09-18 14:09:40.148+0000 INFO [o.n.c.c.s.m.t.LastCommittedIndexFinder] Last committed consensus log index committed into tx log -1

2018-09-18 14:09:40.149+0000 INFO [o.n.c.c.s.m.t.ReplicatedTransactionStateMachine] Updated lastCommittedIndex to -1

2018-09-18 14:09:40.151+0000 INFO [o.n.c.c.s.m.t.ReplicatedTokenStateMachine] (Label) Updated lastCommittedIndex to -1

2018-09-18 14:09:40.151+0000 INFO [o.n.c.c.s.m.t.ReplicatedTokenStateMachine] (RelationshipType) Updated lastCommittedIndex to -1

2018-09-18 14:09:40.151+0000 INFO [o.n.c.c.s.m.t.ReplicatedTokenStateMachine] (PropertyKey) Updated lastCommittedIndex to -1

2018-09-18 14:09:40.151+0000 INFO [o.n.c.c.s.CommandApplicationProcess] Restoring last applied index to 0

2018-09-18 14:09:40.155+0000 INFO [o.n.c.c.s.CommandApplicationProcess] Resuming after startup (count = 0)

2018-09-18 14:09:40.173+0000 INFO [o.n.c.n.Server] catchup-server: bound to 0.0.0.0:6000

2018-09-18 14:09:40.175+0000 INFO [o.n.c.n.Server] backup-server: bound to localhost:6362

2018-09-18 14:09:52.590+0000 INFO [o.n.c.c.c.RaftMachine] Election timeout triggered

2018-09-18 14:09:52.594+0000 INFO [o.n.c.c.c.RaftMachine] Election started with vote request: Vote.Request from MemberId{1061f2ce} {term=1, candidate=MemberId{1061f2ce}, lastAppended=0, lastLogTerm=0} and members: [MemberId{9b3258b0}, MemberId{ce96c70b}, MemberId{1061f2ce}]

2018-09-18 14:09:52.594+0000 INFO [o.n.c.c.c.RaftMachine] Moving to CANDIDATE state after successfully starting election

2018-09-18 14:09:52.594+0000 INFO [o.n.c.c.c.RaftMachine] Moving to CANDIDATE state after successfully starting election

2018-09-18 14:09:52.605+0000 INFO [o.n.c.m.SenderService] Creating channel to: [10.x.0.x:7000]

2018-09-18 14:09:52.608+0000 INFO [o.n.c.m.SenderService] Creating channel to: [10.x.0.x:7000]

2018-09-18 14:09:52.629+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:09:52.645+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:09:52.667+0000 INFO [o.n.c.m.SenderService] Connected: [id: 0x8a0009, L:/10.x.0.x:596 - R:/10.x.0.x:7000]

2018-09-18 14:09:52.674+0000 INFO [o.n.c.m.SenderService] Connected: [id: 0xb5834872, L:/10.x.0.x:58202 - R:/10.x.0.x:7000]

2018-09-18 14:09:52.674+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Initiating handshake local /10.x.0.x:59086 remote /10.x.0.x:7000

2018-09-18 14:09:52.781+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Initiating handshake local /10.x.0.x:58202 remote /10.x.0.x:7000

2018-09-18 14:09:52.840+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Installing: ProtocolStack{applicationProtocol=RAFT_1, modifierProtocols=[]}

2018-09-18 14:09:52.903+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Installing: ProtocolStack{applicationProtocol=RAFT_1, modifierProtocols=[]}

2018-09-18 14:09:52.928+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /10.x.0.x:7000 remote /10.x.0.x:44334

2018-09-18 14:09:52.953+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /10.x.0.x:7000 remote /10.x.0.x:51646

2018-09-18 14:09:52.995+0000 INFO [o.n.c.c.c.RaftMachine] Moving to LEADER state at term 1 (I am MemberId{1061f2ce}), voted for by [MemberId{9b3258b0}]

2018-09-18 14:09:52.995+0000 INFO [o.n.c.c.c.s.RaftState] First leader elected: MemberId{1061f2ce}

2018-09-18 14:09:53.013+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Starting log shipper: MemberId{9b3258b0}[matchIndex: -1, lastSentIndex: 0, localAppendIndex: 1, mode: MISMATCH]

2018-09-18 14:09:53.018+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Starting log shipper: MemberId{ce96c70b}[matchIndex: -1, lastSentIndex: 0, localAppendIndex: 1, mode: MISMATCH]

2018-09-18 14:09:53.020+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Idle{}

2018-09-18 14:09:53.244+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by Store copy @ txId: 2 checkpoint completed in 51ms

2018-09-18 14:09:53.247+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 0

2018-09-18 14:10:00.020+0000 INFO [o.n.c.c.c.s.RaftLogShipper] MemberId{9b3258b0}[matchIndex: 0, lastSentIndex: 1, localAppendIndex: 1, mode: MISMATCH]: caught up after mismatch, moving to PIPELINE mode

2018-09-18 14:10:03.086+0000 INFO [o.n.c.c.c.s.RaftLogShipper] MemberId{ce96c70b}[matchIndex: 0, lastSentIndex: 1, localAppendIndex: 1, mode: MISMATCH]: caught up after mismatch, moving to PIPELINE mode

2018-09-18 14:10:08.181+0000 INFO [o.n.c.c.c.m.MembershipWaiter] MemberId{1061f2ce} Catchup: 1 => 1 (0 behind)

2018-09-18 14:10:08.188+0000 INFO [o.n.m.MetricsExtension] Sending metrics to CSV file at /var/lib/neo4j/metrics

2018-09-18 14:10:08.333+0000 INFO [o.n.b.BoltKernelExtension] Bolt enabled on 0.0.0.0:7687.

2018-09-18 14:10:08.377+0000 INFO [o.n.k.AvailabilityGuard] Fulfilling of requirement makes database available: Database available

2018-09-18 14:10:08.377+0000 INFO [o.n.k.i.f.GraphDatabaseFacadeFactory] Database is now ready

2018-09-18 14:10:08.379+0000 INFO [o.n.k.i.DiagnosticsManager] --- STARTED diagnostics for KernelDiagnostics:Versions START ---

2018-09-18 14:10:08.379+0000 INFO [o.n.k.i.DiagnosticsManager] Graph Database: enterprise core StoreId{creationTime=1537279776817, randomId=-1578084232665967531, storeVersion=16094931155187206, upgradeTime=1537279776817, upgradeId=1}

2018-09-18 14:10:08.379+0000 INFO [o.n.k.i.DiagnosticsManager] Kernel version: 3.4.0,efaa8492f0a3439d61055219c070d83f001f3

2018-09-18 14:10:08.379+0000 INFO [o.n.k.i.DiagnosticsManager] --- STARTED diagnostics for KernelDiagnostics:Versions END ---

2018-09-18 14:10:08.383+0000 INFO [o.n.k.i.DiagnosticsManager] --- STARTED diagnostics for NEO_STORE_VERSIONS START ---

2018-09-18 14:10:08.383+0000 INFO [o.n.k.i.DiagnosticsManager] Store versions:

2018-09-18 14:10:08.383+0000 INFO [o.n.k.i.DiagnosticsManager] Store versions:

2018-09-18 14:10:08.384+0000 INFO [o.n.k.i.DiagnosticsManager] ArrayPropertyStore v0.A.9

2018-09-18 14:10:08.384+0000 INFO [o.n.k.i.DiagnosticsManager] NodeStore v0.A.9

idle)

2018-09-18 14:10:13.279+0000 INFO [o.n.k.i.DiagnosticsManager] --- SERVER STARTED START ---

2018-09-18 14:10:13.718+0000 INFO [o.n.m.s.s.ServerMetrics] Server thread metrics has been registered successfully

2018-09-18 14:10:15.626+0000 INFO [o.n.k.i.DiagnosticsManager] --- SERVER STARTED END ---

2018-09-18 14:10:29.668+0000 ERROR [o.n.b.r.MetricsReportingBoltConnection] Unexpected error detected in bolt session '42010afffe800005-000008e3-0000000b-d3c93cc4f504a54d-ae72d487'. The client is unauthorized due to authentication failure.

org.neo4j.bolt.v1.runtime.BoltConnectionFatality: The client is unauthorized due to authentication failure.

    at org.neo4j.bolt.v1.runtime.BoltStateMachine.handleFailure(BoltStateMachine.java:740)

    at org.neo4j.bolt.v1.runtime.BoltStateMachine.handleFailure(BoltStateMachine.java:726)

    at org.neo4j.bolt.v1.runtime.BoltStateMachine.access$300(BoltStateMachine.java:62)

    at org.neo4j.bolt.v1.runtime.BoltStateMachine$State$1.init(BoltStateMachine.java:434)

    at org.neo4j.bolt.v1.runtime.BoltStateMachine.init(BoltStateMachine.java:140)

    at org.neo4j.bolt.v1.messaging.BoltMessageRouter.lambda$onInit$0(BoltMessageRouter.java:70)

    at org.neo4j.bolt.runtime.MetricsReportingBoltConnection.lambda$enqueue$0(MetricsReportingBoltConnection.java:69)

    at org.neo4j.bolt.runtime.DefaultBoltConnection.processNextBatch(DefaultBoltConnection.java:193)

    at org.neo4j.bolt.runtime.MetricsReportingBoltConnection.processNextBatch(MetricsReportingBoltConnection.java:87)

    at org.neo4j.bolt.runtime.DefaultBoltConnection.processNextBatch(DefaultBoltConnection.java:143)

    at org.neo4j.bolt.runtime.ExecutorBoltScheduler.executeBatch(ExecutorBoltScheduler.java:163)

    at org.neo4j.bolt.runtime.ExecutorBoltScheduler.lambda$null$0(ExecutorBoltScheduler.java:145)

    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)

2018-09-18 14:10:35.841+0000 ERROR [o.n.b.r.MetricsReportingBoltConnection] Unexpected error detected in bolt session '42010afffe800005-000008e3-0000001b-fb4c69e5b504be2a-f859068f'. The client is unauthorized due to authentication failure.

2018-09-18 14:14:11.926+0000 WARN [o.n.c.m.SenderService] Lost connection to: 10.x.0.x:7000 (/10.x.0.x:7000)

2018-09-18 14:14:11.929+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:11.931+0000 WARN [o.n.c.m.SenderService] Failed to connect to: /10.x.x.x:7000. Retrying in 100 ms

2018-09-18 14:14:12.032+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:12.235+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:12.639+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:13.206+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:13.442+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:14.574+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core member removed MembershipEvent {member=Member [bisring-neo4j-1-core-vm-3]:5000 - cc842958-72cd-47c4-b074-28e04ee8b4dc,type=removed}

2018-09-18 14:14:14.810+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:15.045+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:15.087+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core topology changed {added=[], removed=[{memberId=MemberId{ce96c70b}, info=CoreServerInfo{raftServer=10.x.0.x:7000, catchupServer=10.x8.x.x:6000, clientConnectorAddresses=bolt://35.x.x.x:7687,http://35.x.x.33:7474,https://35.x.8.x:7473, groups=[]}}]}

2018-09-18 14:14:15.087+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: [MemberId{9b3258b0}, MemberId{1061f2ce}]

2018-09-18 14:14:15.088+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Not safe to remove member [MemberId{ce96c70b}] because it would reduce the number of voting members below the expected cluster size of 3. Voting members: [MemberId{9b3258b0}, MemberId{ce96c70b}, MemberId{1061f2ce}]

2018-09-18 14:14:15.539+0000 INFO [o.n.c.m.RaftOutbound] No address found for MemberId{c

2018-09-18 14:14:42.292+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:42.357+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:42.469+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:43.557+0000 WARN [o.n.c.n.Server] Worker group not shutdown within 10 seconds.

2018-09-18 14:14:43.557+0000 INFO [o.n.c.c.c.ContinuousJob] ContinuousJob neo4j.raft-batch-handler-1 stopping

2018-09-18 14:14:43.638+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:43.660+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:43.878+0000 INFO [o.n.c.c.s.CommandApplicationProcess] Pausing due to shutdown (count = 1)

2018-09-18 14:14:43.889+0000 INFO [o.n.c.c.s.LocalDatabase] Stopping, reason: Database is stopped

2018-09-18 14:14:43.892+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by database shutdown @ txId: 2 checkpoint started...

2018-09-18 14:14:43.893+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:43.902+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by database shutdown @ txId: 2 checkpoint completed in 10ms

2018-09-18 14:14:43.903+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 0

2018-09-18 14:14:43.933+0000 INFO [o.n.c.n.Server] raft-server: stopping and unbinding from:0.0.0.0:7000

2018-09-18 14:14:43.958+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:44.070+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:45.240+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:45.262+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:45.495+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:45.560+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:45.672+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null

2018-09-18 14:14:46.138+0000 WARN [o.n.c.n.Server] Worker group not shutdown within 10 seconds.

2018-09-18 14:14:46.138+0000 INFO [o.n.c.c.c.l.p.PruningScheduler] PruningScheduler stopping

2018-09-18 14:14:46.138+0000 INFO [o.n.c.c.CatchUpClient] CatchUpClient stopping

2018-09-18 14:14:46.148+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] HazelcastCoreTopologyService stopping and unbinding from 0.0.0.0:5000

2018-09-18 14:14:46.179+0000 INFO [o.n.k.i.DiagnosticsManager] --- STOPPING diagnostics START ---

2018-09-18 14:14:46.179+0000 INFO [o.n.k.i.DiagnosticsManager] --- STOPPING diagnostics END ---


(M. David Allen) #8

OK, so what I see in your logs is a lot of different things, but kind of sketching out the story they tell:

  • You restored the database and that worked
  • You started the database and that worked
  • The database started looking for its buddies to form a cluster.
  • It wasn't able to make a stable connection to the buddies. You'll see these lost connection errors showing that.

If you go through the debug.log files of all three machines, it is very possible that one of them is crashing in a loop. This would explain why the machine starts up, connects to the cluster, but the cluster never gets to the happy state.

One detail I cannot remember is when you do the restore, if you are force overwriting the auth files where your users are stored or not. That needs more research. I notice the authorized errors you are getting and it looks like someone tries to log in with a bad pasword, but I'm not sure if you did that (gave the wrong password) or if you failed to log in after bolt was available because you overwrote your credentials in the force restore, I'm not certain.


(Tharshy) #9

I am not clear on failed to login when the bolt was available, I never have to login with user name and password. I am using shh

should I try without --force, I actually deleted the graph.db from each instances and then restored using restore command.. in this case i am not sure if the --force is necessary?


(Tharshy) #10

also clusters join fine if I do a reset on each vm, but after restoring db it won't rejoin ? also if I manually replace a db from a zip file it will work as well.


(M. David Allen) #11

As I suggested earlier, you should try looking through the debug.log files of all three machines and see if you can find the symptom for why the restore is preventing the nodes of the cluster from rejoining one another. What your log seems to be saying is that they do connect, but then they lose the connection. My bet is that one of the other machines (whose logs you aren't showing) is crashing, and you need to know why before you will come further.

In particular, look for stop/start behavior in the debug.log, and look for lines of errors or warnings just before the server restarts, usually these are the hints you'll need.


(Tharshy) #12

ok thanks, we don't have any orders to stop and start right, we can start with any of the clsuters and continue..


(M. David Allen) #13

You can stop any of the cluster members. If you normally have 3 and you stop 1, the cluster can continue to run while missing 1. If you stop 2, then the only one remaining will be read-only. But yes you can stop and start them independently. Naturally this has availability consequences but the data won't be lost.


(Tharshy) #14

but in order to find out ,my issue, which one of the cluster is in a loop and not able to join, i should be able to try starting the cluster just with 2 instances and figure out which one is causing the issue right? I think this may not able to start, since protocol will look for all three instances to start?


(M. David Allen) #15

That's a bit chicken and egg. Assuming that the problem is only one node starting (that is my speculation, we don't have the data yet) -- even then you'd need to know which one was the problematic one in order to start the other two. I wouldn't go about it that way because you probably can't know that, plus there are other interaction factors. Like for example there are cluster settings that can indicate "don't form at all unless you have this many" -- and if that setting is set, then it won't form with 2 no matter how long you wait because it's been configured not to.

I would go through the debug.log files.


(Tharshy) #16

I will reproduce the scanario and have to send you the debug logs, I couldn't figure it out, is that OK. one thing I can confirm, the clusters forms fine, it only don't form after I replace the graph.db file through restore command.

On the other hand, I am able to unzip a copied graph.db and it works fine, and clusters can form without any issues. ( I skip unbind process while copying manually)


(M. David Allen) #17

I think we're moving out of a type of situation that I can support on a community forum. Some of the details you've got require deeper inspection by a support engineer. I suggest you contact your neo4j representative and obtain a service contract, we'd be happy to help you with this, but it's best done in the bounds of that type of engagement.


(Tharshy) #18

yes, I have been touch with local rep, but you they don't provide any support without purchasing the license for 18K. thats bit too much for to just get support. I was able to configure everything in my own, using cloud deployment neo4j.. even I was told to rely on community support, as we are very small business.

I really appreciate you have been very helpful. but unfortunately if it is few hundred we can afford, asking us to register for US18000 license is not possible for the company at this stage.


(Tharshy) #19

i am exactly following the documentation on backup and restore, still having issue clsuter to join..my cluster setup is from google neo4j deployment, so I did not configure anything in my own, all prebuit..so not sure what I am missing here.


(M. David Allen) #20

Understood. I know this can be a frustrating problem and I apologize for that. Best I can recommend to you is to look through the files as suggested and see what you can find. Make sure to carefully review the architecture docs to get an understanding for what is happening, and the importance of ports 5000, 6000, and 7000 to cluster formation. This is in essence what a support engineer would do.