Neo4j Causal Cluster : Backup Strategies

backup
cluster
(Ashutosh) #1

Documentation is still bit sparse for cloud cluster backup starting this thread to document experiences and knowledge what some of us come to know.

I recently setup a cluster on google cloud using marketplace installation. Our production database is a single instance enterprise edition with around 20GB of data.

I used following steps to seed the cluster first for testing that it works fine.

  1. On standalone prod DB I took full online backup that did not require any downtime.

    neo4j-admin backup --backup-dir=dir-name --name=backup-name

  2. stop all cluster member (mine was one leader + 2 follower core cluster) and unbind them.
    sudo systemctl stop neo4j
    sudo neo4j-admin unbind

  3. used scp to copy backup on all 3 cluster machines. (maybe someone can suggest if alternative is there)

  4. Seed from backup on all 3 machines
    sudo rm -rf /var/lib/neo4j/data/databases/graph.db
    sudo neo4j-admin restore --from=dir-name --database=graph.db

  5. Hit an Intresting issue, after backup cluster DB won't start. After lot of efforts figured out that issue was due to changed permission on copied graph.db files. To solve this use following commands:
    sudo chown -R neo4j /var/lib/neo4j/data/databases/
    sudo chown -R neo4j /var/lib/neo4j/data/cluster-state/

  6. One major doubt was do we need to stop all cluster machine then start backup/seed or it can be done one by one.

1 Like

(M. David Allen) #2

This is great guidance, thanks for posting it!

On your question 6 -- all machines should be stopped. The thing is, the nodes in the cluster all participate as part of the same cluster. If you are restoring a backup, they need to be in agreement about what is in the data set. If you ever have a situation where the nodes of the cluster have a very different perspective on what is in the graph, you could run into problems.

Neo4j uses the raft consensus protocol, and uses a lot of majority votes. So suppose you unbound the nodes of the cluster, and then 2 of the 3 have the backup dataset. Depending on the situation, you might see that either the majority pushes their updates to the third machine, or you could get some errors. If only one of the three had the backup dataset and the cluster had formed, the one with the good data would be in the minority.

To prevent these problems, the best thing to do is shut down all 3 and unbind them, restore to each, and then bring them all up after the restore is completed. In this way, when they form a cluster, they will already agree on the dataset.

0 Likes