Seeding Neo4j using a Backup on Docker

20/02/2020 02:20

In this post, I’ll walk through how you can take a backup of an existing database and use it to seed an instance of Neo4j inside a Docker container. This could be useful if you are looking to fire up a development server using real data. I’ll show you how how to launch an instance of Neo4j using docker-compose and then extend the official Docker image by creating a custom Dockerfile.

Neo4j on Docker

You can find Neo4j images going back to 3.4 on Docker hub all named as x.y.z for community and with -enterprise appended for enterprise edition. You can get up and running quickly by running the docker run command.

In order to run Enterprise Edition, you need to accept the Neo4j Licensing Agreement. You do this with Docker by setting the NEO4J_ACCEPT_LICENSE_AGREEMENT variable to yes.

docker run --env=NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
    neo4j:4.0.0-enterprise

Taking a Backup neo4j-admin backup

The neo4j-admin tool located in the $NEO4J_HOME/bin folder allows you to run a number of administation commands including backup, restore and an import tool for imports of over 1M rows.

There are two types of exports - dump and export. The dump command creates an archive that can be easily shared and is great for smaller databases. For larger databases, the export command allows you to do an incremental backup. If you run a backup on a directory that already has a backup in there, it will take the difference and append it to the store files rather than starting from transaction id 0. This is great for larger databases.

The backup is enabled by default, but by default it will only listen on the backup port on requests from localhost. Because we’ll be taking a backup from a local machine, we’ll need to enable remote backups by setting dbms.backup.listen_address to 0.0.0.0:6362.

neo4j.conf

# Enable online backups to be taken from this database.
dbms.backup.enabled=true
# By default the backup service will only listen on localhost.
# To enable remote backups you will have to bind to an external
# network interface (e.g. 0.0.0.0 for all interfaces).
# The protocol running varies depending on deployment. In a Causal Clustering environment this is the
# same protocol that runs on causal_clustering.transaction_listen_address.
dbms.backup.listen_address=0.0.0.0:6362

To run the backup you can run the following command:

bin/neo4j-admin backup \
    --from=neo4jurl:7687 \           # hostname or ip and port for neo4j server
    --backup-dir=/path/to/backups \  # directory to store the backup in
    --database=neo4j                 # The name of the database to backup

A consistency check runs as part of the backup to make sure that the backup files are OK, but this can take a while on a large database. You can disable this by adding --check-consistency=false and check the consitency at a later time.

Automating the Backup using Docker

One of the nice things about Docker is that you can build or extend Dockerfiles to create an image. These can also be published to the Docker Hub but I won’t cover that here. The FROM keyword allows you to choose an image to build on top of, in this case we want the latest version of Neo4j Enterprise.

Dockerfile

The neo4j images all automatically start up the neo4j instance. In this case, we want to run the backup on the production server and restore it before the neo4j server starts. We can do this by replacing the the ENTRYPOINT with one of our own. There’s a lot of complicated stuff going on in in the docker-entrypoint.sh file that I don’t really want to be replicating and maintaining, so instead we can just create a new shell file which performs the backup and restore before calling the original docker-entrypoint.sh file.

my-entrypoint.sh

#!/bin/bash
echo "Running Backup & Restore"
neo4j-admin backup --from=$PRODUCTION --backup-dir=/backup 
neo4j-admin restore --from=/backup/neo4j --database=neo4j --force

The script runs the neo4j-admin backup command and places the backup in the /backup directory before restoreing it into the default neo4j database. The introduction of the $PRODUCTION environment variable to the call means that the address of the neo4j server can be set as an --env flag when the container is created. The --force command will overwrite any files if they already exist, perfect for if we’re mounting a volume for the data.

File Ownership

The file ownership caused me a few issues when developing this script, the neo4j process is run by a user called neo4j whereas this entrypoint script is ran by root. Originally, this caused a Neo.TransientError.Database.DatabaseUnavailable error complaining that Database 'neo4j' is unavailable. This was because the neo4j user couldn’t write to the directory. chowning the /data directory to neo4j:neo4j fixes this issue.

my-entrypoint.sh

chown -R neo4j:neo4j /data

After that, the original docker-entrypoint.sh script can be run to work it’s magic and bring the database up.

my-entrypoint.sh

/docker-entrypoint.sh neo4j

Modifying the Dockerfile

Back in Dockerfile, a few commands are needed to clean things up. Firstly, we’ll need to accept the license agreement.

Dockerfile

ENV NEO4J_ACCEPT_LICENSE_AGREEMENT yes

Next, setting the dbms.directories.data directory to a folder in the root will make it easier to mount a volume.

Dockerfile

ENV NEO4J_dbms_directories_data /data

Then, my-entrypoint.sh needs to be copied to the docker container. By default the file will not have execute permissions, so the RUN command will allow us to run chmod to add execution permission on the file.

Dockerfile

WORKDIR /
COPY my-entrypoint.sh /my-entrypoint.sh
RUN chmod +x /my-entrypoint.sh

Finally, we can overwrite the ENTRYPOINT to run my-entrypoint.sh (and subsequently the original docker-entrypoint.sh) before running the neo4j command to start neo4j.

Dockerfile

ENTRYPOINT ["/sbin/tini", "-g", "--", "/my-entrypoint.sh"]
CMD ["neo4j"]

Building the image

The docker build command creates an image that can be used when creating containers.

To make life easier, I have tagged the new image as dev using the -t dev flag, otherwise it would generate a random hash and the whole thing couldn’t be automated.

Creating a Container

Containers with the newly created dev image can be created using the docker run command. I have mapped the HTTP and Bolt ports using -p so I can access the Neo4j Browser and query the data via bolt. As mentioned before, running a backup on a directory with an existing backup will trigger an incremental backup so I will mount the backup directory as a volume on the docker container. The same goes for the data directory. The local path to the volumes need to be absolute so I have created a $HERE environment variable to make things a bit easier.

docker run --name=dev \
    -p 17474:7474 \                     # Map HTTP port from container to 17474
    -p 17687:7687 \                     # Map Bolt port from container to 17687
    --env="PRODUCTION=prod.databases.adamcowley.co.uk:6463" # Env var for server
    --volume="$HERE/backup:/backup" \   # Mount backup directory volume to /backup
    --volume="$HERE/data:/data" \       # Mount data directory volume to /data
    dev                                 # Use the newly built dev image

Conclusion

Being fairly inexperienced with Docker, this took me a while to figure out. But once I realised that I can just extend an existing image, my life became a lot easier. This process works well for a single instance, but could also be used to automate the seeding and deployment of Read Replicas. Downloading a copy of a previous backup and mounting it as a volume will speed up the startup process on larger databases.

I’ve put the code up on Github - feel free to pull, clone or submit a PR.


This is a companion discussion topic for the original entry at https://adamcowley.co.uk/neo4j/neo4j-docker-seed-backup/
1 Like