Backup your ERISOne Cluster data with Duply

Overview

This document is a guideline on how to backup a data folder on the ERISOne Linux Computing Cluster using Duply.

Duply is a command-line backup tool (and front-end for Duplicity) that makes full or incremental copies to a remote storage system using ftp, ssh, s3, rsync, cifs, webdav or http protocols.
In this article, we’ll use a CloudBucket account to store the backups. CloudBucket is an Object Storage service that is compatible with the Amazon s3 protocol.
At this time, this service has limited availability for proof-of-concept, pilot use cases.
You can get more information about CloudBucket.

 

Requirements

These are the basic requirements that you’ll need to be able to follow the instructions and to make your first backup.

Basic Knowledge

Duply is a Linux command-line tool, you will need to be familiar with:

  • Linux shell
  • ERISOne Computing Cluster

CloudBucket Account

This guide describes how to backup to the CloudBucket storage system. If you don’t have a CloudBucket account, visit the CloudBucket registration page to request a Pilot Service.

In order to replace the generic values in this guide with your own account information, you will need:

  • the name of your assigned bucket
  • the bucket user name
  • the bucket secret key

ERISOne Cluster account and data

Open an SSH terminal session on ERISOne. You can visit the ERISOne Quick Start guide and take a look at the different Scientific Computing training programs to learn more about using the cluster, and register for an account at the ERISOne sign-up page

  • We recommend that you perform backups from read-only snapshots of your folders rather than the actual folder, since files may change during the course of the backup.
  • See this page to locate the most recent snapshot of the folder you want to backup.

Configure the Backup Tool

Duply is not installed on all ERISOne compute nodes, only the filemove queue nodes have Duply installed.  Please follow this guide and run all backup jobs on the filemove nodes.

Create Backup Profile

First, you'll need to request an interactive session in a filemove node. On the terminal type this, and wait until you get a session. 

bsub -Is -q filemove /bin/bash

A separate backup profile must be created for each distinct backup job. To create a profile with the name 'dataset01':

duply dataset01 create
The profile folder will be stored under '~/.duply/dataset01/' (where ~ is the current user's home directory).

Configure your Backup Profile

The last command should have created a configuration template within the profile folder. There are just a few options that you need to change in the template to set up your backup task. For demonstration purposes, in this guide we are using:

  • bucket name: bucket01
  • bucket username: access_key
  • bucket secret key: secret_key
Edit the file ~/.duply/dataset01/conf, and change or add the following parameters as necessary:
GPG_KEY="disabled" TARGET="s3://erisecs.partners.org/bucket01/dataset01/" TARGET_USER="access_key" TARGET_PASS="secret_key" SOURCE="/PHShome/abc123/dataset01/"
TEMP_DIR=/PHShome/abc123/backup/tmp
ARCH_DIR=/some/space/safe/.duply-cache
DUPL_PARAMS="--allow-source-mismatch --asynchronous-upload"
Let’s walk through each of these parameters.
GPG_KEY:
It can be ‘disabled’, as the CloudBucket service performs data encryption, so your data will be encrypted at rest. But remember that you need to ask the Data Encryption at Rest functionality in your CloudBucket account request.
You can also use GPG if you want an extra layer of encryption. You just need to create your GPG Keys and set up the key/passphrase in the config file.
TARGET:
The target is the final destination for your backups. The value for this config parameter should match this pattern:

s3://erisecs.partners.org/bucket_name[/prefix]

  • bucket_name: this is the name of the bucket where you want to store your backups. In this example, it’s 'bucket01'

  • prefix: this is optional, but it’s helpful to keep your bucket organized. For example, if you want to store backups from different sources in the same bucket, you can create a prefix for each one. We’re using 'dataset01' as a prefix for the backup set location.

TARGET_USER:
This is your CloudBucket account ACCESS_KEY. You should have received your access key in the confirmation email when your account was created.
TARGET_PASS:
This is your CloudBucket account’s SECRET_KEY. You should have received your secret key when your account was created.
SOURCE:
The source is the data set directory that you want to back up. We suggest you use a small sub-directory or a test dataset to start with. Avoid backups of large full directories, this will take a long time to backup and recover.
TEMP_DIR:
If the backup is very big you might need to consider a different /tmp folder than the standard. You can create your own ~/backup/tmp to avoid space issues while creating the backup.
ARCH_DIR:
By default the cache files are located in the home directory but this can grow very fast if the back up is big, you can change the location of it with this flag.
DUPL_PARAMS:
The configuration parameters warrant some explanation:
  • --allow-source-mismatch is mandatory 
  • --asynchronous-upload is optional, but will speed up the data transfer

Test your Backup Profile

You can test your backup profile by running:

duply dataset01 backup
If everything goes well,  the last few lines in the command output should look like this:
--- Finished state OK at 15:01:46.502 - Runtime 00:00:19.672 ---
--- Start running command POST at 15:01:46.514
--- Skipping n/a script '/PHShome/<username>/.duply/dataset01/post'.
--- Finished state OK at 15:01:46.526 - Runtime 00:00:00.012 ---

Submit your Backup Job (for real)

You will ultimately need to submit your backup job to the ERISOne cluster using the 'filemove' queue. For more information on how to do this, see how to create LSF files and submit jobs to the cluster.

Following through on our example, we create an LSF file named backup-dataset01.lsf:

#!/bin/bash
#BSUB -J Duplicity
#BSUB -o backup_logs/backup-dataset01-%J.out
#BSUB -e backup_logs/backup-dataset01-%J.err
echo '---HOSTNAME:---'
hostname
echo '---CURRENT WORKING DIRECTORY:---'
pwd
echo '---CURRENT TIME:---'
date
echo '---RUNNING BACKUP: ---'
duply dataset01 backup
Create the 'backup_logs' directory so that we will have a record of the backups that take place and can review the output after the backup is complete to verify success:
mkdir backup_logs

And now you can submit your backup job to the filemove queue (in the same directory where the backup_logs folder was created)

bsub -q filemove < backup-dataset01.lsf 

If you want to schedule the job to run recurrently you can set it up on a crontab. First create a script of the lsf job that can be run by cron. Then login to the eris1cron node:

ssh eris1cron
crontab -e

- For example to run it every Saturday at midnight use:

    0 0 * * 6 /PHShome/abc123/backup/backup_cron.sh

 

Recovery Tests 

This document covers only a few simple use case for you to be sure that you backup job is working. We’ll publish soon another guide with some advanced concepts about backup policies and Point in Time recovery, and with best practices.

duply dataset01 list

Restoring a specific file or directory

Use the fetch command to restore a specific file. This restores the dir/file01 file in the backup and saves it to restore/file01.txt. Notice the lack of leading slash in the dir/file01.txt argument. Also, please note that the directory 'restore' must already exist.

duply dataset01 fetch dir/file01.txt restore/file01.txt
The fetch command also works on directories. Note that in this case the destination directory should NOT exist.
duply dataset01 fetch dir restored_files/

Full restore

You can restore the latest backup with the Duply restore command.

The following will restore the latest 'dataset01' backup to the target directory 'dataset01_restored'. Note that the target directory does not need to exist; Duply will automatically create it.

duply dataset01 restore dataset01_restored
Go to KB0028042 in the IS Service Desk