Backup your ERISTwo Cluster data with Duply

Overview

This document is a guideline on how to backup a data folder on the Scientific Computing Linux Cluster using Duply.

Duply is a command-line backup tool (and front-end for Duplicity) that makes full or incremental copies to a remote storage system using ftp, ssh, s3, rsync, cifs, webdav or http protocols.

Requirements

These are the basic requirements that you’ll need to be able to follow the instructions and to make your first backup.

Basic Knowledge

Duply is a Linux command-line tool, you will need to be familiar with:

  • Linux shell
  • Scientific Computing Linux Cluster

MAD3 Storage

MAD3 is a storage solution built using Dell-EMS Isilon and which is intended to provide secure access to large amounts of affordable archival storage for infrequently accessed data. Some key points are:

  • data can be accessed by SMB
  • Integration with PHS AD and LDAP security (authentication mechanisms may depend on the protocol) 
  • The storage systems are fault tolerant but do not include a disaster recovery solution. 
  • Files on isilon are protected by RAID redundancy. 

More information and application details can be found here 

Scientific Computing Linux Cluster account and data

Open an SSH terminal session on ERISTwo. You can visit the Scientific Computing Linux Clusters Quick Start Guide and take a look at the different Scientific Computing training programs to learn more about using the cluster, and register for an account at the ERISTwo sign-up page

Configure the Backup Tool

Duply is not installed on all the Scientific Computing nodes, only the filemove queue nodes have Duply installed.  Please follow this guide and run all backup jobs on the filemove nodes.

Create Backup Profile

First, you'll need to request an interactive session in a filemove node. On the terminal type this, and wait until you get a session. 

bsub -Is -q filemove /bin/bash

A separate backup profile must be created for each distinct backup job. To create a profile with the name 'dataset01':

duply dataset01 create
The profile folder will be stored under '~/.duply/dataset01/' (where ~ is the current user's home directory).

Configure your Backup Profile

The last command should have created a configuration template within the profile folder. There are just a few options that you need to change in the template to set up your backup task.

Edit the file ~/.duply/dataset01/conf, and change or add the following parameters as necessary:
GPG_KEY="disabled" TARGET="rsync://<userId>@mad3.partners.org/<Share Name>/<userID>/dataset01/" SOURCE="/PHShome/<userID>/dataset01/"
TEMP_DIR=/PHShome/<userID>/backup/tmp
ARCH_DIR=/some/space/safe/.duply-cache
DUPL_PARAMS="--allow-source-mismatch --asynchronous-upload"
Let’s walk through each of these parameters.
GPG_KEY:
If you wish for an additional layer of protection then you can create your GPG Keys and set up the key/passphrase in the config file.
TARGET:
The target is the final destination for your backups. The value for this config parameter should match this pattern:

rsync://user[:password]@host.com[:port]::[/]module/some_dir

and in our case, a typical file path to the dataset01 folder might be:

rsync://<userId>@mad3.partners.org/<Share Name>/<userID>/dataset01/

where

  • <Share Name> - the name of the storage provided after registering for the MAD3 service, and of the form:

                              MGB-XXXXXXXXXX

 

  • <userID> - the user account name on the Scientific Computing Cluster.
SOURCE:
The source is the data set directory that you want to back up. We suggest you use a small sub-directory or a test dataset to start with. Avoid backups of large full directories, this will take a long time to backup and recover.
TEMP_DIR:
If the backup is very big you might need to consider a different /tmp folder than the standard. You can create your own ~/backup/tmp to avoid space issues while creating the backup.
ARCH_DIR:
By default the cache files are located in the home directory but this can grow very fast if the back up is big, you can change the location of it with this flag.
DUPL_PARAMS:
The configuration parameters warrant some explanation:
  • --allow-source-mismatch is mandatory 
  • --asynchronous-upload is optional, but will speed up the data transfer

Test your Backup Profile

You can test your backup profile by running:

duply dataset01 backup
If everything goes well,  the last few lines in the command output should look like this:
--- Finished state OK at 15:01:46.502 - Runtime 00:00:19.672 ---
--- Start running command POST at 15:01:46.514
--- Skipping n/a script '/PHShome/<userID>/.duply/dataset01/post'.
--- Finished state OK at 15:01:46.526 - Runtime 00:00:00.012 ---

Submit your Backup Job (for real)

You will ultimately need to submit your backup job to the Scientific Computing cluster using the 'filemove' queue. For more information on how to do this, see how to create LSF files and submit jobs to the cluster.

Following through on our example, we create an LSF file named backup-dataset01.lsf:

#!/bin/bash
#BSUB -J Duplicity
#BSUB -o backup_logs/backup-dataset01-%J.out
#BSUB -e backup_logs/backup-dataset01-%J.err
echo '---HOSTNAME:---'
hostname
echo '---CURRENT WORKING DIRECTORY:---'
pwd
echo '---CURRENT TIME:---'
date
echo '---RUNNING BACKUP: ---'
duply dataset01 backup
Create the 'backup_logs' directory so that we will have a record of the backups that take place and can review the output after the backup is complete to verify success:
mkdir backup_logs

And now you can submit your backup job to the filemove queue (in the same directory where the backup_logs folder was created)

bsub -q filemove < backup-dataset01.lsf 

If you want to take regular backups then you can set up a cron job to do this. As an illustration, we first create a script of the lsf job that can be run by cron. Then login to the eris1cron node and use the crontab editor to specify when the backup cron job should run i.e.

ssh eris1cron
crontab -e

- For example to run it every Saturday at midnight use:

    0 0 * * 6 /PHShome/<userID>/backup/backup_cron.sh

 

Recovery Tests 

This document covers only a few simple use case for you to be sure that you backup job is working. We’ll publish soon another guide with some advanced concepts about backup policies and Point in Time recovery, and with best practices.

duply dataset01 list

Restoring a specific file or directory

Use the fetch command to restore a specific file. This restores the dir/file01 file in the backup and saves it to restore/file01.txt. Notice the lack of leading slash in the dir/file01.txt argument. Also, please note that the directory 'restore' must already exist.

duply dataset01 fetch dir/file01.txt restore/file01.txt
The fetch command also works on directories. Note that in this case the destination directory should NOT exist.
duply dataset01 fetch dir restored_files/

Full restore

You can restore the latest backup with the Duply restore command.

The following will restore the latest 'dataset01' backup to the target directory 'dataset01_restored'. Note that the target directory does not need to exist; Duply will automatically create it.

duply dataset01 restore dataset01_restored
Go to KB0028042 in the IS Service Desk