November 5, 2024
Overview
This document is a guideline on how to backup a data folder on the Scientific Computing Linux Cluster using Duply.
Requirements
Basic Knowledge
Duply is a Linux command-line tool, you will need to be familiar with:
- Linux shell
-
Scientific Computing Linux Cluster
MAD3 Storage
MAD3 is a storage solution built using Dell-EMS Isilon and which is intended to provide secure access to large amounts of affordable archival storage for infrequently accessed data. Some key points are:
- data can be accessed by SMB
- Integration with PHS AD and LDAP security (authentication mechanisms may depend on the protocol)
- The storage systems are fault tolerant but do not include a disaster recovery solution.
- Files on isilon are protected by RAID redundancy.
More information and application details can be found here
Scientific Computing Linux Cluster account and data
Open an SSH terminal session on ERISTwo. You can visit the Scientific Computing Linux Clusters Quick Start Guide and take a look at the different Scientific Computing training programs to learn more about using the cluster, and register for an account at the ERISTwo sign-up page
Configure the Backup Tool
Duply is not installed on all the Scientific Computing nodes, only the filemove queue nodes have Duply installed. Please follow this guide and run all backup jobs on the filemove nodes.
Create Backup Profile
First, you'll need to request an interactive session in a filemove node. On the terminal type this, and wait until you get a session.
bsub -Is -q filemove /bin/bash
A separate backup profile must be created for each distinct backup job. To create a profile with the name 'dataset01':
duply dataset01 create
Configure your Backup Profile
The last command should have created a configuration template within the profile folder. There are just a few options that you need to change in the template to set up your backup task.
GPG_KEY="disabled" TARGET="rsync://<userId>@mad3.partners.org/<Share Name>/<userID>/dataset01/" SOURCE="/PHShome/<userID>/dataset01/"
TEMP_DIR=/PHShome/<userID>/backup/tmp
ARCH_DIR=/some/space/safe/.duply-cache
DUPL_PARAMS="--allow-source-mismatch --asynchronous-upload"
rsync://user[:password]@host.com[:port]::[/]module/some_dir
and in our case, a typical file path to the dataset01 folder might be:
rsync://<userId>@mad3.partners.org/<Share Name>/<userID>/dataset01/
where
-
<Share Name> - the name of the storage provided after registering for the MAD3 service, and of the form:
MGB-XXXXXXXXXX
- <userID> - the user account name on the Scientific Computing Cluster.
- --allow-source-mismatch is mandatory
- --asynchronous-upload is optional, but will speed up the data transfer
Test your Backup Profile
You can test your backup profile by running:
duply dataset01 backup
--- Finished state OK at 15:01:46.502 - Runtime 00:00:19.672 ---
--- Start running command POST at 15:01:46.514
--- Skipping n/a script '/PHShome/<userID>/.duply/dataset01/post'.
--- Finished state OK at 15:01:46.526 - Runtime 00:00:00.012 ---
Submit your Backup Job (for real)
You will ultimately need to submit your backup job to the Scientific Computing cluster using the 'filemove' queue. For more information on how to do this, see how to create LSF files and submit jobs to the cluster.
Following through on our example, we create an LSF file named backup-dataset01.lsf:
#!/bin/bash
#BSUB -J Duplicity
#BSUB -o backup_logs/backup-dataset01-%J.out
#BSUB -e backup_logs/backup-dataset01-%J.err
echo '---HOSTNAME:---'
hostname
echo '---CURRENT WORKING DIRECTORY:---'
pwd
echo '---CURRENT TIME:---'
date
echo '---RUNNING BACKUP: ---'
duply dataset01 backup
mkdir backup_logs
And now you can submit your backup job to the filemove queue (in the same directory where the backup_logs folder was created)
bsub -q filemove < backup-dataset01.lsf
If you want to take regular backups then you can set up a cron job to do this. As an illustration, we first create a script of the lsf job that can be run by cron. Then login to the eris1cron node and use the crontab editor to specify when the backup cron job should run i.e.
ssh eris1cron
crontab -e
- For example to run it every Saturday at midnight use:
0 0 * * 6 /PHShome/<userID>/backup/backup_cron.sh
Recovery Tests
This document covers only a few simple use case for you to be sure that you backup job is working. We’ll publish soon another guide with some advanced concepts about backup policies and Point in Time recovery, and with best practices.
duply dataset01 list
Restoring a specific file or directory
Use the fetch command to restore a specific file. This restores the dir/file01 file in the backup and saves it to restore/file01.txt. Notice the lack of leading slash in the dir/file01.txt argument. Also, please note that the directory 'restore' must already exist.
duply dataset01 fetch dir/file01.txt restore/file01.txt
duply dataset01 fetch dir restored_files/
Full restore
You can restore the latest backup with the Duply restore command.
The following will restore the latest 'dataset01' backup to the target directory 'dataset01_restored'. Note that the target directory does not need to exist; Duply will automatically create it.
duply dataset01 restore dataset01_restored