Data Backup and Storage

Data Backup and Storage

Alex Albaugh & Sukanya Sasmal

Backup of data should be a common and vital part of a computational lab’s procedures. Hardware failure is more common than we’d like to admit and we should take steps to ensure that our work is protected from such failures. Additionally long term storage of data is crucial so that future lab members and collaborators can access our hard won knowledge. There are basically three levels of backup and storage that we have access to; HPSS on NERSC machines, Boxbackup on Armada, and Box on our desktop machines. Each has a place in the hierarchy of backup and storage methods.

  1. NERSC High Performance Storage System (HPSS)

HPSS is a NERSC resource that should be used as long-term storage for currently unused data. Your home directory at NERSC is limited to 40 GB. NERSC machines also have local scratch directories and project directories where you can put data that won’t fit on your home directory. In general you should avoid storing large amounts of data in your home directory. Local scratch directories are machine specific (so remember which ones you put your data on) and can be accessed via “cd $SCRATCH”. Scratch directories have 20 TB of storage per user, but unused files are cleared every 12 weeks so they are not intended for long-term storage, but mostly just for currently used files. Project directories are at /project/projectdirs/thglab or /project/projectdirs/[repo] where [repo] is your repo name (e.g. m1876). Project directories are permanent but only have 4 TB of storage and that is shared between on users on the project or repo so you need to be aware of your personal usage.

So your home directory is small but permanent, scratch directories are large but temporary, and project directories are moderately sized but shared. The best option we have for large permanent storage for data is then HPSS. You can archive your files from NERSC or armada to HPSS using the hsi and htar commands (see the NERSC site, http://www.nersc.gov/users/storage-and-file-systems/hpss/getting-started/). This is highly recommended so that you have a backup copy of your data and files. You should not archive large numbers of small files, therefore, if you have a directory of many small files, tar it into one file before archiving to HPSS.

We also have an HPSS project directory if you want to archive your data into a group-accessible location. This directly is located at /home/projects/thglab when accessed through HIS.

To backup a folder on HPSS:

htar -cf example.tar example

To retrieve a folder:

1) cd to directory where you want the folder to be

2) log on to HPSS, hsi

3) get example.tar

4) exit

5) untar example.tar

The hsi command can be used to log onto HPSS, and check your folders. The HPSS interface is very similar to bash interface and has similar commands. See below for a chart of command commands.

HPSS_Commands

For any queries about NERSC, you can check out the document at

https://docs.google.com/document/d/12mzk4aLQKUEdnHcRbEoncfltpnMWIiExlIs6LbWB0vg/edit

II. Boxbackup (Armada)

Armada has a cloud-based storage system that we can use to store and backup large amounts of data. You can access this storage by logging onto Armada and going to /boxbackup/[username]. The Boxbackup system has 100 TB of storage total, shared between users, so you should be aware of your personal usage, but not nearly as much as NERSC project directories, for example. You can use the usual cp and scp commands to move files into and out of your Boxbackup directory. Data put into the directory is automatically backed up onto a cloud system. Because of this moving data into and out of the directory can be fairly slow, so it should be used only for data storage and not as a working directory. Boxbackup provides a more accessible and easier to use form of storage for data than HPSS, but is somewhat smaller in size.

III. Box (Desktop Computers)

As members of the UC Berkeley community we have free access to Box software that is useful for backup of data on desktop computers. This software provides a directory on your desktop computer that will automatically backup any data in said directory to cloud storage. These files can then be accessed on the Box website or other computers with the installed software. The system also protects against a hard drive failure on your desktop by allowing you to re-download the data from the website.

Setting up Box:

1) Login to berkeley.app.box.com with your Berkeley ID and password.

2) Click your name in the upper right corner and select “Get Box Sync”.

3) Follow the download and installation instructions. You may need to re-enter your ID and password.

4) The software will create a directory called “Box Sync”.

5) Move or copy any files you’d like to backup into the Box Sync folder.

Unfortunately you cannot designate any other folders as Box folders, only the Box Sync folder has the Box backup functionality. This is a little inconvenient in that you either need to use it as a parent directory for all of your desktop work or you will need to periodically manually copy important files and folders over to it. If you choose this latter option you should make weekly updates to your Box folder. While this may seem inconvenient it is a small price to pay for protecting your data. Box is also free through UC Berkeley and has unlimited storage space so you should take advantage of it.