Backup

Backups are an important instrument to ensure that data and related files can be restored in case of loss or damage. Among the most common causes of data loss are:

DataLoss700px
  • Hardware failure;
  • Software malfunction;
  • Malware or hacking;
  • Human error (research data accidentally gets deleted or overwritten or is lost in transport);
  • Theft, natural disaster or fire;
  • Degradation of storage media.

Creating a backup strategy in 10 steps

A backup strategy in one sentence would be: Make at least three backup copies of the data on at least two different types of storage media, keep storage devices in separate locations with at least one off-site, regularly check whether they work, ensure you know the process and follow it. In the tabs below the steps to create a backup strategy are outlined in more detail.

CTRL-C600px

Find out whether your institution has a backup strategy. If so, backups may automatically be taken care of for any files stored on institutional servers. However, it is necessary that you check if the backup strategy in place sufficiently meets your requirements.

The three common options for backups are:

  1. Full backup of the entire system and files;
  2. Differential backups, where everything is recorded that was changed since the last full backup. To restore your data and/or system, you will require the last full backup and the last differential backup;
  3. Incremental backups, where only changes since the last backup are recorded. To restore your data and/or system, the last full backup and the entire series of incremental backups is required.

Differential and incremental backups are also called “intelligent” backups. If only a small percentage of your data changes on a daily basis, it's a waste of time and disk space to run a full backup every day.

It is recommended that you make three backup copies. This will greatly minimise the risk of data loss, even in the case that one of the backups is damaged or lost. However, if storage capacity is an issue and/or if sensitive data is involved, it may be necessary to work with fewer copies.

You should clearly state in your backup strategy how often backups will be made. The frequency of backups will depend on the frequency and amount of changes to your data and documents.

We recommend that you store at least some of the backups in (physically) different places. For example, backing up to two servers standing in the same room or building may cause you to lose both backups in case of a fire. Having an offsite copy of your backup mitigates this risk.

Backups can be made to networked drives, cloud storage, and to local or portable devices (see 'Storage'). What works best for your project depends on the amount of data that needs to be backed up, the required frequency of backups, the level of automation, and the sensitivity of the data.

Estimate which amount of data and documentation you will collect and create in your project. Then determine the corresponding approximate amount of storage capacity needed for backups. If your institution has an IT department, they will be able to help you with this.

Automating backups can help to ensure that backups are created at the correct time and that they are saved to the correct location, reducing the risk of human errors. Both Microsoft and Apple operating systems have software to support automatic backups. Cloud storage solutions too often have a backup functionality. However, make sure to check frequently that functional backups were indeed created.

  • Windows 10
    Windows 10 includes two different backup programs:
    • File history
      The File History tool automatically saves multiple versions of a given file, so you can “go back in time” and restore a file before it was changed or deleted. That’s useful for files that change frequently.
    • Windows Backup and Restore
      The Backup and Restore tool creates a single backup of the latest version of your files on a schedule.

Of course, you would still need an off-site backup as well.

It is generally recommended that you do not overwrite one backup with another. However, if you have to back up large amounts of data frequently it may not be feasible to retain all backups for the entire duration of the project.

If sensitive data is involved, make sure that any deleted data are truly gone and cannot be recovered in any way. For suitable procedures, see 'Security'.

Make sure that backups of data containing sensitive information are protected against unauthorised access in the same manner as the original files. For suitable measures, see the chapter on Security.

A disaster recovery plan defines the steps to take if a data loss occurred and thus helps you to restore data as quickly as possible. The plan should also assign responsibilities for data recovery tasks and list persons (or functions) to contact when a data loss occurs.

To ensure that data recovery will run as smoothly as possible in the event of an actual data loss, make sure to regularly test whether restoring lost files from your backups is actually possible.

Never assume that someone will take care of backups and data recovery. Assign responsibilities for making manual backups, for checking those automatic backups actually happened, for testing data recovery, and for restoring any lost data.

Errors can happen when backups are written or copied. We recommend that you frequently check the integrity of your backed up files. This can be done with so-called checksum tools such as MD5summer or Checksum Checker.

The UKDS compares checksums to digital fingerprints. Available tools create such a fingerprint with the help of an algorithm that computes the fingerprint - a string of numbers - from the bit values (the ones and the zeros) of a file. Monitoring whether the fingerprint of a given file changes allows you to detect if a file was changed in any way intentionally or unintentionally.

Video tutorial on using MD5summer: https://www.youtube.com/watch?v=VcBfkB6N7-k

Case studies

In the following, two scenarios will be used to illustrate the importance of backups and to highlight some of the things that are important to consider when planning a backup strategy. After reading through the scenarios, take a few minutes to think about what could have been done to prevent data loss. Afterwards, you can open the tabs to see our diagnosis.

LostBackpackTwitter

On a night out after work*, a friend’s backpack was lost containing literally all of their data and documents for their Master's thesis. A fairly recent copy of the thesis text is backed up in DropBox, but the only two copies of the data - video-recordings and transcripts of interviews with primary school teachers in rural areas of Ireland - were on the laptop (transcripts and sequences from the videos) and the hard drive (original, unanonymised videos and backed-up files from the laptop). Both were lost with the backpack.

  • The thesis text was backed up to the external hard drive and to the cloud;
  • Transcripts and video sequences from the laptop were backed up to the external hard drive but not to Dropbox because they contained sensitive information;
  • No backup of the video footage existed. The entire footage was on the external hard drive in unencrypted form.

1. Keep backups in different locations

One thing the scenario illustrates is that when it comes to backup, never put all your eggs in the same basket. No matter how many backups you have - if all of them are in the same place, the risk to lose everything is considerable. For storage, consider the advantages and disadvantages of different storage solutions and storage media (see 'Storage').
A rule of thumb is to keep three backups, at least one of them in a different location from the others, on different types of storage media. However, sometimes considerations of privacy or storage capacity will require you to deviate from this recommendation.

2. Use encryption to protect research participants’ privacy

In the scenario, the lost hard drive contained personal data of participants in the research. The loss, therefore, compromises the privacy of the involved individuals. Whenever personal data is stored and processed for research, backup measures have to be linked with data protection measures. Personal data should be encrypted and anonymized as quickly and comprehensively as the research objective permits. You should also create only as many copies of this data as absolutely required. Note that this may involve diverging from the “three copies” rule mentioned above.

* The tweet (Penson, 2017) in the image dates from the 7th of July 2017. Although the tweet is real, the scenario about the contents of the backpack is fictional and based on this scenario in a blog post by Peter Murray-Rust (2011).

Mastercopy600px

A group of researchers collaboratively works on quantitative survey data. They use a shared working space on a networked drive where a master copy of the data and a working copy are stored. Two backups exist, stored separately from the working and master files: one copy on an external hard drive and one in the university’s own Cloud system.

A new researcher enters the project. Who is not aware of the way files are named and organised and accidentally works on the master copy of the data. In this process, a number of variables get overwritten when the new team member recodes variable values and forgets to save them into a new variable. Fortunately, two backups exist.

The researchers know that sometimes copies can get changed due to write or transmission errors, so they decide to check with a checksum tool if the two copies are identical. They discover that the checksums for the two files are not identical. This means that either one or both of the files were altered in some way.

  • The master copy is kept as a separate file from working files;
  • Two backup copies on different media and in different locations exist;
  • No frequent integrity checks of the backups were made and no additional protection for the master copy of the data was in place to prevent it from being overwritten.

1. Versioning and file naming rules

Errors such as accidentally overwriting a file can always happen, but they are less likely to occur if clear rules for versioning and file naming are in place and if folders are clearly labeled. Such policies and guidelines help to avoid confusion about what files contain and where they should be saved. See 'Data authenticity, versions and editions'.

2. Restricting access to important files

As mentioned above, human error is one of the most common causes of data loss. Therefore, consider restricting the access to important files, for example with the help of passwords or by using systems with read and access rights management. By giving fewer people access to important files, the risk of data loss caused by human error can be minimised.

3. Creating three backup copies rather than only two

If three copies rather than only two had been created, this would have increased the chances of identifying the unaltered copy: if two out of three copies are identical, this suggests that these are unharmed. This would have saved the project laborious work of trying to identify the correct copy.

4. Checking the integrity of files

Errors can happen when backups are written or copied. These can sometimes make a copy entirely unusable, but sometimes they are small enough to go unnoticed initially but then cause problems further down the line. This could lead to you losing access to the data entirely - for example because a software can suddenly no longer render the files - or it can cause the data to contain errors, thus impacting the results of your research negatively. Learn more about integrity checks in this video about performing a checksum check for your files (UK Data Service, 2016a).