Table of Contents
Processing and analysis of data inevitably result in a number of edits in the data file. However, it is necessary to preserve the authenticity of the original research information contained in the data throughout the whole data lifecycle.
There are many possible types of changes in the data:
- Data cleaning procedures may be implemented;
- Errors are often found and corrected;
- New variables may be constructed;
- New information may be added from external sources;
- File formats may be changed;
- New data may be included;
- The data file structure may be changed for the purpose of increasing operability, etc.
As a result of above-mentioned data management processes, several different versions of the data file are usually created. They are important, as they allow you to step back to versions before particular changes were made. Versions may be used simultaneously for different purposes or replace one another. When data files are being published to make them widely available, the treatment of errors, inclusion of new data and/or changes in a data file structure may result also in the publication of new editions of the same data file which may substantially differ in their content (e.g. when new country data are included into an international data file).
Best practices for quality assurance, version control and authenticity
Version and edition management will help to:
- Clearly distinguish between individual versions and editions and keep track of their differences;
- Prevent unauthorised modification of files and loss of information, thereby preserving data authenticity.
The best practice rules (UK Data Service, 2017a; Krejčí, 2014) may be summarised as follows:
- Establish the terms and conditions of data use and make them known to team members and other users;
- Create a ‘master file’ and take measures to preserve its authenticity, i.e. place it in an adequate location and define access rights and responsibilities – who is authorised to make what kind of changes;
- Distinguish between versions shared by researchers and working versions of individuals;
- Decide how many versions of a file to keep, which versions to keep (e.g. major versions rather than minor versions (keep version 02-00 but not 02-01)), for how long and how to organise versions;
- Introduce clear and systematic naming of data file versions and editions;
- Record relationships between items where needed, for example between code and the data file it is run against, between data file and related documentation or metadata or between multiple files;
- Document which changes were made in any version;
- Keep original versions of data files, or keep documentation that allows the reconstruction of original files;
- Track the location of files if they are stored in a variety of locations;
- Regularly synchronise files in different locations, such as using MS SyncToy (2016).
Version control can be done through:
- Uniquely identifying different versions of files using a systematic naming convention, such as using version numbers or dates (date format should be YYYY-MM-DD, see 'File naming');
- Record the date within the file, for example, 20010911_Video_Twintowers;
- Process the version numbering into the file name, for example, HealthTest-00-02 or HealthTest_v2;
- Do not use ambiguous descriptions for the version you are working on. Who will know whether MyThesisFinal.doc, MyThesisLastOne.doc or another file is really the final version?
- Using version control facilities within the software you use;
- Using versioning software like Subversion (2017);
- Designing and using a version control table. In all cases, a file history table should be included within a file. In this file, you can keep track of versions and details of the changes which were made. Click on the tab to have a look at an example which was taken from the UK Data Service (2017c).
Versioning new data types
Generally, the goal of version management is to enable reproducibility and support trustworthiness by allowing all transformations in the data to be traced. But difficulties emerge connected to versioning of “new data” as these data are (compared to “traditional data”) more frequently or even continuously updated. A good example are collections of Tweets (e.g., for a certain hashtag) as individual posts may be modified or deleted. As the contents of these data are continuously changing and if archived data are expected to reflect such changes (e. g. deleting posts from data set if they were deleted from platform) the result is an increasing number of versions. Consequently, it is necessary to develop a systematic plan to create and name new versions of constantly changing datasets, or find new solutions for streaming data.
Both researchers and repositories can learn from the fields where versioning of dynamic data is already established, such as the field of software development. The most common version control software in software development is Git. Some of the established repositories, such as Zenodo and FigShare or the Open Science Framework, now offer integration with GitHub, so that every version of data sets in those repositories can be recorded through it. A new project called Dolt is developing version control specifically for data which is particularly interesting for dynamic data sets, such as social media data.
To identify the exact version of a dataset as it was used in a specific project or publication, the Research Data Aliance (RDA) suggests that every dataset is versioned, timestamped, and assigned a persistent identifier (PID). In the case of Big Data, however, the RDA warns against excessive versioning: “In large data scenarios, storing all revisions of each record might not be a valid approach. Therefore in our framework, we define a record to be relevant in terms of reproducibility, if and only if it has been accessed and used in a data set. Thus, high-frequency updates that were not ever read might go - from a data citation perspective - unversioned.“