Data authenticity

Processing and analysis of data inevitably result in a number of edits in the data file. However, it is necessary to preserve the authenticity of the original research information contained in the data throughout the whole data lifecycle.

There are many possible types of changes in the data:

Data cleaning procedures may be implemented;
Errors are often found and corrected;
New variables may be constructed;
New information may be added from external sources;
File formats may be changed;
New data may be included;
The data file structure may be changed for the purpose of increasing operability, etc.

As a result of above-mentioned data management processes, several different versions of the data file are usually created. They are important, as they allow you to step back to versions before particular changes were made. Versions may be used simultaneously for different purposes or replace one another. When data files are being published to make them widely available, the treatment of errors, inclusion of new data and/or changes in a data file structure may result also in the publication of new editions of the same data file which may substantially differ in their content (e.g. when new country data are included into an international data file).

Best practices for quality assurance, version control and authenticity

Version and edition management will help to:

Clearly distinguish between individual versions and editions and keep track of their differences;
Prevent unauthorised modification of files and loss of information, thereby preserving data authenticity.

Best practices
The best practice rules (UK Data Service, 2017a; Krejčí, 2014) may be summarised as follows:

Establish the terms and conditions of data use and make them known to team members and other users;
Create a ‘master file’ and take measures to preserve its authenticity, i.e. place it in an adequate location and define access rights and responsibilities – who is authorised to make what kind of changes;
Distinguish between versions shared by researchers and working versions of individuals;
Decide how many versions of a file to keep, which versions to keep (e.g. major versions rather than minor versions (keep version 02-00 but not 02-01)), for how long and how to organise versions;
Introduce clear and systematic naming of data file versions and editions;
Record relationships between items where needed, for example between code and the data file it is run against, between data file and related documentation or metadata or between multiple files;
Document which changes were made in any version;
Keep original versions of data files, or keep documentation that allows the reconstruction of original files;
Track the location of files if they are stored in a variety of locations;
Regularly synchronise files in different locations, such as using MS SyncToy (2016).

Version control
Version control can be done through:

Uniquely identifying different versions of files using a systematic naming convention, such as using version numbers or dates (date format should be YYYY-MM-DD, see 'File naming');
- Record the date within the file, for example, 20010911_Video_Twintowers;
- Process the version numbering into the file name, for example, HealthTest-00-02 or HealthTest_v2;
- Do not use ambiguous descriptions for the version you are working on. Who will know whether MyThesisFinal.doc, MyThesisLastOne.doc or another file is really the final version?
Using version control facilities within the software you use;
Using versioning software like Subversion (2017);
Using file-sharing services with incorporated version control (but remember that using commercial cloud services such as the Google cloud platform, Dropbox or iCloud comes with specific rules set by the provider of these services. Private companies have their own terms of use which applies for example to copyrights);
Designing and using a version control table. In all cases, a file history table should be included within a file. In this file, you can keep track of versions and details of the changes which were made. Click on the tab to have a look at an example which was taken from the UK Data Service (n.d.).

Example of a version control table

Title:		Vision screening tests in Essex nurseries
File Name:		VisionScreenResults_00_05
Description:		Results data of 120 Vision Screen Tests carried out in 5 nurseries in Essex during June 2007
Created By:		Chris Wilkinson
Maintained By:		Sally Watsley
Created:		04/07/2007
Last Modified:		25/11/2007
Based on:		VisionScreenDatabaseDesign_02_00

Version	Responsible	Notes	Last amended

00_05	Sally Watsley	Version 00_03 and 00_04 compared and merged by SW	25/11/2007
00_04	Vani Yussu	Entries checked by VY, independent from SK	17/10/2007
00_03	Steve Knight	Entries checked by SK	29/07/2007
00_02	Karin Mills	Test results 81-120 entered	05/07/2007
00_01	Karin Mills	Test results 1-80 entered	04/07/2007

Versioning new data types

Generally, the goal of version management is to enable reproducibility and support trustworthiness by allowing all transformations in the data to be traced. But difficulties emerge connected to versioning of “new data” as these data are (compared to “traditional data”) more frequently or even continuously updated. A good example are collections of Tweets (e.g., for a certain hashtag) as individual posts may be modified or deleted. As the contents of these data are continuously changing and if archived data are expected to reflect such changes (e. g. deleting posts from data set if they were deleted from platform) the result is an increasing number of versions. Consequently, it is necessary to develop a systematic plan to create and name new versions of constantly changing datasets, or find new solutions for streaming data.

Both researchers and repositories can learn from the fields where versioning of dynamic data is already established, such as the field of software development. The most common version control software in software development is Git. Some of the established repositories, such as Zenodo and FigShare or the Open Science Framework, now offer integration with GitHub, so that every version of data sets in those repositories can be recorded through it. A new project called Dolt is developing version control specifically for data which is particularly interesting for dynamic data sets, such as social media data.

To identify the exact version of a dataset as it was used in a specific project or publication, the Research Data Aliance (RDA) suggests that every dataset is versioned, timestamped, and assigned a persistent identifier (PID). In the case of Big Data, however, the RDA warns against excessive versioning: “In large data scenarios, storing all revisions of each record might not be a valid approach. Therefore in our framework, we define a record to be relevant in terms of reproducibility, if and only if it has been accessed and used in a data set. Thus, high-frequency updates that were not ever read might go - from a data citation perspective - unversioned.“

Table of Contents