File formats and data conversion

We use software for creating text documents, websites, databases, photos, 3D models, and movies. Software developers regularly release new versions of their products. It is not self-evident that the new software supports the use of files created with earlier software versions (compatibility). And some software packages even disappear completely from the scene. Conversions of file formats may be costly or result in loss of information or a reduction of data quality. This is exactly why the choice of file formats should be planned carefully.

Short-term data processing: file formats for operability

FileFormatDocx600px

File format choice depends on your research phase. Choices for short-term data processing may differ from the choices you make for long-term data preservation.

For the reasons of short-term operability, it is advisable to choose a file format that is associated with the specific software that you intend to use for data analysis. Following discipline-specific standards and customs is generally the way to go. However, you should take into consideration how widespread these standards are and to what extent they will allow data processing by others than peers in your own discipline.

Proprietary file formats are owned and copyrighted by a specific company. Their specifications are usually not publicly available and their future development results from decisions and situation of their owner. Thus, the risk of obsolescence is high. However, some proprietary formats, such as Rich Text Format (*.rtf), MP3, MPEG, JPG, MS Excel (*.xls), SPSS (*.sav, *.por), STATA (*.dta) are widely used and you may assume that they will be useful for a reasonable time.

In the table, we give an overview of the data analysis packages/file formats which are used most and which are suitable for short-term data processing.

Quantitative (statistical) data analysis packages

QauntGuideTransparent

MS Excel (*.xls), SPSS (*.sav, *.por), R and STATA (*.dta) are widely used and you may assume that they will be useful for a reasonable time.

Some software also provides so-called portable formats which allow easy transfer of data between different versions of the software of the same brand, often including versions for different platforms (MS Windows, Mac, Linux...). For example, SPSS system files with the *.sav extension and SAS files with the *.sd7 extension (SAS Version 7 or 8 data file) are associated with the concrete version of the SPSS or SAS software. Instead of them, you may use “portable” SPSS files with the *.por extension or “transport” SAS files, which are compatible with different versions of this software running on different platforms.

Qualitative data analysis packages

QualGuideTransparant

Qualitative research data like transcribed interviews of focus group sessions, audio recordings, still images, photographs, ethnographic diaries and various types of written texts are usually transcripted into one of the following types of formats: *.docx, *.rtf, *.pdf, *.mp3, *.wav, *.jpeg and many others.

For the purposes of qualitative data analysis (QDA), textual data may be analyzed in special QDA software packages such as NVivo, ATLAS-ti, and MAXQDA. In such packages researchers are allowed to code their textual data, i.e.indicate parts of text related to same concepts, create a structure of concepts etc. In the process of coding, a “coding tree” emerges along other pieces of information, for example, notes and memos. Common QDA packages have export facilities that enable a whole 'project' consisting of the raw data, coding tree, coded data (Also see 'Coding qualitative data'), and associated memos and notes to be saved.

Long-term data preservation: file formats for the future

FileFormatChoice800px

Standard, open and widespread formats are advisable for long-term storage as they typically undergo fewer changes. Contrary to proprietary formats (see above) specification of open formats is publicly available. Some of them are standardised and maintained by a standards organisation and we may assume that their readability in the future is ensured. Examples of open formats are PDF/A, CSV, TIFF, ASCII, Open Document Format (ODF), XML, Office Open XML, JPEG 2000, PNG, SVG, HTML, XHTML, RSS, CSS, etc.

Quantitative data preservation

QauntGuideTransparent

Long-term preservation of quantitative data is typically best off with simple text (ASCII) formats accompanied by a structured documentation file with information about the variables included, their position in the file, formats, variable labels, value labels etc.

In terms of location of variables in the file, we distinguish between fixed and free formats.

  • Fixed format In a fixed format, variables are arranged in columns and their exact positions, i.e. the start and end of each variable, are known.
  • Free format In a free format data for each variable is separated by blanks or specific characters, e.g. by tab space or a dash. If the character separating variables is used within an item, then it needs to be formatted specifically and separated from the surrounding text (as a rule, by quotation marks).

There exist several extensions for simple text formats, e.g. *.txt., *.dat and *.asc are used for both fixed and free formats, *.csv. is used for fixed format.

Qualitative data preservation

QualGuideTransparant

Qualitative data analysis software packages such as NVivo, ATLAS-ti, and MAXQDA have export facilities that enable a whole 'project' consisting of the raw data, coding tree, coded data, and associated memos and notes to be saved. For archiving such data, the raw data, the final coding tree, and any useful memos should be exported (UK Data Service, 2017)

Digital versions of documents are usually kept in the PDF/A format. This is an official archiving version of the PDF format as defined by the ISO 19005-1:2005 standard. It guarantees independence from the platform and includes all display information (including fonts, colours, etc.). XMLP format is a widespread standard for metadata. Structured textual documentation should, again, be saved in a simple text format, with tags and in line with a standard structure (e.g., DDI).

For audio files the recommended longterm format is WAV, video files are advised to be stored in MXF (Material eXchange Format) and JPEG2000 (Fleischhauer, 2010).

A very useful tool for searching an appropriate format for different types of data is provided by the UK Data Service (2017b) in the table of Recommended file formats.

Data conversion and possible data loss

Data files, depending on the nature of the data, are based on either text or binary encoding or both. Binary encoded information can be read only by specialised software, text information is universal and can be read by a wide range of different software including text editors.

Saveas600pxv2

It is advisable to store your data for use in the future, which means converting them from a current data format to a long-term preservation format. Most software applications offer export or exchange formats that allow a text-formatted file to be created for importing into another program. A typical example is Microsoft Excel, which through the 'Save As' command, can save spreadsheet data in comma delimited format (*.csv or comma separated values). The structure of the rows and columns is preserved through commas and line returns. However, multiple worksheets must be saved as separate *.csv files and any text formatting or macros in the native format will be lost on conversion.

During the process of data conversion, important pieces of information may be lost:

  • In the conversion of a statistical dataset (i.e. survey data), parts of the dataset may be lost, same as missing data definitions, decimal numbers, changes in data formats (e.g., numerical into string data type), data also may be truncated;
  • In case of texts, i.e. transcriptions of speech, editing such as highlighting, bold texts, headers, footers may be lost;
  • In case of images a reduction of resolution, loss of layer, colours may be lost;
  • In converting audiovisual data file conversion may reduce sound quality;
  • Some file formats are constructed specifically to save space. However, this is done by a reduction of information and data quality. For example, .jpg removes details from images, while .tiff bears full information. Similarly, .mp3 is a lossy format for audio data, while .wav keeps detailed information.

For this reason, the conversion itself should be done by a researcher familiar with the data, so he or she can check for potential undesirable changes in the data that occurred as a result of the conversion.

Due to differences in national character sets you should pay attention also to character coding. Some coding systems (e.g., Windows 1250) do not cover all character sets at the same time. As a result, an adequate language environment (Central European languages) has to be set to ensure correct display, which cannot be done at all times. Other coding systems (e.g., UTF 8) allow correct display of symbols of several character sets simultaneously.

TIP: Plan ahead to simplify data publication

ExpertTIp400pxContrast

Different data archives have different preferred formats. Knowing about these preferred formats in advance can save you time later when you want to archive and publish your data. Usually preferred formats are frequently used, independent of specific software, and have open specifications (see for instance information by DANS (n.d.) on preferred formats).