Documentation and metadata


I have never documented my data before. I have both qualitative and quantitative data and I work on a collaborative project. Where do I start?

  • 1. Do not panic. Much documentation is simply good research practice, so you are probably already doing much of it.

    2. Start early! Careful planning of your documentation at the beginning of your project helps you to save time and effort. Do not leave the documentation at the very end of the project. Remember to include procedures for documentation in your data management planning.

    3. Think about the information that is needed in order to understand the data. What will other researchers and reusers be needing in order to understand your data.

    4. Create a separate documentation file for the data that includes the basic information about the data. You can also create similar files for each data set. Remember to organise your files so that there is a connection between the documentation file and the data sets.

    5. Plan where to deposit the data after the completion of the project. The repository probably follows a specific metadata standard that you can adopt.

    6. Document consistently throughout the project. Data documentation gives contextual information about your dataset(s). It specifies the aims and objectives of the original project and harbours explanatory material including the data source, data collection methodology and process, dataset structure and technical information. Rich and structured information helps you to identify a dataset and make choices about its content and usability.

    TIP Use English for documentation. It increases the chance your data are understood and reused.

Systematically documented research data is the key to making the data publishable, discoverable, citable and reusable and overall data quality improves with clear and detailed documentation. It is vital to document both the study for which the data has been collected and the data itself. These two levels of documentation are called project-level and data-level documentation.

Project-level documentation


Project-level documentation explains the aims of the study, what the research questions/hypotheses are, what methodologies were being used, what instruments and measures were being used, etc. In the accordion the questions which your project-level documentation should answer are stated in more detail:

  • Describe the project history, its aims, objectives, concepts and hypotheses, including:

    • The title of the project;
    • Subtitle;
    • Author(s)/creator(s) of the dataset;
    • Other co-workers and their roles (person, research group or organization that participated in the study and their roles);
    • The institution of the author(s)/creator(s);
    • Funders;
    • Grant numbers;
    • References to related projects;
    • Publications from the data.
  • Describe what is in a dataset:

    • Kind of data (interviews, images, questionnaires, etc.);
    • File size (in bytes), file format of the data files and relationships between files;
    • Description of data file(s): version and edition, structure of the database, associations, links between files, external links, formats, compatibility.
  • Describe how the data was acquired:

    • The methodology and technique used in collecting and creating the data;
    • Description of all the sources the data originate from (What is the subject of study? E.g. periodicals, datasets created by others?) together with an explanation of how and why it got to the present place (provenance);
    • The methods/modes of data collection (for example):
      • The instruments, hardware and software used to collect the data;
      • Digitisation or transcription methods;
      • Data collection protocols;
      • Sampling design and procedure;
      • Target population, units of observation.
  • Describe the:

    • Data collector(s);
    • Date of data collection;
    • Geographical coverage of the data (e.g. Nation).
  • Describe your workflow and specific tools, instruments, procedures, hardware/software or protocols you might have used to process the data, like:

    • Data editing, data cleaning;
    • Coding and classification of data.
  • Describe if and how the data was manipulated or modified:

    • Modifications made to data over time since their original creation and identification of different versions of datasets;
    • Other possible changes made to the data;
    • Anonymisation;
    • For time series or longitudinal surveys: changes made to methodology, variable content, question text, variable labelling, measurements or sampling.
  • Describe how the quality of the data has been assured:

    • Checking for equipment and transcription errors;
    • Quality control of materials;
    • Data integrity checks;
    • Calibration procedures;
    • Data capture resolution and repetitions;
    • Other procedures related to data quality such as weighting, calibration, reasons for missing values, checks and corrections of transcripts, transformations.
  • Describe the use and access conditions of the data:

    • Where the data can be found (which data repository);
    • Permanent identifiers;
    • Access conditions such as embargo;
    • Parts of the data that are restricted or protected;
    • Licences;
    • Data confidentiality;
    • Copyright and ownership issues;
    • Citation information.

Data-level documentation


Data-level or object-level documentation provides information at the level of individual objects such as pictures or interview transcripts or variables in a database. You can embed data-level information in data files. For example, in interviews, it is best to write down the contextual and descriptive information about each interview at the beginning of each file. And for quantitative data variable and value names can be embedded within the data file itself.

  • QauntGuideTransparent

    Variable-level annotation should be embedded within a data file itself. If you need to compile an extensive variable level documentation that can be created by using a structured metadata format.

    Data-level documentation for quantitative data

    For quantitative data document the following:

    • Information about the data file
      Data type, file type and format, size, data processing scripts.
    • Information about the variables in the file
      The names, labels and descriptions of variables, their values, a description of derived variables or, if applicable, frequencies, basic contingencies etc. The exact original wording of the question should also be available. Variable labels should:
      • Be brief with a maximum of 80 characters;
      • Indicate the unit of measurement, where applicable;
      • Reference the question number of a survey or questionnaire, where applicable.
    • Variable: 'Q11eximp'

      Variable label: 'Q11: How important is exercise for you?
      Value labels: 1: Very unimportant. 2. Unimportant. 3. Neutral. 4. Important. 5. Very important.

      The label gives the unit of measurement and a reference to the question number (Q11).

    • Information about the cases in the file
      A specification of each case (unis of research like e.g. a respondent) if applicable.
    • Names, labels and descriptions for variables, records and their values
    • Description of the missing values at each variable
    • Description of the weighting variable
    • Explanation or definition of codes and classification schemes used

    Storing documentation

    Whenever possible, embed data documentation within a file. Click on the accordion for an example.

  • QualGuideTransparant

    Background and contextual information and participant details of interviews, observations or diaries can be described at the beginning of a file as a header or summary page.

    Data-level documentation for qualitative data

    For qualitative data document the following:

    • Textual data file (for example, interview)
      • Key information of participants such as age, gender, occupation, location, relevant contextual information);
      • For qualitative data collections (for example image or interview collections) you may wish to provide a data list that provides information that enables the identifying and locating of relevant items within a data collection:
        • The list contains key biographical characteristics and thematic features of participants such as age, gender, occupation or location, and identifying details of the data items;
        • For image collections, the list holds key features for each item;
        • The list is created from an initial list of interviews, field notes or other materials provided by the data depositor.
    • For textual data, background data are systematically entered at the beginning of each data unit (e.g. interview transcript) in a standardised manner.

      The following example from the Finnish Social Science Data Archive presents a typical transcript of an interview with only one interviewee. The transcript of each interview in the data has been saved in a separate file, often in .rtf or .doc(x). Background data fields are entered in the following manner at the beginning of each transcription file.

      Beginning of the transcript file

      Interview date: 08.02.2013 [=8 February 2013]
      Interviewer: Matt Miller
      Pseudonym of interviewee: Ian (not the real first name of the interviewee)
      Occupation of interviewee: Journalist
      Age of interviewee: 32
      Gender of interviewee: Male

    • Audiovisual data files
      For some types of data (image, audio or video files) the file format doesn't always allow recording background information in the beginning of the data file. In such cases, the best practice is to store background information in a manually created data list or a separate text file: a data list which accompanies the data collection.
      • Provide the following information on each image: creator, date, location, subject, content, copyright, keywords, equipment used;
      • Some image files have embedded technical metadata (You may use tools to extract technical metadata from images, such as (n.d.)).
    • In this case - shown on the site of the Finnish Social Science Data Archive (2016) - the background data fields are manually entered in table form using Excel (or Open Office Calc program). The data collected were video-recorded interviews. The data list contains background information related to the interviewee and the interview event as well as information on the model and brand of the camera used and the length of the video (in minutes).


      See also another data list example from the UK Data Service (2017c).

    • Periodicals, magazines, journal articles
      Among materials you use for qualitative data analysis, there may be online periodicals, magazines or journal articles. The information about all such resources must be kept in separate files:
      • Material collected from online periodicals: save references to web resources, like URLs, and do not forget they may change over time. To be sure information isn't lost, articles should be copied into a word processing program;
      • Materials from periodicals: When articles, photographs and other material are collected from periodicals for research purposes, bibliographic information should be carefully detailed (author(s), title, date of publication etc.);
      • When you analyse articles, make a list of them, sort them alphabetically or chronologically in the order they were analysed in the course of research.

    Storing documentation

    • Write the documentation into a separate, well-structured file, and associate that with the data file. You may use the same filename stem in order to strengthen the file-metadata association. For example: 20130311_interviews_audio, 20130311_interviews_trans, 20130311_interviews_image, 20130311_interviews_metadata. The latter part of the name can be used to convey the specifics of the file. In this case "audio" means audio tape and "trans" a transcription of the audio tape;
    • Data-level documentation can be embedded within a data file. For example, in interviews, it is best to write down the contextual and descriptive information about each interview at the beginning of each file;
    • If you have a large amount of metadata or large amounts of data that will need metadata you can use a standard specific database for the purpose (such as the DDI Codebook (DDI Alliance, 2017a)).

Metadata: machine readable data documentation

Metadata or "data about data" are descriptors that facilitate cataloguing data and data discovery. Metadata is intended for reading by machines. When data is submitted to a trusted data repository, the archive generates machine-readable metadata. Machine-readable metadata help to explain the purpose, origin, time, location, creator, terms of use, and access conditions of research data.

In the tabs below we provide you with examples of:

  • Metadata templates (for easy starting)
    If you do not quite know yet what metadata you should generate (what fields are needed) have a look at the metadata templates provide. Some of them are very simple and can, therefore, help to create basic documentation.
  • Metadata standards (for when you need your metadata to be very structured).
    Metadata standards may at first look seem quite scary. They are used by data archives for enhancing discoverability, interoperability and reusability. When you submit your dataset at a trusted data repository, these standards are automatically applied.
  • Metadata can, at its simplest, be stored in a single text file. However, you can use a metadata template to help you structure your metadata or to see how your metadata appears in .html.
    Below we provide examples of metadata templates that you can use when compiling documentation. Or just for inspiration to take a look at typical fields which are often required. It is always possible to include additional documentation beyond what is suggested.

  • You may want your metadata to be very structured. For that purpose, you can choose a metadata standard or a tool (software that has been developed to capture or store metadata) to help you add and organise your documentation. Many standards are discipline-specific. These will help you to add metadata to the workflow as they have been created to suit the needs of research data.

    Remember that you don't generally need to generate machine-readable metadata by yourself. The repository where you may want to deposit your data will do that for you. When you are depositing your data the repository will require a data documentation document from you and will convert the documentation into machine-readable metadata.

    The recommended standard for research in the social sciences is the DDI metadata standard.

    • DDI (Data Documentation Initiative) (DDI Alliance, 2017b) is an international standard for describing the data produced by surveys and other observational methods in the social, behavioural, economic, and health sciences. Expressed in XML, the DDI metadata specification supports the entire research data lifecycle.

      Common fields in the DDI include:

      • Title
      • Alternate Title
      • Principal Investigator
      • Funding
      • Bibliographic Citation
      • Series Information
      • Summary
      • Subject Terms
      • Geographic Coverage
      • Time Period
      • Date of Collection
      • Unit of Observation
      • Universe
      • Data Type
      • Sampling
      • Weights
      • Mode of Collection
      • Response Rates
      • Extent of Processing
      • Restrictions
      • Version History
    • MIDAS Heritage (Historic England, 2012) is a British cultural heritage standard for recording information on buildings, archaeological sites, shipwrecks, parks and gardens, battlefields, areas of interest and artefacts.

    • VRA Core (2015) is a standard for the description of images and works of art and culture.

    • ISO 19115 (DCC, 2017) is a schema for describing geographic information and services. It provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data.

    • In its simplest form, Dublin Core consists of 15 fields that basically describe all online resources:

      1. Title
      2. Creator
      3. Subject
      4. Description
      5. Publisher
      6. Contributor
      7. Date
      8. Type
      9. Format
      10. Identifier
      11. Source
      12. Language
      13. Relation
      14. Coverage
      15. Rights
  • For an example of how the metadata standard DDI is applied, we have a look at a dataset in the Finnish Social Science Data Archive (Galanakis, Michail (University of Helsinki): Intercultural Urban Public Space in Toronto 2011-2013 [dataset]. Version 1.0 (2014-02-13). Finnish Social Science Data Archive [distributor].
    The machine-readable XML file looks like this.