Organisation of variables

Data file structure is supported by the organisation of variables. Variable names and labels contribute into structuring of the data file, allowing to integrate part of the documentation into the data file and helping researchers to orient themselves in the structure of the data sets. At the same time variable names should be short and respect the usual requirements of standard software, because they are used as calling codes in software operations.

Variables5cm

The position of variables in the data file, their names and labelling should reflect the following:

DataSharing1200px
Organising your data

Data files also include supplementary variables which facilitate orientation and management, ensure integrity, or are necessary for some analyses. As a rule, you should include a unique identifier (or set of identifiers) for cases (individual respondents) in the file. A unique identifier is an identification code for the case. They are usually numbers, for example, 0001, 0002, 0003 etc. To facilitate orientation, it is usually placed at the very beginning of the file.

Other variables may help to distinguish between different sources of information, methods of observation, temporal or other links. Yet others may provide information about the organisation of data collection such as interviewer ID or interviewing date, or distinguish cases which belong to various groups.

It is absolutely necessary for an analysis to distinguish data that result from overrepresentation sampling strategies, different waves of research, etc., especially if groups of cases distinguished by them are to be analysed in different ways.

For each variable in the data file, you should set the variable width, i.e. the number of characters or the length of the integer and fractional parts of a number. The set number of characters or digits for each variable is reserved for every case, even if they are left blank.

Naming variables

In the tabs below basic rules for variable naming are given and an example is presented.

  • The basic rules for variable naming are following:

    • Start with a letter. Do not start with a number, question marks or exclamation marks or special character such as #, &, $, @ (they are often reserved for specific purposes in software applications);
    • Variable names cannot contain spaces;
    • Variable names are also used as calling codes in software operations. For this reason, variables should be short and respect the usual requirements of a standard software. The standard is to nog make variable names any longer than eight characters;
    • Do not use diacritics (marks above or below a letter) or national specific characters;
    • Make them meaningful (so they can be used for better orientation in the data files).

    There are three basic approaches to naming variables:

    • Using numeric codes that reflect the variable’s position in a system (e.g. V001, V002, V003...);
    • Using codes that refer to the research instrument (e.g. question number in a questionnaire: Q1a, Q1b, Q2, Q3a...);
    • Using mnemonic names that refer to the content of variables (e.g. BIRTH for the year of birth, AGE for respondent’s age etc.). The word mnemonic means “memory aid”.

    Variable labels

    Variable labels provide a short description of the variable name. These can be longer than the eight characters which are recommended for variable names. Although size limits are less strict here, it is advisable to keep variable labels rather brief and find an adequate compromise between clarity and the size of the label. Keep in mind that many analytical outputs are provided in tables. Thus, excessively lengthy labels can result in large and impractical tabulations. The size of labels may also complicate format conversions. In some analytical outputs or after conversion, only a part of a lengthy label is kept. The loss of the remainder of the variable label may make the label incomprehensible.

    Examples of variable labels include a short or full version of the question, or a question code if variable names are not constructed around them. E.g.:

    • The variable label is adapted from the number and question-wording from the questionnaire: “B10 - How old are you?”;
    • The descriptive label is “Age of a respondent”;
    • Schematic this becomes: “Respondent: AGE”.

    To reach the widest audience possible, the preferred language for variable naming is English.

    Labels for variable values

    Variables have two or more values (a variable with only one value is called a constant and in fact it is not a variable). Sometimes you must assign labels to values of variables. You do not need to assign labels to values of continuous variables like age (in years), height (in metres) or weight (in kilograms), because their units are generally known. But it is different for nominal and ordinal variables. A nominal variable like gender has two values, in your data usually represented with 0 and 1. You should assign labels "male"/"female" to these two values, so you and other researcher who would use your data would know which value represents which gender. Same applies to ordinal scales, for example agree-disagree scale with values 1, 2, 3, 4 and 5, where 1 represents "completely disagree" and 5 "completely agree". You must label these values so you and others know what degree of dis/agreement the numbers represent.

  • Two different concepts of variable naming and labelling in the data file from the International Social Survey Programme

    The International Social Survey Programme (ISSP) is a continuing, long-term international programme of survey research on important sociological topics. It brings together pre-existing, social science projects and coordinates research goals, thereby adding a cross-national perspective to the individual, national studies. Established in 1984, it now has almost 50 member countries. The ISSP surveys are organised annually.

    Each ISSP survey contains two international modules:

    • ISSP thematic module
      A specific topic of the survey is selected for each year. There are about ten topics, which are repeated at regular intervals. However, sometimes a topic is skipped or replaced by a new one.
    • ISSP background variables module
      These include a set of harmonised sociodemographic variables. This module is repeated every year. However, there are also frequent changes in this set of variables.

    Two different concepts of variable naming and labelling are used for these two modules.

    Table: Excerpt from the variable list of the international dataset from ISSP 2009 on ‘Social Inequalities’ (ISSP Research Group, 2017).

    In the table we see two approaches to variable labelling:

    • Simple variable names
      The first thematic part of the file contains simple variable names (numeric codes). The information on the numbers of the questions in the common international questionnaire is included in variable labels. It supports better user orientation in the data file. The question numbers are followed by a literal question, sometimes shortened adequately to remain comprehensible and keep the variable label short. Some ISSP surveys allow alternative wording of questions – possible alternatives are bracketed in inequality signs. Similarly, after country specifics (e.g., country name, the currency used), general names come in inequality signs.
    • Mnemonic names of variables
      The second part contains background variables and uses mnemonic names of variables referring to their contents. These background variables are not directly linked to the wording of questions in the international questionnaire but instead constructed from national versions of data. Their names refer to their contents and simultaneously to links between them (e.g., DEGREE = the education variable transformed into an internationally comparable form, XX_DEGR = education variables using original country-specific coding). Moreover, the set of mnemonic names of background variables is standardised across different ISSP surveys, which allows easier merging of ISSP data files across time and construction of time-series databases.

    TIP! Mnemonic variable names may help to establish links between sets of variables within the data file. In addition, in repeated surveys, if the same naming convention of mnemonic names is used, it makes easier merging data over time.