Data entry and integrity

Data integrity means assurance of the accuracy, consistency, and completeness of original information contained in the data. At the same time, the authenticity of the original research information has to be preserved (see 'Data authenticity').

The integrity of a data file is based on its structure and on links between data and integrated elements of documentation. From the moment that data is being entered, data integrity is at stake.

Data entry procedures have changed over recent years. Operators entering data into a computer manually are being replaced by automated computer technologies, while the universal distinction between the three phases of data collection, data entry, and data editing/checking is often becoming obsolete. In general, greater automation of processes generally prevents some types of errors, but at the same time, it produces other types of errors. For example, errors in scripts during computer-assisted interviewing may cause systematic shifts in data and to be able to detect such deviations in automated forms of data entry requires different kinds of checks in comparison to manually entered data.

Minimising errors in survey data entry

In the accordion below a summary of recommendations on minimising errors in survey data entry is given (UK Data Service, 2017a; ICPSR, 2012; Groves et al., 2004).

Check the completeness of records

Check if your data files contain the correct number of records, number of variables or length of the records, etc.

Reduce burden of manual data entry

Manual data entry requires routine and concentration. Operators should not be burdened by multiple tasks. Tasks such as coding and data entry should be implemented separately.

Minimise the number of steps

The data entry process should include a smaller rather than a larger number of steps. This reduces the likelihood of errors.

Conduct data entry twice

When you have paper questionnaires, the data entry can be processed electronically by scanning questionnaires or manually entering data by a person responsible for data entry. If data are entered by scanning, execute the process of data entry twice and compare values. If data are entered manually, a portion of questionnaires should be entered twice by two different persons. For example, the Czech Association of Public Opinion and Market Research Agencies (SIMAR) recommends 20 percent of questionnaires be re-entered.

Perform in-depth checks for selected records

At least some randomly selected records, e.g. 5–10% of all records, should be subjected to a more detailed, in-depth check to verify the procedures and identify possible systematic errors. The cases should be selected by chance. Be sure to document the changes you make and keep the original data so you can restore them at all times.

Perform logical and consistency checks

There are multiple methods for logical and consistency checks, including the following:

Check the value range (e.g. a respondent over the age of 100 is unlikely);
Check the lowest and highest values and extremes;
Check the relations between associated variables (e.g. educational attainment should correspond with a minimum age, the total number of hours spent doing various activities should not exceed 100% of the available time);
Compare your data with historical data (e.g. check the number of household members with the previous wave of a panel survey).

Automate checks whenever possible

Specialised software for computer-assisted interviewing (CAPI, CATI, etc.) or data entry software allows to set the range of valid values for each category and to apply filters to manage the data entry or the entire data collection process. These automatic checks:

Prevent meaningless values from being entered;
Help to discover inconsistencies that arise when some values are skipped or omitted;
Make the interviewer's work substantially clearer and easier;
Reduce the number of errors that interviewers make.

The software can distinguish between permanent rules that cannot be bent and warnings that only notify the operator when entering an unlikely value.

CAPI software is used by the data collectors and it is usually expensive and therefore individual researchers cannot afford to buy it. In case you collected your survey data by yourself, you must write your own program/syntax to check your data for discrepancies.

An example of an SPSS syntax to check your data

Logical check of income - the household income cannot be SMALLER than individual income

The syntax search for respondents who indicated the household income as well as their individual income, while the household income was smaller than individual income.

Variable names:

ide.10 - household income
interval variable, income in Euros, with special values 8 - refused to answer; 9 - don´t know
ide.10a - individual income
interval variable, income in Euros, with special values 8 - refused to answer; 7 - doesn´t have income

Syntax (SPSS):

USE ALL.

COMPUTE filter_$=(ide.10a ne 0) and (ide.10 ne 0) and (ide.10a ne 7) and (ide.10a ne 8) and (ide.10 ne 8) and (ide.10 ne 9) and (ide.10 < ide.10a).

VARIABLE LABELS filter_$ '(ide.10a ne 0) and (ide.10 ne 0) and (ide.10a ne 7) and (ide.10a ne 8) and (ide.10 ne 8) and (ide.10 ne 9) and (ide.10 < ide.10a) (FILTER)'.

VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.

FORMATS filter_$ (f1.0).

FILTER BY filter_$.

EXECUTE.

FREQUENCIES VARIABLES=CD

/ORDER=ANALYSIS.

FILTER OFF.

USE ALL.

In cases of errors ..

What to do with error values?

You can either delete or try to correct error values. Simple data entry errors can be easily corrected based on comparison with respondents’ original answers. However, you should bear in mind that inconsistencies can also be generated by the respondents themselves, and a correction should make a minimum or no changes/reductions to their original answers. Any replacement of originally measured values must be planned for and done in conformity with your research concepts.
Entering data directly into the MS Excel sheet or data list sheets of statistical software packages is a source of frequent errors. It is easy to skip the column or row and then it is difficult to identify all the errors and correct them. However, even in MS Excel, it is easy to set up a form for purposes of entering the records one by one (see the video by United computers, 2013) video and set up some simple checks if you have at least basic programming skills. Using MS Access for this purpose would be easier. It is also possible to use suitable data entry freeware, which is widely available from the web.

Considerations in making high-quality transcriptions of qualitative data

The most common formats of qualitative data are written texts, interview data and focus group discussion data. In most cases, interview and discussion data are firstly digitally recorded and then transcribed. Transcription is a translation between forms of qualitative data, most commonly a conversion of audio or video recordings into text. If you intend to share your data with other researchers, you should prepare a full transcription of your recordings (Bucholtz, 2000).

There are several basic rules and steps in the process of making and checking a high-quality transcript from audio/video (Kuckartz, 2014):

Prevent mistranscription by recording high-quality data

The quality of interview data gathered by means of recorded interviews depends on both the skills of the interviewer and the quality of the audio-visual equipment. Taking steps to create audio recordings of good quality increases their usefulness. Good quality sound recordings should prevent mis-transcription and reduce the chance of sections of an interview remaining untranscribed due to poor sound quality. When recording an interview, consider the following (Bucholtz, 2000):

The level of sound or picture quality needed;
The budget available for equipment and related consumables;
How quickly the technology being used will become redundant;
Whether consent is in place to allow the fullest use of recordings;
How the data created will be used;
Whether data or information not allowed by consent can be excluded from recording;
Whether the equipment will be simple to operate in the field.

Determine the transcription method

Transcription methods depend upon your theoretical and methodological approach and can vary between disciplines. Three basic approaches to transcription are (Bucholtz, 2000):

Focus on the content
This is also called the denaturalised approach, most like written language. The focus is on the content of what was said and the themes that emerge from that. This approach is used in sociological research projects.
Focus on what is said and how it is said
This approach is called the naturalised approach, which is most closely to speech. A transcriber seeks to capture all the sounds they hear and use a range of symbols to represent particular features of speech like the length of pauses, laughter, overlapping speech, turn-taking or intonation. This approach is usually employed in projects using conversation analysis.
Focus on emotional and physical language
In this approach detailed notes on emotional reactions, physical orientation, body language, use of space, as well as the psycho-dynamics in the relationship between the interviewer and interviewee are detailed. This approach is usually used in psycho-social research.

Choose between manually transcribing or with the help of speech recognition software (SRS)

SRS must “get used” to a speaker and can only be used if a high-quality recording is available. Gibbs (2007) recommends checking the utility and functionality of SRS software before using it.
When transcribing manually, you may sometimes hear something other than what an interviewee actually said. Listen carefully.

Determine the rules

Determine a set of transcription rules or choose an established transcription system that is suited for the planned analysis;
In setting up the rules, consider compatibility with the import features of QDA (Quality Data Analysis) software. For example, document headers and textual formatting, such as italics or bold, may be lost when transcripts are imported into software packages, and text formatted in two columns indicating speakers and utterances may also be problematic;
All members who are doing the transcription should first agree on these rules;
Write transcriber instructions or guidelines with required transcription style, layout and editing.

Transcribe

Transcribe the texts (or part of the texts) on the computer.

Check the transcription

Proofread, edit and modify the transcription, if necessary.

Protect your participants

Anonymise data during transcription, or mark sensitive information for later anonymisation (see 'Anonymisation');
When you assign the task of transcription to somebody else, make sure to take care of personal data protection before sending audio recordings and transcripts that contain personal or sensitive information. Draw up a non-disclosure agreement with the transcriber and encrypt files before transfer.

Choose a QDA-compatible file format

Format the transcription in such a way that your QDA (Qualitative Data Analysis) can be used optimally and files can be imported into the QDA software.

Choose a file format for long-term preservation

Save and archive the transcription in long-term preservation ready files such as *.rtf or *.pdf files (see 'File Formats').

Table of Contents