Access, use and cite data

DataUse600px

Once you find suitable data for your purpose and you've checked data quality (See the paragraph 'The process of data discovery'), how can you get it? The steps below emphasize a number of aspects that you may encounter along the way:

Different access arrangements may be in place for any data collection, especially those containing more detailed, sensitive or confidential data. Generally, the following access options exist:

  • Open Data
    Anyone can freely access, use, modify, and share for any purpose.
  • Open after registration
    The user must register and provide the required information. The information often includes personal data, institutional affiliation, and purpose of use. No specific conditions of access are required.
  • Open under specific terms and conditions
    For example:
    • Access to a data collection requires permission from the data depositors or data owners;
    • 'Scientific use files' are available only for academic research and education;
    • Data use is limited to 'non-commercial use only';
    • Sensitive and confidential data are available only under strict conditions of use and security measures.
  • Access to metadata only
    Many data files are inaccessible for different reasons. Even if data files are inaccessible, relevant metadata may still be available in repositories and information obtained from it may be also helpful for your research.
  • Embargo
    Some repositories contain data under embargo, i.e. after a specified time period (e.g., 6 or 12 months) the data is released for public use.

Also see Chapter 6 for information on access categories from the viewpoint of the data depositor.

Examples

The access and use conditions for different types of data may vary, even in the same repository. For example, the following different access categories are provided to data by GESIS (GESIS. n.d.b), Germany:

  • Category 0
    Data and documents are released for everybody.
  • Category A
    Data and documents are released for academic research and teaching.
  • Category B
    Data and documents are released for academic research and teaching, if the results are not published. If any publication or further processing of the results is planned, permission must be obtained by the Data Archive.
  • Category C
    Data and documents are only released for academic research and teaching after the data depositor’s written authorization. For this purpose the Data Archive obtains written permission with specification of the user and the analysis intention.

UK Data Service uses the following access categories (UK Data Service, n.d.g) to support access to its large collection of data from various sources, including the UK Office for National Statistics:

  • Standard access
    Applies to the majority of UKDS data and only requires user and project registration. These data are fully anonymised.
  • Special Conditions
    Are usually specified by the data owners and users agreement on them is required during the download/ordering process.
  • Special Licence
    Used for data collections containing more detailed (and therefore potentially disclosive) data such as smaller scale geographical information. If you apply for specially licence data, you need to provide more detailed information about the intended use of the data using a set of Special Licence forms.
  • Secure Lab (Controlled data)
    Provides secure access to data that are too detailed, sensitive or confidential to be made available under other arrangements such as a Special Licence. To use the Secure Lab, you need to complete a special application and attend a training course. Data accessed in this way cannot be downloaded. Once researchers and their projects are approved, they can analyse the data remotely or by using the UKDS Safe Room.

In order to protect the anonymity of individual persons or businesses, access to confidential microdata at Eurostat is restricted. Most of Eurostat's microdata is accessible only in the form of so called scientific use files (SUFs) for scientific purposes only.

The access is based on a complicated system of accreditation (European Commission, n.d.d). In addition, there are also public use files (PUFs) or public microdata which are made available to public. These files are prepared in such a way that individual entities cannot be identified. However, de-identification this is accompanied by a loss of informative value in the data.

Data files may be copyrighted work and therefore subject to copyright specified in the terms and conditions of use. Nowadays, the agreement with conditions of use is usually available on-line, but a written agreement may be required at some repositories or under some circumstances. Sometimes, especially for datasets classified as Open Data, CC licences (Creative Commons, 2017) are used to facilitate access to data. For more information on licence agreements read the Licencing your data paragraph of Chapter 6.

Nowadays, data organisations and projects are increasingly offering tools for on-line data analysis in addition to direct downloads. Sometimes different ways of access are offered to the same data file by different repositories.

Examples of ways of access are:

  • Direct download
    Direct download is the easiest way to get the data. However, you should consider the availability of appropriate analytical software, the structure and formats of the dataset. Experienced analysts usually prefer direct downloads as capabilities of on-line tools are often limited to very basic analytical methods.
  • Online analysis
    The advantages of online analysis is that you do not need your own specialised software. In addition, especially if the dataset has a complicated structure, the online tool may be a source of higher operability. It may allow easier orientation in a complex database, selecting, linking or merging of its different sections, selecting correct weighting factors, etc.
    Tools for on-line analysis are available at, e.g.:
  • On site access in the safe room
    Secure data centres with safe rooms provide access to highly sensitive and confidential data under strict security measures. Researchers are required to apply for accreditation, travel to the location of the centre, and work with the data in the safe room. For example, see the description of the Safe Room at the UK Data Service Secure Lab (UK Data Service, n.d.f).
  • Secure remote-execution system
    A secure remote-execution system is an alternative way to make confidential data accessible. The data user has access to rich metadata, but not directly to the dataset. Instead, statistical programs of intended analysis are submitted and on return aggregated results are obtainded. For example, LISSY (LIS, n.d.) allows researchers to access microdata from the Luxembourg Income Study (LIS, n.d.b) and the Luxembourg Wealth Study (LIS, n.d.c). Users submit their statistical programms written in R, SAS, SPSS or Stata via the Job Submittion Interface or via email. LISSY automatically processes the jobs and returns back aggregated results within few minutes.

Case: Using NESSTAR for data discovery

As we have noted, there are differences in ways of data presentation and functionalities of search among individual repositories. Some CESSDA archives use NESSTAR (NSD, n.d.) software. NESSTAR is a software system for publishing and presenting data on the Web. Some data services use NESSTAR as their main tool for searching and accessing data while others have a main catalogue and provide NESSTAR as an additional tool. NESSTAR enables online data browsing and analysis. You can also download tables, graphs, data files and study descriptions. NESSTAR help pages, accessible by clicking the question mark at the top of the screen, include helpful guidance. In NESSTAR, you can use advanced search.

Not all available data can be accessed free of charge. Even if the principles of open access to research data are applied, coverage of the marginal costs of access may be required from data users. For specific types of access, the expenses may be considerable. E.g. when you have to cover travel expenses to gain access at secure data centres.

In addition to the costs associated with access, it can also take time to gain access. For example, administration of requests and authorisation procedures for access to confidential and sensitive data is often time-consuming.

If you download data, it does not mean they are always available in the format you need. Some tips:

  • Keep in mind that raw research data may have a specific structure and their efficient processing and analysis may require specialised software and skills.
  • Data services often offer downloads in several different formats. Sometimes, however, only one format is available. If it is a current, standard analytical software format, there is usually no problem. In contrast, old proprietary formats can cause significant difficulties. An overview of data formats and more information about format conversion is available in Chapter 3.

This Expert Guide does not focus on data processing for purposes of data analysis. However, the following chapters can help you in understanding your data and their preparation for analysis:

Challenges in using data

After downloading the data, you will have to make the data suitable for reuse. The case study below shows that the challenges you may encounter before you can actually start using and analysing the data may be complex.

Case study: Data for a replication study

ReplicationStudy600px

Kristyna Bašná works at the Institute of Sociology of the Czech Academy of Sciences (n.d.). She needed data for a replication study. How did she discover, access and use such data?

My research focuses on the relations between structural properties of states, civic culture attitudes and change in the level of democracy. My research is a replication of a well-known paper written by Muller and Seligson (1994) who did a cross-national analysis on 27 countries and concluded that civic culture does have an important influence on the level of democracy.

To be able to replicate this analysis I needed data that would allow for cross-national and longitudinal comparison. At the same time the data should be comparable with the data used in the data analysis of Muller and Seligson. Data such as GDP per capita, level of democracy or Gini coefficient, are relatively easily accessible. However, it was much harder to find data with variables identical to the variables which were used by Muller and Seligson to measure civic culture. Yet this was exactly what I needed in order to be comparable.

I decided to search all the different cross-national public opinion survey databases and look for the exact same question that was used by Muller and Seligson (1994). In the end I was able to find data on 85 countries ranging from 1981 to 2015, in total having 337 country years. I downloaded the data on civic culture from openly accessible resources such as:

I also downloaded:

Downloading data from multiple resources is not a straightforward task because most databases use different coding. It is therefore essential to combine the data from multiple sources correctly and with the utmost care, because variables names and country names may differ, data may be missing and different types of weighting may have been used. In my case, I did not need data about individuals, but data collapsed by country and year. That is why for each database I first collapsed the data (using weights) keeping only the variables that I needed for my analysis.

In the second step, I made sure that the country names were identical in each of my data resources. I had to recode a number of countries because some surveys used very different coding. I also had to ensure that the variable on civic culture was identically coded in all of the different data resources, which was fortunately the case. Finally, I have merged all the different datasets into one big data file, which I then used for my quantitative analysis and for the replication of the Muller and Seligson (1994) article.

Citing data

After you have used research data you may want to publish about the work you have done. In this case, you should always cite research data. Research data may be subject to intellectual property rights. However, citing data is usually included in the terms and conditions for the use of data. The obligation to properly acknowledge any research work, including the work invested into development of databases, also logically follows from research ethics.

  • ExpertTIp400pxContrast
  • Expert tip: Use a persistent identifier
    In citation always use persistent identifiers (DOI – Digital Object Identifier) if available. It promotes findability and accessibility of data.

The minimal data citation recommended by DataCite (Datacite, n.d.b) is:

Creator (PublicationYear). Title. Publisher. Identifier

DataCite recommends including information about two optional properties, Version and ResourceType (if applicable):

Creator (PublicationYear). Title. Version. Publisher. ResourceType. Identifier

Poguntke, T., Scarrow, S., Webb, P. (2017). Political Party Database, 2011-2014. [data collection]. UK Data Service. SN: 8265, http://doi.org/10.5255/UKDA-SN-8265-1

Scarrow, S., Webb, P., Poguntke, T., 2017, Political Party Database, 2011-2014, [data collection], UK Data Service, Accessed 17 October 2018. SN: 8265, http://doi.org/10.5255/UKDA-SN-8265-1

European Foundation for the Improvement of Living and Working Conditions. (2017). European Working Conditions Survey, 2015. [data collection]. 4th Edition. UK Data Service. SN: 8098,http://doi.org/10.5255/UKDA-SN-8098-4

European Foundation for the Improvement of Living and Working Conditions, (2017). European Working Conditions Survey, 2015. 4th Edition. UK Data Service. [data collection]. http://doi.org/10.5255/UKDA-SN-8098-4

Have a look at the video (ICPSR, 2018) to learn about the benefits.

More about data citation can be found in Chapter 6 or, e.g., in the IASSIST Quick Guide to Data Citation (IASSIST, 2012).