Resources for social media data

Social media data come from various resources, such as Facebook, Twitter, Reddit, Instagram or YouTube. The elements of social media data may be:

  • individual tweets, comments on Facebook, Twitter or Reddit etc.,
  • visual content, such as photos or videos,
  • network connections between network users (friend connections, groups),
  • data on ratings and/or interests (preferences or likes).

Social media data are available to researchers, but their availability is restricted by companies that own respective social media platforms (Facebook , Twitter, etc.). Restricted availability of social media data represents serious obstacle for more intensive application of social media data in social research.

There are several reasons for the limited availability of social media data. One of them is legal and deals with the social media content’s copyright. The users have copyright for their own content (e. g. Tweets or Facebook posts) and by signing terms of use they give the social media platform a license to use the content for various purposes. The use of the social media data for third parties (private companies, academic researchers etc.) is restricted in the terms of use. This constrains the researchers (and data archives) in using, storing and sharing the data. A good source of guidance on social media data preservation both for researchers and repositories is Thomson, S.D. (2016) "Preserving Social Media".

One of other reasons for the limited availability of social media data lies in the ethics. Researchers and data archivist must care about the protection of personal information of the social media users.

Social media data can be obtained through the application programming interfaces (APIs) of the social media platforms. However, these APIs usually restrict the type and amount of data you can collect. If researchers request large amounts of data through APIs, they might not get the complete data but samples. Often it is not fully transparent how these data are sampled.

For those who are not able to handle APIs for downloading the data, there are commercial subjects that sell social media data, such as Gnip (acquired by Twitter Inc. in 2014) or DataSift, but these usually have high costs.

According to the results of a survey carried out among European social science data archives for SERISS project in June 2019, only two CESSDA archives store and disseminate social media data so far: GESIS and UK Data Service (UKDS) offer their users limited collection of social media data, Facebook data, geo-coded Twitter data, and specific subsets of Wikipedia. In particular, UKDS holds several Twitter data sets (20 collections of Twitter communication (tweets’ IDs, timestamp, hashtags).

Currently, several CESSDA archives plan strategies to overcome legal and technical issues related to social media data archiving and sharing as they see it as important area.

Zenodo, Harvard Dataverse or Fig share hold limited but increasing number of social media datasets. These repositories obtain data through self-archiving i.e. without archive taking care over data and metadata quality.

There exist several projects and institutions that ingest and store social media data on various topics. Some of them are: