Data Protection in Data Collections

Data protection is of utmost importance. This section provides helpful information and best practices, how violations of personal data protection rules can be avoided.

Data Protection in Data Collections

Introduction

Anonymity and privacy protection always have to be considered in research and they are of special importance for data providers in politically sensitive regions such as parts of the former Soviet Union. Various actors, including among others authoritarian governments or militant activists, increasingly use online sources to identify and possibly harass people with different views or identities. As Roberts (2013: 344[i]) points out, ‘there can be no limit on the provision of anonymity and care in handling data; even in cases when the respondent does not ask for that provision’.

Concerning the role of human beings as subjects of research in the social sciences and humanities, quantitative as well as qualitative data are mainly gathered in two forms. They can either be collected from participants who have been recruited (or encountered) by the researcher or be taken from sources where they have been produced authentically (i.e., independently of the research project). An example of the latter are the large amounts of data created directly by users e.g. in online and social media, which are called ‘big data’.[1]

Humans as conscious participants in research

If the people included in the study (i.e., respondents, participants or data providers) are recruited by the researcher, ethics guidelines regularly demand ‘informed consent’. Participants should be informed about the purpose of the study and the specific use of the generated data, that their participation is voluntary and that they can withdraw from the study at any time. In a final step, data providers should also be informed about the research results (Lüders, 2015: 79[ii]; Fossheim & Ingierd, 2015: 11[iii]). Consequently, Elgesem (2015[iv]) argues that research should not entail a risk of harm or discomfort for the data provider. Discuss Data offers best practices for informed consent. Also, the DARIAH ELDAH Consent Form Wizard offers a convenient tool to design a GDPR-compliant informed consent template.

In the case of quantitative research, like representative opinion polls, the Data uploaded to the Discuss Data platform can easily be fully anonymized (for concerns related to anonymizing big data see the section below).

Whereas quantitative research usually has no need to identify individual data providers, qualitative research is often based on rather detailed profiles of individual data providers. It is already a challenge to publish the research results anonymously ‘if one wishes to publish direct quotes, as these will be searchable on the Internet. It is also important to note that pseudonyms or nicknames may be identifiable because they may be used in various contexts online and hence function as a digital identity’ (Utaaker Segadal, 2015: 43[v]).

Accordingly, the Discuss Data platform will assess the anonymization or pseudonymization of personal data in Data Collections uploaded to the platform on a case by case basis to ensure the highest level of privacy and personal data protection. Every published Data Collection has to go through a profound evaluation process by a Curator who ensures, that technical as well as legal requirements are fulfilled.

Humans as unconscious providers of data

The situation is different if the researcher is not in direct contact with the ‘data provider’, as in the case of online data. Social media might, for instance, make compromises regarding the privacy of data and the anonymity of data providers or users that do not meet the standards of ‘informed consent’. The ubiquity of publicly available social media data creates enormous possibilities for privacy violations. It often seems impossible (or at least impractical) to obtain the informed consent of data providers who also cannot be informed about the research results due to their large number and due to lack of contact details.

McKee and Porter (2009: 88[vi]) identify four factors that affect the need to obtain consent when research using online data is conducted:

  • the degree of accessibility in the public sphere (public versus private),
  • the sensitivity of the information,
  • the degree of interaction between the researcher and the research subjects, and
  • the vulnerability of the research subjects.

Based on the first criterion, many scholars have proposed that data should not be used without consent if ‘the people being studied do not have an expectation that the information will be used in research’ (Elgesem, 2015: 23[vii], emphasis in the original). However, ‘assessing the acknowledged publicity of an online venue is not always straightforward, at least not as seen from the point of view of the participants. A personal blog might be publicly available for all to read, though very often it can be regarded as a personal and private space by the author’ (Lüders, 2015: 80[viii]).

An additional aspect is the legal regulation in the country where the researcher is based. Research activities are not considered incompatible with the original purpose of data generation in the European Union (and Norway), as science is afforded a special position in the respective legal framework. ‘This provision might be seen as a fundamental principle guaranteeing further use of data for research purposes regardless of the original reason for their production. This leaves open the possibility to conduct research on information obtained online without consent’ (Utaaker Segadal, 2015: 42[ix]).

To achieve the overarching goal of privacy protection, which is also demanded by law in the European Union, a widely used method in quantitative research is the anonymization of data to protect the privacy of the individual research subjects. The method of anonymization can entail one or several of these techniques (Albright, 2011: 779[x]):

  • (micro)-aggregation (e.g., unspecified gender or age),
  • alteration of the data,
  • suppression of certain variables (that might identify the data provider),
  • data swap (data from one research subject are ascribed to another research subject and vice versa),
  • random noise (to distort the original data to some degree).

However, Aiden and Michel (2014[xi]) claim that big data necessarily cast ‘big shadows’. A shadow is a projection of the real object, a ‘visual transformation that preserves some aspects of the original object while filtering out others’ (Aiden & Michel, 2014: 60). It has been shown that the anonymization of quantitative datasets can be broken, revealing personal and sensitive information about the original single data provider.


[1] Big data refer to extremely large sets of semi- or unstructured digital data on social transactions that are, deliberately or passively and in various shapes and forms, generated in our daily interactions with technology. These digital traces constitute enormous datasets available to others that may be analysed to reveal patterns, trends, and associations, especially relating to human behaviour and interactions (Steen-Johnsen & Enjolras, 2015: 122; Prabhu, 2015: 158).


[i] Roberts, S. P. (2013). Research in challenging environments: The case of Russia’s ‘managed democracy’. Qualitative Research, 13(3), 337–351.

[ii] Lüders, M. (2015). Researching social media: Confidentiality, anonymity and reconstructing online practices. In H. Fossheim, & H. Ingierd (Eds.), Internet research ethics (pp. 77–97). Oslo: Cappelen Damm Akademisk.

[iii] Fossheim, H., & Ingierd, H. (2015). Introductory remarks. In H. Fossheim, & H. Ingierd (Eds.), Internet research ethics (pp. 9–13). Oslo: Cappelen Damm Akademisk.

[iv] Elgesem, D. (2015). Consent and information: Ethical considerations when conducting research on social media. In H. Fossheim, & H. Ingierd (Eds.), Internet research ethics (pp. 14-34). Oslo: Cappelen Damm Akademisk.

[v] Utaaker Segadal, K. (2015). Possibilities and limitations of Internet research: A legal framework. In H. Fossheim, & H. Ingierd (Eds.), Internet research ethics (pp. 35–47). Oslo: Cappelen Damm Akademisk

[vi] McKee, H. A., & Porter, J. E. (2009). The ethics of Internet research: A rhetorical, case-based process. New York: Peter Lang.

[vii] Elgesem, D. (2015). Consent and information: Ethical considerations when conducting research on social media. In H. Fossheim, & H. Ingierd (Eds.), Internet research ethics (pp. 14-34). Oslo: Cappelen Damm Akademisk.

[viii] Lüders, M. (2015). Researching social media: Confidentiality, anonymity and reconstructing online practices. In H. Fossheim, & H. Ingierd (Eds.), Internet research ethics (pp. 77–97). Oslo: Cappelen Damm Akademisk.

[ix] Utaaker Segadal, K. (2015). Possibilities and limitations of Internet research: A legal framework. In H. Fossheim, & H. Ingierd (Eds.), Internet research ethics (pp. 35–47). Oslo: Cappelen Damm Akademisk.

[x] Albright, J. J. (2011). Privacy protection in social science research: Possibilities and impossibilities. PS: Political Science & Politics, 44(4), 777–782.

[xi] Aiden, E., & Michel, J.-B. (2014). Uncharted: Big data as a lens on human culture. New York: Riverhead Books.