LibGuides: Data managent in thesis: Collection of personal data in a thesis

Are you collecting personal data?

Thesis and data protection

The EU General Data Protection Regulation (GDPR) governs the processing of personal data in theses as well.

The research data itself may contain personal data, but personal data can also be present in the documents necessary for data collection, such as consent forms from participants. It is important to note that even an anonymous survey may generate personal data if the survey is conducted via an online form that records, for instance, the respondent's IP address or if the survey includes open-ended response options.

The student is responsible for ensuring data protection while conducting the thesis. It is the thesis advisor's duty to advise the student on data protection matters.

Data anonymization

Data anonymization refers to the process of handling data in such a way that it no longer contains any identifiable information. In terms of personal data, this means that individuals can no longer be identified from the data through reasonable means. Additionally, information about organizations or other confidential data can also be anonymized.

Even if you do not directly collect the personal information of participants, it may still be possible to identify them from the data. For instance, an anonymous survey may not be truly anonymous if respondents can disclose information about themselves in open-ended responses, or if the survey form records the respondent's IP address (note that this does not occur if Webropol is used according to the guidelines). Such data is not anonymous and is subject to data protection laws.

Techniques for anonymization include:

removal of individual data: specific information can be marked in the data as [data removed].
reclassification of data: for example, if you have collected exact ages or professions, you can replace them with age groups or professional categories.
fictitious names: if names appear in the data, you can replace them with fictitious names instead of removing them.
generalization: you can modify precise information to make it more general; for example, "AIDS" can be replaced with the term "disease," and "Haaga-Helia" can be replaced with "university of applied sciences."

Anonymization and Personal Data (Finnish Social Science Data Archive)

Guidelines for Data Anonymization from the Data Archive.

Finnish Social Science Data Archive’s guideline includes for example instructions for anonymization of both quantitative and qualitative research data.

What is personal data?

Personal data encompasses any information that can be used to identify an individual, either directly or indirectly. Research data may also include identifying information about individuals in the study participant's close circle or other individuals. Information that can identify them is also considered personal data.

Direct personal data includes items such as a person’s full name, personal identification number, and various biometric identifiers such as fingerprints, facial images, voice samples, and handwritten signatures.

Strong indirect identifiers are individual pieces of information that can be used to identify a person with reasonable ease. Examples include an address, phone number, uncommon job title, rare medical conditions, and unique identifiers such as an IP address, student ID, or bank account number.

Indirect identifiers are any data points that, when combined, can lead to the identification of an individual. These may include gender, age, place of residence, job title, household composition, income, marital status, language, nationality, ethnic background, workplace, or educational institution. When the target population of a study is already relatively small and well-defined, combining indirect background information can make it reasonably easy to identify an individual.

Source: Data Management Guideline: Anonymization and Personal Data (Finnish Social Science Data Archive)

Sensitive personal data

Sensitive personal data refers to specific categories of personal information as defined by data protection regulations, such as the General Data Protection Regulation (GDPR). These data types reveal critical aspects of an individual's identity, including:

Ethnic origin
Political opinions
Religious or philosophical beliefs
Trade union membership
Health information
Sexual orientation or behavior
Genetic and biometric data.

Sensitive personal data must be protected with heightened security measures due to the potential risks to an individual's fundamental rights that may arise from their processing. Consequently, the processing of such data is generally prohibited. However, there are exceptions to this prohibition, one of which includes the explicit consent of the individual regarding the processing of their sensitive personal data.

An ethical review must always be assessed in advance for research that involves:

Deviating from the principle of informed consent in participant involvement,
Interfering with the physical integrity of participants,
Targeting individuals under the age of 15 without separate consent or information from a guardian, which would allow the guardian to deny the child’s participation in the study,
Presenting participants with exceptionally strong stimuli,
Posing a risk of causing participants or their relatives psychological harm that exceeds normal daily life boundaries, or
Presenting a potential safety threat to participants, researchers, or their relatives during the conduct of the study.

Further Information on the Processing of Sensitive Personal Data: Processing of special categories of personal data (Office of The Data Protection Ombudsman)

When is a preliminary ethical review required: Ethical review (Finnish National Board On Research Integrity TENK)

Minimization of personal data

The principle of minimizing the collection of personal data entails avoiding the gathering of unnecessary information. This principle should be adhered to from the planning stage of the research for your thesis. Here are key considerations for implementing this principle:

Collect only the personal data that is essential for answering the research questions
Do not collect personal data "just in case."
Avoid collecting sensitive data
In surveys, avoid open-ended response options, as you cannot control what respondents write in them
In personal interviews, the interviewee can be asked to refrain from providing specific details, such as names or workplaces
Consider how detailed the information needs to be - is it sufficient to use a category or generalization instead of precise information? For example, instead of collecting exact age, consider using an age range such as "20-29 years old" or simply categorizing as "university of applied sciences" instead of specifying "Haaga-Helia University of Applied Sciences."