1
0
adaptation-slr/studies_pdfs/006-franke.md

86 KiB
Raw Blame History

An Exploratory Mixed-Methods Study on General Data Protection Regulation (GDPR) Compliance in Open-Source Software

Lucas Franke lfranke@vt.edu Virginia Tech Blacksburg, Virginia, USA

Huayu Liang huayu98@vt.edu Virginia Tech Blacksburg, Virginia, USA

Sahar Farzanehpour saharfarza@vt.edu Virginia Tech Blacksburg, Virginia, USA

Aaron Brantly abrantly@vt.edu Virginia Tech Blacksburg, Virginia, USA

James C. Davis davisjam@purdue.edu Purdue University West Lafayette, Indiana, USA

Chris Brown dcbrown@vt.edu Virginia Tech Blacksburg, Virginia, USA

ABSTRACT

Background: Governments worldwide are considering data privacy regulations. These laws, such as the European Unions General Data Protection Regulation (GDPR), require software developers to meet privacy-related requirements when interacting with users data. Prior research describes the impact of such laws on software development, but only for commercial software. Although open-source software is commonly integrated into regulated software, and thus must be engineered or adapted for compliance, we do not know how such laws impact open-source software development.

Aims: Understanding how data privacy laws affect open-source software development. We focused on the European Unions GDPR, as it is the most prominent such law. We specifically investigated how GDPR compliance activities influence OSS developer activity (RQ1), how OSS developers perceive fulfilling GDPR requirements (RQ2), the most challenging GDPR requirements to implement (RQ3), and how OSS developers assess GDPR compliance (RQ4).

Method: We distributed an online survey to explore perceptions of GDPR implementations from open-source developers (N=56). To augment this analysis, we further conducted a repository mining study to analyze development metrics on pull requests (N=31,462) submitted to open-source GitHub repositories.

Results: Our results suggest GDPR policies complicate open-source development processes and introduce challenges for developers, primarily regarding the management of users data, implementation costs and time, and assessments of compliance. Moreover, we observed negative perceptions of GDPR from open-source developers and significant increases in development activity, in particular metrics related to coding and reviewing activity, on GitHub pull requests (PRs) related to GDPR compliance.

Conclusions: Our findings provide future research directions and implications for improving data privacy policies, motivating the need for policy-related resources and automated tools to support data privacy regulation implementation and compliance efforts in open-source software.

1 INTRODUCTION

Software products collect an increasing amount of data from users to enhance user experiences through personalized, machine learning-enabled [53] application behaviors [33] and marketing [79]. Such practices may benefit users, but also threaten their well-being. For example, in 2013, Facebook allowed the political research firm Cambridge Analytica to access data on ~87 million Facebook users [62]. Cambridge Analytica used this data to influence US elections [114, 115].

To protect their citizens, over 100 governments worldwide are developing data privacy regulations [105]. Their goal is to constrain how their citizens personal data is collected, processed, stored, and saved. Some target specific industries, e.g., the United Statess Health Insurance Portability and Accountability Act (HIPAA), which places requirements on healthcare organizations handling medical data [7]. Others cover personal data regardless of context, e.g., the European Unions General Data Protection Regulation (GDPR), which grants rights to EU citizens and affects entities that handle their data [12]. The penalties for non-compliance with data privacy laws and regulations may be severe [18, 46]. For example, under GDPR, corporations have been fined millions or billions of euros [80]. Most organizations store and manipulate this data electronically through software, and so ensuring the software is in legal compliance is an important software engineering task.

Data privacy regulations create challenging software requirements because they entail both technical and legal expertise. Software developers must implement required features, such as obtaining consent from users for data collection, to ensure their organizations products are compliant. However, developers may have limited legal knowledge [81, 109] and receive minimal training [21, 55]. This can lead to coarse solutions, such as exiting the affected market [88] — hundreds of websites simply banned all European users when GDPR went into effect [97, 103]. Researchers have explored the impact of data privacy regulations on businesses [72, 73, 88], users [22, 32, 68], and observable software product properties such as website cookies [67] and database performance [92]. However, there has been limited study of how such laws affect the software development process. The few existing studies have been of commercial software development [20, 29]; we lack knowledge of the effects of GDPR on open-source software (OSS) development.

The goal of this work is to describe the impact of data privacy regulation compliance on open-source software. Our study is the first on this topic.2 We therefore adopt an exploratory methodology to provide an initial characterization and identify phenomena of

2This paper is an extension of our preliminary work, presented as a poster [44]. interest for further study. Our study draws on two data sources collected in two phases. The first phase examined qualitative data on developers experiences with GDPR implementations in OSS, collected via a survey (N=56). To further investigate the impact of GDPR in OSS, the second phase collected and analyzed developers activities in open-source projects on GitHub, examining metrics and sentiments on 31,462 pull requests, divided into 15,731 GDPR and non-GDPR pull requests (PRs).

Our results show GDPR compliance negatively impacts open-source development—incurring complaints from developers and significantly increasing coding and reviewing activities on PRs. In addition, despite the benefits of data privacy regulations for users, we find developers have mostly negative perceptions of the GDPR, reporting challenges with implementing and verifying policy compliance. We also find that interactions with legal experts hinder development processes, yet developers rarely consult with legal teams—often relying on ad hoc methods to verify GDPR compliance.

In sum, our contributions are:

  • We survey OSS developers to understand developers experiences with GDPR compliance and challenges with implementing and assessing data privacy regulations.
  • We empirically analyze the impact of GDPR-related implementations on development activity metrics.
  • We use natural language processing (NLP) techniques to evaluate the perceptions of GDPR compliance through discussions on OSS repositories.

Significance: This work contributes an exploratory analysis on the impact of GDPR compliance on open-source software. It identifies interesting phenomena for further research—in particular opportunities to support policy implementation and verification. We also provide recommendations for policymakers and software developers to improve data privacy regulations and their implementation.

2 BACKGROUND

2.1 Software Regulatory Compliance

2.1.1 In General. Software requirements are divided into two categories: functional and non-functional [96]. Functional requirements pertain to input/output characteristics, i.e., the functions the software computes. Non-functional requirements cover everything else, such as resource constraints, deployment conditions, and development process. One major class of non-functional requirement is compliance with applicable standards and regulations. These requirements are typically developed and enforced on a per-industry basis in acknowledgment of that industrys risks and best practices [54].

Complying with standards and regulations has been part of software engineering work for many years. Some standards apply to any manufacturing process, e.g., the ISO 9001 quality standard [11]. Others are generic to software development (e.g., ISO/IEC/IEEE 90003 [10]). Still others are contextualized to the risk profile of the usage context, e.g., ISO 26262 [13] or IEC 61508 [9] which describe standards for safety-critical systems [54]; the US HIPAA law (Health Insurance Portability and Accountability Act) which describes privacy standards for handling medical data [7]; and the US FERPA law (Family Education Rights and Privacy Act) which describes privacy standards for handling educational data [5]. Although these regulations are not new (e.g., FERPA dates to 1974, HIPAA to 1996, and IEC 61508 to 1998), software engineering teams still struggle to comply with them [34, 40, 43, 75].

2.1.2 In Open-Source Software. This study focuses on GDPR compliance in open-source software. The reader may be surprised that regulatory compliance is a factor in open-source software development, as open-source software licenses such as MIT [3], Apache [8], and GNU GPL [6] disclaim legal responsibility. For example, the MIT license, the most common license on GitHub [27], states “the software is provided as is, without warranty...[authors are not] liable for any claim, damages, or other liability”. However, users and developers of open-source software may desire regulatory compliance. We note three examples. (1) A majority of open-source software is developed for commercial use [47] and may require standards or regulatory compliance [108]. (2) Users with open-source software components in software supply chains [52, 83] may request compliance requirements such as web cookies. The developers may service these requests. (3) Users may extend open-source software themselves and undertake their own compliance analysis [99]. Standards such as IEC 61508Part 3 include provisions for doing so [60].

Open-source software is no longer a minor player in commercial software engineering. Multiple estimates suggest that open-source components comprise the majority of many software applications [47, 82]. In a 2023 survey of 1700 codebases across 17 industries, Synopsys found open-source software in 96% of the codebases and reported an average contribution of 75% of the code in the codebase [101]. It is therefore important to understand how open-source software development considers non-functional requirements such as regulatory compliance.

2.2 Privacy Regulations, Especially GDPR

2.2.1 Consumer Privacy Laws. In §2.1 we discussed standards and regulatory requirements that affect software products based on industry. Recently a new kind of regulation has begun to affect software: consumer privacy laws. The most prominent example of such a law is the European Unions General Data Protection Regulation (EU GDPR), enacted in 2016 and enforceable beginning in 2018. Examples in the United States include the California Consumer Privacy Act (CCPA, enacted 2018) and the Virginia Consumer Data Protection Act (CDPA, enacted 2021). Similar legislation has been considered by >100 governments [59, 105].

2.2.2 The General Data Protection Regulation (GDPR). The General Data Protection Regulation (GDPR) [12] protects the personal data of European Union (EU) citizens, regardless of whether data collection and processing is based in the EU. The law has implications for entities that interact with the personal data of EU citizens, divided into data subjects, data controllers, and data processors [45]. Data subjects are individuals whose personal data is collected. Data controllers are any entities —organization, company, individual, or otherwise — that own, control, or are responsible for personal data. Data processors are entities that process data for data controllers. The GDPR grants data subjects rights to their personal data, providing guidelines and requirements to data controllers and processors to understand how to properly handle this data.

GDPR compliance is complex for software engineers and consequential for their organizations. Data controllers and processors commonly use software, e.g., a controllers mobile app transmits data to its backend service and processors subsequently access and update the database. Software teams must determine appropriate data policies, update their systems to comply, and validate them, e.g., incorporating cookie consent notices into websites to provide users with informed consent [106]. Anticipating a lengthy compliance process, the EU enacted the GDPR in 2016 but made it enforceable in 2018, allowing two years for corporations to prepare [1]. Companies in the US and UK alone invested $9 billion in GDPR compliance [110]. As of December 2022, many use manual compliance methods or are not compliant [14]. Non-compliance is costly: thousands of distinct fines have been imposed on non-compliant data controllers and processors, exceeding €2.5 billion [15].

Although GDPR compliance affects any software that processes the data of EU citizens, and open-source software components comprise the majority of many software applications that process such data [47, 82, 101], to the best of our knowledge there is no prior research on the impacts of GDPR compliance in open-source software.

3 METHODOLOGY

3.1 Data Availability and Research Questions

In §2 we described a range of privacy-related standards and regulations. We noted that there has been little study of the effect of these requirements on open-source software engineering practice. To address this gap, we need data. Table 1 estimates the availability of software engineering data associated with these requirements through two common metrics: the number of posts on Stack Overflow and the number of pull requests on GitHub.

Privacy Law (Year) Stack Overflow GitHub-PRs
GDPR (2016) 2058 64 K
HIPAA (1996) 725 5 K
CCPA (2018) 96 1 K
FERPA (1974) 35 254
CDPA (2021) 7 19
PIPEDA (2000) 5 31

Based on this data, we scoped our study to the EUs GDPR; and to open-source software hosted on GitHub, currently the most popular hosting platform for OSS. We answer four research questions:

RQ1: How does GDPR compliance influence development activity on OSS projects?

RQ2: How do OSS developers perceive fulfilling GDPR requirements?

RQ3: What GDPR concepts do OSS developers find most challenging to implement?

RQ4: How do OSS developers assess GDPR compliance?

We analyzed data from quantitative and qualitative sources: surveying open-source developers and mining OSS repositories on GitHub. We present how we obtained and analyzed each data source next. We integrate this data in answering RQ1 and RQ2, and use the survey data alone to answer RQ3 and RQ4.

3.2 Data Source 1: Developer Survey

To explore the impact of implementing GDPR policies on OSS development, we distributed an online survey for open-source developers. This data informed our answers to all RQs. We used a four-step approach motivated by the framework analysis methodology [90] for policy research to collect and analyze data in the second phase of our experiment. An overview of this process is presented in Table 2. Our Institutional Review Board (IRB) provided oversight.

3.2.1 Step 1: Pilot Study and Data Familiarization. To formulate an initial thematic framework for our qualitative analysis, we conducted semi-structured pilot interviews with OSS developers (n = 3). As no prior work has explored the perceptions of GDPR compliance in OSS, pilot interviews gave us insight into developers perceptions and experiences with implementing GDPR concepts in the context of open-source software development. Two subjects had contributed to PRs in our dataset, and the third was a personal contact. They had a wide range of open-source development experience, from < 1 year to > 20 years. Interviews were transcribed using Otter.ai and coded by two researchers to inform our survey.

Thematic analysis of our pilot interviews provided insight that informed our survey questions. The participants highlighted the challenges with implementing GDPR requirements in open-source software. One participant worked at a large corporation and outlined differences between GDPR compliance at their company and in OSS, namely with (1) approaches used to assess whether compliance is implemented correctly, and (2) access to legal teams. The other two participants discussed the impact of the GDPR, noting its privacy benefits as well as challenges OSS developers face implementing GDPR requirements and assessing compliance. These findings informed our survey.

3.2.2 Step 2: Survey Design. The survey consisted of open-ended and short answer questions seeking details about GDPR implementation and experiences in the context of open-source software development. We used the pilot study interview results to identify topics to focus on in the survey. Based on the interviews, we asked about the perceived impact of the GDPR on data privacy, the most difficult concepts to implement, and how they assess GDPR compliance. The survey instrument is in the supplemental material.

3.2.3 Step 3: Participant Recruitment. We distributed our survey in three rounds. In the first round, we emailed a sample of 98 developers who authored or commented on GDPR-related pull requests with a publicly available email addresses. We received 5 responses, i.e., a 5% response rate. In the second round, we made broader calls for participation on Twitter and Reddit. We received 44 responses, 2 of which indicated no experience implementing GDPR compliance. All survey respondents in these rounds were entered in a drawing for two $100 Amazon gift cards. After a few months, we undertook the third round, redistributing our survey to an additional 235 GitHub users with GDPR implementation experience (authored GDPR-related pull requests in our dataset) and offered individual compensation ($10 gift card) to encourage participation. We received 9 responses (4% response rate). In total we have data from 56 survey participants (14 from direct GitHub contacts and 42 from Twitter and Reddit). Table 2: Overview of sample questions from pilot interview study and survey design/analysis for framework analysis approach used for Data Source 2. The final column notes the inter-rater agreement score for these themes using the \kappa score, prior to reaching agreement.

Interview Question Codes Survey Question Codes \kappa
What meaningful impact, if any, do you believe the GDPR has had on data security and privacy? data privacy, rights to users, data collection What impact, if any, do you believe the GDPR and similar data privacy regulations have had on data security and privacy? data privacy, data processing, data collection, insufficient information, data breach, fines 0.736
What GDPR concepts do you find the most difficult or frustrating to implement? None, data minimization, embedded content What GDPR concepts do you find the most difficult or frustrating to implement? privacy by design, data minimization, cost, data processing, user experience, data management, security risks, None, lawfulness and dispute resolution, time, right to erasure 0.929
Have you had to specifically seek out legal consultation on GDPR-related issues, and if so, how did that affect your development process? Yes/No; no effect, negative effect (time) Have you had to specifically seek out legal consultation on GDPR-related issues, and if so, how did that affect your development process? Yes/No; N/A, no effect, positive effect, negative effect (cost, time, data storage, data processing,...) 0.514
During your software development projects, do you frequently consult with a legal team, and if so, how does this impact the development processes? If not, how did you assess GDPR compliance for your software projects? Yes: legal consultation; No: privacy by design, data minimization During your software development projects, have you consulted with a legal team? If not, how do you assess GDPR compliance for your software projects? Yes: legal consultation; No: accountability system, online resources, self-assessment, data management, none), N/A 0.668
Has implementing GDPR concepts for compliance impacted your development process in any way? (yes/no/maybe) Please explain: positive impact (logging, privacy by design), negative impact (cost, data management, security,...), no impact 0.860

Our participants have a median of approximately 5 years of OSS development experience (avg = 5.9) and 6 years of general industry experience (avg = 7.7). Participants reported contributing to a variety of OSS projects such as Mozilla, Wordpress, Fedora, Moodle, Ansible, Flask, Django, Kubernetes, PostGreSQL, OpenCV, GitLab, and Microsoft Cognitive Toolkit.

3.2.4 Step 4: Data Analysis. To analyze our survey results, we used an open coding approach. Two researchers independently performed a manual inspection of responseshighlighting keywords and categorizing responses based on the pre-defined themes derived from our pilot study. If new themes arose, the coders discussed and agreed upon adding the new theme. Then, both coders came together to merge their individual results. Finally, we used Cohens kappa (\kappa) to calculate inter-rater agreement (see Table 2).

3.3 Data Source 2: GDPR PRs on GitHub

We collected data concerning GDPR compliance by analyzing pull requests on GitHub repositories. Pull requests are a mechanism on GitHub that allow developers to collaborate on open-source repositories, involving code contributions from developers to be reviewed and merged into the source code [48].

3.3.1 GDPR and non-GDPR PRs. We used the GitHub REST API to search for GDPR-related pull requests—pull requests returned by the GitHub APIs default search with the query string “GDPR”. Manual inspection suggested the results are typically English-language PRs related to (GDPR) data privacy regulatory compliance.

Using this method, we collected GDPR-related PRs created from April 2016 (when the GDPR was adopted by the European Parliament) to January 2024. We removed content submitted by users with “bot” in their username [16] and designated as a bot type according to the GitHub API to avoid PRs generated by automated systems. This resulted in 15,731 GDPR-related pull requests across 6,513 unique GitHub repositories. For comparison, we also collected a random sample of 15,731 pull requests created in these same repositories after April 2016 that did not mention “GDPR”, which we call non-GDPR-related pull requests. The studied repositories had a median of 14 stars (avg = 1,635), 11 forks (avg = 416), 727 commits (avg = 8,997), 172 PRs (avg = 1,425), and 15 contributors (avg = 59), suggesting popular, active repositories. The distribution of PRs across all repositories in our GDPR-related and non-GDPR-related datasets is summarized in Table 3.

3.3.2 Measuring Development Activity. To analyze GDPRs impacts, we collected development activity metrics [49] per pull request:

  • Comments: the total number of comments
  • Active time: the amount of time the PR remained active (until merged or closed)

Table 3: Distribution of PRs in Datasets.

Dataset min 50%ile 75%ile 90%ile max
GDPR 1 1 2 3 956
non-GDPR 1 2 10 34 203

3https://docs.github.com/en/graphql/reference/objects#botCommits: the total number of commits • Additions: the number of lines of code added • Deletions: the number of lines of code removed • Changed files: the total number of modified files • Status: outcome of PR (merged, closed, or open)

We selected these metrics to analyze development activity, specifically to derive coding and code review tasks from pull requests. We compared the distributions of these metrics between GDPR-related and non-GDPR-related PRs using a Mann-Whitney U test, to compare nonparametric ordinal data between the datasets [76]. To control for multiple comparisons on the same dataset, we calculate adjusted p-values using Benjamini-Hochberg correction [30]. We measure effect size (r) for significant results using Cohens d [39].

3.3.3 Measuring Developer Perception

To augment our survey results, we applied sentiment analysis—a technique to automatically infer sentiment from natural language—on the title, body, commit messages, review comments, and discussion comments from pull requests in our datasets to examine developer perceptions of GDPR compliance. Prior studies have similarly inferred developer sentiment and emotion from GitHub activity, including PR discussion comments [87], review comments [57], commit messages [50], and bodies [84]. While this technique sometimes has negative results in software engineering contexts [64], we use it in our exploratory work as a proxy to obtain preliminary insights into developers sentiments regarding GDPR compliance in OSS.

We followed standard NLP preprocessing steps [69]: (1) We removed bot-generated content using the process described in Section 3.3.1. (2) We removed non-sentiment material: hyperlinks and mentions (“@username”). (3) We tokenized text using the Natural Language Toolkit (NLTK) tokenize library. (4) We converted tokens to lowercase and removed punctuation. (5) We removed stopwords such as “but” and “or” (nltk.corpus library). (6) We lemmatized the text, i.e., reducing words to their base form (e.g., “mice” becomes “mouse” [23]) using WordNetLemmatizer from the nltk.stem library. (7) We normalize the data by removing meaningless tokens, such as SHA or hash values for commits, and non-standard English words, such as words that contain numerical values (i.e., “3d”) [98].

After preprocessing the data, we were left with 15,731 titles, 14,515 bodies, 15,217 commit messages, 4,922 review comments, and 4,862 discussion comments across the GDPR-related pull requests. We compared these against non-GDPR-related PRs, for which we had 15,731 titles, 13,718 bodies, 15,652 commit messages, 3,427 review comments, and 3,165 discussion comments.

To perform sentiment analysis, we use three state-of-the-art models: Liu-Hu [56], VADER [58], and SentiArt [63]. We fed the preprocessed textual data to each model, which provided compound sentiment scores. We use a t-test (t) to statistically analyze sentiment across our datasets. Moreover, we aim to assess the impact of the GDPR on developer sentiment over time. To accomplish this, we divided the GDPR and non-GDPR PRs into 3-month segments based on the creation date of the PR. Then, we performed sentiment analysis on the binned data to observe whether and how developer sentiments manifest in OSS interactions over the lifecycle of the GDPR regulation — from its initial adoption in 2016, enforcement in 2018, and to the present. We combined all preprocessed textual elements (title, body, commit messages, review comments, and discussion comments) to observe the overall trends in PR communications and compare with non-GDPR data as a baseline sentiment in developer communications for the projects studied.

4 RESULTS

We are interested in understanding the impact of GDPR implementations on open-source software by analyzing development activity and developer perceptions, including challenges with implementation and assessment of compliance. In this work, we answer our research questions using multiple sources—analyzing GitHub repositories and surveying open-source developers. For RQ1 and RQ2, we report views from the survey and the GitHub measurements. For RQ3 and RQ4, we use data only from the survey.

4.1 RQ1: Development Activity

This question was: RQ1: How does GDPR compliance influence development activity on OSS projects?

4.1.1 Survey

We surveyed 56 OSS developers to understand the impact of GDPR implementations on development activity. Most participants (n = 41, 73\%) responded “Yes” to a question regarding the impact of implementing GDPR concepts on development processes, indicating data privacy compliance effects open-source development. When asked to elaborate, 23 developers provided examples of development impacts related to the GDPR.

Data Management: 11 participants mentioned GDPR requirements related to data management impact development activity, notably increasing development efforts. For instance, responses indicated handling personal data (P17) and anonymization (P19), managing data controllers (P21) and data recipients (P23), implementing functionality to limit the collection of personal data (P26), and the monitoring of data subjects from the EU (P28) all impacted development processes. P53 also added “we had to separate in a clear way sensitive data from the other data”, exemplifying the effort needed to implement compliant data processing in OSS.

Time and Costs: Five participants mentioned GDPR compliance increases development time and costs in OSS. For example, regarding time, respondents said “it does slow down our development cycle” (P54) and “we lost a complete year to be ready” (P56). For costs, participants said “budgets have soared” (P5) and “costs of production should not go over the cost of consequence of data breach” (P46).

Design: Three participants also noted the effects of GDPR compliance on the design and structure of software products. For example, P54 responded “we have to check whether we comply with GDPR every time we draft a new design” and P55 added “the design of systems now incorporates the concept of needing to remove PII after the fact”. P21 explained how GDPR compliance reduced the quality of their applications design—replying “the principle of minimum scope was not observed”—indicating potential unnecessarily extended scopes of variables in the code [36].

Organization: Three participant responses embodied the negative effects of data privacy regulations on their organization, stating the GDPR has a “major impact” requiring “an overhaul of project management and program priorities” (P1). P45 highlighted that “making sure to follow privacy by design” is challenging for GDPR compliance in OSS development. One participant also mentioned additional steps to verify implementations affected their development, stating "we need to make an additional review with the GDPR consultants that functionality that is related to the users data" (P53).

Benefits: One participant mentioned benefits to their development team and processes regarding the implementation of GDPR concepts, stating it helped highlight "things we had not considered before", such as ensuring that "logging functionality" and "access restrictions" were in place (P1). However, the majority of responses indicate that GDPR compliance often increases development efforts and incurs negative impacts for open-source developers.

4.1.2 Pull Request Metrics

To further observe the impact of GDPR compliance on OSS, we compared metrics for GDPR and non-GDPR related PRs. Table 4 presents these results. Using a Mann-Whitney U test, we found statistically significant differences between GDPR and non-GDPR PRs in the number of comments, active time, number of commits, lines of code added, lines of code deleted, and number of modified files. We also calculate the effect size for these results.

This indicates that incorporating changes related to the GDPR has a major impact on development work, leading to increased discussions between developers, longer review times, more code commits, and higher code churn. While we observed significant differences exist in pull request metrics between GDPR and non-GDPR PRs, the calculated effect sizes are "small" [71], indicating low practical differences between the groups. Yet, these findings support our survey results from open-source developers purporting that GDPR compliance efforts affect OSS development.

Finding 1: Developers report implementing GDPR compliance negatively affects development processes—citing cost, time, and data management as concerns.

Finding 2: PRs related to GDPR compliance have significantly more development activity for coding (comments, additions, deletions, files changed) and review (comments, active time) tasks.

Table 4: GDPR (G) vs. Non-GDPR (non-G) GitHub Activity Metrics.

Characteristic Type Median p-value
Comments* G 1 < 0.0001
non-G 1 (U = 1.4E8, r = 0.09)
Active time (days)* G 418.05 < 0.0001
non-G 1.78 (U = 1.4E8, r = 0.14)
Commits* G 2 < 0.0001
non-G 1 (U = 1.4E8, r = 0.04)
Additions* G 57 < 0.0001
non-G 19 (U = 1.5E8, r = 0.05)
Deletions* G 7 < 0.0001
non-G 4 (U = 1.3E8, r = 0.05)
Changed files* G 4 < 0.0001
non-G 2 (U = 1.4E8, r = 0.03)
  • denotes statistically significant results (p-value < 0.05)

4.2 RQ2: GDPR Perceptions

This question was: RQ2: How do OSS developers perceive fulfilling GDPR requirements?

4.2.1 Survey

We asked participants their perceptions on the impact of GDPR regulations on privacy. Of participants who responded to this question (n = 25), most had negative opinions of the GDPR. Three participants were neutral (e.g., "N/A" (P4)). We summarize positive and negative perceptions next.

Negative Perceptions: Despite the utility of data privacy regulations, 22 participants reported negative perceptions of the GDPR. These responses primarily focused on three issues: cost, organizations, and enforcement. For costs, respondents noted that implementing GDPR requirements is expensive and burdensome. Participants said that compliance is "costly for many companies" (P16) is "too expensive" (P24), and "the cost of protection should not go over the cost of consequence of data breach...GDPR [isnt] worth the time" (P46). P55 also highlights that "in general there have been major costs to companies of all sizes" regarding GDPR implementations. For organizations, participants reported a negative impact of the GDPR on companies and organizations. They mentioned that GDPR compliance "weakens small and medium-sized enterprises" (P15), "threatens innovation" (P18), "fails to meaningfully integrate the role of privacy-enhancing innovation and consumer education in data protection" (P23), and that "in order to be safer than risky useful functionality is removed" (P52). P46 added that the GDPR is "a lot of headache...jobs for lawyers at the expense of people who are trying to solve real problems". For enforcement, one subject said "there is a large gap in GDPR enforcement among member states (P17) and another observed "the trend...is an increase in the number of times and the amount of fines" (P18). Similarly, P49 described GDPR as "a big hammer", but was unsure "if it has necessarily increased security and privacy at this point".

Positive Perceptions: Eight participants had positive perceptions of the GDPR, generally stating that GDPR enhances data privacy for users. For example, participants said that "the risk of incurring and paying out hefty fines has made companies take privacy and security more proactively" (P30), that GDPR brings "awareness to the importance about privacy" (P45), that "data integrity is ensured" (P47), and "customers can now delete their data quite easily" (P54). Participants also appreciated the increased accountability for corporations in safeguarding users data—for example one participant stated "Before GDPR data protection was usually considered only as an afterthought if not an outright joke. Nowadays companies will at least consider what they are doing wrong before violating data protection laws, rather than doing it by accident because no-one even thought about it" (P50). These responses reflect the intentions of the GDPR — to safeguard the rights of users and their data online.

4.2.2 Sentiment Analysis

We investigated the sentiment of developers implementing GDPR concepts by analyzing PR titles, commit messages, review comments, discussion comments, and bodies. Our overall results are in Table 5. We anticipated a higher percentage of negative comments for GDPR-related pull requests. However, we did not find evidence that GDPR-related PRs have less favorable sentiments from developers. In fact, we found they often had more positive sentiments than non-GDPR-related PRs—with two of the three models (Liu-Hu and VADER) indicating a statistically significant difference between the GDPR and non-GDPR sentiment. We speculate two explanations. First, non-GDPR-related PRs represent a broad range of code contributions, which could address a number Table 5: GDPR (G) vs Non-GDPR (non-G) Sentiment Analysis

Test Type Mean Variance p-value
Liu-Hu* G 0.43 0.27 p < 0.0001 (t = -4.05, r = 0.22)
non-G -0.04 0.28
VADER* G 0.44 0.04 p < 0.0001 (t = -6.47, r = 0.02)
non-G 0.21 0.01
SentiArt G 0.39 0.01 p = 0.1399 (t = -1.10, r = 0.01)
non-G 0.36 0.002
  • denotes statistically significant results (p-value < 0.05)

Figure 1: Longitudinal GDPR (G) and Non-GDPR (non-G) Sentiment Analysis Data. We grouped GDPR and non-GDPR data into 3-month segments and used 3 sentiment models. For each model, GDPR data is plotted in a color with a filled marker, and non-GDPR data in the same color but with a hollow marker. The general trend is that sentiment for GDPR data is moderately positive, and more positive than for non-GDPR data.

of issues. Second, we are limited by the capabilities of the sentiment analyzer. For example, the two most negative commit messages for non-GDPR pull requests said “obsolete” and “fatal”, which are common terms of art in software maintenance tasks [89, 113] (e.g., “fix fatal error”). We also observed some variation at the beginning and end of our dataset collection period, but no significant variation in sentiment over time (see Figure 1).

Nonetheless, manual inspection of negatively scored content showed OSS developers expressing frustration with GDPR compliance. For instance, one title and commit message described GDPR-related changes to “avoid lawsuits by mentioning cookies thing” [91]. Another title states adding “just enough EULA [end user license agreement] not to get banned” [31]. Similar frustrations were shared in a PR body for “GDPR stuff” adding changes to “display the annoying cookies banner” [104]. Discussion comments, such as “will this conflict with GDPR?” [24], also highlight OSS developers confusion with GDPR requirements.

Finding 3: Despite its nominal advantages, most developers had negative perceptions of the GDPR and its implementation. Finding 4: We found developers did not express more negative sentiments about GDPR compliance in PR discussions. Finding 5: Sentiment related to GDPR compliance appears to be stable over time.

4.3 RQ3: Implementation Challenges

This question was: RQ3: What GDPR concepts do OSS developers find most challenging to implement? In the survey data, we observed three common challenges: data management, data protection, and vague requirements.

Data Management: 11 developers responded that processing and storing users data according to GDPR requirements is the most challenging concept to implement. For example, participants mentioned challenges implementing “data protection” (P24), handling “personal data” (P34), the “exchange of documents containing personal data” (P32), the “improper storage” (30) of user data, and “knowing what info can or cannot be accessed or saved” (P49). In particular, four participants mentioned users right to erasure—or the obligation for data controllers to delete users data upon request “without undue delay” [4]—as the most complicated requirement to implement. For example, P53 responded, “its not always easy enough to implement data processing in a way, that its anonymized, and if the user would like their data to be erased, be able to continue processing of the results based on user data in an anonymous way”—describing the complexity of this requirement for their project.

Data Protection: Five participants mentioned security factors as a challenge for GDPR compliance. For instance, participants were concerned with “data protection” and “other security concerns” (P24), “leaks” (P27), and the fact that other entities have “the ability to steal data” (P28). P55 noted challenges with handling and securing data in “central databases, where that data may be relied on by many loosely connected applications and systems”. These responses highlight the difficulties of implementing mechanisms to safeguard users data.

Vague Requirements: 10 survey respondents highlighted a lack of clear requirements as the biggest challenge with GDPR compliance in OSS. For example, one participant mentioned that GDPR “is pretty vague” with a lack of “standard format” (P54). Another described confusion in knowing “how long can data be retained” and “what is Personal[sic] Identifiable Information”—adding, the “lack of clarity in the regulations[sic] leads to confusion” (P52). Moreover, P48 highlighted the lack of company understanding of GDPR requirements makes compliance difficult.

Beyond these clear categories, we also received a wide range of other responses, including “lawfulness and dispute resolution” (P47), the conflict between “individual privacy and the publics right to know” (P21), and being in a “rush to regulate” (P28). P27 mentioned challenges with user experiences, stating “users endure invasive pop-ups”. Further, P1 noted the challenges evolve during the lifetime of a project, stating “At the beginning of a project, privacy by design and default. In the middle or the end, data minimization and transparency” are the main challenges. Based on the challenges of implementation, participants described difficulties limiting functionality—e.g., “knowing when interacting with EU citizens” (P49) and “more than 1,000 news websites in the European Union have gone dark” (P15). Meanwhile, P17 mentioned difficulties implementing GDPR requirements for data-intensive programming domains: “many of the GDPRs requirements are essentially incompatible with big data, artificial intelligence, blockchain, and machine learning”. These challenges motivate new resources to help developers overcome problems related to GDPR implementation and compliance.

Finding 6: The management and protection of user data and vague requirements are key challenges open-source developers face when implementing GDPR requirements.

4.4 RQ4: Compliance Assessment

This question was: RQ4: How do OSS developers assess GDPR compliance? We found three kinds of responses related to compliance assessment: consulting with legal counsel, referencing other compliance resources, and self-assessment.

Compliance Through Legal Counsel: In our survey results, 15 OSS developers reported consulting with legal teams for GDPR compliance. We were also interested in exploring the impact of seeking legal counsel for GDPR compliance on OSS development processes. Seven participants with experience seeking legal consultations noted that it did have a positive impact on development activity (P6, P13, P14, P45, P53, P55, P56). Participants noted the benefits of seeking legal experts, stating the importance of “consulting with lawyers on the team who have a seat at the table” (P45), it “clarifies requirements and prevents misinterpretations” (P55), and allowed GDPR compliance to be “implemented rather easily” (P56).

However, most participants (n = 9) with experience seeking legal counsel lamented the impact, stating it decreased development productivity: “it slows things down as code has to be reviewed and objectives revised” and “it impacted our approach to the SDLC” (P1), “its a bit of a headache” (P24), “it slowed us down...was mostly a box ticking exercise” (P51), and “it interrupted the development but it is required” (P49). Respondents also bemoaned the costs of working with legal teams, stating “for a global project open source project any legal advice would be extremely expensive” (P52) and “open-source projects cant afford even to sustain maintainers, not even speaking about legal team...Legal teams are consulted with some corps want to kill the project” (P47). P54 also noted legal experts found difficulties with the vagueness of GDPR compliance, replying the “legal team struggles to interpret how to comply with GDPR, there are a lot of back-and-forth. We have to change our design many times”.

In sum, legal experts can provide valuable insight into data privacy regulations and compliance, but developers often find these interactions negatively impact development processes.

Compliance Resources: To assess GDPR compliance, three participants mentioned a variety of other resources. One participant described formal training on regulatory compliance, with a “special training on GDPR within the company” (P16). Another participant responded that their team uses an “accountability system” (P24) to assess compliance. Finally, P15 noted using online resources to help, but highlighted their ineffectiveness, stating, “many of the articles on the Internet about GDPR are incomplete or even wrong”.

Self-assessment: Other developers mentioned they were largely responsible for evaluating the “legality” (P18) and “integrity and confidentiality” (P23) of the processing and storage of user data in their system on their own. P24 responded developers have to “consider whether you really need all the data you collect” while P38 advised to “get your consent in order”. P53 noted the impact on development teams, stating GDPR implementations “took us significant amount of time due to several rounds of architecture review”. P18 added there is “really no good way” to evaluate compliance.

Finding 7: Developers often do not consult legal experts to validate GDPR compliance, relying on other resources such as compliance training, accountability systems, online resources, and self-assessed data management.

Finding 8: Participants with experience interacting with legal teams provided mixed perceptions, feeling they provided valuable insight but hindered development processes.

5 DISCUSSION AND FUTURE WORK

Our results demonstrate that GDPR-related code changes have a major impact in OSS development, significantly increasing development activity with regards to number of lines of code added and the number of commits included in PRsindicating increased effort in code contributions and code review activities for developers (§4.1.2). Further, we found that GDPR compliance provides a wide range of challenges for OSS development (§4.3) and that developers often assess compliance without the help of legal and policy experts (§4.4). These findings posit that implementing GDPR compliance is a challenging activity for OSS developers.

We recognize many stakeholders are involved in adhering to data privacy legislation. For instance, policymakers also play a role in data privacy compliance [112]. Data privacy regulations, such as the GDPR, are beneficial for protecting the rights and data of users online. However, we noticed developers complaining about providing privacy to peopleholding negative perceptions of the GDPR policy in general and its implementation. To that end, we provide guidelines to enhance data privacy regulations and software development processes to reduce the negative effects of policy compliance in OSS software.

5.1 Improving Data Privacy Regulations

Provide Clear Requirements. We found developers struggled to implement GDPR concepts (§4.3). Moreover, few respondents reported consulting with legal experts to provide insight of policies and assess the compliance of projects (§4.4). Thus, most development teams are forced to evaluate the system themselves. Yet, participants complained that understanding compliance is difficult due to the ambiguity of GDPR concepts: for instance, “the procedure for obtaining user consent and the information provided are unclear” (P25). Prior work suggests ambiguity is a main challenge in requirements engineering [28]. Further, incomplete requirements can increase development costs and the probability of project failure [38].

To improve program specifications, researchers have explored a variety of techniques. For instance, Wang et al. explored using natural language processing to automatically detect ambiguous terminology in software requirements [111]. Similar techniques could be applied to regulations such as the GDPR to notify policymakers of unclear language and clarify requirements for software engineers. Another way to improve the clarity of requirements is to involve software developers in the policy-making process. Verdon argues a good policy must be “understandable to [its] audience” [109, p. 48], yet our results show developers are confused by GDPR requirements. Prior work shows collaboration between policy makers and practitioners improves policies in domains such as public health [37] and education [61]. Thus, developers should be incorporated into the policy-making process to provide input on the impact of implementing and complying with policies concerning software development, such as data privacy regulations.

5.1.2 Policy Resources. Our survey results show OSS developers face challenges implementing GDPR-related changes (§4.3). Participants also found legal consultations negatively affect development processes (§4.4), and report existing resources are largely ineffective, primarily relying on self-assessment within the development team. Only one participant mentioned receiving formal training on GDPR compliance (P16). To that end, OSS developers largely resort to implementing and evaluating compliance on their own efforts with “insufficient information” (P26). Prior work also outlines issues with software developers and security policies, noting a lack of understanding from programmers [109].

Based on our findings, we posit OSS development can benefit from novel resources to educate developers on policies and their implementation. To further support compliance, policymakers can provide resources, such as guides or online forums, to provide information on data privacy-related concepts in an accessible manner. These guidelines can also reduce the effects of GDPR compliance on code review tasks by providing specialized expertise and correct understanding for reviewers [85]. Yet, there are limited online developer communities focused on seeking help in data privacy policy implementation. Popular programming-related Q&A websites, e.g., Stack Overflow, are frequently used by developers to ask questions and seek information online [86]—and are used for discussions on data privacy policy implementation (see Table 1). However developers have no way to verify the correctness of responses, which can also become obsolete over time. Zhang et al. recommend automated tools to identify outdated information in responses for development concepts, such as API libraries and programming languages [116]. A similar approach can be used to keep responses regarding GDPR compliance up-to-date and accurate.

5.2 Improving Development Processes

5.2.1 Privacy by Design. Participants reported challenges implementing GDPR compliance (§4.3) and negative effects on development practices (§4.1.1). Moreover, our GitHub analysis found GDPR-related changes necessitated significantly more time and effort (i.e., comments, commits, etc.) for developers to implement and review in PRs (see Table 4). However, compliance is required for organizations to avoid “paying out hefty fines” (P30). Researchers have investigated techniques to streamline the incorporation of privacy in development processes. For instance, Privacy By Design (PBD) is a software development approach to make privacy the “default mode of operation” [35]. P50 mentioned cultivating “a privacy-respecting mindset long before GDPR came about” avoided negative impacts on development processes and made the effort required “quite minimal”. However, numerous participants noted the burden of implementing GDPR requirements, with one survey participant in particular (P1) highlighting that prioritizing privacy in software development processes “requires an overhaul”. Additionally, while PBD can benefit GDPR compliance efforts, Kurtz et al. note a scarcity of research in this area and note particular challenges with PBD for GDPR implementations, such as ensuring third party libraries also adhere to privacy principles [70].

PBD can be effective for new projects starting from scratch [102], yet may be ill-equipped for existing projects complying with new and changing data privacy regulations. Anthonysamy et al. outline limitations with current privacy requirements that solve present issues, which may differ from regulations and policies in the future [25]. More work is needed to explore tools and processes to support data privacy in mature software projects. One solution could be a partial or gradual approach to compliance. For instance, some programming languages (e.g., Typescript) support gradual typing to selectively check for type errors in code [93]. Similarly, research in formal methods has explored supporting gradual verification of programs [26]. Thus, gradually introducing privacy into OSS can help reduce efforts related to GDPR compliance as opposed to overhauling development processes to prioritize privacy.

5.2.2 Automated Tools. We found GDPR compliance has a major impact on OSS development, significantly increasing coding and reviewing tasks for PRs in GitHub repositories (see Table 4). Developers who responded to our survey also indicated the impact of GDPR compliance on their project source code, noting data privacy regulations always need more software (P4) and violate the principle of minimum scope (P21). This indicates further difficulty for developers to validate their projects for the GDPR, with one participant responding there is “no good way” to assess compliance (P18). These findings point to an increased burden and effort on OSS developers to implement and review GDPR requirements to comply with data privacy regulations and avoid penalties for non-compliance (e.g., losing market share).

To that end, we posit automated tools can reduce the burden of GDPR implementation efforts. One participant mentioned using a tool, an “accountability system” (P24), to help assess compliance—however did not provide any details about this system. Our findings for RQ1 (§4.1) show GDPR-related pull requests have significantly more coding involved, consisting of more commits and lines of code added in code contributions, as well as requiring significantly more comments and time in reviewing processes. Thus, systems to support data privacy implementation and tools to review policy-relevant code are needed to streamline compliance. Ferrara and colleagues present static analysis techniques to support GDPR compliance [42]. Further tools can support review processes for assessing implementation changes. Prior work suggests static analysis tools can reduce time and effort in code reviews [94]. Future systems could also provide automated feedback to developers and reviewers on data privacy regulation compliance. For instance, using NLP techniques [17] or rule-based machine learning approaches [51] to automatically summarize requirements and verify compliance.

5.3 Other Directions

Based on our results, we observe several other avenues of future work. First, we plan to investigate other data sources to further explore GDPR compliance in open-source projects. For example, we plan to mine relevant queries from Stack Overflow to gain insight into challenges and information needs developers have for implementing GDPR policies. We will also examine answers to observe how developers respond. For instance, online discussions between developers regarding policies often use disclaimers, such as the acronyms “IANAL” or “NAL” to indicate “I am not a lawyer”, before offering advice or answering questions related to legal frameworks. Without legal expertise, we anticipate it is difficult for OSS developers to offer guidance and seek help complying with data privacy regulations—motivating the need for novel approaches to support regulation adherence and compliance assessment.

Moreover, we aim to engage with policymakers to understand their perspectives on data privacy policies and the challenges developers face implementing them. We will collect qualitative insights from politicians and individuals with authority to develop policies to further explore methods to support the implementation of privacy laws. Finally, we aim to extend this work to investigate the impact of broader technology-related policies on open-source software development practices—for instance, investigating the impact of alternative data privacy regulations (i.e., the CCPA or CDPA) as well as other legal frameworks that will impact software development and maintenance, such as current and imminent legislation regarding artificial intelligence governance.

6 RELATED WORK

We note two lines of related work: characterizations of stakeholder perspectives on data privacy regulations, and technical and methodological approaches for regulatory compliance.

Stakeholder perspectives: Research has investigated perspectives on the GDPR for stakeholders in data privacy regulation compliance. Sirur and colleagues examined organizational perceptions on the feasibility of implementing GDPR concepts, finding that larger organizations were confident in their ability to comply while smaller companies struggled with the breadth and ambiguity in GDPR requirements [95]. Earp et al. surveyed software users to show the Internet privacy protection goals and policies for online websites do not meet users expectations for privacy [41]. Similarly, Strycharz et al. surveyed consumers to uncover frustrations and negative attitudes related to the GDPR [100]. Our work focuses on the perceptions of developers, who are responsible for implementing code changes to comply with data privacy regulations.

On the perspective of software engineers as regulatory stakeholders, van Dijk and colleagues provide an overview of the transition of privacy policies from self-imposed guidelines from developers to legal frameworks and legislation [107]. Alhazmi interviewed software developers to uncover barriers for adopting GDPR principles—finding the lack of familiarity, precedented techniques, useful help resources, and prioritization from employers. The paper also found that developers generally do not prioritize privacy features in their projects, focusing instead on functional requirements prevent compliance [20]. Similarly, researchers interviewed senior engineers to understand the challenges implementing general privacy guidelines, indicating a frustration with legal interactions and the non-technical aspects of requirements [29]. Finally, Klymenko et al. interviewed technical and legal professionals to investigate measures for data privacy compliance in GDPR implementation—noting a lack of understanding and need for interdisciplinary solutions [66]. While these papers take similar approaches to our research, ultimately our goals and questions are distinct, since we are specifically interested in the perspective of open-source developers.

Implementing and verifying GDPR compliance: Prior work has explored approaches to implement and verify GDPR compliance. For instance, Martín et al. recommend Privacy by Design methods and tools for GDPR compliance [78]. Shastri and colleagues introduce GDPRBench, a tool to assess the GDPR compliance of databases [92]. Li et al. investigated automated GDPR compliance as part of continuous integration workflows [74]. Al-Slais conducted a literature review to develop a taxonomy privacy implementation approaches to guide GDPR compliance [19]. Finally, Mahindrakar et al. proposed the use of blockchain technologies to validate personal data compliance [77]. Rather than proposing new software engineering methods, measures, and tools related to GDPR, our work takes an empirical perspective to understand current practices.

7 THREATS TO VALIDITY

We discuss three types of threats to validity.

Construct: In mining OSS repositories, we defined the construct of “GDPR-related pull requests” based on the presence of the string “GDPR”. Some PRs may incorrectly refer to GDPR (false positives), while others may perform GDPR-relevant changes without using the acronym (false negatives). This is also biased towards English-speakers, as this acronym differs in other languages. To mitigate non-English GDPR-related PRs polluting the non-GDPR-related dataset, we manually inspected PR titles for various iterations of the GDPR in other languages, including “RGPD” (French, Spanish, and Italian), “DSGVO” (German), and “AVG” (Dutch). However, these were not included in our GDPR-related dataset since we only focus on PRs in English for our analysis. We used off-the-shelf NLP techniques to assess sentiment, inheriting biases from these methods (e.g., misinterpreted connotations of homonyms such as “mock”). In addition, parametric models for sentiment analysis are based on defined dictionary values and cannot detect certain aspects of human communication, such as sarcasm. Prior work also suggests sentiment analysis tools can be inaccurate in software engineering contexts [64]. However, we use this to gain preliminary insights into developers perceptions of GDPR compliance in OSS.

Internal: We perceive no internal threats. This study provides characterizations rather than cause-effect measurements.

External: There are several threats to the generalizability of our findings. We inherit the standard perils of mining open-source software [65]. We focus on open-source software available on GitHub, which omits other code hosting platforms, such as GitLab, which may be used by different populations of developers. We doubt our results generalize to commercial software, since those development organizations directly face the consequences of GDPR non-compliance. We only consider the effect of GDPR because it is the most prominent privacy law, and hence has the most available data. Other regulations may have different effects. Specifically, we conjecture differences in the software engineering impact between general data privacy regulations, such as the GDPR and CCPA, and industry-specific data privacy regulations, such as HIPAA and FERPA: general regulations may necessarily be more ambiguous.

8 CONCLUSIONS

Data privacy regulations are being introduced to prevent data controllers from misusing users information and to protect individuals. To adhere with these regulations, developers are charged with the complex task of understanding policies and making modifications to the source code of applications to implement privacy-related requirements. This work examines the impact of data privacy regulations on software development processes by investigating code contributions and developer perceptions of GDPR compliance in open-source software. Our results show that complying with data privacy regulations significantly impacts development activities on GitHub, evoking negative perceptions and frustrations from developers. Our findings provide implications for developers and policymakers to support the implementation of data privacy regulations that protect the rights of human users in digital environments.

9 DATA AVAILABILITY

We have uploaded the survey, datasets, and data collection and analysis scripts as supplementary materials [2]. Our IRB protocol does not allow us to share individual survey responses.

10 ACKNOWLEDGMENTS

Brown and Brantly acknowledge support from the Virginia Commonwealth Cyber Initiative (CCI). REFERENCES

[1] [n. d.]. https://edps.europa.eu/data-protection/data-protection/legislation/its- tory-general-data-protection-regulation_en

[2] [n. d.]. https://anonymous.4open.science/r/GDPR-OSS-Impact-D77B

[3] [n. d.]. MIT License. https://opensource.org/licenses/MIT. Accessed. July 2023.

[4] [n. d.]. Right to erasure (right to be forgotten). https://gdpr-info.eu/art-17- gdpr/

[5] 1974. Family Educational Rights and Privacy Act of 1974. 20 U.S.C. § 1232g; 34 CFR Part 99. https://www2.ed.gov/policy/gen/guid/epco/erpa/index.html

[6] 1991. GNU General Public License, version 2. Free Software Foundation. https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

[7] 1996. Health Insurance Portability and Accountability Act of 1996. Pub. L. No. 104-191, 110 Stat. 1936. https://www.govinfo.gov/content/pkg/PLAW- 104publ191/pdf/PLAW-104publ191.pdf

[8] 2004. Apache License, Version 2.0. Apache Software Foundation. https: //www.apache.org/licenses/LICENSE-2.0

[9] 2010 IEC 61508-1:2010 - Functional safety of electro-

cal/electronic/programmable electronic safety-related systems Part 1: General requirements. International Electrotechnical Commission. https://webstore.iec.ch/publication/5512

[10] 2014. ISO 90003:2014 - Software engineering Guidelines for the applica- tion of ISO 9001:2015 to computer software. International Organization for Standardization. https://www.iso.org/standard/59149.html

[11] 2015. ISO 9001:2015 - Quality management systems Requirements. International Organization for Standardization. https://www.iso.org/standard/62085.h

[12] 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Official Journal of the European Union. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX: 95/46/EC

[13] 2018. ISO 26262-1:2018 - Road vehicles Functional safety Part 1: Vocabulary. International Organization for Standardization. https://www.iso.org/standard /68383.html

[14] 2023. 5th State of CCPA & GDPR Privacy Rights Compliance Research Report Q4 2022. Cytrio. https://cytrio.com/wp-content/uploads/2023/02/5th-State- of-CCPA-GDPR-Compliance-Report_FNL2.pdf

[15] 2023. GDPR Enforcement Tracker list of GDPR fines. Enforcement Tracker. https://www.enforcementtracker.com

[16] Ahmad Abdellatif, Mairieli Wessel, Igor Steinmacher, et al. 2022. BotHunter: an approach to detect software bots in GitHub. In Proceedings of the 19th Interna- tional Conference on Mining Software Repositories. 617.

[17] Abdel-Jaouda Aberkane, Geert Poels, and Seppe Vanden Broucke. 2021. Ex- ploring automated gdpr-compliance in requirements engineering: A systematic mapping study. IEEE Access 9 (2021), 6654266559.

[18] Saeed Akhlaghpour, Farkhondeh Hassandoust, et al. 2021. Learning from enforcement cases to manage gdpr risks. MIS Quarterly Executive 20, 3 (2021).

[19] Yaqoob Al-Slais. 2020. Privacy Engineering Methodologies: A survey. In 2020 In- ternational Conference on Innovation and Intelligence for Informatics, Computing and Technologies (SICT). 16. https://doi.org/10.1109/3ICT51146.2020.9311949

[20] Abdulrahman Alhazmi and Nalin Asanka Arachchilage. 2021. Im all ears! listening to software developers on putting GDPR principles into software development practice. Personal and Ubiquitous Computing, 25, 5 (2021), 879892.

[21] Reni Allan. 2007. Reskilling for compliance. Inf. Professional 4, 1 (2007), 2023.

[22] Fernando Almeida and José Augusto Monteiro. 2021. Exploring the effects of GDPR on the user experience. Journal of information systems engineering and management 6, 3 (2021).

[23] Murugan Anandarajan, Chelsey Hill, Thomas Nolan, Murugan Anandarajan, Chelsey Hill, and Thomas Nolan. 2019. Text preprocessing. Practical text analytics: Maximizing the value of text data (2019), 4559.

[24] Maythee Anegboonlap. 2018. Will this conflict with GDPR? https://github.com /ReferaCandy/woocommerce-refera-candy/pull/24#discussion_r2381535 46. Github repository: ReferaCandy/woocommerce-refera-candy.

[25] Pauline Anthonysamy, Awais Rashid, and Ruzanna Chitchyan. 2017. Privacy re- quirements: present & future. In IEEE/ACM International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS). IEEE, 1322.

[26] Johannes Bader, Jonathan Aldrich, and Éric Tanter. 2018. Gradual program verification. In Verification, Model Checking, and Abstract Interpretation (VMCAI). Springer, 2546.

[27] Ben Balter. 2015. Open source license usage on Github.com. Github Blog. https://github.blog/2015-03-09-open-source-license-usage-on-github-com/

[28] Muneera Bano. 2015. Addressing the challenges of requirements ambiguity: A review of empirical literature. In 2015 IEEE Fifth International Workshop on Empirical Requirements Engineering (EmpiRE) IEEE, 2124.

[29] Kathrin Bednar, Sarah Spekermann, and Marc Langheinrich. 2019. Engineering Privacy by Design: Are engineers ready to live up to the challenge? The Information Society 35, 3 (2019), 122142.

[30] Yoav Benjamini and Yosef Hochberg. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 1 (1995), 289300. http: //www.jstor.org/stable/2346181

[31] Ani Betts. 2021. Just enough EULA to not get banned. https://github.com/anais- bets/sirene/pull/37. Github repository: anaisbets/sirene.

[32] Alex Bowyer, Jack Holt, and Johnnie Go Jeffers, Rob Wilson, David Kirk, and Jan David Smeddinck. 2022. Human-GDPR interaction: Practical experiences of accessing personal data. In Proceedings of the 2022 chi conference on human factors in computing systems. 119.

[33] Randolph E. Bucklin and Catarina Sinimero. 2009. Click here for Internet insight: Advances in clickstream data analysis in marketing. Journal of Interactive marketing 23, 1 (2009), 3548.

[34] Noel Carroll and Ita Richardson. 2016. Software-as-a-medical device: demystify- ing connected health regulations. Journal of Systems and Information Technology 18, 2 (2016), 186215.

[35] Ann Cavoukian. 2009. Privacy by design. (2009).

[36] David Chisnall. 2012. The Go programming language phrasebook. Addison- Wesley.

[37] Bernard CK Choi, Tikki Pang, Vivian Lin, et al. 2005. Can scientists and policy makers work together? Journal of Epidemiology & Community Health 59, 8 (2005), 632637.

[38] Tom Clancy. 1995. The chaos report. The Standish Group (1995).

[39] Jacob Cohen. 2013. Statistical power analysis for the behavioral sciences. Rout- ledge.

[40] Jose Luis de La Vara, Markus Borg, Krzysztof Wnuk, and Leon Moonen. 2016. An industrial survey of the impact of evidence change impact analysis practice. IEEE Transactions on Software Engineering 42, 12 (2016), 10951117.

[41] J.B. Earp, A.I. Anton, L. Aiman-Smith, and W.H. Stufflebeam. 2005. Examining Internet privacy policies within the context of user privacy values. IEEE Transactions on Engineering Management 52, 2 (2005), 227237.

[42] Pietro Ferrara, Nicola Fausto Spoto, et al. 2018. Static analysis for GDPR com- pliance. In CEUR Workshop Proceedings. CEUR Workshop Proceedings, 110.

[43] Aaron J Fischer, Brandon K Schultz, Melissa A Collier-Meek, et al. 2018. A critical review of videoconferencing software to support school consultation. International Journal of School & Educational Psychology 6, 1 (2018), 1222.

[44] Lucas Franke, Huayu Liang, Aaron Brantly, James C. Davis, and Chris Brown. 2024. A First Look at the General Data Protection Regulation (GDPR) in Open- Source Software. In Proceedings of the 2024 IEEE/ACM 46th International Confer- ence on Software Engineering: Companion Proceedings (Lisbon, Portugal) (ICESE Companion 24). Association for Computing Machinery, New York, NY, USA, 268269. https://doi.org/10.1145/3639478.3643077

[45] GDPR. 2018. Art. 4 GDPR: Definitions. https://gdpr.eu/article-4-definitions/

[46] GDPR. 2018. Art. 83 GDPR: General conditions for imposing administrative fines. https://gdpr.eu/article-83-conditions-for-imposing-administrative-fines/

[47] Github. 2022. Octoverse 2022: The state of open source software. https: //octoverse.github.com

[48] Github. 2023. Creating a pull request. https://help.github.com/en/articles/crea- ting-a-pull-request. Github Help.

[49] Georgios Gousios and Andy Zaidman. 2014. A dataset for pull-based develop- ment research. In Conference on Mining Software Repositories. 368371.

[50] Emiza Guzman, David Aziozar, and Yang Li. 2014. Sentiment analysis of commit comments in Github: an empirical study. In Mining Software Repositories (MSR)

[51] Rajaa El Hamdani et al. 2021. A combined rule-based and machine learning approach for automated GDPR compliance checking. In Eighteenth International Conference on Artificial Intelligence and Law 4049.

[52] Nikolay Harutyunyan. 2020. Managing your open source supply chain-why and how? Computer 53, 6 (2020), 7781.

[53] Paul Hitlin, Rainie Lee, and Kenneth Olmstead. 2019. Facebook Algorithms and Personal Data. Pew Research Center. https://www.pewresearch.org/internet/2 019/01/16/facebook-algorithms-and-personal-data/

[54] Chris Hobbs. 2019. Embedded software development for safety-critical systems CRC Press.

[55] Sebastian Holst. 2017. GDPR liability: software development and the new law. LinkedIn (2017). https://www.linkedin.com/pulse/gdpr-liability-software- development-new-law-sebastian-holst/

[56] Mingyu Hu and Bing Liu. 2004. Mining opinion features in customer reviews. In AAAI. Vol. 4. 755760.

[57] Syed Fatiul Huq, Ali Zafar Sadiq, and Kazi Sakib. 2019. Understanding the effect of developer sentiment on fix-inducing changes: An exploratory study on github pull requests. In 2019 26th Asia-Pacific Software Engineering Conference (APSEC) IEEE, 514521.

[58] Clayton Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, Vol. 8. 216225.

[59] International Association of Privacy Professionals. Accessed 2023. Global Comprehensive Privacy Law Mapping Chart. https://iappr.org/resources/article/glo- bal-comprehensive-privacy-law-mapping-chart/ [60] International Electrotechnical Commission. 2010. Functional safety of electrical/ electronic/programmable electrical safety-related systems - Part 3: Software requirements. https://webstore.iec.ch/publication/9277

[61] Chongtao Jia, Mihai Stănescu, and Elham Marin. 2019. How can researchers facilitate the utilisation of research by policy-makers and practitioners in education? Research Papers in Education 34, 4 (2019), 483498.

[62] Onnisaak and M. Henna. 2018. User Data Privacy: Facebook, Cambridge Analytica, and Privacy Protection. Computer 51, 8 (2018), 5659.

[63] Arthur M. Jacobs. 2019. Sentiment analysis for words and fiction characters from the perspective of computational (neuro-) poetics. Frontiers in Robotics and AI 6 (2019), 53.

[64] Robbert Jongeling, Proshanta Sarkar, Subhajit Datta, and Alexander Serebrenik. 2017. On negative results when using sentiment analysis tools for software engineering research. Empirical Software Engineering 22 (2017), 25432584.

[65] Eirini Kallianvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2014. The promises and perils of mining github. In 11th Working Conference on Mining Software Repositories (MSR). 92101.

[66] Oleksandra Klymenko, Oleksandr Kosenkov, Stephen Meisenbacher, Parisa Elahidoost, Daniel Mendez, and Florian Matthes. 2022. Understanding the implementation of technical measures in the process of data privacy compliance: A qualitative study. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 261271.

[67] Michael Kretschmer, Jan Pennekamp, and Klaus Weber. 2021. Cookie banners and privacy policies: measuring the impact of the gdpr on the web. ACM Transactions on the Web (TWEB) 15, 4 (2021), 142.

[68] Oksana Kulyk, Nina Gerber, Annika Hilt, et al. 2020. Has the gdpr hype affected users reaction to cookie disclaimers? Journal of Cybersecurity - 1. 8895.

[69] Aman Kumar, Manish Khare, and Saurabh Tiwari. 2022. Sensitivity Analysis of Developers Comments on GitHub Repository: A Study. In International Conference on Advanced Computational Intelligence (ICACI). IEEE, 9198.

[70] Christian Kurtz, Martin Semmann, and Tilo Bohman. 2018. Privacy by design to comply with GDPR: a review on third-party data processors. (2018).

[71] Daniël Lakens. 2013. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in psychology 4 (2013), 6267.

[72] Roslyn Layton and Silvia Elaluf-Calderwood. 2019. A social economic analysis of the impact of GDPR on security and privacy practices. In 2019 12th CMI Conference on Cybersecurity and Privacy (CMI). IEEE, 16.

[73] Thomas W MacFarland, Jan M Yates, Thomas W MacFarland, and Jan M Yates. 2016. Mannwhitney u test. Introduction to nonparametric statistics for the biological sciences using R (2016), 103132.

[74] Abhishek Mahindrakar and Karuna Pande Joshi. 2020. Automating GDPR Compliance using Policy Integrated Blockchain. In IEEE Intl Conf on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conf on High Performance and Smart Computing (HPSC) and IEEE Intl Conf on Intelligent Data and Security (IDS). 8693. https://doi.org/10.1109/BigDataSecurity-HPSC-IDS49724.2020.00026

[75] MH Lloyd and PJ Reeve. 2009. IEC 61508 and IEC 61511 assessments-some lessons learned. (2009).

[76] Thomas W MacFarland, Jan M Yates, Thomas W MacFarland, and Jan M Yates. 2018. 20232026. The impact of GDPR on global technology development. Journal of Global Information Technology Management 22, 1 (2019).

[77] Ze Shi Li, Colin Werner, and Neil Ernst. 2019. Continuous Requirements: An Example Using GDPR. In 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW). 144149. https://doi.org/10.1109/REW.2019.00031

[78] MH Lloyd and PJ Reeve. 2009. IEC 61508 and IEC 61511 assessments-some lessons learned. (2009).

[79] Thomas W MacFarland, Jan M Yates, Thomas W MacFarland, and Jan M Yates. 2016. Mannwhitney u test. Introduction to nonparametric statistics for the biological sciences using R (2016), 103132.

[80] Abhishek Mahindrakar and Karuna Pande Joshi. 2020. Automating GDPR Compliance using Policy Integrated Blockchain. In IEEE Intl Conf on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conf on High Performance and Smart Computing (HPSC) and IEEE Intl Conf on Intelligent Data and Security (IDS). 8693. https://doi.org/10.1109/BigDataSecurity-HPSC-IDS49724.2020.00026

[81] Yod-Samuel Martad and Anu Kung. 2018. Methods and tools for GDPR compliance through privacy and data protection engineering. In IEEE European Symposium on Security and PrivacyWorkshops. IEEE, 108111.

[82] J M Valdez Mendia and J J A. Flores-Cuatle. 2022. Toward customer hyper-personalization experience — A Data-driven approach. Cogent Business & Management 9, 1 (2022), 2041384. https://doi.org/10.1080/23311975.2022.2041384

[83] Dan Milmo and Lisa OCarroll. 2023. Facebook owner Meta fined €1.2bn for mishandling user information. The Guardian. https://www.theguardian.com/technology/2023/may/22/facebook-fined-mishandling-user-information-ireland-eu-meta

[84] Rene Moquin and Robin L Wakefield. 2016. The roles of awareness, sanctions, and ethics in software compliance. Journal of Computer Info. Sys. 56, 3 (2016).

[85] Frank Nagle, James Dana, Jennifer Hoffman, Steven Randazoo, and Xanou Zhou. 2022. Census II of Free and Open Source Software—Application Libraries. Linux Foundation, Harvard Laboratory for Innovation Science (LISH) and Open Source Security Foundation (OpenSSF) 80 (2022).

[86] Chinenye Okafor et al. 2022. Sok: Analysis of software supply chain security by establishing secure design properties. In ACM SCORED Workshop. 1524.

[87] Kang-il Park and Bonita Sharif. 2021. Assessing perceived sentiment in pull requests with emojis: evidence from tool and developer eye movements. In 2021 IEEE/ACM Sixth International Workshop on Emotion Awareness in Software Engineering (SEmotion). IEEE, 16.

[88] Luca Pascarella, Davide Spadini, et al. 2018. Information needs in contemporary code review. Proc. of the ACM on Human-Computer Interaction: CSCW (2018).

[89] Cole S Peterson, Jonathan A Saddler, Natalie M Halavick, and Bonita Sharif. 2019. A gaze-based exploratory study on the information seeking behavior of developers on stack overflow. In CI 16.

[90] Daniel Pletea, Bogdan Vasilescu, and Alexander Serebrenik. 2014. Security and emotion: sentiment analysis of security discussions on github. In Proceedings of the 11th working conference on mining software repositories. 348351.

[91] Supreeth Shastri et al. 2020. Understanding and benchmarking the impact of GDPR on database systems. VLDB 13, 7 (2020), 10641077.

[92] Jeremy Sirk and Walid Tabu. 2007. Gradual typing for objects. In European Conference on Object-oriented Programming. Springer, 227.

[93] Devarshi Singh et al. 2017. Evaluating how static analysis tools can reduce code review effort. In 2017 IEEE symposium on visual languages and human-centric computing (VL/HCC). IEEE, 191105.

[94] Sean Sirur, Jason R.C. Nurse, and Helena Webb. 2018. Are We There Yet? Understanding the Challenges Faced in Complying with the General Data Protection Regulation (GDPR). In 2nd International Workshop on Multimedia and Security (MMSec). Springer, 116.

[95] Ian Sommerville. 2011. Software Engineering, 9/E. Pearson Education India.

[96] Jeff South. 2018. More than 1,000 U.S. news sites are still unavailable in Europe, two months after GDPR took effect. Nieman Lab. https://www.niemanlab.org/2018/08/more-than-1000-us-news-sites-are-still-unavailable-in-europe-two-months-after-gdpr-took-effect/

[97] Richard Sproat, Alan W Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christina D Richards. 2001. Normalization of non-standard words. Computer speech & language 15, 3 (2001), 287333.

[98] David Stokes. 2012. 21 - Validation and regulatory compliance of free/open source software. In Open Source Software in Life Science Research, Lee Harland and Mark Forster (Eds.). Woodhead Publishing, 481504.

[99] Joanna Stryczew, Jef Audouls, and Natali Helberger. 2020. Data protection or data frustration? Individual perceptions and attitudes towards the GDPR. Eur. Data Prot. L. Rev. 6 (2020), 407.

[100] Synopsys. 2023. Open Source Security and Risk Analysis Report. https://www.pwc.com/us/en/services/consulting/library/gdpr-readiness.html

[101] Aurelia Tamò-Larrieux and Aurelia Tamò-Larrieux. 2018. Privacy by Design for the Internet of Things: A Startup Scenario. Designing for Privacy and its Legal Framework: Data Protection by Design and Default for the Internet of Things (2018), 203226.

[102] Neil Thurman. 2020. Many EU visitors shut out of US sites in response to GDPR never came back. Reuters Institute for the Study of Journalism. https://reutersinstitute.politics.ox.ac.uk/news/many-eu-visitors-shut-out-us-sites-

[103] Serj Tubin. 2023. GDPR stuff. https://github.com/2beens/serj-tubin-vue/pull/71. GitHub repository: 2beens/serj-tubin-vue.

[104] UNCTAD. 2021. Data Protection and Privacy Legislation Worldwide. United Nations Conference on Trade and Development (2021). https://unctad.org/page/data-protection-and-privacy-legislation-worldwide

[105] Christine Utz, Martin Degeling, Sascha Fahl, et al. 2019. (Un) informed consent: Studying GDPR consent notices in the field. In ACM SIGSAC Conference on Computer and Communications Security (CCS). 973990.

[106] N. van Dijk, A. Tanas, K. Rommetveit, and C. Raab. 2018. Right engineering? the redesign of privacy and Personal Data Protection. International Review of Law, Computers & Technology 32, 23 (Apr 2018), 230256. https://doi.org/10.1080/13600069.2014.1575022

[107] Ana Vazão, Leonel Santos, Maria Beatriz Piedade, and Carlos Rabadao. 2019. SIEM open source solutions: a comparative study. In 2019 14th Iberian Conference on Information Systems and Technologies (CISTE). IEEE, 15.

[108] Denis Verdon. 2006. Security policies and the software developer. IEEE Security & Privacy 4, 4 (2006), 4249.

[109] Branka Vuleta. 2023. 10 unbelievable GDPR statistics in 2023. https://legaljobs.io/blog/gdpr-statistics/

[110] Yue Wang, Irene L Manotas Gutièrrez, Kristina Winbladh, and Hui Fang. 2013. Automatic detection of ambiguous terminology for software requirements. In 18th International Conference on Applications of Natural Language to Information Retrieval (NAACL-HLT). Association for Computational Linguistics, 7585.

[111] R Kent Weaver. 2015. Getting people to behave: Research lessons for policy makers. Public Administration Review 75, 6 (2015), 806816.

[112] Krzysztof Wnuk, Tony Gorschek, and Showary Zahda. 2013. Obsolete software requirements. Information and Software Technology 55, 6 (2013), 921940. [114] Christopher Wylie. 2019. How I Helped Hack Democracy. New York Magazine. https://nymag.com/intelligencer/2019/10/book-excerpt-mindf-ck-by-christopher-wylie.html

[115] Christopher Wylie. 2019. I Made Steve Bannons Psychological Warfare Tool: Meet the Cambridge Analytica Whistle-blower. New York Magazine. https://nymag.com/intelligencer/2019/10/book-excerpt-mindf-ck-by-christopher-wylie.html

[116] Haoxiang Zhang, Shaowei Wang, Tse-Hsun Chen, Ying Zou, and Ahmed E Hassan. 2019. An empirical study of obsolete answers on stack overflow. IEEE Transactions on Software Engineering 47, 4 (2019), 850862.