14 lines
1.1 MiB
14 lines
1.1 MiB
{"id": "4e11c90bd2c57ea0dc4412b757d19b678999bec5", "text": "Code Reuse in Stack Overflow and Popular Open Source Java Projects\n\nAdriaan Lotter \nDepartment of Information Science \nUniversity of Otago \nDunedin, New Zealand \nadriaan.lotter@otago.ac.nz\n\nSherlock A. Licorish \nDepartment of Information Science \nUniversity of Otago \nDunedin, New Zealand \nsherlock.licorish@otago.ac.nz\n\nSarah Meldrum \nDepartment of Information Science \nUniversity of Otago \nDunedin, New Zealand \nsarah-meldrum@outlook.com\n\nBastin Tony Roy Savarimuthu \nDepartment of Information Science \nUniversity of Otago \nDunedin, New Zealand \ntony.savarimuthu@otago.ac.nz\n\nAbstract\u2014Solutions provided in Question and Answer (Q&A) websites such as Stack Overflow are regularly used in Open Source Software (OSS). However, many developers are unaware that both Stack Overflow and OSS are governed by licenses. Hence, developers reusing code from Stack Overflow for their OSS projects may violate licensing agreements if their attributions are not correct. Additionally, if code migrates from one OSS through Stack Overflow to another OSS, then complex licensing issues are likely to exist. Such forms of software reuse also have implications for future software maintenance, particularly where developers have poor understanding of copied code. This paper investigates code reuse between these two platforms (i.e., Stack Overflow and OSS), with the aim of providing insights into this issue. This study mined 151,946 Java code snippets from Stack Overflow, 16,617 Java files from 12 of the top weekly listed projects on SourceForge and GitHub, and 39,616 Java files from the top 20 most popular Java projects on SourceForge. Our analyses were aimed at finding the number of clones (indicating reuse) (a) within Stack Overflow posts, (b) between Stack Overflow and popular Java OSS projects, and (c) between the projects. Outcomes reveal that there was up to 3.3% code reuse within Stack Overflow, while 1.8% of Stack Overflow code was reused in recent popular Java projects and 2.3% in those projects that were more established. Reuse across projects was much higher, accounting for as much as 77.2%. Our outcomes have implication for strategies aimed at introducing strict quality assurance measures to ensure the appropriateness of code reuse, and licensing requirements awareness.\n\nKeywords\u2014Code reuse, Stack Overflow, Java projects, OSS, Q&A, Quality\n\nI. INTRODUCTION\n\nQuality plays a fundamental role in software success [30]. Thus, quality standards have been developed to provide guidance for software developers, covering the requirements for producing high quality, defect-free software [30, 31]. Under the ISO-9126 quality model, for example, it is stated that the quality requirements for software should cover efficiency, functionality, reliability, usability, reusability, and maintainability [9]. Such standards have also been the subject of previous academic studies (e.g., Singh et al. [22]).\n\nWith quality as an underlying motivator for instilling good software development practices while creating software, developers should be particularly conscious when employing code reuse from external sources (e.g., from open source (OS) portals) [29], which may impact software efficiency, functionality, reliability, usability, and maintainability. While code reuse allows for previously tested and quality-assured code to be implemented in a system, reusing code from untrusted sources may lead to system harm [16]. The implications of code reuse could be particularly significant for software maintainability, as poor knowledge of reused code at the time of software development will likely create challenges for future corrective and perfective actions. As discussed in Roy et al. [40], understanding the levels of reuse and cloning could be valuable for developers in terms of assisting with issues related to plagiarism, software evolution, debugging, code compaction, and security. Furthermore, Kashima et al. [36] noted that there are several OSS licenses that require software outcomes derived from original solutions to be published under the same license. This demands that developers are aware of the legal implications of the licenses under which OSS and code posted on other portals (such as Stack Overflow) are published. Additionally, businesses also need to be aware of the reuse occurring within outsourced development [20], as under these conditions they may face future legal challenges.\n\nCode reuse is formally defined as \u201cthe use of existing software or software knowledge to construct new software\u201d [15]. It is prevalent in many software, including those produced by top-tier software development companies such as Google [34]. Beyond such industry leaders, code reuse has been found to be exceptionally common in Mobile Apps, with some of these products consisting entirely of reused elements [13]. This high level of reuse seen in the practice of developers stems from the benefits it provides in terms of easily adding and enhancing system features [25]. The accessibility of readily available solutions to coding problems is highly attractive to both novice and experienced programmers [25]. In fact, in a study by Sojer et al. [21], the responses from 869 developers confirmed that they consider ad hoc reuse of code from the internet to be important for their work. Similarly, Heinemann et al. [18] also found that 90% of the OS projects they analyzed contained reused code, reiterating the point that code reuse is found extensively in many software systems.\n\nThe ease and attractiveness of code reuse has been particularly aided by readily accessible code fragments on Q&A websites, such as Stack Overflow. Stack Overflow is a very popular Q&A website which allows members of the public to post development related questions and/or answers, with the answers often containing code fragments.\n\n1 http://www.stackoverflow.com\nRecent evidence shows that the majority of the questions that are asked on Stack Overflow usually receive one or more answers [6], and this forum is often a substitute for official programming languages\u2019 tutorials and guides [24].\n\nWith both implications for software maintainability and licensing when reusing Stack Overflow code fragments, of interest to us is the potential effects reusing code from this portal could have on effort for future changes and correct use of license to avoid future legal issues. The aim of this paper is thus to investigate the levels of code reuse within Stack Overflow, and between Stack Overflow and OSS projects. We focus on the Java programming language, given its popularity 2, and the need to understand reuse beyond Python (Yang et al. [8]). With a strong body of knowledge around the scale of developers\u2019 reuse practices, team leaders may begin to introduce stricter quality assurance measures to ensure the appropriateness of reused code fragments. We thus answer five research questions in our portfolio of work. Firstly, we explore, what is the extent of Java code reuse within Stack Overflow? to understand how the community operates as an ecosystem in the provision of self-support (RQ1). Related to this question, we next explore, what is the extent of code reuse between answers published under the same question in Stack Overflow? to understand the degree of innovation (or lack thereof) that is prevalent on this platform (RQ2). Answers to these two questions are particularly useful for the software engineering community as within-source code migration is likely to increase the risk of incorrect author attribution, due to (a) having more copies in existence, and (b) increasing the number of \u2018steps\u2019 a piece of code could have taken from its origin to where it was found. This could in turn lead to unsuspecting license violations for those implementing these code snippets in OSS.\n\nOur third research question, what is the extent of code reuse between Stack Overflow and the current most popular Open Source Java Projects? helps us to understand recent code reuse trends (RQ3). Related to this research question we examine, what is the extent of code reuse between Stack Overflow and the all-time most popular Open Source Java projects? to understand software practitioners\u2019 behavior to code reuse over time (RQ4). Additionally, we answer, are there differences in the nature of reuse found between the different contexts in terms of scale and size? to provide deeper evidence for the nature and ranges of code reuse between Stack Overflow and OSS projects (RQ5). Beyond understanding the extent of code reuse (or clones) existing between OSS and Stack Overflow, it is important to understand how practitioners\u2019 attitude towards this practice has changed over time. Our investigation led by the latter three questions will provide initial evidence on the extent of code reuse between projects developed more recently and those having existed for longer.\n\nThe remaining sections of this paper are organized as follows. We provide our study background in Section 2. We next provide our research setting in Section 3, before providing our results in Section 4. We then discuss our findings and their implications in Section 5, prior to considering threats to the study in Section 6. Finally, we provide concluding remarks and point to future research in Section 7.\n\nII. BACKGROUND\n\nSoftware practitioners would benefit from developing maintainable software systems that are free of code license violations, and thus, code reuse should be given serious consideration during development. Both of these topics (i.e., software maintenance and license) have been investigated to various extents, and their importance has been widely noted in the literature. Firstly, the maintainability of a software system is highly significant to all its stakeholders, especially when considering project lead-times and costs [9]. Maintainability refers to the likelihood of performing software improvements in a given period, and is said to become more difficult with the prevalence of code reuse [32]. Kamiya et al. [32] established that code reuse could introduce multiple points of failure if code fragments are \u2018buggy\u2019, and in fact, it has been noted that approximately half of the changes made to code clone groups are inconsistent [15].\n\nThe issue of code reuse and maintainability becomes more complex when the reused code is sourced from external sources (e.g., Stack Overflow). This is due to potential code incompatibility issues and sub-optimal solutions, which are often tied to a lack of developer understanding. Also, code fragments provided on Stack Overflow are largely written for accompanying some textual explanation, and not for immediate use as such. In fact, for many developers, online sources such as Stack Overflow are of utility, when they are faced with issues that require knowledge they do not possess. This brings into question their likely understanding of such code, which in turn brings into question the software\u2019s quality. Furthermore, security complications may arise, as evidence has shown that Stack Overflow portal includes insecure code [10].\n\nAn example of how catastrophic code reuse could be is illustrated by Bi [11]. This author shows that a piece of Stack Overflow code was used in the NissanConnect EV mobile app, which accidentally displayed a piece of text reading \u201cApp explanation: the spirit of stack overflow is coders helping coders\u201d. This example illustrates that code reused from Stack Overflow and other similar portals are not always examined thoroughly. Although this example illustrates a non-threatening issue, many similar cases could introduce security and functionality-related problems if not inspected properly. Thus, it is important to investigate and understand the extent of code reuse occurring between software systems and online code resources such as Stack Overflow.\n\nRecently, several research studies have been conducted on the topic of code reuse and Stack Overflow. For instance, Abdalkareem et al. [25] investigated code reused from Stack Overflow in Mobile Apps and found that 1.3% of the Apps they sampled were constructed from Stack Overflow posts. They also discovered that mid-aged and older Apps contained Stack Overflow code introduced later in their lifetime. An et al. [19] also investigated Android Apps and found that 62 out of 399 (15.5%) Apps contained exact code clones; and of the 62 Apps, 60 had potential license violations. In terms of Stack Overflow, they discovered that 1,226 posts contained code found in 68 Apps. Furthermore,\n126 snippets were involved in code migration, where 12 cases of migration involved Apps published under different licenses. Yang et al. [8] noted that, in terms of Python projects, over 1% of code blocks in their token form exist in both GitHub and Stack Overflow. At an 80% similarity threshold, over 1.1% of code blocks in GitHub were similar to those in Stack Overflow, and 2% of Stack Overflow code blocks were similar to those in GitHub.\n\nIn terms of attribution, in ensuring conformance to license requirements, Baltes et al. [27] found that 7.3% of popular repositories on GitHub contained a reference to Stack Overflow. In the context of Java projects, a minimum of two thirds containing copied code did not contain a reference to Stack Overflow. Additionally, only 32% of surveyed developers were aware of the attribution requirements of Stack Overflow. This could result in complicated legal issues for developers. In fact, the study of licensing violations is also the subject of previous research [4, 23, 21]. It has been noted that license violations occur frequently in OS projects [4], as well as in Q&A websites such as Stack Overflow, where the community itself has inquired about the issue [3]. As stated by German et al. [7], it is illegal for code fragments from one system to be implemented in another if their licenses are incompatible. As such, developers are required to be cautious with their work and should be aware of the legal consequences involved with code reuse from internet sources. Although license violations do not have direct implications for quality, it does pose potential legal problems, which could result in the removal of software and court costs. Additionally, from a software development perspective, licensing issues could result in further costs to resolve complications, implement system changes, and fix reputation damage.\n\nStack Overflow is covered under the CC BY-SA 3.0: Creative Commons Attribution-ShareAlike 3.0 license [2], and as such, developers have the right to transform and build upon the content on Stack Overflow. However, new software using Stack Overflow code must be distributed under the same license as the original. Furthermore, credit must be given to the specific answer on Stack Overflow, a link must be provided for the license, and the developer should specify if they introduced changes. Noticeably, code reuse from Stack Overflow has been shown to exist in various OSS projects, with varying amounts of reuse levels. The reused code, however, is not often acknowledged, and the lack of attribution results in license violations in many projects [3, 25]. As such, additional research is required to both validate and extend the current literature. We pursue this line of work in this study, in answering our five research questions (RQ1-RQ5 stated earlier).\n\nIII. RESEARCH SETTING\n\nA. Data Collection and Processing\n\nTo address the research questions posed three sets of data were extracted, including Stack Overflow code snippets and two sets of OSS projects\u2019 source code. For the purpose of this study each dataset was required to only contain Java files. To collect the necessary data, we utilized the Stack Overflow data dump, SourceForge, and GitHub. A key motivator for selecting these sources was their popularity in the programming community and their open access to data.\n\nThe projects selected from SourceForge and GitHub were all based on popularity (both weekly and all time), resulting in projects being selected which were widely used and contributed towards. As such, we believe that the effects of code reuse would be more significant for these projects than for less popular ones.\n\nStack Overflow Java Snippets: The Java \u2018snippets\u2019 from Stack Overflow were extracted using the data explorer function to create the first dataset. Answer posts were then selected based on having at least one \u201c<code>\u201d tag and were filtered on the language Java. Of these answers, only those which were selected as accepted answers were kept, on the premise that such snippets will be trusted and thus reused. As a final filter, only answers from 2014 to 2017 were selected to ensure relevancy. This resulted in 117,526 answers. These answers were then separated into individual code snippets, based on each being within \u201c<code>\u2026</code>\u201d tags. This resulted in 404,799 individual code snippets. Of these snippets, only those with more than one line of code were selected. Ultimately, 151,954 code snippets were extracted and saved as Java files, and 151,946 were analyzed since eight returned errors when they were processed.\n\nTop Weekly OSS Projects: The second dataset of files extracted were projects with the greatest weekly popularity, with the specific week of sourcing starting on December 18, 2017. We extracted the top 10 weekly Java projects on SourceForge and GitHub. This resulted in a preliminary sample of 20 projects, in line with previous research done by Heinemann et al. [18] on Open Source Java projects. Each of these projects was investigated, and those containing at least one Java file was selected. Ultimately, 12 suitable projects were finally selected for the analysis, which contained a total of 16,617 Java files. Five files returned errors during processing, as reported in Table III.\n\nAll Time Most Popular OSS Projects: The final dataset covered projects that had the highest all-time popularity on SourceForge. As above, the top 20 projects were selected, and 16 were appropriate for the analysis (i.e., contained at least one Java source file). We did not extract projects from GitHub in this round given the richness of the projects that were extracted from SourceForge. The projects were filtered on popularity, as well as containing Java code. However, four projects were included in the subset which did not contain Java files, leaving 39,616 files. The final list of projects and their summaries can be found in Table IV with 39,558 Java files being used in our analyses after processing.\n\nB. Tools and Techniques\n\nTo answer our research questions an appropriate clone (reuse) detection tool was required. We conducted a review of several tools, including NiCad [14], SourcererCC [12] and CCFinderX [32]. We selected CCFinderX given its performance and popularity among researchers [5, 25, 32]. Its token-based techniques for clone detection is computationally more efficient than alternative methods, it has a high recall rate, and is able to detect all hidden clones [5]. As discussed by Kamiya et al. [32], the software works by employing a lexical analyzer to create token sequences, after which it applies rule-based transformations to these sequences (based on the specific programming language).\nThe lexical analyzer is used to transform sequences of characters into sequences of tokens, which are word-like entities [33]. These entities can be identifiers, keywords, numbers, literals, operators, separators, or comments [33, 1]. The matching of clones is then computed using a suffix-tree algorithm, \u201cin which the clone information is represented as a tree with sharing nodes for leading identical subsequences and the clone detection is performed by searching the leading nodes on the tree\u201d [33].\n\nWhen utilizing CCFinderX for the analyses, several parameters were configured. We followed previous recommendations and used CCFinderX default settings [25]. The minimum clone length, representing the absolute count of tokens, was set at its default value of 50. As such, code blocks were only considered if they contain at least 50 tokens. Additionally, the minimum unique token set value was configured as default (being 12). Hence, code blocks were only considered if it contains at least 12 unique tokens in addition to having an absolute minimum count of 50 tokens. The shaper level was also set at its default of 2. The shaper restricts code blocks from being considered a candidate clone if an outer block \u2018}\u2019 splits the token sequence. The final two parameters were the \u2018P-match application\u2019 and the \u2018Pre-screening application\u2019. The P-match application parameter is by default ticked, and denotes that variables and function names are not replaced with special characters. The Pre-screening application was, by default, not ticked, as we wanted to retain all clone instances. Pre-screen is ticked to filter outcomes where there are visually too many code clones. The output from CCFinderX includes both file metrics and clone metrics. The file metrics provide file-level insights into the data, whereas the clone metrics provide information regarding clone sets. One set exists for each unique group of clones. As such, a clone set will contain a minimum of 2 code blocks. Additionally, we were able to identify the number of files containing clones and clone-sets present in different files in the data (refer to Figure 1 for example).\n\nIn order to determine the extent of code reuse occurring within files, between files, and between projects/datasets, the Radius metric (RAD) of CCFinderX was utilized.\n\n\n\nAfter performing the analysis, clone-sets were selected based on their specific RAD values, and in turn these were used to select the individual files involved. The RAD metric, as defined by Kamiya et al. [32], gives an indication of the maximum distance to a common directory between the files involved in a clone-set. As such, clones found within the same file will have a Radius of 0, and clones found between two files in the same directory will have a Radius of 1, and so on.\n\nC. Measures for Answering RQs\n\nTo answer the first four research questions posed (RQ1-RQ4), five analyses were performed. These analyses involve calculating the following metrics: Firstly, the number of files containing at least one clone is computed. Secondly, by using the previous measure, we got a measure of the percentage of files containing clones. This allows us to compare our results with those from similar studies, such as that of Yang et al. [8]. Thirdly, by summing the population variable (pop) of each clone-sets we identified the total number of clones present in the files. Fourthly, the total number of clone-sets reveals all unique clones. Fifthly, among these clone sets we identified which clones involved more than one file.\n\nTo answer **RQ1. What is the extent of Java code reuse within Stack Overflow?**, all Stack Overflow files were stored in the same directory when the CCFinderX was executed, and as such a Radius of 1 was used to identify between-file clone-sets. Answering the second research question (**RQ2. What is the extent of code reuse between answers published under the same question in Stack Overflow?**) required Stack Overflow files to be stored in separate directories based on the questions under which they were posted. As such, a Radius of 1 would indicate that clones exist between answers for the same question, and a Radius of 2 would indicate that clones exist between answers for separate questions. Having a Radius of 2, however, does not imply that intra-question clones (i.e., clones under the same question) do not exist, it simply implies that a clone is also found between questions. This can hide intra-question clones, and as such a manual inspection was performed on the clone-sets with a Radius of 2 to identify intra-question clones hidden by the maximum Radius value. Figures 2 and 3 demonstrate the situation, where both cases have a Radius of 2, however only one (Figure 2) has an intra-question clone. The code piece, denoted by \u2018A\u2019, is found under the same question (Question 1).\n\n\n\nTo answer the third (**RQ3. What is the extent of code reuse between Stack Overflow and the current most popular Open Source Java Projects?**) and fourth (**RQ4. What is the extent of code reuse between Stack Overflow and the all-time most popular Open Source Java projects?**) questions each project\u2019s files were extracted and saved under the same directory. Furthermore, the Stack Overflow files were saved two directories away, which allowed us to identify clone-sets with clones found between Stack Overflow and a project(s) using a Radius value of 2. The primary measurements required to answer the research questions includes the total number of files containing at least one clone, the total number of clones present in these files, and\nthe number of unique clones. **RQ5. Are there differences in the nature of reuse found between the different contexts in terms of scale and size?** was answered through follow up statistical analyses involving the outcomes above.\n\n\n\n### D. Reliability Checks\n\nTo ensure that the results obtained from our analyses were reliable, we conducted a manual investigation of 60 clone pairs detected by CCFinderX. Initially author AL (first author) performed the checks, which were then discussed with author SAL (second author) who triangulated the outcomes and provided confirmation. Within the sample of the 60 clone pairs, 20 were randomly obtained from the Stack Overflow analysis in Section IV (A), 20 from Section IV (C \u2013 a), and 20 from Section IV (D \u2013 a). For each selected clone-pair, it was determined to what extent the two pieces of code were similar, and the nature of the code was also recorded (i.e., is it a class, method, or piece of code within a method that was detected as being a clone). The extent to which clones were similar was rated either \u2018Exact\u2019, \u2018High\u2019, or \u2018Medium\u2019. For those rated as \u2018Exact\u2019, the code in question would be identical copies, including all identifiers, the structure, and the functionality. For those rated as \u2018High\u2019, the primary difference between the two pieces of code would be the identifiers. Finally, those ranked as \u2018Medium\u2019 were considered to still be similar in structure, although identifiers, minor pieces of data structures, and minor pieces of functionality may be different. The results from the analyses are given in Tables I and II, where Table I reflects the number of clone pairs considered similar to a given extent, and Table II displays the nature of code elements detected in the sample.\n\n#### TABLE I. MANUAL CHECK OF DETECTED CLONE SIMILARITY\n\n| Similarity | SO | SO & All Time Most Popular | SO & Current Most Popular | Total |\n|------------|----|-----------------------------|---------------------------|-------|\n| Exact | 10 | 6 | 3 | 19 |\n| High | 10 | 12 | 13 | 35 |\n| Medium | 0 | 2 | 4 | 6 |\n\n#### TABLE II. CODE CLONES ELEMENTS\n\n| Nature of Code Element | SO | SO & All Time Most Popular | SO & Current Most Popular | Total |\n|------------------------|----|-----------------------------|---------------------------|-------|\n| Class | 5 | 0 | 1 | 6 |\n| Method | 5 | 6 | 8 | 19 |\n| Part of Method | 10 | 14 | 11 | 35 |\n\nOur results show that it is highly plausible that these pieces of code could have been copied directly, or at least have been adapted to fit the software in question (refer to Table I for details). Furthermore, Table II shows that the majority of clones were code found within methods. Thus, it appears that if a developer is to copy a piece of code from Stack Overflow, then it is likely that this code would provide some additional functionality to a method.\n\n### IV. RESULTS\n\n#### A. Java Code Reuse within Stack Overflow (RQ1)\n\nOur analysis of the Stack Overflow files revealed that, overall, 5,041 files (out of 151,946) contained at least one clone (or were reused). Thus, 3.3% of Stack Overflow Java code snippets have a duplicate found elsewhere in Stack Overflow. Furthermore, it was observed that within the 5,041 files, a total of 8,786 clones were present, indicating that some contained multiple clones. In terms of clone sets, 3,530 unique code snippets were observed to have clones. However, when focusing on clones found in at least two files, this number reduced to 2,338. As a result, we are able to determine that there were potentially 2,338 unique license violations existing within the Stack Overflow files extracted (refer to Section II for Stack Overflow licensing requirements), and that these cumulatively appear in 5,863 places. The additional 1,192 (i.e., 3530 minus 2338) unique clones were found within the same files, and as such, do not present potential license violations as they are contained within the same answers by the same author.\n\n#### B. Java Code Reuse between Answers on Stack Overflow (RQ2)\n\nTo further investigate code reuse within Stack Overflow we also looked at the amount of reuse occurring within answers given to the same questions. Our analyses reveal that of the 151,946 Stack Overflow files 2,666 contained clones found under the same question. This equates to 1.8% of the total files, and implies that this amount of snippets had at least one clone (code duplication) published under the same question. Within these 2,666 files, a total of 3,559 clones were found, again indicating that some answers contained multiple clones. Out of the 3,559 clones discovered, the number of unique clones were found to be 1,763. Additionally, from the 2,666 Stack Overflow files containing clones, we were able to identify that they were present in the answers in responses to 1,207 unique questions (out of 46,082 in total). Hence, 2.6% of Java related questions on Stack Overflow can be expected to contain two or more answers with the same code.\n\n#### C. Code Reuse between Stack Overflow and Current Popular Projects (RQ3)\n\n**a) Stack Overflow and Project Reuse Analysis:** The analysis of the Stack Overflow and our top weekly OSS projects revealed that 12,763 files (out of 168,558; five project files were removed by CCFinderX\u2019s due to errors) contained at least one clone. Based on this result we observed that 7.6% of the files under consideration contain at least one clone. Of the 12,763 files, a total of 5,447 of these were Stack Overflow files (out of 151,946 files), and 7,316 were top weekly OSS project files (out of 16,612 files). This indicates that when introducing the project files, 406 additional Stack Overflow files contain clones (refer to Section IV (A)). This implies that these 406 Stack Overflow files contain code that is not found anywhere else on Stack Overflow, with the clones being solely between Stack Overflow and at least one project. Additionally, the project files with clones account for 44% of the total project files, which, as a proportion is much greater than that of the Stack Overflow files (just 3.3%). This is primarily believed to be a\nresult of the size of the project files, with their average token size being 617, compared to a much smaller 48 for the Stack Overflow files. We performed further probing of the data, observing that in the 12,763 files containing at least one clone, 21,893 clone sets existed. In other words, there were 21,893 unique code snippets which have at least one clone. Of these, a smaller number of clone sets contained clones found in both Stack Overflow and top weekly OSS project files. This figure is 223, indicating that 223 unique code snippets are found between the Stack Overflow and project files. These clones cumulatively appear in 1,627 files (1.0% of 168,558), with each appearing in an average of 7.3 files. In total, these 223 unique code snippets appear 1,995 times.\n\nb) Inter-Project Reuse Analysis: Of the 12,763 files containing clones, a total of 75,959 clones were discovered within these. When the project files were analyzed independently, it was found that 7,287 (i.e., 57%) of the project files contained clones among themselves, giving an average of 2,979.1 clones per project (as depicted in Table III). Additionally, when investigating clone-sets, we observe 212 clones in at least two projects, with these appearing 1,995 times. Further probing also revealed that 29 project files (out of 7,316) contained clones that are only found in Stack Overflow files, and not in any other project. In other words, these 29 clones are found in a one to one fashion between one project and Stack Overflow, and as such they are most likely to have migrated directly between Stack Overflow and a project, since there is no evidence of these originating internal to the project. The direction of this migration, however, is not known, although independent of these situations, our reliability checks above show that there were no attributions, and thus, licensing issues could arise.\n\nD. Code Reuse between Stack Overflow and All-Time Most Popular Projects (RQ4)\n\na) Stack Overflow and Project Reuse Analysis: The analysis of the Stack Overflow and all time most popular Java projects revealed that overall 24,537 files (out of 191,504; 58 project files were removed by CCFinderX\u2019s due to errors) contained at least one clone. Based on this result we observe that approximately 12.8% of the files under question contain at least one clone. However, only 5,554 Stack Overflow files contained a clone, which is 513 more than when Stack Overflow files were considered on their own. On the other hand, 18,983 project files (out of 39,558 files) contained at least one clone, which is approximately 48% of the total project files. Again, it should be noted, that the average length of a project file was 652 tokens. Furthermore, of the 24,537 files containing at least one clone, 51,282 clone sets existed. In other words, there were 51,282 unique code snippets which had at least one clone. Of these, a smaller number of clone sets contain clones found in both Stack Overflow and the projects. This figure is 450, indicating that 450 unique code snippets were found between the Stack Overflow and project files. These clones cumulatively appear 4,334 times (2.3% of 191,504), or in 6.4 files on average.\n\nb) Inter-Project Reuse Analysis: Within the 24,537 files a total of 245,750 clones were discovered. Additionally, when analyzed independently, it was found that 18,935 of the project files contained clones among themselves (i.e., 77.2%), giving an average of 9,186.9 clones per project (as depicted in Table IV). Additionally, when investigating clone-sets, it was found that 726 clones were found it at least two projects, with these appearing 6,377 times. We noticed that 48 project files (out of 18,983) contained clones that are only found in Stack Overflow files, and not in any other project. As above, these 48 files are found directly between one project and Stack Overflow, and as such is highly likely to have migrated directly between Stack Overflow and a project.\n\n| Project | Number of Java files | Average number of tokens/file | Number of clones | Number of files with clones/s |\n|------------------|----------------------|-------------------------------|-----------------|-------------------------------|\n| Awesome Java | 57 | 220 | 18 | 15 |\n| Leetcode | 1327 | 498.4 | 1996 | 486 |\n| Dubbo | 5576 | 936.7 | 13823 | 2869 |\n| Elastic-Search | 1018 | 120.5 | 320 | 189 |\n| Java Design | 3966 | 673 | 13674 | 2327 |\n| Patterns | 239 | 713.4 | 2906 | 140 |\n| Apache OpenOffice| 17 | 252.3 | 11 | 9 |\n| Proxeye | 3799 | 277.7 | 2356 | 1037 |\n| Qmui Android | 164 | 673.2 | 230 | 71 |\n| Sap NetWeaver | 252 | 286.6 | 187 | 89 |\n| Server Adapter | 1612 | 3836.8 | 35749 | 7216 |\n| for Eclipse | | | | |\n| Sefin | | | | |\n| Total | 13813 | 406.4 | 2979.1 | 609.7 |\n\nE. Contextual Differences in Scale and Size of Reuse (RQ5)\n\nIn addition to the findings above, the results displayed in Table V and Figure 4 show that the sizes of clones found within the various contexts are different. Of primary interest is the larger mean sizes of the clones within Stack Overflow (refer to boxplots in Figure 4-A, B). These larger sizes suggest that there is a likelihood of the clones detected being true positives, i.e., they are indeed evidence of reuse where entire snippets are copied. Additionally, the median and upper quartile of the top weekly Java projects clone sizes are greater than that of the other four contexts where project files were included. This is displayed in Figure 4, graphs C, D, E, and F; where D can be seen to have a greater median and upper quartile value. This indicates that newer projects are constructed to a greater extent from reused elements.\n\nIn Table V the average and maximum sizes of clones found within the various contexts are presented. Interestingly, the clones in terms of their maximum sizes are smaller for the two analyses looking at Stack Overflow and OSS projects together (277 and 324 respectively). As such, we can see that the code clones found between Stack Overflow and OSS projects are at most 324 tokens in length. However, when looking at inter-project clones, we notice that the maximum values are much higher, with the biggest clone consisting of 1,369 tokens. This suggests that code reuse between projects involves copying of larger pieces of code, including entire components. In contrast to this, Stack Overflow code usually provides smaller code snippets as answers to specific coding questions, and so, evidence here may be linked to this reality.\nTo test for statistically significant differences between the six groups of measures (refer to Table V), in terms of clone sizes, a Kruskal-Wallis test was performed. This test was selected as it is non-parametric in nature (i.e., does not assume that the data follows a Normal distribution), and it does not require sample sizes to be equivalent [28].\n\n### TABLE IV. SUMMARY OF THE ALL-TIME MOST POPULAR JAVA PROJECTS (INTER-PROJECT)\n\n| Project | Number of Java files | Average number of tokens/file | Number of clones | Number of files with clone(s) |\n|--------------------------|----------------------|-------------------------------|------------------|-------------------------------|\n| Angry IP Scanner | 219 | 397 | 102 | 48 |\n| Catacombae | 91 | 758.6 | 223 | 33 |\n| Cyclops Group | 2609 | 151.9 | 2545 | 1291 |\n| Eclipse Checkstyle Plug-in | 1708 | 319 | 3115 | 782 |\n| Freemind | 529 | 772 | 495 | 192 |\n| Hibernate | 2392 | 285.6 | 2148 | 627 |\n| Hitachi Vantara - Pentaho | 24494 | 673.2 | 112415 | 12008 |\n| Libjpeg-turbo | 12 | 2061.3 | 44 | 7 |\n| OpenCV | 148 | 1003.9 | 508 | 94 |\n| Sap NetWeaver Server Adapter for Eclipse | 239 | 713.4 | 2921 | 144 |\n| Sweet Home 3D | 233 | 2408.3 | 1476 | 142 |\n| TurboVNC | 245 | 886.5 | 495 | 114 |\n| Vuze \u2013 Azureus | 3639 | 750 | 5784 | 1461 |\n| Weka | 42 | 1505.1 | 66 | 21 |\n| Xtreme Download Manager | 155 | 806.4 | 468 | 71 |\n| Total | 39558 | 14592.2 | 146990 | 18983 |\n| Average/mean | 2472.4 | 912 | 9186.9 | 1186.4 |\n\n### TABLE V: CLONE SIZE STATISTICS\n\n| Data Group | Median | Mean | Max | Mean Rank |\n|--------------------------|--------|-------|-------|-----------|\n| A. Stack Overflow | 66 | 85.7 | 938 | 14869.7 |\n| B. Stack Overflow Intra-Answers | 69 | 87.2 | 938 | 15480.4 |\n| C. Stack Overflow and Top Weekly | 57 | 67.9 | 277 | 11014.3 |\n| D. Top Weekly | 60 | 84.3 | 774 | 13478.7 |\n| E. Stack Overflow and Top All Time | 58 | 71.2 | 324 | 11646.4 |\n| F. Top All Time | 58 | 69.2 | 1369 | 11392.1 |\n\nOur result reveals a statistically significant outcome (significance level = 0.05), providing evidence that our outcomes are different (H(5) = 1409, p <0.01). Given this finding we further examined the distributions for A, B, D in Table V against others (C, E, F) with post hoc Kruskal-Wallis tests. Outcomes confirm that there were significantly bigger clones (p <0.05) for Stack Overflow, Stack Overflow Intra-Answers and Top Weekly projects when compared to the other distributions. This, alongside the results in Table V and the boxplots in Figure 4, provide preliminary evidence that the nature of clones, in terms of their sizes, are different for different data sets. We thus plan further analyses to investigate why these differences exist.\n\n### V. DISCUSSION AND IMPLICATIONS\n\n**Discussion**: Quality is an important element in all software development projects. In particular, the quality of freely available software should be a key consideration for its users. However, the migration of code between OSS projects and online Q&A platforms complicates such assessments. Stack Overflow as a platform, for instance, often acts as a medium through which code migrates between many projects, and as such, the quality of the code in many projects is influenced by factors that are beyond the control of their programmers. Furthermore, OSS projects are often published under specific licenses, which adds an additional level of complexity in terms of understanding their availability for reuse. In fact, users of the code published on Q&A platforms often lack the required understanding of the code, which can have direct implications for quality management if such code is reused in software projects. In order to investigate the extent of code reuse in these situations we focused on Java code from Stack Overflow and popular OSS projects. Here we revisit our outcomes to answer our five research questions (RQ1-RQ5).\n\n**RQ1. What is the extent of Java code reuse within Stack Overflow?** Our results indicate that within Stack Overflow, approximately 3.3% of the Java code sampled have at least one clone elsewhere on the website. Additionally, we found that up to 2,338 unique license violations could be present within these answers. This evidence duplicates that of Python code, which also revealed a 3.3% duplication [8]. It should be noted, however, that the Python code examined in Yang et al.\u2019s [8] study was processed to remove the effects of white space and comments, which increase the performance of clone detection tools and lead to better comparisons. To this end, our outcomes is at best conservative, and so Java code reuse could be actually higher than 3.3% in Stack Overflow.\n\nThe results from our study, along with that of Yang et al. [8], indicate that code reuse is prevalent in Stack Overflow in both Java and Python contexts. The near identical results obtained by these two studies suggest that users and developers of the Stack Overflow platform should expect just over 3% of code on Stack Overflow to be duplicated. When considering the parameter settings for these code blocks to be considered candidate clones, it should be emphasized that these clones are of significant size (at least 50 tokens). Unlike many small snippets found on Stack Overflow, these clones meet the specified requirements set before the analysis, and as such, it is more likely that these code blocks are not clones by coincidence, rather they are reused. Hence, developers need to be cautious with reusing larger code blocks from Stack Overflow, and be prepared to rigorously evaluate such code before its usage. In addition,\ninstances of reuse demand proper attribution so that the community is aware of how Stack Overflow knowledge is recycled. We believe that a software tool could be of utility in terms of aiding developers wanting to evaluate the appropriateness of code for reuse, and also detecting exactly where such code originated from to help with correct attribution.\n\n**RQ2. What is the extent of code reuse between answers published under the same question in Stack Overflow?** We observed that 1.8% of all Java snippets (i.e., code in answers) have at least one clone within other answers provided for the same question. Our evidence also revealed that 2.6% of questions sampled contain at least one clone pair between its answers. Furthermore, there were 1,763 potential unique license violations in our sample data. As with insights provided in response to RQ1, this outcome has implication for developers using Stack Overflow code in terms of the need to be aware of the rate of code duplication within Stack Overflow. With an overall duplication rate of 3.3%, we notice that a significant proportion of this duplication refers to clones between answers in different questions. As a result, developers may not give attribution to the original authors. Furthermore, in cases where these code blocks have migrated from external sources, having duplicates within Stack Overflow may make it more difficult to find these original sources. Without complete knowledge of the origin of reused code, developers may publish their OSS under different licenses, which will result in license violations. In fact, given the conservative settings used for our analyses, we anticipate that the reuse rate for smaller code snippets may be much higher. As such, if duplicated code can be identified by Stack Overflow, then the process of identifying the most appropriate solution (code) may be expedited, since users will be able to avoid duplicated answers. Having repeated duplicate answers may also result in convoluted pages, which could lead to slower problem solving for developers.\n\n**RQ3. What is the extent of code reuse between Stack Overflow and the current most popular Open Source Java Projects?** Our evidence showed that between Stack Overflow and the top weekly Java projects, approximately 223 unique code snippets appeared in both sets of files. Between the Stack Overflow and project files, these snippets appeared in a total of 1,627 files. This evidence shows that, overall, 1.0% of the project files contain one of these Stack Overflow clones. However, it should be noted that the percentage of project files containing clones is higher when compared to the percentage of Stack Overflow files that contained code. This outcome suggests that the current most popular open source Java projects tend to use code copied from Stack Overflow. In fact, within the projects, we discovered that approximately 57% of these files contained a clone. These clones were found either within a single project, or between projects. In a study by Koschke et al. [26], they discovered that approximately 7.2% of all lines of code in Open Source Java projects were exact clones. These findings indicate that there is a high levels of code reuse and duplication within Open Source Java projects. Our findings suggest that an opportunity exists for developers to reduce their intra-project reuse, which could result in less maintainability issues. Furthermore, developers should also consider that code reuse is occurring between these projects, and as such, they should become acquainted with licensing requirements (refer to Section II).\n\n**RQ4. What is the extent of code reuse between Stack Overflow and the all-time most popular Open Source Java projects?** When we compared Stack Overflow Java code against the all-time most popular OSS projects on SourceForge we observed that 450 unique code fragments were evident in both datasets, and that these appear in 4,334 files in total. This evidence shows that approximately 2.3% of the files sampled contained at least one clone, and that there is one unique clone for every 54.5 project files. In fact, the proportion of project files containing clones was quite high, with approximately 77.2% containing clones when excluding the Stack Overflow files.\n\nConsidering our outcomes against those of previous work [26], where 7.2% of code reuse was found, we believe that code reuse is high in popular Open Source Java projects. Interestingly, the percentage of files containing clones is higher for the all-time most popular projects, when compared to the newer, top weekly projects. It is thus more likely that code copied from these projects could have originally came from a different source, hence, creating a nested code reuse situation. Furthermore, the developers of these systems may potentially benefit from reducing the amount of reused code, thus improving the maintainability of their projects.\n\n**RQ5. Are there differences in the nature of reuse found between the different contexts in terms of scale and size?** Our results show that there are differences in the sizes of clones found across our datasets. Our evidence shows that when reuse was done in Stack Overflow most of the snippets were copied. We also observed that current popular Java projects had a greater extent of reused elements from other projects. We believe that newer projects may be constructed more commonly from whole elements of other projects, i.e., the mean clone length is greater than that of the \u2018Top All-Time\u2019 group in Table V, possibly due to the availability of these elements, or perhaps developers are more willing to reuse in recent times. Similar outcomes were reported for Android Mobile Apps [25], which tends to dominate recent application development environments. Evidence here indicates that developers\u2019 behaviors are potentially changing, as we are seeing them incorporate larger pieces of copied code into their work. As such, the effects, both negative and positive, resulting from copying code will be amplified for these projects. In situations where the copied code is well explained in the respective sections on websites such as Stack Overflow, it could lead to better quality software, since the functionality is well understood, tested, and documented by developers. However, if larger pieces of code are copied and pasted without having sufficient accompanying documentation (e.g., comments), then it is likely that the software in question will contain code which is not understood by developers, thus bringing into questions its functionality, reliability, debuggability, and overall quality.\n\nOur results also show a great degree of code duplication between all-time popular OSS projects, and, in fact, the scale and size of reuse was generally higher between OSS projects. This evidence is understandable given that Stack Overflow is generally known for shorter code snippets aimed at answering specific questions. Code duplication\nbetween projects was possibly driven by the use of common third party libraries, but could also be through intentional duplication of similar functionalities. The fact that Stack Overflow snippets were also copied suggests that reuse may be a part of practitioners\u2019 culture. Thus, there are implications for making sure the correct license is used, and developers are aware of the strengths and weaknesses of the code that are copied. Furthermore, on the backdrop of the need for the community to develop high quality, maintainable and secured code, developers should carefully evaluate code that is reused.\n\n**Implications:** Our investigation has shown that code clones do exist across Java-based projects and Stack Overflow. Having clones or duplicates within a system is unavoidable, since many software elements often rely on the same functionalities. However, in cases where many code clones exist, it is possible that developers may experience negative side-effects. Firstly, it is important to understand that high levels of code cloning can have negative effects on software quality, in terms of inconsistencies in code. Studies have found that around half of the software projects investigated had clones which contained inconsistencies, i.e., clones are changed inconsistently, with many of these being unintentional [15, 37]. Furthermore, these works also found that between 3-23% of code clones represented a fault. Thus, it is important for developers to be aware of the levels of code clones that exist within their software. To this end, we believe that tracking clones could improve the overall quality of software. This notion of tracking clones, and thus, being more aware of them, have been shown to improve software debugging [38, 39]. Another implication of our findings relates to probable licensing violations. Copying code from other projects or websites such as Stack Overflow without adhering to licensing requirements may result in complicated legal issues, and thus developers should take caution when doing so.\n\n**VI. THREATS TO VALIDITY**\n\nOur analyses were conducted with CCFinderX, which uses a token-based approach to identify clones. This technique itself has some limitations, including a lower precision rate compared to some alternative techniques, primarily Abstract Syntax Tree (AST) techniques [5]. Additionally, CCFinderX had preset parameter settings for its analyses. These parameters were given specific values, which were used to filter all texts in order to identify candidate clones. As such, the detection of clones was based on code meeting the set requirements given by CCFinderX, possibly leading to some clones being missed by the software. This is particularly important when considering that we worked with Stack Overflow data, with which we had an average file token size of 48. Thus, we can assume that some smaller snippets from Stack Overflow reused in our Open Source projects were not detected, and thus, our results could be conservative.\n\nIn fact, our reliability checks show that many clones were of smaller sizes (refer to Table II). However, as code chunks get smaller, the ability to trace these back to their original source becomes challenging. Smaller code fragments may also be labelled as clones accidentally. That said, our contextual analyses performed for reliability evaluation ascertained that code was duplicated, and that there were no attributions. This evidence thus confirms the potential for future maintenance and quality issues, and possible licensing complications.\n\nAdditionally, we did not introduce a time element to determine the direction of reuse. As such, we cannot make conclusive statements regarding the temporal copying of code from Stack Overflow into OSS projects, and in terms of the direction of the copying (i.e., if code was copied from Stack Overflow to OSS projects, or OSS to Stack Overflow). Lastly, our sample of projects may not be representative of all software projects, and as such a large-scale study may produce more generalizable insights. The total number of projects on SourceForge containing Java code alone is over 40,000, and GitHub has over 3.5 million available Java-based repositories. Thus, a larger study may help validate the results obtained from this study. However, the initial study completed here reflects the findings from highly-used projects, making code reuse an important element to consider.\n\n**VII. CONCLUSION AND FUTURE RESEARCH**\n\nThere is an imperative that the software engineering community develop and deliver high quality software. Improper code reuse as a practice may create barriers to the delivery of high-quality software however, and particularly in terms of software maintainability and confirming to legal requirements. With code reuse being a popular practice in the software engineering community, and Q&A forums such as Stack Overflow fueling this practice, it is pertinent to understand how this practice could affect future software maintenance and correct use of license to avoid legal issues. Towards this goal, we investigated the levels of code reuse within Stack Overflow, and between Stack Overflow and popular OSS projects.\n\nOur findings have indicated that clones (reuse) do exist in all of the examined contexts (within Stack Overflow, between Stack Overflow and OSS, and between OSS), with numerous cases of code duplication detected in each setting. Outcomes in the work show that projects are all highly likely to contain code that has been copied from sources external to their own code. Additionally, our findings are similar to the research conducted on mobile apps and Python projects. As such, the levels of code reuse in these studies indicate that Java developers need to be made aware of licensing issues and the problems that could arise from ad-hoc copying. In particular, the quality assurance activities in software projects can be more comprehensive and could place greater emphasis on code reused from platforms such as Stack Overflow. This stands in agreement with [40], which discussed the benefits that code clone analysis can provide for software analysis. We also further believe that due to the increased amount of external code being integrated into projects, an even greater need exists for utilizing clone analysis software. If licensing knowledge and correct attribution is improved, then code fragments implemented from external sources will be less likely to cause licensing violations.\n\nOur inter-project analyses showed that the top weekly Java projects had a greater average token size when compared to the all-time most popular Java projects. To further analyze this phenomenon, a time-based comparison of code reuse in OSS projects could be beneficial in\nidentifying the changes in reuse behavior over time. From our preliminary results it appears that newer projects have larger pieces of reused code, which could indicate that inter-project reuse of whole components is occurring. The work completed here can be replicated for a larger sample of projects in order to validate our results and assess the scale of reuse more generally. Additionally, research may look beyond the scope of OSS projects to contrast our findings with closed source projects. Our research may also be expanded to provide insights into the direction of migration of clones. An et al. [19] published results on code migration for Android mobile apps, and Inoue et al. [17] have developed a tool for tracking code in open source repositories, however dedicated work is required to investigate the direction of code migration from Stack Overflow (and other such portals) to OSS projects.\n\nREFERENCES\n\n[1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: principles, techniques, and tools. Harlow, Essex: Pearson, 2014.\n\n[2] Anon. Creative Commons License Deed. Available: https://creativecommons.org/licenses/by-sa/3.0/, Feb. 2018.\n\n[3] Anon. Do I have to worry about copyright issues for code posted on Stack Overflow? Available: http://meta.stackexchange.com/questions/12527/do-i-haveto-worry-about-copyright-issues-for-code-posted-on-stack-overflow, Feb. 2018.\n\n[4] A. Mathur, H. Choudhary, P. Vashist, W. Thies, and S. Thilagam, \u201cAn Empirical Study of License Violations in Open Source Projects,\u201d presented at the 35th Annual IEEE Software Engineering Workshop. DOI:http://dx.doi.org/10.1109/sew.2012.24, 2012.\n\n[5] C. K. Roy, J. R. Cordy, and R. Koschke, \u201cComparison and evaluation of code clone detection techniques and tools: A qualitative approach,\u201d Science of Computer Programming, vol. 74, pp. 470\u2013495, 2009.\n\n[6] J. Cordeiro, B. Antunes, and P. Gomes, \u201cContext-based recommendation to support problem solving in sof. Development,\u201d In Proceedings of 3rd Int.Workshop on RSSE. 2012.\n\n[7] D. M. German, M. Di Penta, Y-G. Guenecheu, and G. Antoniol, \u201cCode siblings: Technical and legal implications of copying code between applications,\u201d In Proc. of 6th Working Conference on Mining Software Repositories (MSR), 2017. DOI:http://dx.doi.org/10.1109/msr.2017.13.\n\n[8] D. Yang, P. Martins, V. Saini, and C. Lopes, \u201cStack Overflow in Github: Any Snippets There?\u201d In Proc. of 14th International Conference on Mining Software Repositories (MSR), 2017. DOI:http://dx.doi.org/10.1109/msr.2017.13.\n\n[9] E. Johansson, A. Wesslen, L. Bratthall, and M. Host. \u201cThe importance of quality requirements in software platform development-a survey,\u201d In Proc. of 34th Annual Hawaii International Conference on System Sciences, 2001.\n\n[10] Felix Fischer et al, \u201cStack Overflow Considered Harmful? The Impact of Copy&Paste on Android Application Security,\u201d IEEE Symposium on Security and Privacy (SP), 2017.\n\n[11] F. Bi, \u201cNissan app developer busted for copying code from Stack Overflow,\u201d May. 2016. Available: https://www.theverge.com/2016/5/11195308/dont-get-busted-copying-code-from-stack-overflow\n\n[12] H. Sajjani, V. Saini, J. Sva\u017eenjko, C. K. Roy, and C. V. Lopes, \u201cSourceRec:CC,\u201d In Proc. of 58th International Conference on Software Engineering \u2013 ICDE 16, 2016.\n\n[13] I. J. Mojica, B. Adams, M. Nagappan, S. Dienst, T. Berger, and A. Hassan, \u201cA Large-Scale Empirical Study on Software Reuse in Mobile Apps,\u201d IEEE Software, vol. 31, no. 2, pp. 78\u201386, 2014. DOI:http://dx.doi.org/10.1109/ms.2013.142\n\n[14] J. R. Cordy and C. K. Roy, \u201cThe NiCad Clone Detector,\u201d Presented at the IEEE 19th International Conference on Program Comprehension, 2011.\n\n[15] J. Krinke, \u201cA Study of Consistent and Inconsistent Changes to Code Clones,\u201d Presented at the 14th Working Conference on Reverse Engineering (WCRE), 2007.\n\n[16] J. C. Knight and M. F. Dunn, \u201cSoftware quality through domain-driven certification,\u201d Ann. Softw. Eng., vol. 5, pp. 293\u2013315, 1998.\n\n[17] K. Inoue, Y. Sasaki, P. Xia, and Y. Manabe, \u201cWhere does this code come from and where does it go? \u2014 Integrated code history tracker for open source systems,\u201d In Proc. of 34th International Conference on Software Engineering, 2012.\n\n[18] L. Heinemann, F. Deissenboeck, M. Gleirscher, B. Hummel, and M. Irlbeck, \u201cOn the Extent and Nature of Software Reuse in Open Source Java Projects,\u201d Lecture Notes in Computer Science Top Productivity through Software Reuse, pp. 207\u2013222, 2011.\n\n[19] L. An, O. Mlouki, F. Khomh, and G. Antoniol, \u201cStack Overflow: A code laundering platform?\u201d In Proc. of IEEE 24th SANER, 2017.\n\n[20] M. Sojer and J. Henkel, \u201cCode Reuse in Open Source Software Development: Quantitative Evidence, Drivers, and Impediments,\u201d Journal of the Association for Information Systems, vol. 11, pp. 868\u2013901, 2010.\n\n[21] M. Sojer and J. Henkel, \u201cLicense risks from ad hoc reuse of code from the internet,\u201d Communications of the ACM, vol. 54, pp. 74, 2011.\n\n[22] M. Singh, A. Mittal, and S. Kumar, \u201cSurvey on Impact of Software Metrics on Software Quality,\u201d International Journal of Advanced Computer Science and Applications, vol. 3, 2012.\n\n[23] O. Mlouki, F. Khomh, and G. Antoniol, \u201cOn the Detection of Licenses Violations in the Android Ecosystem,\u201d In Proc. of IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), 2016.\n\n[24] P., L., A. Bacchelli, and M. Lanza, \u201cLeveraging crowd knowledge for software comprehension and development,\u201d CSMR, IEEE Computer Society, 2013, p. 57\u201366.\n\n[25] R. Abdalkareem, E. Shihab, and J. Rilling, \u201cOn code reuse from StackOverflow: An exploratory study on Android apps,\u201d Information and Software Technology, vol. 88, pp. 148\u2013158, 2017.\n\n[26] R. Koschke and S. Bazrafshan, \u201cSoftware-Clone Rates in Open-Source Programs Written in C or C++,\u201d In Proc. of IEEE 23rd SANER, 2016.\n\n[27] S. Baltes, R. Kiefer, and S. Diehl, \u201cAttribution Required: Stack Overflow Code Snippets in GitHub Projects,\u201d In Proc. of IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), 2017.\n\n[28] S. Sawilowsky and G. Fahoome, Kruskal-Wallis Test: Basic, Wiley StatsRef: Statistics Reference Online, 2014.\n\n[29] S. Haefliger, G. Von Krogh, and S. Speth, \u201cCode Reuse in Open Source Software,\u201d Management Science, vol. 54, pp. 180-193, 2008.\n\n[30] S. H. Kan, Metrics and Models in Software Quality Engineering (2nd ed.), Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002.\n\n[31] V. Suma and T.R. Gopalakrishnan nair, \u201cEffective Defect Prevention Approach in Software process: Achieving Better Quality levels,\u201d World Academy of Science, Engineering and Technology, vol. 42, pp. 258-262, 2008.\n\n[32] T. Kamiya, S. Kusumoto, and K. Inoue, \u201cCCFinder: a multilingual token-based code clone detection system for large scale source code,\u201d IEEE TSE, vol. 28, pp. 654\u2013670, 2002.\n\n[33] T. \u00c6gidius Mogensen, \u201cLexical Analysis,\u201d Introduction to Compiler Design Undergraduate Topics in Computer Science, pp. 1\u201337, 2011.\n\n[34] V. Bauer, J. Eckhardt, B. Hauptmann, and M. Klimek, \u201cAn exploratory study on reuse at google,\u201d In Proc. of 1st International Workshop on Software Engineering Research and Industrial Practices - SER&IPs, 2014.\n\n[35] W.b. Frakes and K. Kang, \u201cSoftware reuse research: status and future,\u201d IEEE Transactions on Software Engineering, vol. 31, pp. 529\u2013536, 2005. DOI:http://dx.doi.org/10.1109/tse.2005.85\n\n[36] Y. Kashima, Y. Hayase, N. Yoshida, Y. Manabe, and K. Inoue, \u201cAn Investigation into the Impact of Software Licenses on Copy-and-paste Reuse among OSS Projects,\u201d In Proc. of 18th Working Conference on Reverse Engineering, 2011.\n\n[37] Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, and Stefan Wagner, \u201cDo code clones matter?\u201d In Proc. of IEEE 31st International Conference on Software Engineering, 2009.\n\n[38] Z. Li, S. Lu, S. Myagmar, and Y. Zhou, \u201cCP-Miner: Finding copy-paste and related bugs in large-scale software code,\u201d IEEE Trans. Softw. Eng, vol. 32, pp. 176\u2013192, 2006.\n\n[39] L. Jiang, Z. Su, and E. Chiu, \u201cContext-based detection of clone-related bugs,\u201d In Proc. ESEC/FSE, ACM, 2007.\n\n[40] C. K. Roy, J. Cordy, and R. Koschke, \u201cComparison and evaluation of code clone detection techniques and tools: A qualitative approach,\u201d Science of Computer Programming, vol. 74, no. 7, pp. 470-495, 2009.", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/022_lotter.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 10, "total-input-tokens": 32677, "total-output-tokens": 15979, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 5883, 1], [5883, 12544, 2], [12544, 19259, 3], [19259, 24989, 4], [24989, 31558, 5], [31558, 39298, 6], [39298, 46449, 7], [46449, 53428, 8], [53428, 60184, 9], [60184, 68475, 10]]}}
|
|
{"id": "ebb74a38228c2dea765e7bf0bef0674d05011c0b", "text": "Why Do Developers Use Trivial Packages? \nAn Empirical Case Study on npm\n\nRabe Abdalkareem, Olivier Nourry, Sultan Wehaibi, Suhaib Mujahid, and Emad Shihab \nData-driven Analysis of Software (DAS) Lab \nDepartment of Computer Science and Software Engineering \nConcordia University \nMontreal, Canada \n{rab_abdu,o_nourry,s_alweha,s_mujahi,eshihab}@encs.concordia.ca\n\nABSTRACT\n\nCode reuse is traditionally seen as good practice. Recent trends have pushed the concept of code reuse to an extreme, by using packages that implement simple and trivial tasks, which we call \u2018trivial packages\u2019. A recent incident where a trivial package led to the breakdown of some of the most popular web applications such as Facebook and Netflix made it imperative to question the growing use of trivial packages.\n\nTherefore, in this paper, we mine more than 230,000 npm packages and 38,000 JavaScript applications in order to study the prevalence of trivial packages. We found that trivial packages are common and are increasing in popularity, making up 16.8% of the studied npm packages. We performed a survey with 88 Node.js developers who use trivial packages to understand the reasons and drawbacks of their use. Our survey revealed that trivial packages are used because they are perceived to be well implemented and tested pieces of code. However, developers are concerned about maintaining and the risks of breakages due to the extra dependencies trivial packages introduce. To objectively verify the survey results, we empirically validate the most cited reason and drawback and find that, contrary to developers\u2019 beliefs, only 45.2% of trivial packages even have tests. However, trivial packages appear to be \u2018deployment tested\u2019 and to have similar test, usage and community interest as non-trivial packages. On the other hand, we found that 11.5% of the studied trivial packages have more than 20 dependencies. Hence, developers should be careful about which trivial packages they decide to use.\n\nCCS CONCEPTS\n\n\u2022 Software and its engineering \u2192 Software libraries and repositories; Software maintenance tools;\n\nKEYWORDS\n\nJavaScript; Node.js; Code Reuse; Empirical Studies\n\nACM Reference Format:\n\nRabe Abdalkareem, Olivier Nourry, Sultan Wehaibi, Suhaib Mujahid, and Emad Shihab. 2017. Why Do Developers Use Trivial Packages? An Empirical Case Study on npm. In Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Paderborn, Germany, September 4\u20138, 2017 (ESEC/FSE\u201917). 11 pages. \nhttps://doi.org/10.1145/3106237.3106267\n\n1 INTRODUCTION\n\nCode reuse is often encouraged due to its multiple benefits. In fact, prior work showed that code reuse can reduce the time-to-market, improve software quality and boost overall productivity [3, 32, 37]. Therefore, it is no surprise that emerging platforms such as Node.js encourage reuse and do everything possible to facilitate code sharing, often delivered as packages or modules that are available on package management platforms, such as the Node Package Manager (npm) [7, 39].\n\nHowever, it is not all good news. There are many cases where code reuse has had negative effects, leading to an increase in maintenance costs and even legal action [2, 29, 35, 41]. For example, in a recent incident code reuse of a Node.js package called left-pad, which was used by Babel, caused interruptions to some of the largest Internet sites, e.g., Facebook, Netflix, and Airbnb. Many referred to the incident as the case that \u2018almost broke the Internet\u2019 [33, 45]. That incident lead to many heated discussions about code reuse, sparked by David Haney\u2019s blog post: \u201cHave We Forgotten How to Program?\u201d [26].\n\nWhile the real reason for the left-pad incident was that npm allowed authors to unpublish packages (a problem which has been resolved [40]), it raised awareness of the broader issue of taking on dependencies for trivial tasks that can be easily implemented [26]. Since then, there have been many discussions about the use of trivial packages. Loosely defined, a trivial package is a package that contains code that a developer can easily code him/herself and hence, is not worth taking on an extra dependency for. Many developers agreed with Haney\u2019s position, which stated that every serious developer knows that \u2018small modules are only nice in theory\u2019 [8], suggesting that developers should implement such functions themselves rather than taking on dependencies for trivial tasks. Other work showed that npm packages tend to have a large number of dependencies [13, 14] and highlighted that developers need to use caution since some dependencies can grow exponentially [4]. In fact, in our dataset, we found that more than 11% of the trivial packages have more than 20 dependencies.\nSo, the million dollar question is \u201cwhy do developers resort to using a package for trivial tasks, such as checking if a variable is an array?\u201d At the same time, other questions regarding how prevalent trivial packages are and what the potential drawbacks of using these trivial packages remain unanswered. Therefore, we performed an empirical study involving more than 230,000 npm packages and 38,000 JavaScript applications to better understand why developers resort to using trivial packages. Our empirical study is qualitative in nature and is based on survey results from 88 Node.js developers. We also quantitatively validate the most commonly developer-cited reason and drawback related to the use of trivial packages.\n\nSince, to the best of our knowledge, this is the first study to examine why developers use trivial packages, we first propose a definition of what constitutes a trivial package, based on feedback from JavaScript developers. We also examine how prevalent trivial packages are in npm and how widely they are used in Node.js applications. Our findings indicate that:\n\n**Trivial packages are common and popular.** Of the 231,092 npm packages in our dataset, 16.8% of them are trivial packages. Moreover, of the 38,807 Node.js applications on GitHub, 10.9% of them directly depend on one or more trivial packages.\n\n**Most developers do not consider the use of trivial packages as bad practice.** In our survey of the 88 JavaScript developers, 57.9% of them said they do not consider the use of trivial packages as bad practice, whereas only 23.9% consider it to be a bad practice. This finding shows that there is not a clear consensus on the issue of trivial package use.\n\n**Trivial packages provide well implemented and tested code and increase productivity.** Developers believe that trivial packages provide them with well implemented/tested code and increase productivity. At the same time, the increase in dependency overhead and the risk of breakage of their applications are the two most cited drawbacks.\n\n**Developers need to be careful which trivial packages they use.** Our empirical findings show that many trivial packages have their own dependencies. In fact, we found that 43.7% of trivial packages have at least one dependency and 11.5% of trivial packages have more than 20 dependencies.\n\nIn addition to the aforementioned findings, our study provides the following key contributions:\n\n- We provide a way to quantitatively determine trivial packages.\n- To the best of our knowledge, this is the first study to examine the prevalence, reasons for and drawbacks of using trivial packages in Node.js applications. Our study is also one of the largest studies on JavaScript applications, involving a survey of more than 80 JavaScript developers, 231,092 npm packages and 38,807 Node.js applications.\n- We perform an empirical study to validate the most commonly cited reasons for and drawbacks of using trivial packages in our developer survey.\n- We make our dataset of the responses provided by the npm developers publicly available. \n\nThe paper is organized as follows: Section 2 provides the background and introduces our datasets. Section 3 presents how we determine what a trivial package is. Section 4 examines the prevalence of trivial packages and their use in Node.js applications. Section 5 presents the results of our developer survey, presenting the reasons and perceived drawbacks for developers who use trivial packages. Section 6 presents our empirical validation of the most commonly cited reason for and drawback of using trivial packages. The implications of our findings are noted in section 7. We discuss the related works in section 8, the limitations of our study in section 9, and present our conclusions in section 10.\n\n## 2 BACKGROUND AND DATASETS\n\nJavaScript is used to write client and server side applications. Its popularity has steadily grown, thanks to popular frameworks such as Node.js and an active developer community [7, 46]. JavaScript projects can be classified into two main categories: packages that are used in other projects or applications that are used as standalone software. The Node Package Manager (npm) provides tools to manage Node.js packages. npm is the official package manager for Node.js and its registry contains more than 250,000 packages [25].\n\nTo perform our study, we gather two datasets from two sources. We obtain Node.js packages from the npm registry and applications that use npm packages from GitHub.\n\n**Packages:** Since we are interested in examining the impact of \u2018trivial\u2019 packages, we mined the latest version of all the Node.js packages from npm as of May 5, 2016. For each package we obtained its source code from GitHub. In some cases, the package publisher did not provide a GitHub link, in which case we obtained the source code directly from npm. In total, we mined 252,996 packages.\n\n**Applications:** We also want to examine the use of the packages in JavaScript applications. Therefore, we mined all of the Node.js applications on GitHub. To ensure that we are indeed only obtaining the applications from GitHub, and not npm packages, we compare the URL of the GitHub repositories to all of the URLs we obtained from npm for the packages. If a URL from GitHub was also in npm, we flagged it as being an npm package and removed it from the application list. To determine that an application uses npm packages, we looked for the \u2018package.json\u2019 file, which specifies (amongst others) the npm package dependencies used by the application.\n\nTo eliminate dummy applications that may exist in GitHub, we choose non-forked applications with more than 100 commits and more than 2 developers. Similar filtering criteria were use in prior work by Kalliamvakou et al. [31]. In total, we obtained 115,621 JavaScript applications and after removing applications that did not use the npm platform, we were left with 38,807 applications.\n\n## 3 WHAT ARE TRIVIAL PACKAGES ANYWAY?\n\nAlthough what a trivial package is has been loosely defined in the past (e.g., in blogs [27, 28]), we want a more precise and objective way to determine trivial packages. To determine what constitutes a trivial package, we conducted a survey, where we asked participants what they considered to be a trivial package and what indicators they used to determine if a package is trivial or not. We devised an online survey that presented the source code of 16 randomly selected Node.js packages that range in size between 4 - 250 JavaScript lines of code (LOC). Participants were asked to 1) indicate if they thought the package was trivial or not and 2) specify what indicators they use to determine a trivial package. We opted to\nWhy Do Developers Use Trivial Packages?\nAn Empirical Case Study on npm\n\nlimit the size of the Node.js packages in the survey to a maximum of 250 JavaScript LOC since we did not want to overwhelm the participants with the review of excessive amounts of code.\n\nWe asked the survey participants to indicate trivial packages from the list of Node.js packages provided. We provided the survey participants with a loose definition of what a trivial package is, i.e., a package that contains code that they can easily code themselves and hence, is not worth taking on an extra dependency for. Figure 1 shows an example of a trivial package, called is-Positive, which simply checks if a number is positive. The survey questions were divided into three parts: 1) questions about the participant\u2019s development background, 2) questions about the classification of the provided Node.js packages and 3) questions about what indicators the participant would use to determine a trivial package. We sent the survey to 22 developers and colleagues that were familiar with JavaScript development and received a total of 12 responses.\n\n```javascript\nmodule.exports = function (n) {\n return toString.call(n) === '[object Number]' && n > 0;\n};\n```\n\nFigure 1: Package is-Positive on npm\n\nParticipants Background and Experience. Of the 12 respondents, 2 are undergraduate students, 8 are graduate students and 2 are professional developers. Ten of the 12 respondents have at least 2 years of JavaScript experience and half of the participants have been developing with JavaScript for more than five years.\n\nSurvey Responses. We asked participants to list what indicators they use to determine if a package is trivial or not and to indicate all the packages that they considered to be trivial. Of the 12 participants, 11 (92%) state that the complexity of the code and 9 (75%) state that size of the code are indicators they use to determine a trivial package. Another 3 (20%) mentioned that they used code comments and other indicators (e.g., functionality) to indicate if a package is trivial or not. Since it is clear that size and complexity are the most common indicators of trivial packages, we use these two measures to determine trivial packages. It should be mentioned that participants could provide more than 1 indicator, hence the percentages above sum to more than 100%.\n\nNext, we analyze all of the packages that were marked as trivial. In total, we received 69 votes for the 16 packages. We ranked the packages in ascending order, based on their size, and tallied the votes for the most voted packages. We find that 79% of the votes consider packages that are less than 35 lines of code to be trivial. We also examine the complexity of the packages using McCabe\u2019s cyclomatic complexity, and find that 84% of the votes marked packages that have a total complexity value of 10 or lower to be trivial. It is important to note that although we provide the source code of the packages to the participants, we do not explicitly provide the size or the complexity of the packages to the participants, so they are not biased by any metrics, i.e., size or complexity, in their classification.\n\nBased on the aforementioned findings, we used the two indicators JavaScript LOC \\( \\leq 35 \\) and complexity \\( \\leq 10 \\) to determine trivial packages in our dataset. Hence, we define trivial packages as \\( \\{ X_{\\text{LOC}} \\leq 35 \\cap X_{\\text{Complexity}} \\leq 10 \\} \\), where \\( X_{\\text{LOC}} \\) represents the JavaScript LOC and \\( X_{\\text{Complexity}} \\) represents McCabe\u2019s cyclomatic complexity of package \\( X \\). Although we use the aforementioned measures to determine trivial packages, we do not consider this to be the only possible way to determine trivial packages.\n\nOur survey indicates that size and complexity are commonly used measures to determine if a package is trivial. Based on our analysis, packages that have \\( \\leq 35 \\) JavaScript LOC and a McCabe\u2019s cyclomatic complexity \\( \\leq 10 \\) are considered to be trivial.\n\n4 HOW PREVALENT ARE TRIVIAL PACKAGES?\n\nIn this section, we want to know how prevalent trivial packages are. We examine prevalence from two aspects: the first aspect is from npm\u2019s perspective, where we are interested in knowing how many of the packages on npm are trivial. The second aspect considers the use of trivial packages in JavaScript applications.\n\n4.1 How Many of npm\u2019s Packages are Trivial?\n\nWe use the two measures, LOC and complexity, to determine trivial packages, which we now use to quantify the number of trivial packages in our dataset. Our dataset contained a total of 252,996 npm packages. For each package, we calculated the number of JavaScript code lines and removed packages that had zero LOC, which removed 21,904 packages. This left us with a final number of 231,092 packages. Then, for each package, we removed test code since we are mostly interested in the actual source code of the packages. To identify and remove the test code, similar to prior work [22, 44, 48], we look for the term \u201ctest\u201d (and its variants) in the file names and file paths.\n\nOut of the 231,092 npm packages we mined, 38,845 (16.8%) packages are trivial packages. In addition, we examined the growth of trivial packages in npm. Figure 2 shows the percentage of trivial to all packages published on npm per month. We see an increasing trend in the number of trivial packages over time and approximately 15% of the packages added every month are trivial packages. We investigated the spike around March 2016 and found that this spike corresponds to the time when npm disallowed the un-publishing of packages [40].\n\nnpm posts the most depended-upon packages on its website [38]. We measured the number of trivial packages that exist in the top 1,000 most depended-upon packages; we find that 113 of them are trivial packages. This finding shows that trivial packages are not\nonly prevalent and increasing in number, but they are also very popular among developers, making up 11.3% of the 1,000 most depended on npm packages.\n\n**Trivial packages make up 16.8% of the studied npm packages. Moreover, the proportion of trivial packages is increasing and trivial packages make up 11.3% of the top 1,000 most depended on npm packages.**\n\n### 4.2 How Many Applications Depend on Trivial Packages?\n\nJust because trivial packages exist on npm, it does not mean that they are actually being used. Therefore, we also examine the number of applications that use trivial packages. To do so, we examine the package.json file, which contains all the dependencies that an application installs from npm. However, in some cases, an application may install a package but not use it. To avoid counting such instances, we parse the JavaScript code of all the examined applications and use regular expressions to detect the require dependency statements, which indicates that the application actually uses the package in its code. Finally, we measured the number of packages that are trivial in the set of packages used by the applications. Note that we only consider npm packages since it is the most popular package manager for Node.js packages and other package managers only manage a subset of packages (e.g., Bower [9] only manages front-end/client-side frameworks, libraries and modules). We find that of the 38,807 applications in our dataset, 4,256 (10.9%) directly depend on at least one trivial package.\n\n**Of the 38,807 Node.js applications in our dataset, 10.9% of them depend on at least one trivial package.**\n\n### 5 SURVEY RESULTS\n\nWe surveyed Node.js developers to understand the reasons for and the drawbacks of using trivial packages. We use a survey because it allows us to obtain first-hand information from the developers who use these trivial packages. In order to select the most relevant participants, we sent out the survey to developers who use trivial packages. We used Git\u2019s `pickaxe` command on the lines that contain the required dependency statements in the applications; a procedure that provided us with the email and name of the developer who introduced the trivial package dependency.\n\n**Survey Participants.** To mitigate the possibility of introducing misunderstood or misleading questions, we initially sent the survey to two JavaScript developers and incorporated their minor suggestions to improve the survey. Next, we sent the survey to 1,055 developers from 1,696 applications. To select the developers, we ranked them based on the number of trivial packages they use. We then took a sample of 600 developers that use trivial packages the most, and another 600 of those that indicated the least use of trivial packages. The survey was emailed to the 1,200 selected developers, however, since some of the emails were returned for various reasons (e.g., the email account does not exist anymore, etc.), we could only reach 1,055 developers.\n\nNote that if a package is required in the application, but does not exist, it will break the application.\n\nThe survey listed the trivial package and the application that we detected the trivial package in. We received 88 responses to our survey, which translates to a response rate of 8.3%. Our survey response rate is in line with, and even higher, than the typical 5% response rate reported in questionnaire-based software engineering surveys [42]. Of the 88 respondents, 83 of them identified as developers working either in industry (68) or as a full time independent developers (15). The remaining 5 identified as being a casual developers (2) or other (3), including one student and two developers working in executive positions at npm. As for the development experience of the survey respondents, the majority (67) of the respondents have more than 5 years of experience, 14 have between 3-5 years and 7 have 1-3 years of experience. The fact that most of the respondents are experienced JavaScript developers gives us confidence in our survey responses.\n\n### 5.1 Do Developers Consider Trivial Packages Harmful?\n\nThe first question of our survey to the participants is: \u201cDo you consider the use of trivial packages as bad practice?\u201d The reason to ask this question so bluntly is that it allows us to gauge, in a very deterministic way, how the Node.js developers felt about the issue of using trivial packages. We provided three possible replies, Yes, No or Other in which case they were provided with a text box to elaborate. Of the 88 participants, 51 (57.9%) stated that they do NOT consider the use of trivial packages as bad practice. Another 21 (23.9%) stated that they indeed think that using trivial package is a bad practice. The remaining 16 (18.2%) stated that it really depends on the circumstances, such as the time available, how critical a piece of code is, and if the package used has been thoroughly tested.\n\n**Most of the surveyed developers (57.9%) do NOT believe that using trivial packages is a bad practice.**\n\n### 5.2 Why Do Developers Use Trivial Packages?\n\nWhile we have answered the question as to whether developers think using trivial packages is a bad practice, what we are most interested in is why do developers resort to using trivial packages and what do they view as the drawbacks of using trivial packages. Therefore, the second part of the survey asks participants to list the reasons why they resort to using trivial packages. To ensure that we do not bias the responses of the developers, the answer fields for these questions were in free-form text, i.e., no predetermined suggestions were provided. After gathering all of the responses, we grouped and categorized the responses in a two-phase iterative process. In the first phase, the first two authors carefully read the participant\u2019s answers and came up with a number of categories that the responses fell under. Next, they discussed their groupings and agreed on the extracted categories. Whenever they failed to agree on a category, a third author was asked to help break the tie. Once all of the categories were decided, the same two authors went through all the answers again and classified them into their respective categories. For the majority of the cases, the two authors agreed on most categories and the classifications of the responses. To measure the agreement between the two authors, we used Cohen\u2019s Kappa coefficient [10]. The Cohen\u2019s Kappa coefficient has\nbeen used to evaluate inter-rater agreement levels for categorical scales, and provides the proportion of agreement corrected for chance. The resulting coefficient is scaled to range between -1 and +1, where a negative value means less than chance agreement, zero indicates exactly chance agreement, and a positive value indicates better than chance agreement [18]. In our categorization, the level of agreement measured between the authors was of +0.90, which is considered to be an excellent inter-rater agreement.\n\nTable 1 shows the five reasons for using trivial packages, as reported by our survey respondents; another category was used to group the \u2018no reason\u2019 responses. Table 1 presents the different reasons, a description of each category and its frequency. These reasons are listed below, in order of their popularity:\n\n**R1. Well implemented & tested (54.6%)**: The most cited reason for using trivial packages is that they provide well implemented and tested code. More than half of the responses mentioned this reason. In particular, although it may be easy for developers to code these trivial packages themselves, it is more difficult to make sure that all the details are addressed, e.g., one needs to carefully consider all edge cases. Some example responses that mention these issues are stated by participants P68 and P4, who cite their reasons for using trivial packages as follows: P68: \u201cTests already written, a lot of edge cases captured [...].\u201d & P4: \u201cThere may be a more elegant/efficient/correct/cross-environment-compatilble solution to a trivial problem than yours\u201d.\n\n**R2. Increased productivity (47.7%)**: The second most cited reason is the improved productivity that using trivial packages enables. Trivial tasks or not, writing code on your own requires time and effort, hence, many developers view the use of trivial packages as a way to boost their productivity. In particular, early on in a project, a developer does not want to worry about small details, they would rather focus their efforts on implementing the more difficult tasks. For example, participants P13 and P27 state: P13: \u201c[...] and it does save time to not have to think about how best to implement even the simple things.\u201d & P27: \u201cDon\u2019t reinvent the wheel! if the task has been done before.\u201d. The aforementioned are clear examples of how developers would rather not code something, even if it is trivial. Of course, this comes at a cost, which we discuss later.\n\n**R3. Well maintained code (9.1%)**: A less common, but cited reason for using trivial packages is the fact that the maintenance of the code need not to be performed by the developers themselves; in essence, it is outsourced to the community or the contributors of the trivial packages. For example, participant P45 states: \u201cAlso, a highly used trivial package is probable to be well maintained.\u201d. Even tasks such as bug fixes are dealt with by the contributors of the trivial packages, which is very attractive to the users of the trivial packages, as reported by participant P80: \u201c[...], leveraging feedback from a larger community to fix bugs, etc.\u201d\n\n**R4. Improved readability & reduced complexity (9.1%)**: Participants also reported that using trivial packages improves the readability and reduces the complexity of their code. For example, P34 states: \u201cimmediate clarity of use and readability for other developers for commonly used packages[...]\u201d & P47 states: \u201cSimple abstract brings less complexity.\u201d\n\n**R5. Better performance (3.4%)**: A few of the participants stated that using trivial packages improves performance since it alleviates the need for their application to depend on large frameworks. For example, P35 states: \u201c[...] you do not depend on some huge utility library of which you do not need the most part.\u201d\n\nOnly a small percentage (8.0%) of the respondents stated that they do not see a reason to use trivial packages.\n\nThe two most cited reasons for using trivial packages are 1) they provide well implemented and tested code and 2) they increase productivity.\n\n### 5.3 Drawbacks of Using Trivial Packages\n\nIn addition to knowing the reasons why developers resort to trivial packages, we wanted to understand the other side of the coin - what they perceive to be the drawbacks of their decision to use these packages. The drawbacks question was part of our survey and we followed the same aforementioned process to analyze the survey responses. In the case of the drawbacks the Cohen\u2019s Kappa agreement measure was +0.86, which is considered to be an excellent agreement. Table 2 lists the drawback mentioned by the survey respondents along with a brief description and the frequency of each drawback.\n\n**I1. Dependency overhead (55.7%)**: The most cited drawback of using trivial packages is the increased dependency overhead, e.g., keeping all dependencies up to date and dealing with complex dependency chains, that developers need to bear [7]. This situation is often referred to as \u2018dependency hell\u2019, especially when the trivial packages themselves have additional dependencies. This drawback came through clearly in many comments, for example, P41 states:\nTable 2: Drawbacks of using trivial packages.\n\n| Drawback | Description | # Resp. | % |\n|------------------------|-----------------------------------------------------------------------------|--------|------|\n| Dependency overhead | Using trivial packages results in a dependency mess that is hard to update and maintain. | 49 | 55.7%|\n| Breakage of applications | Depending on a trivial package could cause the application to break if the package becomes unavailable or has a breaking update. | 16 | 18.2%|\n| Decreased performance | Trivial packages decrease the performance of applications, which includes the time to install and build the application. | 14 | 15.9%|\n| Slows development | Finding a relevant and high quality trivial package is a challenging and time consuming task. | 11 | 12.5%|\n| Missed learning opportu- | The practice of using trivial packages leads to developers not learning and experiencing writing code for trivial tasks. | 8 | 9.1% |\n| nities | | | |\n| Security | Using trivial packages can open a door for security vulnerability. | 7 | 8.0% |\n| Licensing issues | Using trivial packages could cause licensing conflicts. | 3 | 3.4% |\n| No drawbacks | - | 7 | 8.0% |\n\n\"[...] people who don\u2019t actively manage their dependency versions could [be] exposed to serious problems [...]\") & P40: \"Hard to maintain a lot of tiny packages\". Hence, while trivial packages may provide well implemented/tested code and improve productivity, developers are clearly aware that the management of the additional dependencies is something they need to deal with.\n\n12. Breakage of applications (18.2%): Developers also worry about the potential breakage of their application due to a specific package or version becoming unavailable. For example, in the left-pad issue, the main reason for the breakage was the removal of left-pad, P4 states: \"Obviously the whole 'left-pad crash' exposed an issue\". However, since that incident, npm has disabled the possibility of a package to be removed [40]. Although disallowing the removal solves part of the problem, packages can still be updated, which may break an application. For a non-trivial package, it may be worth it to take the risk, however, for trivial packages, it may not be worth taking such a risk.\n\n13. Decreased performance (15.9%): This issue is related to the dependency overhead drawback. Developers mentioned that incurring the additional dependencies slowed down the build time and increased application installation times. For example, P64 states: \"Too many metadata to download and store than a real code.\" & P34 states: \"[...], slow installs; can make project noisy and unintuitive by attempting to cobble together too many disparate pieces instead of more targeted code.\" As mentioned earlier, in some cases it is not just the fact that the trivial package adds a dependency, but in some cases the trivial package itself depends on additional packages, which negatively impacts performance even further.\n\n14. Slows development (12.5%): In some cases, the use of trivial packages may actually have a reverse effect and slow down development. For example, as P23 and P15 state: P23: \"Can actually slow the team down as, no matter how trivial a package, if a developer hasn\u2019t required it themselves they will have to read the docs in order to double check what it does, rather than just reading a few lines of your own source.\" & P15: \"[...], we have the problem of locating packages that are both useful and \"trustworthy\" [...]\"; it can be difficult to find a relevant and trustworthy package. Even if others try to build on your code, it is much more difficult to go fetch a package and learn it, rather than read a few lines of your code.\n\n15. Missed learning opportunities (9.1%): In certain cases, the use of these trivial packages is seen as a missed learning opportunity for developers. For example, P24 states: \"Sometimes people forget how to do things and that could lead to a lack of control and knowledge of the language/technology you are using\". This is a clear example of where just using a package, rather than coding the solution yourself, will lead to less knowledge about the code base.\n\n16. Security (8.0%): In some cases the trivial packages may have security flaws that make the application more vulnerable. This is an issue pointed out by a few developers, for example, as P15 mentioned earlier, it is difficult to find packages that are trustworthy. P57 also mentions: \"If you depend on public trivial packages then you should be very careful when selecting packages for security reasons\". As in the case of any dependency one takes on, there is always a chance that a security vulnerability could be exposed in one of these packages.\n\n17. Licensing issues (3.4%): In some cases, developers are concerned about potential licensing conflicts that trivial packages may cause. For example, P73 states: \"[...], possibly license-issues\", P62: \"[...], there is a risk that the 'trivial' package might be licensed under the GPL must be replaced anyway prior to shipping.\"\n\nThere were also 8% of the responses that stated they do not see any drawbacks with using trivial packages.\n\nThe two most cited drawbacks of using trivial packages are 1) they increase dependency overhead and 2) they may break their applications due to a package or a specific version becoming unavailable or incompatible.\n\n6 PUTTING DEVELOPER PERCEPTION UNDER THE MICROSCOPE\n\nThe developer survey provided us with great insights on why developers use trivial packages and what they perceive to be their drawbacks. However, whether there is empirical evidence to support their perceptions remains unexplored. Thus, we examine the most commonly cited reason for using trivial packages, i.e., the fact...\nthat trivial packages are well tested, and drawback, i.e., the impact of additional dependencies, based on our findings in Section 5.\n\n6.1 Examining the \u2018Well Tested\u2019 Perception\n\nAs shown in Table 1, 54.6% of the responses indicate that they use trivial packages since they are well implemented and tested. And, the developers have good reasons to believe so. npm requires that developers provide a test script name with the submission of their packages (listed in the package.json file). In fact, 81.2% (31,521 out of 38,845) of the trivial packages in our dataset have some test script name listed. However, since developers can provide any script name under this field, it is difficult to know if a package is actually tested.\n\nWe examine whether a package is really well tested and implemented from two aspects; first, we check if a package has tests written for it. Second, since in many cases, developers consider packages to be \u2018deployment tested\u2019, we also consider the usage of a package as an indicator of it being well tested and implemented [47]. To carefully examine whether a package is really well tested and implemented, we use the npm online search tool (known as npms [11]) to measure various metrics related to how well the packages are tested, used and valued. To provide its ranking of the packages, npms mines and calculates a number of metrics based on development (e.g., tests) and usage (e.g., no. of downloads) data. We use three metrics measured by npms to validate the \u2018well tested and implemented\u2019 perception of developers, which are:\n\n1) Tests: considers the tests\u2019 size, coverage percentage and build status for a project. We looked into the npms source code and find that the Tests metric is calculated as: \\( \\text{testsSize} \\times 0.6 + \\text{buildStatus} \\times 0.25 + \\text{coveragePercentage} \\times 0.15 \\). We use the Tests metric to determine if a package is tested and how trivial packages compare to non-trivial packages in terms of how well tested they are. One example that motivates us to investigate how well tested a trivial package is the response by P68, who says: \u201cTests already written, a lot edge cases captured [...].\u201d\n\n2) Community interest: evaluates the community interest in the packages, using the number of stars on GitHub & npm, forks, subscribers and contributors. Once again, we find through the source code of npm that Community interest is simply the sum of the aforementioned metrics, measured as: \\( \\text{starsCount} + \\text{forksCount} + \\text{subscribersCount} + \\text{contributorsCount} \\). We use this metric to compare how interested the community is in trivial and non-trivial packages. We measure the community interest since developers view the importance of the trivial packages as evidence of its quality as stated by P56, who says: \u201c[...] Using an isolated module that is well-tested and vetted by a large community helps to mitigate the chance of small bugs creeping in.\u201d\n\n3) Download count: measures the mean downloads for the last three months. Again, the number of downloads of a package is often viewed as an indicator of the package\u2019s quality; as P61 mentions: \u201cthis code is tested and used by many, which makes it more trustful and reliable.\u201d\n\nAs an initial step, we calculate the number of trivial packages that have a Tests value greater than zero, which means trivial packages that have some of tests. We find that only 45.2% of the trivial packages have tests, i.e., a Tests value > 0. In addition, we compare the values of the Tests, Community interest and Download count for Trivial and non-Trivial packages. Our focus is on the values of the aforementioned metric values for trivial packages, however, we also present the results for non-trivial packages to put our results in context.\n\nFigure 3 shows the bean-plots for the Tests, Community interest and Download count. The figures show that in all cases trivial packages have, on median, a smaller Tests value, Community interest value and Download count compared to non-trivial packages. That said, we observe from Figure 3 a) that the distribution of the Tests metric is similar for both, trivial and non-trivial packages. Most packages have a Tests value of zero, then there are small pockets of packages that have values of aprox. 0.25, 0.6, 0.8 and 1.0. In the case of the Community interest and Download count metrics, once again, we see similar distributions, although clearly the median values are lower for trivial packages.\n\nTo examine whether the difference in metric values between trivial and non-trivial packages is statistically significant, we performed a Mann-Whitney test to compare the two distributions and determine if the difference is statistically significant, with a \\( p \\)-value < 0.05. We also use Cliff\u2019s Delta (\\( d \\)), which is a non-parametric effect size measure to interpret the effect size between trivial and non-trivial packages. As suggested in [23], we interpret the effect size value to be small for \\( d < 0.33 \\) (positive as well as negative values), medium for \\( 0.33 \\leq d < 0.474 \\) and large for \\( d \\geq 0.474 \\).\n\nTable 3 shows the \\( p \\)-values and effect size values. We observe that in all cases the differences are statistically significant, however, the effect size is small. The results show that although the majority of trivial packages do not have tests written for them, and have\n\n| Metrics | \\( p \\)-value | \\( d \\) |\n|--------------------------|--------------|--------------|\n| Tests | 2.2e-16 | -0.119 (small) |\n| Community interest | 2.2e-16 | -0.269 (small) |\n| Downloads count | 2.2e-16 | -0.245 (small) |\nContrary to developers\u2019 perception, only 45.2% of trivial packages actually have tests. Albeit, trivial packages have lower Tests, Community interest and Download count values, the values of the metrics do not seem to have a large difference compared to non-trivial packages, i.e., trivial packages are similar to non-trivial packages in terms of how well they are tested.\n\n6.2 Examining the \u2018Dependency Overhead\u2019 Perception\n\nAs discussed in Section 5, the top cited drawback of using trivial packages is the fact that developers need to take on and maintain extra dependencies, i.e., dependency overhead. Examining the impact of dependencies is a complex and well-studied issue (e.g., [1, 12, 15]) that can be examined in a multitude of ways. We choose to examine the issue from both, the application and the package perspectives.\n\nApplications: When compared to coding trivial tasks themselves, using a trivial package imposes extra dependencies. One of the most problematic aspects of managing dependencies for applications is when these dependencies update, causing a potential to break their application. Therefore, as a first step, we examined the number of releases for trivial and non-trivial packages. The intuition here is that developers need to put in extra effort to assure the proper integration of new releases. Figure 4 shows that trivial packages have less releases than non-trivial packages (median is 2 for trivial and 3 for non-trivial packages), hence trivial packages do not require more effort than non-trivial packages. The fact that the trivial packages are updated less frequently may be attributed to the fact that trivial packages \u2018perform less functionality\u2019, hence they need to be updated less frequently.\n\nNext, we examined how developers choose to deal with the updates of trivial packages. One way that application developers reduce the risk of a package impacting their application is to \u2018version lock\u2019 the package. Version locking a dependency/package means that it is not updated automatically, and that only the specific version mentioned in the packages.json file is used. As stated in a few responses from our survey, e.g., P8: \u201c...Also, people who don\u2019t lock down their versions are in for some pain.\u201d There are different types of version locks, i.e., only updating major releases, updating patches only, updating minor releases or no lock at all, which means the package automatically updates. The version locks are specified in the packages.json file next to every package name. We examined the frequency at which trivial and non-trivial packages are locked. We find that on average, trivial packages are locked 14.9% of the time, whereas non-trivial packages are locked 11.7% of the time. However, the Wilcox test shows that the difference is not statistically significant, p-value > 0.05. Hence, we cannot say that developers version lock trivial packages more.\n\nPackages: At the package level, we investigate the direct and indirect dependencies of trivial packages. In particular, we would like to determine if the trivial packages have their own dependencies, which makes the dependency chain even more complex. For each trivial and non-trivial package, we install it and then count the actual number of (direct and indirect) dependencies that the package requires. Doing so, allows us to know the true (direct and indirect) dependencies that each package requires. Note that simply looking into the .json file and the require statements will provide the direct dependencies, but not the indirect dependencies.\n\nFigure 5 shows the distribution of dependencies for trivial and non-trivial packages. Since most trivial packages have no dependencies, the median is 0. Therefore, we bin the trivial packages based on the number of their dependencies and calculate the percentage of packages in each bin. Table 4 shows the percentage of packages and their respective number of dependencies. We observe that the majority of trivial packages (56.3%) have zero dependencies, 27.9% have between 1-10 dependencies, 4.3% have between 11-20 dependencies and 11.5% have more than 20 dependencies. The table shows that some of the trivial packages have many dependencies, which indicates that indeed, trivial packages can introduce significant dependency overhead.\n\n| Packages | # Dependencies (Direct & Indirect) |\n|----------------|-----------------------------------|\n| Trivial | zero 56.3% 1-10 27.9% 11-20 4.3% 11.5% |\n| Non Trivial | zero 34.8% 1-10 30.6% 11-20 7.3% 27.3% |\n\nTrivial packages have fewer releases and developers are less likely to be version locked than non-trivial packages. That said, developers should be careful when using trivial packages, since in some cases, trivial packages can have numerous dependencies. In fact, we find that 43.7% of trivial packages have at least one dependency and 11.5% of trivial packages have more than 20 dependencies.\n7 RELEVANCE AND IMPLICATIONS\n\nA common question that is asked in empirical studies is - so what? what are the implications of your findings? why would practitioners care about your findings? We discuss the issue of relevance of our study to the developer community, based on the responses of our survey and highlight some of the implications of our study.\n\n7.1 Relevance: Do Practitioners care?\n\nAt the start of the study, we were not sure how practically relevant our study of trivial packages is. However, we were surprised by the interest of developers in our study. In fact, one of the developers (P39) explicitly mentioned the lack of research on this topic, stating \u201cThere has not been enough research on this, but I\u2019ve been taking note of people\u2019s proposed \u201cquick and simple\u201d code to handle the functionality of trivial packages, and it\u2019s surprised me to see the high percentage of times the proposed code is buggy or incomplete.\u201d\n\nMoreover, when we conducted our study, we asked respondents if they would like to know the outcome of our study and if so, they provide us with an email address. Of the 88 respondents, 66 (approx. 74%) of them provided their email for us to provide them with the outcomes of our study. Some of these respondents hold very high level leadership roles in npm. To us this is an indicator that our study and its outcomes are of high relevance to the npm and Node.js development community.\n\n7.2 Implications of Our Study\n\nOur study has a number of implications on both, software engineering research and practice.\n\nImplications for Future Research: Our study mostly focused on determining the prevalence, reasons for and drawbacks of using trivial packages. Based on our findings, we find a number of implications/motivations for future work. First, our survey respondents indicated that the choice to use trivial packages is not black or white. In many cases, it depends on the team and the project. For example, one survey respondent stated that on his team, less experienced developers are more likely to use trivial packages, whereas the more experienced developers would rather write their own code for trivial tasks. The issue here is that the experienced developers are more likely to trust their own code, while the less experienced are more likely to trust an external package. Another aspect is the maturity of the project. As some of the survey respondents pointed out, they are much more likely use trivial packages early on in the project, so they do not waste time on trivial tasks and focus on the more fundamental tasks of their project. However, once their project matures, they start to look for ways to reduce dependencies since they pose potential points of failure for their project. Hence, our study motivates future work to examine the relationship between team experience and project maturity and the use of trivial packages.\n\nSecond, survey respondents also pointed out that using trivial packages is seen favourably compared to using code from Q&A sites such as StackOverflow or Reddit. When compared to using code on StackOverflow, where the developer does not know who posted the code, who else uses it or whether the code may have tests or not, using a trivial package that is on npm is a much better option. In this case, using trivial packages is not seen as the best choice, but it is certainly a better choice. Although there have been many studies that examined how developers use Q&A sites such as StackOverflow, we are not aware of any studies that compare code reuse from Q&A sites and trivial packages. Our findings motivate the need for such a study.\n\nPractical Implications: A direct implication of our findings is that trivial packages are commonly used by others, perhaps indicating that developers do not view their use as bad practice. Moreover, developers should not assume that all trivial packages are well implemented and tested, since our findings show otherwise. npm developers need to expect more trivial packages to be submitted, making the task of finding the most relevant package even harder. Hence, the issue of how to manage and help developers find the best packages needs to be addressed. To some extent, npms has been recently adopted by npm to specifically address the aforementioned issue. Developers highlighted that the lack of a decent core or standard JavaScript library causes them to resort to trivial packages. Often, they do not want to install large frameworks just to leverage small parts of the framework, hence they resort to using trivial packages. Therefore, there is a need by the Node.js community to create a standard JavaScript API or library in order to reduce the dependence on trivial packages. However, the issue of creating such a standard JavaScript library is under much debate.\n\n8 RELATED WORK\n\nStudies of Code Reuse. Prior research on code reuse has been shown its many benefits, which include improving quality, development speed, and reducing development and maintenance costs [3, 32, 36, 37]. For example, Sojer and Henkel [43] surveyed 686 open source developers to investigate how they reuse code. Their findings show that more experienced developers reuse source code and 30% of the functionality of open source software (OSS) projects reuse existing components. Developers also reveal that they see code reuse as a quick way to start new projects. Similarly, Haefliger et al. [24] conducted a study to empirically investigate the reuse in open source software, and the development practices of developers in OSS. They triangulated three sources of data (developer interviews, code inspections and mailing list data) of six OSS projects. Their results showed that developers used tools and relied on standards when reusing components. Mockus [36] conducted an empirical study to identify large-scale reuse of open source libraries. Their study shows that more than 50% of source files include code from other OSS libraries. On the other hand, the practice of reusing source code has some challenging drawbacks including the effort and resource required to integrate reused code [16]. Furthermore, a bug in the reused component could propagate to the target system [17]. While our study corroborates some of these findings, the main goal is to define and empirically investigate the phenomenon of reusing trivial packages, in particular in Node.js applications.\n\nStudies of Other Ecosystems. In recent years, analyzing the characteristics of ecosystems in software engineering has gained momentum [4, 5, 15, 34]. For example, in a recent study, Bogart et al. [6, 7] empirically studied three ecosystems, including npm, and found that developers struggle with changing versions as they might break dependent code. Witter et al. [46] investigated the evolution of the npm ecosystem in an extensive study that covers the dependence between npm packages, download metrics and the\nusage of npm packages in real applications. One of their main findings is that npm packages and updates of these packages is steadily growing. Also, more than 80% of packages have at least one direct dependency package.\n\nOther studies examined the size characteristics of packages in an ecosystem. German et al. [21] studied the evolution of the statistical computing project GNU R, with the aim of analyzing the differences between code characteristics of core and user-contributed packages. They found that user-contributed packages are growing faster than core packages. Additionally, they reported that user-contributed packages are typically smaller than core packages in the R ecosystem. Kabbedijk and Jansen [30] analyzed the Ruby ecosystem and found that many small and large projects are interconnected.\n\nIn many ways, our study complements the previous work since, instead of focusing on all packages in an ecosystem, we specifically focus on trivial packages. Moreover, we examine the reasons developers use trivial package and what they view as their drawbacks.\n\nWe study the reuse of trivial packages, which is a subset of general code reuse. Hence, we do expect there to be some overlap with prior work. Like many empirical studies, we confirm some of the prior findings, which is a contribution on its own. Moreover, our paper adds to the prior findings through, for example, our validation of the developers\u2019 assumptions. Lastly, we do believe our study fills a real gap since 74% of the participants said they wanted to know our study outcomes.\n\n9 THREATS TO VALIDITY\n\nConstruct validity considers the relationship between theory and observation, in case the measured variables do not measure the actual factors. To define trivial packages, we surveyed 12 JavaScript developers who are mostly graduate student with some professional experience. However, we find that there was a clear vote for what is considered a trivial package. Also, although our data suggested that packages with $\\leq 35$ LOC and a complexity $\\leq 10$ are trivial packages, we believe that other definitions are possible for trivial packages. That said, of the 88 survey participants that we emailed about using trivial packages, only 1 mentioned that the flagged package is not a trivial package (even though it fit our criteria). To us, this is a confirmation that our definition applies in the vast majority of the cases, although clearly it is not perfect.\n\nWe use the LOC and complexity of the code to determine trivial packages. In some cases, these may not be the only measures that need to be considered to determine a trivial packages. For example, some of the trivial packages have their own dependencies, which may need to be taken into consideration. However, our experience tells us that most developers only look at the package itself and not its dependencies when determining if it is trivial or not. That said, it would be interesting to replicate this questionnaire with another set of participants to confirm or enhance our definition of a trivial Node.js package.\n\nOur list of reasons for and drawbacks of using trivial packages are based on a survey of 88 Node.js developers. Although this is a large number of developers, our results may not hold for all Node.js developers. A different sample of developers may result in a different list or ranking of advantages and disadvantages. To mitigate the risk due to this sampling, we contacted developers from different applications and as our responses show, most are experienced developers. Also, there is potential that our survey questions may have influenced the replies from the respondents. However, to minimize such influence, we made sure to ask for free-form responses (to minimize any bias) and we publicly share our survey and all of our anonymized survey responses.\n\nWe used npms to measure various quantitative metrics related to testing, community interest and download counts. Our measurements are only as accurate as npms, however, given that it is the main search tool for npm, we are confident in the the npms metrics.\n\nWe do not distinguish between the domain of the npm packages, which may impact the findings. However, to help mitigate any bias we analyzed more than 230,000 npm packages that cover a wide range of domains.\n\nWe removed test code from our dataset to ensure that our analysis only considers JavaScript source code. We identified test code by searching for the term \u2018test\u2019 (and its variants) in the file names and file paths. Even though this technique is widely accepted in the literature [22, 44, 48], to confirm whether our technique is correct, i.e., files that have the term \u2018test\u2019 in their names and paths actually contain test code, we took a statistically significant sample of the packages to achieve a 95% confidence level and a 5% confidence interval and examined them manually.\n\nExternal validity considers the generalization of our findings. All of our findings were derived from open source Node.js applications and npm packages, hence, our findings may not generalize to other platforms or ecosystems. That said, historical evidence shows that examples of individual cases contributed significantly in areas such as physics, economics, social sciences and even software engineering [19]. We believe that strong empirical evidence is built from both, studies on individual cases and studies on large samples.\n\n10 CONCLUSION\n\nThe use of trivial packages is an increasingly popular trend in software development. Like any development practice, it has its proponents and opponents. The goal of our study is to examine the prevalence, reasons and drawbacks of using trivial packages. Our findings indicate that trivial packages are commonly and widely used in Node.js applications. We also find that the majority of developers do not oppose the use of trivial packages and the main reasons developers use trivial packages is due to the fact that they are considered to be well implemented and tested. However, they do cite the fact that the additional dependencies\u2019 overhead as a drawback of using these trivial packages. That said, our empirical study showed considering trivial packages to be well tested is a misconception since more than half of the trivial package we studied do not even have tests written, however, these trivial packages seem to be \u2018deployment tested\u2019 and have similar Tests, Community interest and Download count values as non-trivial packages. In addition, we find that some of the trivial packages have their own dependencies and, in our studied dataset, 11.5% of the trivial packages have more than 20 dependencies. Hence, developers should be careful about which trivial packages they use.\n\nACKNOWLEDGMENTS\n\nThe authors are grateful to the many survey respondents who dedicated their valuable time to respond to our surveys.\nREFERENCES\n\n[1] Pietro Abate, Roberto Di Cosmo, Jaap Boender, and Stefano Zacchiroli. 2009. Structural Dependence: Between Software Components. In Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement (ESEM \u201909). IEEE Computer Society, 89\u201399.\n\n[2] Rabe Abdalkareem, Emad Shihab, and Juergen Rilling. 2017. On Code Reuse from StackOverflow: An exploratory study on Android apps. Information and Software Technology 88, C (2017), 148\u2013158.\n\n[3] Victor R. Basili, Lionel C. Briand, and Walc\u00e9lio L. Melo. 1996. How Reuse Influences Productivity in Object-oriented Systems. Commun. ACM 39, 10 (October 1996), 104\u2013116.\n\n[4] Gabriele Bavota, Gerardo Canfora, Massimiliano Di Penta, Rocco Oliveto, and Sebastiano Panichella. 2013. The Evolution of Project Inter-dependencies in a Software Ecosystem: The Case of Apache. In Proceedings of the 2013 IEEE International Conference on Software Maintenance (ICSM \u201913). IEEE Computer Society, 280\u2013289.\n\n[5] Remco Bloemen, Chintan Amrit, Stefan Kuhlmann, and Gonzalo Ord\u00f3\u00f1ez Mata-moros. 2014. Gentoo Package Dependencies over Time. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR \u201914). ACM, 404\u2013407.\n\n[6] Christopher Bogart, Christian K\u00e4stner, and James Herbsleb. 2015. When It Breaks, It Breaks: How Ecosystem Developers Reason About the Stability of Dependencies. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW \u201915). IEEE Computer Society, 86\u201389.\n\n[7] Christopher Bogart, Christian K\u00e4stner, James Herbsleb, and Ferdian Thung. 2016. How to Break an API: Cost Negotiation and Community Values in Three Software Ecosystems. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE \u201916). ACM, 109\u2013120.\n\n[8] Stephan Bonnemann. 2015. Dependency Hell Just Froze Over. https://speakerdeck.com/bonnemann/dependency-hell-just-froze-over. (September 2015). (accessed on 08/10/2016).\n\n[9] Bower. 2012. Bower a package manager for the web. https://bower.io/. (2012). (accessed on 08/23/2016).\n\n[10] J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological measurement 20 (1960), 37\u201346.\n\n[11] Andre Cruz and Andre Duarte. 2017. npmjs. https://npmjs.org/ (01/2017). (accessed on 02/20/2017).\n\n[12] Cleidson R. B. de Souza and David F. Redmiles. 2008. An Empirical Study of Software Developers\u2019 Management of Dependencies and Changes. In Proceedings of the 30th International Conference on Software Engineering (ICSE \u201908). ACM, 241\u2013250.\n\n[13] Alexandre Decan, Tom Mens, and Ma\u00eblick Claes. 2016. On the Topology of Package Dependency Networks: A Comparison of Three Programming Language Ecosystems. In Proceedings of the 10th European Conference on Software Architecture Workshops (ECSAW \u201916). ACM, Article 21, 4 pages.\n\n[14] Alexandre Decan, Tom Mens, and Ma\u00eblick Claes. 2017. An Empirical Comparison of Dependency Issues in OSS Packaging Ecosystems. In Proceedings of the 24th International Conference on Software Analysis, Evolution, and Reengineering (SANER \u201917). IEEE.\n\n[15] Alexandre Decan, Tom Mens, Philippe Grosjean, and others. 2016. When GitHub Meets CRAN: An Analysis of Inter-Repository Package Dependency Problems. In Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER \u201916). Vol. 1. IEEE, 493\u2013504.\n\n[16] Roberto Di Cosmo, Davide Di Ruscio, Patrizio Pelliccione, Alfonso Pierantonio, and Stefano Zacchiroli. 2011. Supporting software evolution in component-based FOSS systems. Science of Computer Programming 76, 12 (2011), 1144\u20131160.\n\n[17] Mehdi Dogguy, Stephane Glondu, Sylvain Le Gall, and Stefano Zacchiroli. 2011. Enforcing Type-Safe Linking using Inter-Package Relationships. Studia Informatica Universalis. 9, 11 (2012), 129\u2013157.\n\n[18] J. L. Fleiss and J. Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33 (1973), 613\u2013617.\n\n[19] Bent Flyvbjerg. 2006. Five misunderstandings about case-study research. Qualitative Inquiry 12, 2 (2006), 219\u2013245.\n\n[20] Thomas Fuchs. 2016. What if we had a great standard library in JavaScript? https://medium.com/@thomafuchs/what-if-we-had-a-great-standard-library-in-javascript-52692342ee3f. (Mar 2016). (accessed on 02/24/2017).\n\n[21] D German, B Adams, and AE Hassan. 2013. Programming language ecosystems: the evolution of r. In Proceedings of the 37th European Conference on Software Maintenance and Reengineering (CSMR \u201913). IEEE, 243\u2013252.\n\n[22] Georgios Gousios and Andy Zaidman. 2014. A Dataset for Pull-based Development Research. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR \u201914). ACM, 368\u2013371.\n\n[23] Robert J Grissom and John J Kim. 2005. Effect sizes for research: A broad practical approach. Lawrence Erlbaum Associates Publishers.\n\n[24] Stefan Haefliger, Georg Von Krogh, and Sebastien Spahett. 2008. Code reuse in open source software. Management Science 54, 1 (2008), 180\u2013193.\n\n[25] Quin Hanam, Fernando N. N. M. S. M. Brito, and Ali Mesbah. 2016. Discovering Bug Patterns in JavaScript. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE \u201916). ACM, 144\u2013156.\n\n[26] Dan Haney. 2016. NPM & left-pad: Have We Forgotten How To Program? http://www.haneycodes.net/npm-left-pad-have-we-forgotten-how-to-program/. (March 2016). (accessed on 08/10/2016).\n\n[27] Rich Harris. 2015. Small modules: it\u00e2\u20ac\u2122s not quite that simple. https://medium.com/@Rich_Harris/small-modules-it-s-not-quite-that-simple-3ca5352d5d4e. (Jul 2015). (accessed on 08/24/2016).\n\n[28] Hemanth HM. 2015. One-line node modules -issue#10-sindresorhus/ama. https://github.com/sindresorhus/ama/issues/10. (2015). (accessed on 08/10/2016).\n\n[29] Katsuro Inoue, Yusuke Sakai, Pei Xia, and Yuki Manabe. 2012. Where Does This Code Come from and Where Does It Go? - Integrated Code History Tracker for Open Source Systems -. In Proceedings of the 34th International Conference on Software Engineering (ICSE \u201912). IEEE Press, 331\u2013341.\n\n[30] Jaap Kabbedijk and Slinger Jansen. 2011. Steering insight: An exploration of the ruby software ecosystem. In Proceedings of the Second International Conference on Software Business (ICSOB \u201911). Springer, 44\u201355.\n\n[31] Ernui Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR \u201914). ACM, 92\u2013101.\n\n[32] Wayne C. Lim. 1994. Effects of Reuse on Quality, Productivity, and Economics. IEEE Software 11, 5 (1994), 23\u201330.\n\n[33] Fiona Macdonald. 2016. A programmer almost broke the Internet last week by deleting 11 lines of code. &+#http://www.sciencealert.com/how-a-programmer-almost-broke-the-internet-by-deleting-11-lines-of-code. (March 2016). (accessed on 08/24/2016).\n\n[34] Konstantinos Manikas. 2016. Revisiting software ecosystems research: a longitudinal literature study. Journal of Systems and Software 117 (2016), 84\u2013103.\n\n[35] Stephen McCamant and Michael D. Ernst. 2003. Predicting Problems Caused by Component Upgrades. In Proceedings of the 9th European Software Engineering Conference Held Jointly with 11th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE \u201903). ACM, 287\u2013296.\n\n[36] Audris Mockus. 2007. Large-Scale Code Reuse in Open Source Software. In Proceedings of the First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS \u201907). IEEE Computer Society, 7\u2013.\n\n[37] Parastoo Mohagheghi, Reidar Conradi, Ole M. Killi, and Henrik Schwarz. 2004. An Empirical Study of Software Reuse vs. Defect-Density and Stability. In Proceedings of the 26th International Conference on Software Engineering (ICSE \u201904). IEEE Computer Society, 282\u2013292.\n\n[38] npm. 2016. Most depended-upon packages. http://www.npmjs.com/browse/depended. (August 2016). (accessed on 08/10/2016).\n\n[39] npm. 2016. What is npm? Node Package Management Documentation. https://docs.npmjs.com/getting-started/what-is-npm. (July 2016). (accessed on 08/14/2016).\n\n[40] The npm Blog. 2016. The npm Blog changes to npm\u2019s unpublish policy. http://blog.npmjs.org/post/141953680000/changes-to-unpublish-policy. (March 2016). (accessed on 08/11/2016).\n\n[41] Heikki Orsila, Jaco Geldenhuys, Anna Ruokonen, and Imed Hammouda. 2008. Update propagation practices in highly reusable open source components. In Proceedings of the 4th IFIP WG 2.13 International Conference on Open Source Systems (OSS \u201908). 159\u2013170.\n\n[42] Janice Singer, Susan E Sim, and Timothy C Lethbridge. 2008. Software engineering: data collection for field studies. In Guide to Advanced Empirical Software Engineering. Springer London, 9\u201334.\n\n[43] Manuel Sojer and Joachim Henkel. 2010. Code Reuse in Open Source Software Development: Quantitative Evidence, Drivers, and Impediments. Journal of the Association for Information Systems 11, 12 (2010), 868\u2013901.\n\n[44] Jason Tsay, Laura Dabbish, and James Herbsleb. 2014. Influence of Social and Technical Factors for Evaluating Contribution in GitHub. In Proceedings of the 36th International Conference on Software Engineering (ICSE \u201914). ACM, 356\u2013366.\n\n[45] Chris Williams. 2016. How one developer just broke Node, Babel and thousands of projects in 11 lines of JavaScript. http://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos. (March 2016). (accessed on 08/24/2016).\n\n[46] Erik Wittern, Philippe Suter, and Shriram Rajagopalan. 2016. A Look at the Dynamics of the JavaScript Package Ecosystem. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR \u201916). ACM, 351\u2013361.\n\n[47] Dan Zambonini. 2011. Testing and deployment. In A Practical Guide to Web App Success, Owen Gregory (Ed.). Five Simple Steps, Chapter 20. (accessed on 02/02/2017).\n\n[48] Jiaxin Zhu, Minghui Zhou, and Audris Mockus. 2014. Patterns of Folder Use and Project Popularity: A Case Study of GitHub Repositories. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM \u201914). ACM, Article 30, 4 pages.", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/023_abdalkareem.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 11, "total-input-tokens": 36685, "total-output-tokens": 16644, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 4817, 1], [4817, 11524, 2], [11524, 17427, 3], [17427, 23887, 4], [23887, 29040, 5], [29040, 35172, 6], [35172, 40865, 7], [40865, 45790, 8], [45790, 52694, 9], [52694, 59551, 10], [59551, 69826, 11]]}}
|
|
{"id": "4fc20f84bd90809d954030810b3d5522acfb5d14", "text": "How Has Forking Changed in the Last 20 Years? A Study of Hard Forks on GitHub\n\nShurui Zhou \nCarnegie Mellon University, USA\n\nBogdan Vasilescu \nCarnegie Mellon University, USA\n\nChristian K\u00e4stner \nCarnegie Mellon University, USA\n\nABSTRACT\n\nThe notion of forking has changed with the rise of distributed version control systems and social coding environments, like GitHub. Traditionally forking refers to splitting off an independent development branch (which we call hard forks); research on hard forks, conducted mostly in pre-GitHub days showed that hard forks were often seen critical as they may fragment a community. Today, in social coding environments, open-source developers are encouraged to fork a project in order to contribute to the community (which we call social forks), which may have also influenced perceptions and practices around hard forks. To revisit hard forks, we identify, study, and classify 15,306 hard forks on GitHub and interview 18 owners of hard forks or forked repositories. We find that, among others, hard forks often evolve out of social forks rather than being planned deliberately and that perception about hard forks have indeed changed dramatically, seeing them often as a positive non-competitive alternative to the original project.\n\nACM Reference Format:\nShurui Zhou, Bogdan Vasilescu, and Christian K\u00e4stner. 2020. How Has Forking Changed in the Last 20 Years? A Study of Hard Forks on GitHub. In 42nd International Conference on Software Engineering (ICSE \u201920), May 23\u201329, 2020, Seoul, Republic of Korea. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3377811.3380412\n\n1 INTRODUCTION\n\nThe notion of forking in open-source has evolved: Traditionally, forking was the practice of copying a repository and splitting off new independent development, often under a new name; forking was rare and was typically intended to compete with or supersede the original project [15, 30, 32]. Nowadays, forks in distributed version control systems are public copies of repositories in which developers can make changes, potentially, but not necessarily, with the intention of integrating those changes back into the original repository.\n\nWith the rise of social coding and explicit support in (distributed) version control systems, forking of repositories has been explicitly promoted by sites like GitHub, Bitbucket, and GitLab, and has indeed become very popular [19, 34]. For example, we identified over 114,120 GitHub projects with more than 50 forks, and over 9,164 projects with more than 500 forks as of June 2019, with numbers rising quickly. However, most of these modern forks are not forks in the traditional sense. As in our prior work [53], we distinguish between social forks, referring to creating a public copy of a repository on a social coding site like GitHub, often with the goal of contributing to the original project, and hard forks, referring to the traditional notion of splitting off a new development branch.\n\nHard forks have been discussed controversially throughout the history of free and open-source software: On the one hand, free and open-source licenses codified the right to create hard forks, which was seen as essential for guaranteeing flexibility and fostering disruptive innovations [15, 30, 32] and useful for encouraging a survival-of-the-fittest model [48]. On the other hand, hard forks were frequently considered as risky to projects, since they could fragment a community and lead to confusion for both developers and users [15, 26, 30, 36], and there was a strong norm against forking; many well known hard forks exist (e.g., LibreOffice, Jenkins, io.js; see Fig. 1), but there are not many well known cases where both communities survived and are both healthy after a hard fork, with a prominent exception being the BSD variants.\n\nPrior research into forking of free and open-source projects focused on the motivations behind hard forks [8, 12, 13, 26, 31, 39, 47], the controversial perceptions around hard forks [6, 15, 26, 30, 36, 49], and the outcomes of hard forks (including studying factors that influence such outcomes) [39, 49]. However, essentially all that research has been conducted before the rise of social coding, much of it on SourceForge (GitHub was launched in 2008 and became the dominant open-source hosting site around 2012; cf. Fig 1).\n\nIn this paper, we argue that perceptions and practices around forking could have changed significantly since SourceForge\u2019s heydays. In contrast to the strong norm against forking back then, we conjecture that the promotion of social forks on sites like GitHub, and the often blurry line between social and hard forks, may have encouraged forking and lowered the bar also for hard forks. At the same time, advances in tooling, especially distributed version control systems like Git [7] and transparency mechanisms on social coding sites [10], may have enabled new opportunities and\n\nFigure 1: Timeline of some popular open-source forking events; popularity approximated with Google Trends.\nchanged common practices and perceptions. The professionalization of open-source development and the increasing involvement of corporations or even corporate ownership of open-source projects may have further tilted perceptions.\n\nTherefore, we argue that it is time to revisit, replicate, and extend research on hard forks, asking the central question of this work: **How have perceptions and practices around hard forks changed?** Updating and deepening our understanding regarding practices and perceptions around hard forks can inform the design of better tools and management strategies to facilitate efficient collaboration. Furthermore, we attempt to automate the process of identifying hard forks among social forks and quantifying how frequent hard forks are across GitHub, which previous research did not cover.\n\nUsing a mixed-methods empirical design, combining repository mining with 18 developer interviews, we investigate:\n\n- **Frequency of hard forks:** We attempt to quantify the frequency of hard forks among all the (mostly social) forks on GitHub. Specifically, we design and refine a classifier to automatically detect hard forks. We find 15,306 instances, showing that hard forks are a significant concern, even though their relative numbers are low.\n\n- **Common evolution patterns of hard forks:** We classify the evolution of hard forks and their corresponding upstream repository to observe outcomes, including whether the fork and upstream repositories both sustain their activities and whether they synchronize their development. We develop our classification by visualizing and qualitatively analyzing evolution patterns (using card sorting) and subsequently automate the classification process to analyze all detected hard forks. We find that many hard forks are sustained for extended periods and a substantial number of hard forks still at least occasionally exchange commits with the upstream repository.\n\n- **Perceptions of hard forks:** In interviews with 18 open-source maintainers of forks and corresponding upstream repositories, we solicit practices and perceptions regarding hard forks and analyze whether those align with ones reported in pre-social-coding research. We find that the \u2018stigma\u2019 often reported around hard forks is largely gone, indeed forks including hard forks are generally seen as a positive, with many hard forks complementing rather than competing with the upstream repository. Furthermore, with social forking encouraging forks as contribution mechanism, we find that many hard forks are not deliberately planned but evolve slowly from social forks.\n\nOverall, we contribute (1) a method to identify hard forks, (2) a dataset of 15,306 hard forks on GitHub, (3) a classification and analysis of evolution patterns of hard forks, and (4) results from interviews with 18 open source developers about the reasons for hard forks, interactions across forks, and perceptions of hard forks.\n\nOur research focuses on development practices on GitHub, which is by far the dominant open-source hosting platform (cf. Fig. 1) and has been key in establishing the social forking phenomenon. Even large projects primarily hosted on other sites often have a public mirror on GitHub, allowing us to gather a fairly representative picture of the entire open-source community. Our main research instruments are semi-structured interviews with open-ended questions and repository mining with GitHub API. While our research is not planned as an exact replication of prior work and exceeds the scope of prior studies by comparing social and hard forks, many facets seek to replicate prior findings (e.g., regarding motivations and outcomes of hard forks) and can be considered a conceptual replication [24, 43].\n\n## 2 PAST RESEARCH ON FORKING\n\n### 2.1 Types of Forking\n\nWhat is popularly understood by \u2018forking a project\u2019 has changed in the last decades, which, in line with our prior work [53], we distinguish as **hard forks** and **social forks**:\n\n- **Hard forks:** Traditionally, forking refers to copying a project in order to continue a separate, often competing line of development; the name and the direction of the project also typically change. Developers might fork a project, e.g., when they are unhappy with the direction or governance, deciding to create a divergent version more in line with their own vision [15]. In pre-GitHub days, ways to contribute to an open-source project varied widely, but rather than using public forks one would typically create local copies to make changes and then send those as patch files.\n\n- **Social forks:** Popularized through GitHub, \u2018forking\u2019 now also refers to public copies of open-source repositories that are often created for short-term feature implementation, often with the intention of contributing back the the upstream repository. A fork on GitHub is thus typically not intended to start an independent development line, but as a uniform mechanism for distributed development and third-party contribution (i.e., pull requests) [10, 19]. In fact, the forking function on GitHub is frequently used even just as a bookmarking mechanism to keep a copy of a project without intention of performing any changes [25].\n\nOn GitHub, nowadays, both forms of forking exist, and we conjecture that the vast majority of forks are social forks, however it is not obvious how to distinguish the two kinds without a closer analysis.\n\nAt a technical level, forks can be created by cloning of repositories in distributed version control systems, in which case the fork maintains the history of the upstream project, or simply by copying files over and starting a new history (the latter was more common in pre-GitHub days). If forks are created directly on GitHub, a clone is automatically created, and GitHub tracks and visually shows the relationship between fork and upstream projects.\n\nThere is significant research on both hard forks and social forks. The hard-forking research is typically older, conducted almost exclusively before GitHub and social forking. Research on social forking is more recent, but focuses much more on the contribution process and issues around managing contributions in a single project.\n\n### 2.2 Motivations for Forking\n\nReasons why developers might create a hard fork of an existing open-source project vary widely. Motivations for such forks have been studied primarily on SourceForge before the advent of social coding environment [8, 12, 13, 26, 31, 39, 47]. As per Robles and\nGonz\u00e1lez-Barahona [39], the most common motivations for hard forks were:\n\n- **Technical.** Variants targeting specific needs or user segments that are not accommodated by the upstream project are the most common motivation [31]. As a project grows and matures, the contributors\u2019 goals or perspectives may diverge, and some may want to take the project in a different direction. If taken to the extreme, hard forks can be used for variant management, in which multiple related but different projects originating from the same source are maintained separately [3, 13, 14, 45].\n\n- **Governance disputes.** Some contributors created hard forks when they feel their feedback is not heard or maintainers are accepting patches too slowly in the original project. A hard fork, or even just the threat of creating one, can help developers negotiate in governance disputes [17]; recent examples of hard forks caused by governance disputes include Node.js [42, 50] and Docker [51]. Other common forms of disputes occur when companies are involved and try to influence the direction of the project or try to close-source or monetize future versions of the project, as with Hudson and OpenOffice.\n\n- **Discontinuation of the original project.** A hard fork can revive a project when the original developers have ceased to work on it. For example, back in the 1990s, the Apache web server project took over for the abandoned NCSA HTTPd project.\n\n- **Commercial forks.** Companies sometimes fork open-source projects to create their own branded version of the project, sometimes enhanced with closed-source features. An example is Apple\u2019s fork of KDE\u2019s KHTML rendering engine as Webkit.\n\n- **Legal reasons.** A project might consider different licenses, a trademark dispute may arise, or changes in laws (e.g., regarding encryption) require technical changes. Hard forks can be used to split development for different jurisdictions.\n\n- **Personal reasons.** Interpersonal disputes and irreconcilable differences of a non-technical nature lead to a rift between various parties, so the project forks. OpenBSD is a classic example.\n\nIn contrast to the older work on hard forks, more recent work has also investigated the motivation and practices behind social forks. For example, Fung et al. [16] report that only 14 percent of all active forks of nine popular JavaScript projects integrated back any changes. Subsequently, researchers studied social forks at larger scale and reported that around 50 percent of forks on GitHub never integrate code changes back [23, 53]. In addition, Jiang et al. [23] reported that 10 percent of their study participants used forks for backup purposes.\n\nIn our study, we revisit the question about the motivation for hard forks and explore whether they have changed with the rise of social coding.\n\n### 2.3 Outcomes of Hard Forks\n\nWheeler [49] and Robles and Gonz\u00e1lez-Barahona [39] distinguish five possible outcomes of hard forks:\n\n- **Successful branching, typically with differentiation.** Both the original project and the fork succeed and remain active for a prolonged period of time, fragmenting the community into smaller subcommunities. The BSD variants are notable examples.\n\n- **Fork merges back into the upstream project.** The fork does not sustain independence but merges changes back into the upstream project, e.g., after resolving a dispute that triggered the hard fork in the first place, as in the io.js fork of Node.js [50].\n\n- **Discontinuation of the fork.** The fork is initially active, but does not sustain its activity. For example, when libc split off from glibc, the glibc maintainers invested in improvements to win back users and the fork failed.\n\n- **Discontinuation of the upstream project.** The fork outperforms the upstream project such that the upstream project is discontinued (or the fork revives an already dead upstream project). For example, XFree86 moved away from a GPL-compatible license, so the project forked and created X.org, which was quickly adopted by most developers and users; soon after, the XFree86 core team disbanded and development ceased on the project.\n\n- **Both fail.** Both projects fail (or the fork fails to revive a dead project).\n\nWheeler [49] conjectured that it is rare for both the fork and the upstream project to sustain activities. Robles and Gonz\u00e1lez-Barahona [39] quantified the frequency of each outcome in a sample of 220 forked open-source projects referenced from Wikipedia in 2011 (i.e., selection biased toward well-known projects that have achieved a certain level of success) and found that successful branching was most common (43.6%), followed by discontinuation of the fork (29.8%) and discontinuation of the upstream project (13.8%); failure of both and merges were relatively rare (8.7% and 3.2%).\n\n### 2.4 Pros and Cons of Hard Forks\n\nHard forks have long been discussed controversially. In the 90s and 2000s, forking was seen as an important right but also as something to avoid if at all possible, unless it is a last resort. There was a strong norm against forking, as it fragments communities and can cause hard feelings for the people involved. The free software movement has traditionally seen forking as something to avoid: forks split the community, introduce duplicate effort, reduce communication, and may produce incompatibilities [39]. Specifically, it can tear a community apart, meaning people in the community have to pick sides [6, 15, 26, 30, 36, 49]. Such fragmentation can also threaten the sustainability of open-source projects, as scarce resources are additionally scattered and changes need to be performed redundantly across multiple projects; e.g., the 3D printer firmware Marlin fixed an issue (PR #10119) two years after the same problem was fixed in its hard fork Ultimaker (PR #118). At the same time, the right to forking is also seen as an important political tool of the community: The threat of a fork alone can cause project leaders to pay attention to issues they may ignore otherwise, should those issues actually be important and potentially improve their current practices [49].\n\nIn contrast, social forks are seen as something almost exclusively positive and are actively encouraged [4]. They are a mechanism to contribute to a project, and most open-source projects actively embrace external contributors [19, 46]. Although some maintainers complain about the burden of dealing with so many third-party contributions [21, 46] and some researchers warn about inefficiencies regarding lost contributions or duplicate work [38, 52, 53], we are not aware of any calls to constrain social forking.\nImportantly though, as our study will show, the distinction between social and hard forks is fluent. Social coding platforms contain both kinds of forks and they are not always easy to distinguish. Diffusion of efforts and fragmentation of communities, as always feared in discussions of hard forks, can be observed also on GitHub. Many secondary forks (i.e., forks of forks) contribute to other forks, but not to the original repository, and forks slowly drift apart [16, 45]. A key question is, thus, whether the popularity of social forking encourages also hard forks and causes similar fragmentation and sustainability challenges feared in the past.\n\nWe believe it is necessary to revisit hard forking after the rise of social coding and GitHub. Specifically, we aim to understand the hard-fork phenomenon in a current social-forking environment, and understand how perceptions and practices may have changed.\n\n3 RESEARCH QUESTIONS AND METHODS\n\nAs described in Sec. 2, the conventional use of the term forking as well as corresponding tooling have changed with the rise of distributed version control and social coding platforms, and we conjecture that this also influenced hard forks. Hence, our overall research question is How have perceptions and practices around hard forks changed?\n\nWe explore different facets of hard forks, including motivations, outcomes, and perceived stigma (cf. Sec. 2). We also attempt to identify how frequent hard forks are across GitHub, and discuss how developers navigate the tension and often blurry line between social and hard forks. We adopt a concurrent mixed-method exploratory research strategy [9], in which we combine repository mining \u2013 to identify hard forks and their outcomes \u2013 with interviews of maintainers of both forks and upstream projects \u2013 to explore motivations and perceptions. Mixing multiple methods allows us to explore the research question simultaneously from multiple facets and to triangulate some results. In addition, we use some results of repository mining to guide the selection of interviewees.\n\nWe explicitly decided against an exact replication [24, 43] of prior work, because contexts have changed significantly. Instead, we guide our research by previously explored facets of hard forks, revisit those as part of our repository mining and interviews, and contrast our findings with those reported in pre-GitHub studies. In addition, we do not limit our research to previously explored facets, but explicitly explore new facets, such as the tension between social and hard forks, that have emerged from technology changes or that we discovered in our interviews.\n\n3.1 Instrument for Visualizing Fork Activities\n\nWe created commit history graphs, a custom visualization of commit activities in forks, as illustrated in Figure 2, to help develop and debug our classifiers (Sec. 3.2 and 3.3), but also to prepare for interviews. Given a pair of a fork and corresponding upstream repositories, we clone both and analyze the joint commit graph between the two, assigning every commit two one of five states: (1) created before the forking point, (2) only upstream (not synchronized), (3) only in fork (unmerged), (4) created upstream but synchronized to the fork, and (5) created in the fork but merged into upstream. Technically, in a nutshell, we build on our prior commit graph analysis [53], where merge edges are assigned weight 1 and all other edges weight 0, and the shortest path from the commit to any branch in either fork or upstream repository identifies where the commit originates and whether it has been merged (and in which direction).1\n\nWe subsequently plot activities in the two repositories over time, aggregated in three-month intervals; larger dots indicate more commits. In these plots, we include additional arrows for synchronization (from upstream into the fork) and merge (from fork to upstream) activities. With these plots, we can quickly visually inspect development activities before and after the forking point as well whether the fork and the upstream repository interact.\n\n3.2 Identifying Hard Forks\n\nIdentifying hard forks reliably is challenging. Pre-GitHub work often used keyword searches in project descriptions, e.g., \u2018software fork\u2019, or relied on external curated sources (e.g., Wikipedia) [39]. Today, on sites like GitHub, hard forks use the same mechanisms as social forks without any explicit distinction.\n\nClassifier development. For this work, we want to gather a large set of hard forks and even approximate the frequency of hard forks among all 47 million forks on GitHub. To that end, we need a scalable, automated classifier. We are not aware of any existing classifier except our own prior work [53], in which we classified forks as hard forks if they have at least two own pull requests or at least 100 own, unmerged commits and the project\u2019s name has been changed. Unfortunately, we found that this classifier missed many actual hard forks (false negatives), thus we went back to the drawing board to develop a new one.\n\nWe proceeded iteratively, repeatedly trying, validating, and combining various heuristics. That is, we would try a heuristic to detect hard forks and manually sample a significant number of classified forks to identify false positives and false negatives, revising the heuristic or combining it with other steps. Commit history graphs (cf. Sec. 3.1) and our qualitative analysis of forks (Sec 3.3 below) were useful debugging devices in the process. We iterated until we reached confidence in the results and a low rate of false positives.\n\n1There are a few nuances in the process due to technicalities of Git and GitHub. For example, if the upstream repository deletes a branch after forking, the joint commit graph would identify the code as exclusive to the fork; to that end, we discard commits that are older than the forking timestamp on GitHub. Such details are available in our open-source implementation (https://github.com/shuiblue/VisualHardFork).\nOur final classifier proceeds in two steps: first, we use multiple simple heuristics to identify candidate hard forks; second, we use a more detailed and more expensive analysis to decide which of those candidates are actual hard forks.\n\nIn the first step, we identify as candidate hard forks, among all repositories labeled as forks on GitHub, those that:\n\n- **Contain the phrase \u201cfork of\u201d in their description** (H1). We use GitHub\u2019s search API to find all repositories that contain the phrase \u201cfork of\u201d in their project description and are a fork of another project. The idea, inspired by prior work [31], is to look for projects that explicitly label themselves as forks (defined as \u201cself-proclaimed forks\u201d), i.e., developers explicitly change their description after cloning the upstream repository. To work around GitHub\u2019s API search limit of 1000 results per query, we partitioned the query based on different time ranges in which the repository was created. Next, we compare the description of the fork and its upstream project to make sure the description is not copied from the upstream, i.e., that the upstream project is not already a self-proclaimed fork.\n\n- **Received external pull requests** (H2). Using the June 2019 GHTorrent dataset [18], we identified all GitHub repositories that are labeled as forks and have received at least three pull requests (excluding pull requests issued by the fork\u2019s owner to avoid counting developers who use a process with feature branches). We consider external contributions to a fork as a signal that the fork may have attracted its own community.\n\n- **Have substantial unmerged changes** (H3). Using the same GHTorrent dataset, we identify all forks that have at least 100 own commits, indicating significant development activities beyond what is typical for social forks.\n\n- **Have at least 1-year of development activity** (H4). Similar to the previous heuristic, we look for prolonged development activities beyond what is common for social forks. Specifically, we identify those forks as candidates in which the time between the first and the last commit spans more than one year.\n\n- **Have changed their name** (H5). We check if the fork\u2019s name in GitHub has been changed from the upstream repository\u2019s name (with Levenshtein distance $\\geq 3$). This heuristic comes from the observation that most social forks do not change names, but that forks intending to go in a different direction and create a separate community tend to change names more commonly (e.g., Jenkins forked Hudson).\n\nEach repository that meets at least one of these criteria is considered as a candidate. We show how many candidates each heuristic identified in the second column of Fig. 3b. Note, for all heuristics that use GHTorrent, we additionally validated the results by checking whether the fork and upstream pair still exist on GitHub and whether the measures align with those reported by the GitHub API.\\(^5\\)\n\nIn line with prior work [25, 53], we remove repositories using GitHub for document storage or course project submission \u2013 some of which are among the most forked projects on GitHub. Specifically, after manual review, we discard repositories containing the keywords \u2018homework\u2019, \u2018assignments\u2019, \u2018course\u2019, \u2018codecamp\u2019, or \u2018documents\u2019 in their description; we discard repositories whose name starts with \u2018awesome-\u2019 (usually document collections); and we discard repositories with no programming-language-specific files (as per GitHub\u2019s language classification queried through the API).\n\n- We discard candidates with fewer than three stars on GitHub. Stars are a lightweight mechanism for developers to indicate their interest in a project and a common measure of popularity. A threshold of three stars is very low, but still requires a minimum amount of public interest. According to GHTorrent data, of the 125 million GitHub repositories, 2 million repositories (1.6%) have three or more stars.\n\n- We discard candidates without any own commits after the fork, typically projects that only performed a name change as the single post-fork action.\n\n- We discard candidates in which 30% or more of all commits in the fork have been merged upstream, which indicates social forks with active contributions to the upstream project.\n\n- For candidates identified with 100 commits or 1 year of activity, we discard those where the thresholds are not met when considering only unmerged commits exclusive to the fork.\n\n- We discard candidates owned by developers who contributed more than 30% of the commits or pull requests of the upstream repository, which typically indicates core team members of the upstream project using social forks for feature development.\n\n- We discard candidates if the fork was created right after the upstream stopped updating if the fork is owned by an organization account and the upstream is owned by a user account. This is a common pattern we observed, indicating the ownership transfer. Our classifier identifies a total of 15,306 hard forks across GitHub. In Fig. 3b, we show which heuristics identified the hard forks and the overlap between the different heuristics in Fig. 3a.\n\n**Classifier validation.** To validate the precision of our classifier, we manually inspected a random sample of 300 detected hard forks. By manually analyzing the fork\u2019s and the upstream repository\u2019s history and commit messages, we classified 14 detected hard forks\n\n| Rule | Candidates | Actual |\n|------|------------|--------|\n| H1 | 10,609 | 551 |\n| H2 | 23,109 | 7,043 |\n| H3 | 14,956 | 810 |\n| H4 | 33,073 | 11,268 |\n| H5 | 20,358 | 5,568 |\n| Total| 63,314 | 15,306 |\nas likely false positives, suggesting an acceptable accuracy of 95%. Note that manual labeling is a best effort approach as well, as the distinction between social and hard fork is not always clear (see also our discussion of interview results in Sec. 4.4).\n\nAnalyzing false negatives (recall) is challenging, because hard forks are rare, projects listed in previous papers are too old to detect in our GitHub dataset, and we are not aware of any other labeled dataset. We have manually curated a list of known hard forks from mentions in web resources and from mentions during our interviews. Of the 3 hard forks of which both the fork and the upstream repository are on GitHub, we detect all with our classifier, but the size of our labeled dataset is too small to make meaningful inferences about recall.\n\n### 3.3 Classifying Evolution Patterns\n\nWe identified different evolution patterns among the analyzed forks using an iterative approach inspired by card sorting [44]. Evolution patterns describe how a hard fork and the corresponding upstream project coevolve and can help to characterize forking outcomes. In addition, we used evolution patterns to diversify interviewees.\n\nSpecifically, we printed cards with commit history graphs of 100 randomly selected hard forks (see Sec. 3.2), then all three authors jointly grouped the cards and identified a few common patterns. Our card-sorting was open, meaning we had no predefined groups; the groups emerged and evolved during the analysis process. Afterward, we manually built a classifier that detects the forks for each identified pattern. We then applied this classifier to the entire dataset, inspected that the automatically classified forks actually fit the patterns as intended (refining the classifier and its thresholds if needed). We then picked another 100 hard forks that fit none of the previously defined patterns and sorted those again, looking for additional patterns. We similarly proceeded within each pattern, looking at 100 hard forks to see whether we can further split the pattern. We repeated this process until we could not identify any further patterns.\n\nAfter several iterations, we arrived at a stable list of 15 patterns with which we could classify 97.7% of all hard forks. We list all patterns with a corresponding example commit history graph in Tab. 2. The patterns use characteristics that relate to previously found outcomes, such as fork or upstream being discontinued, but also consider additional characteristics corresponding to features that were not available or easily observable before distributed version control, e.g., whether the fork and upstream merge or synchronize. We present the patterns in a hierarchical form, because our process revealed a classification with a fairly obvious tree structure, not because we were specifically looking for a hierarchical structure.\n\n### 3.4 Interviews\n\nTo solicit views and perceptions, we conducted 18 semi-structured interviews with developers, typically 20 to 40 minutes. Despite reaching fewer developers, we opted for interviews rather than surveys due to the exploratory nature of our research: Interviews allow more in-depth exploration of emerging themes.\n\n**Interview protocol.** We designed a protocol [2] that covers the relevant dimensions from earlier research and touches on expected changes, including reasons for forking, perceived stigma of forking, and the distinction and possible tensions between social and hard forks. We asked fork owners about their decision process that lead to the hard fork, their practices afterward (e.g., why they renamed the projects), their current relationship to the upstream project (e.g., whether they still monitor or even synchronize), and their future plans. In contrast, we asked owners of upstream projects to what extent they are aware of, interact with, or monitor hard forks; and to what degree they are concerned about such forks or even take steps to avoid them. In addition, we asked all participants with a long history of open-source activity if they observed any changes in their practices or perceptions and that of others over time.\n\nAll interviews were semi-structured, allowing for exploration of topics that were brought up by the participants. Our interview protocol evolved with each interview, as we reacted to confusion about questions and insights found in earlier interviews. That is, we refined and added questions to explore new insights in more detail in subsequent interviews \u2013 for example, after the first few interviews we added questions about the tradeoff between being inclusive to changes versus risking hard forks and questions regarding practices and tooling to coordinate across repositories. To ground each interview in concrete experience rather than vague generalizations, we focused each interview on a single repository in which the interviewee was involved, bringing questions back to that specific repository if the discussion became too generic.\n\n**Participant recruitment.** We selected potential interviewees among the maintainers of the 15,306 identified hard forks and corresponding upstream repositories. We did consider maintainers with public email address on their GitHub profile that were active in the analyzed repositories within the last 2 years (to reduce the risk of misremembering). We sampled candidates from all evolution patterns (Sec. 3.3) and sent out 242 invitation emails.\n\nOverall, 18 maintainers volunteered to participate in our study (7% response rate). Ten opted to be interviewed over email, one\n\n---\n\n### Table 1: Background information of participants.\n\n| Par. | Domain | #Stars(U) | #Stars(F) | LOC | Role | Exp.(yr) |\n|------|-----------------|-----------|-----------|-----|------|----------|\n| P1 | Blockchain | <20 | <10 | 10K | F | 19 |\n| P2 | Reinforcement learning | 10K | 1K | 30K | F | 3 |\n| P3 | Mobile processing | - | 70 | 20K | F | 6 |\n| P4 | Video recording | - | 100 | 300K| F | 18 |\n| P5 | Helpdesk system | 2K | <10 | 800K| F | 5 |\n| P6 | CRM system | 30 | 200 | 800K| F | 10 |\n| P7 | Physics engine | - | 300 | 100K| F | 15 |\n| P8 | Social platform | 500 | 230 | 500K| F | 20 |\n| P9 | Reinforcement learning | <20 | <20 | 30K | 2nd-F| 3 |\n| P10 | Game Engine | 500 | <10 | 200K| 2nd-F| 21 |\n| P11 | Networking | 300 | 100 | 500K| F | 10 |\n| P12 | Email library | - | 10K | 20K | F/U | 32 |\n| P13 | Game engine | 3K | 70 | 20K | F | 11 |\n| P14 | Machine learning| 30K | 50 | 60K | F | 8 |\n| P15 | Image editing | 70 | <10 | 20K | F | 20 |\n| P16 | Image editing | 70 | <10 | 20K | U | 10 |\n| P17 | Microcontrollers| 9K | 1K | 300K| U | 6 |\n| P18 | Maps | 400 | <10 | 100K| U | 9 |\n\nF: Hard Fork Owner; U: Upstream Maintainer; 2nd-F: Fork of the Hard Fork\n\n*Some of the upstream projects are not in GitHub, so the number of stars is unknown. Numbers rounded to one significant digit.\nthrough a chat app, and all others over phone or teleconferencing. In Table 2, we map our interviewees to the evolution pattern for the primary fork discussed (though interviewees may have multiple roles in different projects). Naturally, our interviewees are biased toward hard forks that are still active. Our response rate was also lower among maintainers of upstream repositories, who were maybe less invested in talking about forking. In Table 1, we list information about our interviewees and the primary hard fork we discussed. All interviewees are experienced open-source developers, specifically, many with more than 10 years experience of participating in open-source community, meaning they have interacted with earlier open-source platform such as Sourceforge. Our interviews reached saturation, in that the last interviews provided only marginal additional insights.\n\n**Analysis.** We analyzed the interviews using standard qualitatively research methods [41]. After transcribing all interviews, two authors coded the interviews independently, then all authors subsequently discussed emerging topics and trends. Questions and disagreements were discussed and resolved together, if needed asking follow up questions to some interviewees.\n\n### 3.5 Threats to Validity and Credibility\n\nOur study exhibits the threats to validity and credibility that are typical and expected of this kind of exploratory interview studies and the used analysis of archival GitHub data.\n\nDistinguishing between social and hard forks is difficult, even for human raters, as the distinction is primarily one of intention. In our experience, we can make a judgment call with high inter-rater reliability for most forks, but there are always some repositories that cannot be accurately classified without additional information. We build and evaluate our classifiers based on a best effort strategy, as discussed.\n\nWhile we check later steps with data from the GitHub API, early steps to identify candidate hard forks may be affected by missing or incorrect data in the GHTorrent dataset. In addition, the history of Git repositories is not reliable, as timestamps may be incorrect and users can rewrite histories after the fact. In addition, merges are difficult to track if code changes are merged as a new commit or through \u2018squashing\u2019 rather than through a traditional merge commit. As a consequence, despite best efforts, there will be inaccuracies in our classification of hard forks and individual commits, which we expect will lead to some underreporting of hard forks and to some underreporting of merged code.\n\nWe analyze data with right-censored time series data, in which we can detect that projects have seized activity in the past, but cannot predict the future, thus seeing a larger chance for older forks to be discontinued.\n\nOur study is limited to hard forks of which both fork and upstream repository are hosted on GitHub and of which the forking relationship is tracked by GitHub. While GitHub is by far the most dominant hosting service for open source, our study does not cover forks created of (typically older) projects hosted elsewhere and forks created by manually cloning or copying source code to a new repository. In addition, our interviews, as typical for all interview studies in our field, is biased toward answer from developers who chose to make their email public and chose to answer to our interview request, which underrepresented maintainers of upstream repositories in our sample.\n\n### 4 RESULTS\n\nWe explore practices and perceptions around hard forks along four facets that emerged from our interviews and data.\n\n#### 4.1 Frequency of Hard Forks\n\nOur classifier identified 15,306 hard forks, confirming that hard forks are generally a rare phenomenon. As of June 2019, GitHub tracks 47 million repositories that are marked as forks over 5 million distinct upstream repositories among GitHub\u2019s over 125 million repositories.\n\nAmong those, the vast majority of forks has no activity after the forking point and no stars. Most active forks have only very limited activity indicative of social forks. Only 0.2% of GitHub\u2019s 47 million forks have 3 or more stars.\n\nAs our analysis of evolution patterns (Tab. 2) reveals, cases where both the upstream repository and the hard fork remain active for extended periods of time are not common (patterns 1, 2, and 4\u20137; 1157 hard forks, 8.8%). Most hard forks actually survive the upstream project, if the upstream project was active when the fork was created (patterns 8\u201311; 7280 hard forks, 47.6%), but many also run out of steam eventually (patterns 3 and 12\u201315; 6671 hard forks, 43.6%).\n\nWhile most hard forks are created as forks of active projects (patterns 4\u201315; 14254 hard forks, 93%), there are a substantial number of cases where hard fork are created to revive a dead project (pattern 1\u20133; 1052 hard forks, 6.8%), in some cases even triggering or coinciding with a revival of the upstream project (pattern 2; 56 hard forks, 0.36%), but also here not all hard fork sustain activity (pattern 3; 420 hard forks, 2.7%).\n\n**Discussion and implications.** Even though the percentage of hard forks is low, the total number of attempted and sustained hard forks is not. Considering the significant cost a hard fork can put on a community through fragmentation, but also the potential power a community has through hard forks, we argue that hard forks are an important phenomenon to study even when they are comparably rare.\n\nWhereas previous work typically looked at only a small number of hard forks, and research on tooling around hard-fork issues typically focus on few well known projects, such as the variants of BSD [35] or Marlin [28] or artificial or academic variants [14, 22], we have detected a significant number of hard forks, many of them recent, using many different languages, that are a rich pool for future research. We release the dataset of all hard forks with corresponding visualizations as dataset with this paper [2].\n\n#### 4.2 Why Hard Forks Are Created (And How to Avoid Them)\n\nAt a first glance, the interviewees give reasons for creating hard forks that align well with prior findings, including especially continuing discontinued projects or projects with unresponsive maintainers (P1, P2, P8), disagreements around project governance (P2, P12), and diverging technical goals or target populations (P3, P5,\nTable 2: Evolution patterns of hard forks\n\n| Id | Category | Total | Sub-category | Example | Count | Interviewees |\n|----|---------------------------|-------|-----------------------|---------|-------|--------------|\n| 1 | Success (F. active > 2 Qt.) | 632 | Upstream remains inactive |  | 576 | P12 |\n| 2 | Revive Dead Project | | Upstream active again |  | 56 | |\n| 3 | Not success (F active <= 2 Qt) | 420 | |  | 420 | |\n| 4 | only merge | | |  | 26 | P10 |\n| 5 | Both Alive | 723 | only sync |  | 107 | P2, P13, P15 |\n| 6 | merge & sync | | |  | 28 | P9 |\n| 7 | no interaction | | |  | 562 | P1, P3, P4, P5, P7, P14 |\n| 8 | only merge | | |  | 174 | |\n| 9 | Fork Lived Longer | 7280 | only sync |  | 686 | |\n| 10 | Forking Active Project | | merge & sync |  | 107 | |\n| 11 | no interaction | | |  | 6313 | P6, P8, P11 |\n| 12 | only merge | | |  | 388 | |\n| 13 | Fork does not out live upstream | 6251 | only sync |  | 762 | |\n| 14 | merge & sync | | |  | 199 | |\n| 15 | no interaction | | |  | 4902 | |\n\nP6, P11, P13, P14, P17). As discussed, we identified 1052 hard forks (Tab. 2, patterns 1\u20133, 6.8 %) that forked an inactive project.\n\nAn interesting common theme that emerged in our interviews though was that many hard forks were not deliberately created as hard forks initially. More than half of our interviewees described that they initially created a fork with the intention of contributing to the upstream repository (social fork), but when they faced obstacles they decided to continue on their own. Common obstacles were unresponsive maintainers (P1, P2, P8) and rejected pull requests (P11, P13, P14), typically because the change was considered beyond the scope of the project. For example, P2 described that \u201cbefore forking, we started by opening issues and pull requests, but there was a lack of response from their part. [We] got some news only 2 months after, when our fork was getting some interest from others.\u201d Similarly, some maintainers reported that a fork initially created for minor personal changes evolved into a hard fork as changes became more elaborate and others found them useful (P2, P14, P17); for example, P14 described that the upstream project had been constantly evolving and the code base became quickly incompatible with some libraries, so he decided to fix this issue while also adding functionality, after which more and more people found his fork and started to migrate.\n\nSeveral maintainers also had explicit thoughts about how to avoid hard forks (both maintainers of projects that have been forked and fork owners who themselves may be forked), and they largely mirror common reasons for forking, i.e., transparent governance, being responsive, and being inclusive to feature requests. For example, P2 suggests that their project is reactive to the community, thus he considers it unlikely to be forked; similarly P16 decided to generally \u201crespond to issues in a timely manner and make a good...\u201d\nfaith effort to incorporate PRs and possibly fix issues and add features as the needs arrives\u201d to reduce the need for hard forks. Beyond these, P2 also mentioned that they created a contributing guide and issue templates to coordinate with contributors more efficiently; P14 suggested to \u201ccredit the contributors\u201d explicitly in release notes in order to keep contributors stay in the community.\n\nDiscussion and Implications. Whereas forking was typically seen as a deliberate decision in pre-GitHub days that required explicit steps to set up a repository for the fork and find a new name, nowadays many hard forks seem to happen without much initial deliberation. Social coding environments actively encourage forking as a contribution mechanism, which significantly lowers the bar to create a fork in the first place without having to think about a new name or potential consequences like fragmenting communities. Once the fork exists (initially created as social fork), there seems to be often a gradual development until developers explicitly consider their fork a separate development line. In fact, many hard forks seem to be triggered by rather small initial changes. These interview results align with the observation that only about 36% of the detected hard forks on GitHub have changed the project\u2019s name (cf. Fig. 3a).\n\nMore importantly, a theme emerged throughout our interviews that hard forks are not likely to be avoidable in general, because of a project\u2019s tension between being specific and begin general. On the one hand, projects that are more inclusive to all community contributions risk becoming so large and broad that they become expensive to maintain (e.g., as P17 suggests, the project maintainers need to take over maintenance of third-party contributions for niche use cases) and difficult to use (e.g., lots of configuration options and too much complexity). On the other hand, projects staying close to their original vision and keeping a narrow scope may remain more focused with a smaller and easier to maintain code base, but they risk alienating users who do not fit that original vision, who then may create hard forks. One could argue that hard forks are a good test bed for contributions that diverge from the original project despite their costs on the community: If fork dies it might suggest a lack of support and that it may have been a good decision not to integrate those contributions in the main project.\n\nIn this context, a family of related projects that serve slightly different needs or target populations but still coordinate may be a way to overcome this specificity-generality dilemma in supporting multiple projects that each are specific to a mission, but together target a significant number of use cases. However, current technology does not support coordination across multiple hard forks well, as we discuss next.\n\n4.3 Interactions between Fork and Upstream Repository\n\nMany interviewees indicate that they are interested in coordinating across repositories, either for merging some or all changes back upstream eventually or to monitor activity in the upstream repository to incorporate select or all changes. Some hard fork owners did not see themselves competing with the upstream project, but rather being part of a larger project. For instance, although fork owner P13 has over 1500 commits ahead of the upstream project, he still said that \u201cI would not consider it independent because I am relying on what they (upstream) are doing. I could make it independent and stop getting their improvements, but it\u2019s to their credit they make it very easy for their many hundreds of developers to contribute patches and accept patches from each other. They regulate what goes into their project very well, and that makes [merging changes] into my fork much easier.\u201d Some (P4 and P11) indicate that they would like to merge, once the reason for the hard fork disappears (typically governance practices or personal disputes). Also upstream maintainers tend to be usually interested in what happens in their forks; for example, P17, a maintainer of a project with thousands of (mostly social) forks, said \u201cI try to be aware of the important forks and try to get to know the person who did the fork. I will follow their activities to some extent.\u201d\n\nHowever, even though many interviewees expressed intentions, we see little evidence of actual synchronization or merging across forks in the repositories: For example, P1, P4, P8, and P11 mention that they are interested in eventually merging back with the upstream repository, but they have not done so yet and do not have any concrete plans at this point. Similarly, P2, P6, and P10 indicate that they are interested in changes in upstream projects, but do not actually monitor them and have not synchronized in a long time. Our evolution patterns similarly show that synchronization (from upstream to fork) and merging (from fork to upstream) are rare. Only 16.18% of all hard forks with active upstream repositories ever synchronize or merge (Tab. 2, patterns 4\u20136, 8\u201310, and 12\u201314).\n\nWhat might explain this difference between intentions and observed actions is that synchronization and merging becomes difficult once two repositories diverge substantially and that monitoring repositories can becoming overwhelming with current tools. For example, P2 reports to only occasionally synchronize minor improvements, because the fork has diverged to much to synchronize larger changes; P10 has experienced problems of synchronizing too frequently and thus being faced with incomplete implementations and now only selectively synchronizes features of interest. In line with prior observations on monitoring change feeds [5, 10, 33, 52], interviewees report that systematically monitoring changes from other repositories is onerous and that current tools like GitHub\u2019s network graph are difficult to use and does not scale (P11, P16).\n\nDiscussion and Implications. Tooling has changed significantly since the pre-GitHub days of prior studies on hard forks which may allow new forms of collaboration across forks: Git specifically supports merges across distributed version histories, as well as selectively integrating changes through a \u2018cherry picking\u2019 feature. GitHub and similar social coding pages track forks, allowing developers to subscribe to changes in select repositories, and generally make changes in forks transparent [10, 11, 52]. Essentially all interviewees were familiar with GitHub\u2019s network view [1] that visually shows contributions over time across forks and branches.\n\nEven though advances in tooling provide new opportunities for coordination across multiple forks and project maintainers are interested in coordinating and considering multiple forked projects as part of a larger community, current tools do not support this use case well. Current tools work well for short-term social forks but...\ntend to work less well for coordinating changes across repositories that have diverged more significantly.\n\nThis provides opportunities for researchers to explore tooling concepts that can monitor, manage, and integrate changes across a family of hard forks. Recent academic tools for improved monitoring [33, 52] or cross-fork change migration [35, 37] are potentially promising but are not yet accessible easily to practitioners. Also more experimental ideas about virtual product-line platforms that unify development of multiple variants of a project [3, 14, 29, 40, 45] may provide inspiration for maintaining and coordinating hard forks, though they typically do not currently support the distributed nature of development with competing hard forks. A technical solution could solve the specificity-generality dilemma (cf. Sec. 4.2), allowing subcommunities to handle more specific features without overloading the upstream project and without fragmenting the overall community. We believe that our dataset of 15,306 hard forks can be useful to develop and evaluate such tools in a realistic setting.\n\n4.4 Perceptions of Hard Forking\n\nOur discussion with maintainers confirmed that the line between hard forks and social forks is somewhat subjective, but, when prompted, they could draw distinctions that largely mirror our definition (long-term focus, extensive changes, fork with own community). For example, P2 agree that his fork is independent from the upstream project because they have different goals, and suggests the fork has better code quality, and better community management practices; the only remaining connection are upstream bug fixes that he incorporates from time to time. Also, P6 considers his fork as independent, given a quicker release cycle and significantly refactoring to the code base.\n\nFor most interviewees, the dominant meaning of a fork is that of a social fork. When asked about perceptions of forks, most interviewees initially thought of social forks and have strong positive associations, e.g., others contributing to a project, onboarding newcomers and finding collaborators, and generally fostering innovation. For instance, P6 described the advantages of social forking as \u201cit encourages developers to go in a direction that the original project may not have gone,\u201d and similarly P9 thought that \u201cit could boost the creative ideas of the communities.\u201d One interviewee also mentioned that for young projects primarily focus on growth, having been forked is a positive signal meaning the project is useful to other people. Social forks were so dominant in the interviewees\u2019 mind as a default, that we had to frequently refocus the interview on hard forks. When asked specifically about hard forks, several interviewees raised concerns about potential community fragmentation (P4, P6, P17), worried about incompatibilities and especially confusing end users (P3, P9, P14, P17), and would have preferred to see hard-fork owners to contribute to the upstream project instead (P3, P8, P12). However, concerns were mostly phrased as hypotheticals and contrasted with positive aspects.\n\nMany interviewed owners of hard forks do not see themselves competing with the upstream repository, as they consider that they address a different problem or target a different user population. For example, P10 described his fork as a \u201clight version\u201d of the upstream project targeting a different group of users.\n\nWhile it is understandable that hard-fork owners see their forks as justified, also some interviewed owners of upstream projects had positive opinions about such forks. For example, P17 expressed that forks are good if there is a reason (such as a focus on a different target population, in this case beginners), and that those forks may benefit the larger community by bringing in more users to the project; P18 suggested even that he would support and contribute forks of his own project by occasionally contributing to them as long as it will benefit the larger community.\n\nDiscussion and Implications. Overall, we see that the perception of forking has significantly changed compared to perceptions reported in earlier work. Forking used to have a rather negative connotation in pre-GitHub days and was largely regarded as a last resort to be avoided to not fragment the community and confuse users. With GitHub\u2019s rebranding of the word forking, the stigma around hard forking seems to have mostly disappeared; the word has mostly positive connotations for developers, associated positively with external contributors and community. While there is still some concern about community fragmentation, it is rarely a concrete concern if there are actual reasons behind a hard fork. Transparent tooling seems to help with acceptance and with considering multiple hard forks as part of a larger community that can mutually benefit from each other.\n\nWe expect that a more favorable view, combined with lower technical barriers (Sec. 4.2) and higher expectations of coordination (Sec. 4.3) makes hard forks a phenomenon we should expect to see more of. However, positive expectations can turn into frustration (and disengagement of valuable contributors to sustain open source) if fragmentation leads to competition, confusion, and coordination breakdowns due to insufficient tooling.\n\nWith the right tooling for coordination and merging, we think hard forks can be a powerful tool for exploring new and larger ideas or testing whether there is sufficient support for features and ports for niche requirements or new target audiences (e.g., solving the specificity-generality dilemma discussed in Sec. 4.2 with a deliberate process). To that end though, it is necessary to explicitly understand (some) hard forks as part of a larger community around a project and possibly even explicitly encourage hard forks for specific explorations beyond the usual scope of social forks. We believe that there are many ways to support development with hard forks and to coordinate distributed developers beyond what social coding site offer at small scale today. Examples include (1) an early warning system that alerts upstream maintainers of emerging hard forks (e.g., external bots), which maintainers could use to encourage collaboration over competition and fragmentation if desired, (2) a way to declare the intention behind a fork (e.g., explicit GitHub support) and dashboard to show how multiple projects and important hard forks interrelate (e.g., pointing to hard forks that provide ports for specific operating systems), and (3) means to identify the essence of the novel contributions in forks (e.g., history slicing [27] or code summarization [52]).\n\n5 CONCLUSION\n\nWith the rise of social coding and explicit support in distributed version control systems, forking of repositories has been explicitly promoted by sites like GitHub and has become very popular.\nHowever, most of these modern forks are not hard forks in the traditional sense. In this paper, we automatically detected hard forks and their evolution patterns and interviewed open-source developers of forks and upstream repositories to study perceptions and practices. We found that perceptions and practices have indeed changed significantly: Among others, hard forks often evolve out of social forks rather than being planned deliberately and developers are less concerned about community fragmentation but frequently perceive hard forks a positive noncompetitive alternatives to the original projects. We also outlined challenges and suggested directions for future work.\n\nAcknowledgements. Zhou and K\u00e4stner have been supported in part by the NSF (awards 1552944, 1717022, and 1813598) and AFRL and DARPA (FA8750-16-2-0042). Vasilescu has been supported in part by the NSF (awards 1717415 and 1901311) and the Alfred P. Sloan Foundation.\n\nREFERENCES\n\n[1] 2008. GitHub Network View. https://help.github.com/en/articles/viewing-a-repositorys-network.\n\n[2] 2020. Appendix. https://github.com/shuiblue/ICSE20-hardfork-appendix.\n\n[3] Michal Antkiewicz, Wenbin Ji, Thorsten Berger, Krzysztof Czarnecki, Thomas Schmorleiz, Ralf L\u00e4mmel, \u015etefan St\u0103nciulescu, Andrzej W\u0105sowski, and Ina Schaefer. 2014. Flexible Product Line Engineering with a Virtual Platform. In Proc. Int'l Conf. Software Engineering (ICSE). ACM, 532\u2013535.\n\n[4] Matt Asay. 2014. Why you should fork your next open-source project. Blog Post. https://www.techrepublic.com/article/why-you-should-fork-your-next-open-source-project/\n\n[5] Christopher Bogatin, Christian K\u00e4stner, James Herbsleb, and Ferdian Thung. 2016. How to Break an API: Cost Negotiation and Community Values in Three Software Ecosystems. In Proc. Int\u2019l Symposium Foundations of Software Engineering (FSE). ACM, 109\u2013120.\n\n[6] Pete Bratach. 2017. Why Do Open Source Projects Fork? Blog Post. https://thenewstack.io/open-source-projects-fork/\n\n[7] Caius Brindescu, Mihai Codoban, Sergiu Shmaruktiachi, and Danny Dig. 2014. How Do Centralized and Distributed Version Control Systems Impact Software Changes? In Proc. Int\u2019l Conf. Software Engineering (ICSE). ACM, 323\u2013333.\n\n[8] Bee Bee Chua. 2017. A Survey Paper on Open Source Forking Motivation Reasons and Challenges. In 21st Pacific Asia Conference on Information Systems (PACIS). 75.\n\n[9] John W Creswell and J David Creswell. 2017. Research design: Qualitative, quantitative, and mixed methods approaches. Sage publications.\n\n[10] Laura Dabbish, Colleen Stuart, Jason Tsay, and Jim Herbsleb. 2012. Social coding in GitHub: transparency and collaboration in an open software repository. In Proc. Conf. Computer Supported Cooperative Work (CSCW). ACM, 1277\u20131286.\n\n[11] Laura Dabbish, Colleen Stuart, Jason Tsay, and James Herbsleb. 2013. Leveraging transparency. IEEE Software 30, 1 (2013), 37\u201343.\n\n[12] James Dixon. 2009. Forking Protocol: Why, When, and How to Fork an Open Source Project. Blog Post. https://jamesdixon.wordpress.com/2009/05/13/different-kinds-of-open-source-forks-salad-dinner-and-fish/\n\n[13] Neil A Ernst, Steve Easterbrook, and John Mylopoulos. 2010. Code forking in open-source software: a requirements perspective. arXiv preprint arXiv:1004.2889 (2010).\n\n[14] Stefan Fischer, Lukas Linsbauer, Roberto Erick Lopez-Herrejon, and Alexander Egyed. 2014. Enhancing clone-and-own with systematic reuse for developing software variants. In Proc. Int\u2019l Conf. Software Maintenance (ICSM). IEEE, 391\u2013400.\n\n[15] Karl Fogel. 2005. Producing open source software: How to run a successful free software project. O\u2019Reilly Media, Inc.\n\n[16] Kam Hay Fung, Ayb\u00fcke Aurum, and David Tang. 2012. Social Forking in Open Source Software: An Empirical Study. In Proc. Int\u2019l Conf. Advanced Information Systems Engineering (CAiSE) Forum. Citeseer, 50\u201357.\n\n[17] Jonas Gamalielsson and Bj\u00f6rn Lundell. 2014. Sustainability of Open Source Software Communities beyond a Fork: How and Why has the LibreOffice Project Evolved? Journal of Systems and Software 89 (2014), 128\u2013145.\n\n[18] Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proc. Working Conf. Mining Software Repositories (MSR). IEEE Press, 233\u2013236.\n\n[19] Georgios Gousios, Martin Pinger, and Arie van Deursen. 2014. An exploratory study of the pull-based software development model. In Proc. Int\u2019l Conf. Software Engineering (ICSE). ACM, 345\u2013355.\n\n[20] Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik, and Andy Zaidman. 2014. Lean GHTorrent: GitHub data on demand. In Proc. Working Conf. Mining Software Repositories (MSR). ACM, 384\u2013387.\n\n[21] Georgios Gousios, Andy Zaidman, Margaret-Anne Storey, and Arie Van Deursen. 2015. Work Practices and Challenges in Pull-Based Development: The Integrator\u2019s Perspective. In Proc. Int\u2019l Conf. Software Engineering (ICSE). Vol. 1. 358\u2013368.\n\n[22] Wenbin Ji, Thorsten Berger, Michal Antkiewicz, and Krzysztof Czarnecki. 2015. Maintaining Feature Traceability with Embedded Annotations. In Proc. Int\u2019l Software Product Line Conf. (SPLC). ACM, 61\u201370.\n\n[23] Jing Jiang, David Lo, Jiuhuan He, Xin Xia, Paveent Singh Kochhar, and Li Zhang. 2017. Why and how developers fork what from whom in GitHub. Empirical Software Engineering 22, 1 (2017), 547\u2013578.\n\n[24] Natalia Juristo and Omar S G\u00f3mez. 2010. Replication of software engineering experiments. In Empirical software engineering and verification. Springer, 60\u201388.\n\n[25] Eirini Kallianvakou, Georgios Gousios, Kelly Blinco, Leif Singer, Daniel M German, and Daniela Damian. 2016. An in-depth study of the promises and perils of mining GitHub. Empirical Software Engineering 21, 5 (2016), 2035\u20132071.\n\n[26] Andrew M St Laurent. 2004. Understanding Open Source and Free Software Licensing: Guide to Navigating Licensing Issues in Existing & New Software. O\u2019Reilly Media, Inc.\n\n[27] Yi Li, Chenguang Zhu, Julia Rubin, and Marsha Chechik. 2017. Semantic slicing of software version histories. IEEE Trans. Softw. Eng. (TSE) 44, 2 (2017), 182\u2013201.\n\n[28] Max Lillack, \u015etefan St\u0103nciulescu, Wilhelm Hedman, Thorsten Berger, and Andrzej W\u0105sowski. 2019. Intention-based Integration of Software Variants. In Proceedings of the 41st International Conference on Software Engineering (ICSE \u201919). IEEE Press. Piscataway, NJ, USA, 831\u2013842.\n\n[29] Leticia Montalvillo and Oscar D\u00edaz. 2015. Tuning GitHub for SPL development: branching models & repository operations for product engineers. In Proceedings of the 19th International Conference on Software Product Line. ACM, 111\u2013120.\n\n[30] Linus Nyman. 2014. Hackers on forking. In Proc. Int\u2019l Symposium on Open Collaboration (OpenSym). ACM, 6.\n\n[31] Linus Nyman and Tommi Mikkonen. 2011. To Fork or not to Fork: Fork Motivations in SourceForge Projects. In Proc. IEP\u201911 Int\u2019l Conf. on Open Source Systems. Springer, 259\u2013268.\n\n[32] Linus Nyman, Tommi Mikkonen, Juho Lindman, and Martin Foug\u00e8re. 2012. Perspectives on Code Forking and Sustainability in Open Source Software. Open Source Systems: Long-Term Sustainability (2012), 274\u2013279.\n\n[33] Rohan Padhye, Senthil Mani, and Vibha Singhal Sinha. 2014. NeedFeed: Taming Change Notifications by Modeling Code Relevance. In Proc. Int\u2019l Conf. Automated Software Engineering (ASE). ACM, 665\u2013676.\n\n[34] Ayushi Rastogi and Nachiappan Nagappan. 2016. Forking and the Sustainability of the Developer Community Participation\u2014An Empirical Investigation on Outcomes and Reasons. In Proc. Int\u2019l Conf. Software Analysis, Evolution, and Reengineering (SANER). Vol. 1. IEEE, 102\u2013111.\n\n[35] Baishakhi Ray, Miryung Kim, Suzette Person, and Neha Rungta. 2013. Detecting and characterizing semantic inconsistencies in ported code. In Proc. Int\u2019l Conf. Automated Software Engineering (ASE). IEEE, 367\u2013377.\n\n[36] Eric S Raymond. 2001. The Cathedral & the Bazaar: Musings on linux and open source by an accidental revolutionary. O\u2019Reilly Media, Inc.\n\n[37] Loyao Ren. 2019. Automated Patch Forging Across Forked Projects. In Proc. Int\u2019l Symposium Foundations of Software Engineering (FSE). ACM, New York, NY, USA, 1199\u20131201.\n\n[38] Loyao Ren, Shurui Zhou, Christian K\u00e4stner, and Andrzej W\u0105sowski. 2019. Identifying Redundancies in Fork-based Development. In Proc. Int\u2019l Conf. Software Analysis, Evolution, and Reengineering (SANER). IEEE, 230\u2013241.\n\n[39] Gregorio Robles and Jes\u00fas M. Gonz\u00e1lez-Barahona. 2012. A Comprehensive Study of Software Forks: Dates, Reasons and Outcomes. In Proc. IEP\u201912 Int\u2019l Conf. on Open Source Systems. 1\u201314.\n\n[40] Julia Rubin and Marsha Chechik. 2013. A framework for managing cloned product variants. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 1233\u20131236.\n\n[41] Johnny Saldana. 2015. The coding manual for qualitative researchers. Sage.\n\n[42] Anand Mani Sankar. 2015. Node.js vs io.js: Why the fork?!? Blog Post. http://anandmanisankar.com/posts/nodejs-iojs-why-the-fork/\n\n[43] Stefan Schmidt. 2009. Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology 13, 2 (2009), 90\u2013100.\n\n[44] Donna Spencer. 2009. Card sorting: Designing usable categories. Rosenfeld Media.\n\n[45] Stefan St\u0103nciulescu, Thorsten Berger, Eric Walkingshaw, and Andrzej W\u0105sowski. 2016. Concepts, operations, and feasibility of a projection-based variation control system. In Proc. Int\u2019l Conf. Software Maintenance and Evolution (ICSME). IEEE, 323\u2013333.\n\n[46] Igor Steinmacher, Gustavo Pinto, Igor Scaliente Wiese, and Marco Aur\u00e9lio Gerosa. 2018. Almost there: A study on quasi-contributors in open-source software projects. In Proc. Int\u2019l Conf. Software Engineering (ICSE). IEEE, 256\u2013266.\n\n[47] Robert Viseur. 2012. Forks impacts and motivations in free and open source projects. International Journal of Advanced Computer Science and Applications 3, 2 (2012), 117\u2013122.\n\n[48] Steve Weber. 2004. The success of open source. Harvard University Press.\n[49] David A. Wheeler. 2015. Why Open Source Software/Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers! Blog Post. https://dwheeler.com/ossfswhy.html\n\n[50] Owen Williams. 2015. Node.js and io.js are settling their differences, merging back together. Blog Post. https://thenextweb.com/dd/2015/06/16/node-js-and-io-js-are-settling-their-differences-merging-back-together/\n\n[51] Alex Williams and Joab Jackson. 2016. A Docker Fork: Talk of a Split Is Now on the Table. Blog Post. https://thenewstack.io/docker-fork-talk-split-now-table/\n\n[52] Shurui Zhou, \u015etefan St\u00e3nciulescu, Olaf Le\u00dfenich, Yingfei Xiong, Andrzej W\u0105sowski, and Christian K\u00e4stner. 2018. Identifying Features in Forks. In Proc. Int\u2019l Conf. Software Engineering (ICSE). ACM Press, 105\u2013116.\n\n[53] Shurui Zhou, Bogdan Vasilescu, and Christian K\u00e4stner. 2019. What the Fork: A Study of Inefficient and Efficient Forking Practices in Social Coding. In Proc. Europ. Software Engineering Conf./Foundations of Software Engineering (ESEC/FSE). ACM Press, New York, NY, 350\u2013361.", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/024_zhou.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 12, "total-input-tokens": 38021, "total-output-tokens": 17230, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 5039, 1], [5039, 11535, 2], [11535, 18173, 3], [18173, 24178, 4], [24178, 29844, 5], [29844, 37186, 6], [37186, 43590, 7], [43590, 47471, 8], [47471, 54391, 9], [54391, 61288, 10], [61288, 71149, 11], [71149, 72190, 12]]}}
|
|
{"id": "8dec900e397da90ec5ec09bbd9ecba5bf49d7982", "text": "I Depended on You and You Broke Me: An Empirical Study of Manifesting Breaking Changes in Client Packages\n\nDANIEL VENTURINI, Federal University of Technology (UTFPR), Brazil\nFILIPE ROSEIRO COGO, Huawei Technologies, Canada\nIVANILTON POLATO, Federal University of Technology (UTFPR), Brazil\nMARCO A. GEROSA, Northern Arizona University (NAU), United States\nIGOR SCALIANTE WIESE, Federal University of Technology (UTFPR), Brazil\n\nComplex software systems have a network of dependencies. Developers often configure package managers (e.g., npm) to automatically update dependencies with each publication of new releases containing bug fixes and new features. When a dependency release introduces backward-incompatible changes, commonly known as breaking changes, dependent packages may not build anymore. This may indirectly impact downstream packages, but the impact of breaking changes and how dependent packages recover from these breaking changes remain unclear. To close this gap, we investigated the manifestation of breaking changes in the npm ecosystem, focusing on cases where packages\u2019 builds are impacted by breaking changes from their dependencies. We measured the extent to which breaking changes affect dependent packages. Our analyses show that around 12% of the dependent packages and 14% of their releases were impacted by a breaking change during updates of non-major releases of their dependencies. We observed that, from all of the manifesting breaking changes, 44% were introduced in both minor and patch releases, which in principle should be backward compatible. Clients recovered themselves from these breaking changes in half of the cases, most frequently by upgrading or downgrading the provider\u2019s version without changing the versioning configuration in the package manager. We expect that these results help developers understand the potential impact of such changes and recover from them.\n\nCCS Concepts: \u2022 Software and its engineering \u2192 Software evolution;\n\nAdditional Key Words and Phrases: Breaking changes, Semantic Version, npm, dependency management, change impact\n\nACM Reference format:\nDaniel Venturini, Filipe Roseiro Cogo, Ivanilton Polato, Marco A. Gerosa, and Igor Scaliante Wiese. 2023. I Depended on You and You Broke Me: An Empirical Study of Manifesting Breaking Changes in Client Packages. ACM Trans. Softw. Eng. Methodol. 32, 4, Article 94 (May 2023), 26 pages. https://doi.org/10.1145/3576037\n\nThis work is partially supported by the National Science Foundation under Grant Number IIS-1815503, CNPq/MCTI/FNDCT (grant #408812/2021-4), and MCTIC/CGI/FAPESP (grant #2021/06662-1).\n\nAuthors\u2019 addresses: D. Venturini, I. Polato, and I. S. Wiese, Federal University of Technology (UTFPR), Campo Mour\u00e3o, Paran\u00e1, Brazil; emails: danielventurini@alunos.utfpr.edu.br, {ipolato,igor}@utfpr.edu.br; F. R. Cogo, Huawei Technologies, Kingston, Canada; email: filipe.cogo@gmail.com; M. A. Gerosa, Northern Arizona University (NAU), Flagstaff, AZ; email: Marco.Gerosa@nau.edu.\n\nPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.\n\n\u00a9 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. 1049-331X/2023/05-ART94 $15.00 https://doi.org/10.1145/3576037\n1 INTRODUCTION\n\nComplex software systems are commonly built upon dependency relationships in which a client package reuses the functionalities of provider packages, which in turn depend on other packages. To automate the process of installing, upgrading, configuring, and removing dependencies, package managers such as npm, Maven, pip, and Cargo are widely adopted. Despite the many benefits brought by the reuse of provider packages, one of the main risks client packages face is breaking changes [21]. Breaking changes are backward-incompatible changes performed by the provider package that renders the client package build defective (e.g., a change in a provider\u2019s API). When client packages configure package managers to automatically accept updates on a range of provider package versions, the breaking change will have the serious consequence of catching clients off guard. For example, in npm, where most of the packages follow the Semantic Versioning specification [23], clients adopt configurations that automatically update minor and patch releases of their providers. In principle, these release types should not contain any breaking changes, as the semantic version posits that only major updates should contain breaking changes. However, minor or patch releases occasionally introduce breaking changes and generate unexpected errors in the client packages when these breaking changes manifest on clients. Due to the transitive nature of the dependencies in package managers, unexpected breaking changes can potentially impact a large proportion of the dependency network, preventing several packages from performing a successful build.\n\nResearch has shown that providers occasionally incorrectly use the Semantic Versioning specification [15]. In the npm ecosystem, prior research has shown that provider packages indeed publish releases containing breaking changes [14, 15, 18, 19]. However, such studies provide limited information regarding the prevalence of these breaking changes, focusing on API breaking changes without clarifying how the client packages solve the problems they cause. In this article, we fill this gap by conducting an empirical study of npm projects hosted on GitHub, verifying the frequency and types of the breaking changes that manifest as defects in client packages and how clients recover from them. npm is the main package manager for the JavaScript programming language, with more than 1 million packages. An estimated 97% of web applications come from npm [1], making it the most extensive dependency network [9]. We employed mixed methods to identify and analyze the types of manifesting breaking changes\u2014changes in a provider release that render the client\u2019s build defective\u2014and how client packages deal with them in their projects. This article does not study cases in which a breaking change does not manifest itself in other projects. Our research answers the following questions:\n\nRQ1. To what extent do breaking changes manifest themselves in client packages?\nWe analyzed 384 packages selected using a random sampling approach (95% confidence level and \u00b15% confidence interval) to select client packages with at least one provider. We found that manifesting breaking changes impacted 11.7% of all client packages (regardless of their releases) and 13.9% of their releases. In addition, 2.6% of providers introduced manifesting breaking changes.\n\nRQ2. What changes in the provider packages manifest a breaking change?\nThe main causes of manifesting breaking changes were feature modifications, change propagation among dependencies, and data type modifications. We also verified that an equal proportion of manifesting breaking changes was introduced in minor and patch releases (approximately 44% in each release type). Providers fixed most of the manifesting breaking change cases introduced in minor and patch releases (46.4% and 61.5%, respectively). Finally, manifesting breaking changes were documented in issue reports, pull requests, or changelogs in 78.1% of cases.\nRQ3. How do client packages recover from manifesting breaking changes?\n\nClient packages recovered from manifesting breaking changes in 39.1% of the cases, and their recovery took about 134 days when providers did not fix the break or when clients recovered first. When providers released a fix to a manifesting breaking change, they took a median of 7 days. Upgrading the provider is the most frequent way client packages recover from a manifesting breaking change.\n\nThis article contributes to the literature by providing quantitative and qualitative empirical evidence about the phenomenon of manifesting breaking changes in the npm ecosystem. Our qualitative study may help developers understand the types of changes that manifest defects in client packages and which strategies are used to recover from breaking changes. We also provide several suggestions about how clients and providers can enhance the quality of their release processes. As an additional contribution, we created pull requests for real manifesting breaking change cases that had not yet been resolved, half of which were merged.\n\n2 DEFINITIONS, SCOPE, AND MOTIVATING EXAMPLES\n\nThis section defines terms used in this article and describes motivating examples for our research.\n\n2.1 Glossary Definitions\n\nIn the following, we describe the terms and definitions that we use in the article, based on related work [7, 11, 17].\n\n- **Provider package release** is the package release that provides features and resources for use by other package releases. In Figure 1, the package express is a provider of ember-cli, body-parser is a provider of express, and so on. We refer to a provider package $P$ as a transitive provider when we want to emphasize that $P$ has other provider packages. For instance, in Figure 1, body-parser is a provider of express; body-parser also has bytes as a provider. In this scenario, we consider body-parser to be a transitive provider.\n\n- **Client package release** is the package release that uses features and resources exposed by provider package releases. In Figure 1, express is a client of body-parser, body-parser is a client of bytes, and so on.\n\n- **Direct provider release** is the one directly used by its client, that is, the package that the client explicitly declares as a dependency. In Figure 1, express is a direct provider of ember-cli, and bytes is a direct provider of body-parser.\n\n- **Indirect provider release** is a package release that at least one of its providers uses. In other words, it is a provider of at least one of the direct client\u2019s providers. In Figure 1, both body-parser and bytes are indirect providers of ember-cli, and bytes is an indirect provider of express.\n\n- **Transitive provider release** is the package release between the one that introduced a breaking change and the client. For example, if a breaking change is introduced by bytes, in Figure 1, and affects client ember-cli, both packages express and body-parser are transitive providers. This is because the breaking change transited through these packages (body-parser and express) to arrive at client ember-cli. The transitive providers are all also impacted by the breaking change.\n\u2022 **Version statement:** A client can specify its provider\u2019s versions on `package.json`, a metadata file used by npm to specify providers and their versions, among other purposes. The version statement contains the accepted version of a provider. For example, the version statement in the following metadata `{\"dependencies\": {\"express\": \"^4.10.6\"}}` defines that the client requires express on version `^4.10.6`.\n\n\u2022 **Version range:** On the version statement a client can specify a range of versions/releases accepted by its provider. There are three types of ranges:\n - **All (>=, or *):** Using this range, the client specifies that all new provider releases are supported/accepted and downloadable, even the ones with breaking changes.\n - **Caret (^):** With this range, the client specifies that all new provider releases that contain new features and bug fixes are supported/accepted and downloadable; breaking changes must be avoided. This is the default range used by npm when a dependency is installed.\n - **Tilde range (~):** This range specifies that all new provider releases that only contain bug fixes are supported/accepted and downloadable; breaking changes and new features must be avoided.\n - **Steady range:** This range always resolves to a specific version and is also known as specific range. That is, the versioning statement has no range on it but rather a specific version. npm allows installation with a steady range using the command line option `--save-exact`.\n\n\u2022 **Implicit and explicit update:** An implicit update happens when the client receives a new provider version due to the range version in the `package.json`. For a version statement defined with a range of versions, for example, `^4.10.6`, an implicit update happens when npm installs a version 4.10.9 that matches the range. An explicit update takes place when the client manually updates the versioning statement directly in the `package.json`.\n\n\u2022 **Manifesting breaking changes** are provider changes that manifest as a fault on the client package, ultimately breaking the client\u2019s build. The adopted definition of breaking change by the prior literature [3\u20136, 8, 15, 19, 21] includes cases that are not considered breaking changes (e.g., a change in an API that is not effectively used by a client package). Conversely, manifesting breaking changes include cases that are not covered by the prior definitions of breaking change (e.g., because the provider package is used in a way that is not intended by the provider developer, a semantic-version-compliant change introduced by a new release of this provider causes an expected error in the client package).\n\n### 2.2 Motivating Examples\n\nWe found the following two examples of manifesting breaking changes in our manual analysis (on each of the following Listings, red lines have been removed from the source code, whereas blue lines have been inserted into the source code). Our manual analysis (Section 3.2.1) consists of executing the client tests suite for its releases and analyzing all executions that run into an error.\n\nThe client `assetgraph-builder@7.0.0` has a provider `assetgraph@6.0.0` that has a provider `terser@^4.0.0`, but, due to a range of versions, npm installed `terser@4.6.10`. Release 4.3.0 of terser introduces a change that, by default, enables the wrapping of functions on parsing, as shown in Listing 1.\n\n```javascript\n// terser@4.2.1 without default wrapping behavior\nfoo(function(){});\n\n// terser@4.3.0 default wrapping behavior\nfoo((function(){}));\n```\n\nListing 1. Diff between terser@4.2.1 and terser@4.3.0 default behavior.\n\n[1]https://github.com/terser/terser/compare/v4.2.1..v4.3.0.\nThis change breaks the assetgraph-builder@7.0.0\u2019s tests. Once this feature is turned into a default behavior, the client assetgraph-builder@8.0.0 adopts its test to make it compatible with the terser\u2019s behavior, as shown in Listing 2.\n\n```javascript\nexpect(\n javaScriptAssets[0].text,\n 'to match',\n - /SockJS=[\\s\\S]*define\\(\"main\",function\\(\\)\\{\\}\\);/\n + /SockJS=[\\s\\S]*define\\(\"main\",\\(?function\\(\\)\\{\\}\\) ?\\);/\n);\n```\n\nListing 2. Diff with assetgraph@8.0.0 client\u2019s tests adjusting to breaking change.\n\nSometimes, provider changes can break a client long after their introduction. This occurred in the client package ember-cli-chartjs@2.1.1. In Figure 2, the release 1.0.4 of ember-cli-qunit (left-tree) introduced a change that did not lead to a breaking change. However, almost 3 years later, ember-cli-qunit was used together with release 1.3.1 of the provider broccoli-plugin (middle-tree), and a breaking change manifested.\n\nIn November 2015, the provider ember-cli-qunit@1.0.4 fixed an error in its code, changing the returned object type of function lintTree, as shown in Listing 3. Despite being a type change, it did not break the client when it was released, and this fix was retained in further releases of ember-cli-qunit.\n\n```javascript\nlintTree: function(type, tree) {\n // Skip if useLintTree === false.\n if (this.options['ember-cli-qunit'] && ... ) {\n return tree;\n + // Fakes an empty broccoli tree\n + return { inputTree: tree, rebuild: function() { return []; } };}\n```\n\nListing 3. ember-cli-qunit@1.0.4 object type change.\n\nAlmost 3 years later, in August 2018, the provider broccoli-plugin@1.3.1 was released (middle-tree in Figure 2) to fix a bug, as in Listing 4.\n\n```javascript\nfunction isPossibleNode(node) {\n - return typeof node === 'string' ||\n - (node !== null && typeof node === 'object')\n + var type = typeof node;\n```\n\n---\n\n2 https://github.com/terser/terser/issues/496.\n3 https://github.com/assetgraph/assetgraph-builder/commit/e4140416e7feaa3d088cf3ad0229fd677ff36dbc.\n4 https://github.com/ember-cli/ember-cli-qunit/commit/6fdfe7d.\n5 https://github.com/broccolijs/broccoli-plugin/commit/3f9a42b.\nRelease 1.3.1 of the broccoli-plugin package experienced a manifesting breaking change due to a fix in the provider ember-cli-qunit@1.0.4, which was released almost 3 years prior. This manifesting breaking change occurred because the ember-cli-chartjs\u2019 dependency tree evolved over time due to the range versions, as shown in Figure 2, causing the break. When the package ember-cli-chartjs@2.1.1 was installed in April 2020 (the date of our analysis), the installation failed due to the integration of broccoli-plugin@1.3.1 changes into ember-cli-qunit. Fifteen days later, ember-cli-qunit@1.4.3 fixed the issue when the ember-cli-qunit\u2019s object type was changed again. During the 15-day period when the manifesting breaking change remained unresolved, broccoli-plugin received about 384k downloads from npm. This scenario shows that even popular and mature projects can be affected by breaking changes. Although we recognize that the download count does not necessarily reflect the popularity of a package, we use this metric as an illustrative example of how many client packages might have been impacted by a provider package.\n\n3 STUDY DESIGN\n\nThis section describes how we collected our data (Section 3.1) and the motivation and approach for each RQ (Section 3.2).\n\n3.1 Data Collection\n\n3.1.1 Obtaining Metadata from npm Packages. The first part of Figure 3 shows our approach for sampling the database. We initially gathered all the metadata files (i.e., package.json files) from the published packages in the npm registry between December 20, 2010, and April 01, 2020, accounting for 1,233,944 packages. This range refers to the oldest checkpoint that we could retrieve and the most recent one when we started this study. We ignored packages that did not have any providers in the package.json since they cannot be considered client packages and will therefore not suffer breaking changes. After filtering packages without a provider, our dataset comprises 987,595 package.json metadata files. For each release of each package, we recorded the timestamp of the release and the name of the providers with their respective versioning statements.\n\nWe parsed all the versioning statements and determined the resolved provider version at the time of each client release. Prior works have adopted similar approaches when studying dependency management [7, 29]. For each provider in each client release, we retrieved the most recent provider version that satisfied the range specified by the client in that release, i.e., the resolved version. Using this resolved version, we determined whether a provider changed its version between the two client releases. In other words, we reproduced the adopted versions of all providers by resolving the provider version at the release time of the client.\n\nTo further refine our sample, we analyzed two criteria in the associated package.json snapshot with the latest version of the client packages in our dataset:\n\n6https://github.com/broccolijs/broccoli-merge-trees/issues/65.\n7https://github.com/ember-cli/ember-cli-qunit/commit/59ca6ad.\n(1) The `package.json` snapshot should have a non-empty entry for the \u201cscript test\u201d field, and the entry should differ from the default: `Error: no test specified`. We specified this criterion in order to run the automated tests that were part of our method to detect manifesting breaking changes. In total, 488,805 packages remained after applying this criterion.\n\n(2) The `package.json` snapshot should have an entry containing the package\u2019s repository URL, as we wanted to retrieve information from the package codebase. After applying this criterion, 410,433 packages remained in our dataset.\n\n### 3.1.2 Running Clients\u2019 Tests.\n\nGiven the size of our dataset (more than 410,000 client packages), we ran tests on a random sample. At a 95% confidence level and \u00b15% confidence interval, we randomly selected 384 packages. Our sample has a median of 5.5 releases and 9 direct providers per package. We chose to study a random sample since our manual analysis is slow to run over a large dataset (Section 3.1.3); we spent a month executing our method in our sample. We did not ignore packages based on the number of releases or providers or any other metric. We performed a manual check on all selected packages that had fewer than four releases (130 out of 384) by checking their repositories and aiming to remove packages that are not real projects, lack tests, lack code, are example projects, and so forth. When we removed one package, we sampled another one following the two criteria described above.\n\nThe second part of Figure 3 depicts our approach to running the test scripts for each release of the 384 clients. For each client package, we cloned its repository\u2014all client repositories are hosted on GitHub\u2014and restored the work tree of all releases using their respective release tags (e.g., \u201cv1.0.0\u201d). For releases that are not tagged, we used their provided timestamp in the `package.json` metadata to restore the work tree (i.e., we matched the release timestamp and the closest existing commit in the master branch). We conducted an analysis and verified that tags and timestamps point to the same commit in 94% of releases with tags; thus, checkout based on timestamps is reliable for untagged releases.\n\nAfter restoring the work tree of a client release, we updated all versioning statements in the associated `package.json` entry with the specific resolved provider version (see Section 3.1.1). We then excluded a file called `package-lock.json`, which locks the providers\u2019 and indirect providers\u2019 versions. We also executed the associated tests on a release of the client package whenever a provider package changed in that release, as this can potentially introduce a manifesting breaking\nchange. A provider change can be (1) a provider added into the `package.json` or (2) the resolved version of a provider changed between the previous and current release of the client package.\n\nWe sought to reproduce the same build environment that existed when the provider changed. Therefore, before executing the tests of the client packages, we performed a best-effort procedure to identify the Node.js that was adopted by the client package at the time the provider changed. This was because every 6 months a new major version of Node.js is released.\\(^8\\) As we wanted to reproduce the test results with respect to the time when the client package published its release, we changed the Node.js version before executing the client package tests. We selected the Node.js version using two different approaches. Our preferred approach was to select the same Node.js version as the one specified in the `engines \u2192 node` field of the `package.json` file.\\(^9\\) This field allows developers to manually specify the Node.js version that runs the associated code with the build of a specific release. When this field was not set, we selected the latest Node.js version available\\(^10\\) at the time of the client package release. Therefore, we changed the Node.js version, executed the install script, and released tests using the `npm install` and `npm test` commands, respectively. If the install or test commands failed due to incompatibilities with the selected Node.js version or took more than 10 minutes, we changed to the previous major release of Node.js until the install and test commands succeeded. We used the **Node Version Manager (NVM)** tool to exchange Node.js versions. Additionally, we also changed the npm version according to the Node.js version. npm is the package manager to Node.js packages and executes the `install` and `test` scripts. We performed the same procedure to select the npm version to use during the installation and test runs. Finally, we executed the install/test scripts and saved the results (success or error) for each client release.\n\nAfter executing the install/test scripts of the 384 client packages in our sample, we discarded 33 packages because the errors did not allow the execution of the install/test script in any of their releases: 15 clients did not have one of the required files; 11 had invalid test scripts (e.g., `\"test\": \"no test\"`); 4 listed some required files in the `.gitignore` file, that specifies untracked files that git should ignore;\\(^11\\) 2 required specific database configurations that could not be done; and 1 package required a key to access a server. We randomly replaced these 33 packages following the aforementioned criteria.\n\nTable 1 shows the results of the execution of the install/test scripts of the 384 client packages and their 3,230 releases. Since the associated providers\u2019 version with 2,727 releases did not change, these tests\u2019 releases were not executed. Finally, we consider as possible manifesting breaking changes cases in which all client packages and releases failed the install/test scripts.\n\nA replication package including our client packages\u2019 sample, instruments, scripts, and identified manifesting breaking changes is available for download at [https://doi.org/10.5281/zenodo.5558085](https://doi.org/10.5281/zenodo.5558085).\n\n---\n\n\\(^8\\) [https://github.com/nodejs/node#release-types](https://github.com/nodejs/node#release-types).\n\n\\(^9\\) [https://docs.npmjs.com/files/package.json#engines](https://docs.npmjs.com/files/package.json#engines).\n\n\\(^10\\) [https://nodejs.org/en/download/releases](https://nodejs.org/en/download/releases).\n\n\\(^11\\) [https://git-scm.com/docs/gitignore](https://git-scm.com/docs/gitignore).\n3.1.3 Manual Check on Failure Cases: Detecting Manifesting Breaking Changes. For all failure cases (203 clients and 1,276 releases) on the execution of install/test scripts, we manually analyzed which ones were true cases of manifesting breaking changes. To identify breaking changes that manifest themselves in a client package, we leveraged the output logs (logs generated by npm when executing the install and test scripts) generated as the result of executing the method described in Section 3.1.2 (see the second part of Figure 3). For each failed test result, we obtained the error description and the associated stack trace. We then differentiated failed test results caused by a related issue with the client package (e.g., an introduced bug by the client) from those caused by a change in the provider package (e.g., a change in the return type of a provider\u2019s function). From the obtained stack traces, we determined whether any function of a provider package was called and manually investigated the positive cases. During our manual investigation, we sought to confirm that the test failure was caused by a manifesting breaking change introduced by the provider package.\n\nThe first author was responsible for running the tests and identifying the manifesting breaking changes and related releases and commits. The first author also manually analyzed each of the manifesting breaking changes and recorded the following information about each of them: the number of affected versions of the client, whether any documentation mentions the manifesting breaking change, the responsible package for addressing the breaking change (provider or client), the client version impacted by the manifesting breaking change, the provider version that introduced the breaking change, and a textual description about the causes for the breaking change manifestation (e.g., \u201cThe provider function was renamed by mistake,\u201d \u201cThe provider normalizeurl@1.0.0 introduce[d] a new function and the client assetgraph use[d] it. But the client forgot to update the provider version in package.json,\u201d \u201cThe provider inserts an \u2018in a null body request\u2019\u201d). During this process, several rounds of discussions were performed among the authors to refine the analysis, using continuous comparison [22] and negotiated agreement [13]. In the negotiated agreement process, the researchers discussed the rationale they used to categorize each code until reaching consensus [13]. More specifically, we leveraged the recorded information about each manifesting breaking change to derive a consistent categorization of the introduced breaking changes (RQ2 and RQ3) and to guide new iterations of the manual analysis.\n\nMore specifically, the following set of actions was performed during our manual investigation:\n\n- **Analyze the execution flow:** To determine whether the associated function with the test failure occurred in the provider or the client code, we leveraged the stack traces to identify which function was called when the test failed. In particular, we instrumented the code of the provider and the client packages to output any necessary information to analyze the execution flow. We analyzed the variable contents by adding a call to the `console.log()` and `console.trace()` functions in each part of the code where the client package calls a function of the provider. For example, suppose the following error appeared: \u201cTypeError: my-Object.callback is not a function.\u201d To discover the variable content, we use the command `console.log(myObject)` to check whether myObject variable was changed, was null, or received other values.\n\n- **Analyze the status of the Continuous Integration (CI) pipeline:** We compared the status of the CI pipeline between the originally built release and the status of the CI pipeline at the time of our manual investigation. Since the source code of the client package remains the same between the original release and the installed version in our analysis, we use the difference between the status of the CI pipeline as additional evidence that the test failure was caused by a provider version change. Not all clients had CI pipelines, but when they did, it was helpful.\n\u2022 **Search for client fixing commits:** We manually searched for recovering commits in the history of commits between the installed and previous releases of the client package. Whenever a recovery commit was identified (by reading the commit message), we determined whether the error was due to the client or the provider code. For example, we observed cases in which a client updated a provider in the release with failed tests. We also observed that, in the following commits, the provider was downgraded and the commit message was \u201cdowngrade provider\u201d or \u201cfix breaking change.\u201d In these cases, we considered the test failure as caused by a manifesting breaking change.\n\n\u2022 **Search for related issue reports and pull requests:** We hypothesized that a manifesting breaking change would affect different clients that, in turn, would either issue a bug report or perform a fix followed by a pull request to the codebase of the provider package. Therefore, we searched for issue reports and pull requests with the same error message obtained in our stack trace. We then collected detailed information about the error to confirm whether it was due to a manifesting breaking change introduced by the provider package.\n\n\u2022 **Previous and subsequent provider versions:** If the test error was caused by a manifesting breaking change, downgrading to the previous provider version or upgrading to a subsequent provider version might fix the error, if the provider already fixed it. *Subsequent provider versions* means all provider versions that fit the versioning statement and are greater than the provider version that introduced the manifesting breaking change (i.e., the adopted provider version when the test failed). In this case, we uninstalled the current version and installed the previous and subsequent versions and executed the test scripts again. For example, if the client specified a provider `p` as `{\"p\": \"^1.0.2\"}` that brought about a breaking change in the version, for example, `1.0.4`, we installed `p@1.0.2`, `p@1.0.3`, and `p@1.0.5` to verify whether the error persisted for those versions.\n\n### 3.2 Research Questions: Motivation, Approach\n\nThis section contains the motivation and the approach for each of the research questions.\n\n#### 3.2.1 RQ1. To What Extent Do Manifesting Breaking Changes Manifest in Client Packages?\n\n**Motivation:** By default, npm sets the caret range as a default versioning statement that automatically updates minor and patch releases. Hence, manifesting breaking changes that are introduced in minor and patch releases can inadvertently cause downtime in packages that are downloaded hundreds of thousands of times per day, affecting a large body of software developers. Understanding the prevalence of manifesting breaking changes in popular software ecosystems such as npm is important to help developers assess the risks of accepting automatic minor and patch updates. Although prior studies have focused on the frequency of API breaking changes [3], breaking changes can occur for different reasons. Determining the prevalence of a broader range of breaking change types remains an open research problem.\n\n**Approach:** For all cases that resulted in an error on the install/test script, we determined the type of error (client, provider, not discovered). We calculated, out of the 384 packages and 3,230 releases, the percentage of cases that we confirmed as manifesting breaking change. Considering all the providers on the client\u2019s latest releases, we calculated the percentage of providers that introduced manifesting breaking changes. In addition, we calculated how many times (number of releases) each provider introduced at least one manifesting breaking change.\n3.2.2 RQ2. What Problems in the Provider Package Cause a Manifesting Breaking Change?\n\n**Motivation:** Prior studies about breaking changes in the npm ecosystem are restricted to APIs\u2019 breaking changes [14]. However, other issues that provider packages can introduce in minor and patch releases can manifest a breaking change. To support developers to reason about manifesting breaking changes, it is important to understand their root causes.\n\n**Approach:** In this RQ, we analyzed the type of changes introduced by provider packages that bring about a manifesting breaking change. With the name and version of the provider packages, we manually analyzed the provider\u2019s repository to find the exact change that caused a break. We used the following approaches to find the specific changes introduced by providers:\n\n- **Using diff tools:** We used diff tools to analyze the introduced change between two releases of a provider. For example, suppose that a manifesting breaking change was introduced in the release `provider@1.2.5`. In this case, we retrieved the source code of previous versions, e.g., `provider@1.2.4`, and performed the diff between these versions to manually inspect the changed code.\n\n- **Analyzing provider\u2019s commits:** We used the provider\u2019s commits to analyze the changes between releases. For a manifesting breaking change in the provider `p`, we verified its repository and manually analyzed the commits ahead or behind the release tag commit that introduced a manifesting breaking change.\n\n- **Analyzing changelogs:** Changelogs contain information on all relevant changes in the history of a package. We used these changelogs to understand the introduced changes in a release of a client package and to verify whether any manifesting breaking change fix was described.\n\nWe also looked at issue reports and pull requests for explanations of the causes of manifesting breaking changes. After discovering the provider changes that introduced breaking changes, we analyzed, categorized, and grouped common issues. For example, all related issues to changing object types were grouped into a category called *Object type changed*. Furthermore, we analyzed the Semantic Version level that introduced and fixed/recovered the manifesting breaking changes in both the provider and client packages to verify the relationship between manifesting breaking changes and non-major releases.\n\nWe analyzed the version numbering of releases that fixed a manifesting breaking change and where manifesting breaking changes were documented (changelogs, issue reports, etc.). Furthermore, we analyzed the depth of the dependency tree of the provider that introduced a manifesting breaking change, since 25% of npm packages had at least 95 transitive dependencies in 2016 [10].\n\n3.2.3 RQ3. How Do Client Packages Recover from a Manifesting Breaking Change?\n\n**Motivation:** A breaking change may impact the client package through an *implicit* or *explicit* update. A client recovery is identified by an update to its code, by waiting for a new provider\u2019s release, or by performing a downgrade/upgrade in the provider\u2019s version. Breaking changes may be caused by either a *direct* or *indirect* provider since the client packages depend on a few direct providers and many indirect ones [11]. A breaking change may cascade to transitive dependencies if it remains unfixed. Even if the client packages can recover from the breaking change by upgrading to a newer version of the provider package, the client packages can manually resolve incompatibilities that might exist [12]. Understanding how breaking changes manifest in client packages can help developers understand how to recover from them.\n**Approach:** We retrieved all information for this RQ from the clients\u2019 repositories. We searched for information about the error and how the client packages recovered from the manifesting breaking change. The following information was analyzed:\n\n- **Commits:** We manually checked the subsequent commits of the client packages that were pushed to their repositories after the provider release that introduced the respective manifesting breaking change. In particular, we searched for commits that touched the `package.json` file. In the file history, we checked if the provider was downgraded, upgraded, replaced, or removed.\n\n- **Changelogs:** We analyzed the client changelogs and release notes looking for mentions of provider updates/downgrades. About 48% of clients maintained a changelog or release notes in their repositories.\n\n- **Pull requests/issue reports:** We searched for pull requests and issue reports in the client repository that contained information about the manifesting breaking changes. For example, we found pull requests and issue reports with \u201cUpdate provider\u201d and \u201cFix provider error\u201d in the title.\n\nFor each manifesting breaking change case, we recovered the provider\u2019s dependency tree. For example, in our second motivating example (Section 2), we recovered the dependency tree from the client to the package that introduced the manifesting breaking change, which resulted in `broccoli-asset-rev`\u2192`broccoli-filter`\u2192`broccoli-plugin` (Figure 2). We investigated how many breaking change cases were introduced by direct and indirect providers, when the manifesting breaking change was introduced and fixed/recovered, which package fixed/recovered from it, and how it was fixed/recovered. We also verified how client packages changed the provider\u2019s versions and how the associated documentation with manifesting breaking changes related to the time to fix it.\n\n### 3.3 Scope and Limitations\n\nAs our definition of manifesting breaking changes includes cases that are not included by the prior definitions of breaking changes (see Section 2.1), this article does not intend to provide a direct comparison between these two phenomena. As a result, the stated research questions do not indicate the proportion of manifest breaking changes that are, in fact, breaking changes as defined by prior literature (e.g., an API change by the provider). In addition, since provider packages are rarely accompanied by any formal specification of their intended behavior, it is impossible at the scale of our study to differentiate errors that manifest in the client package due to breaking changes from those that manifest due to an idiosyncratic usage of the provider by the client package. Therefore, the results of the stated RQs cannot be used to assess whether a client package could fix its build by simply updating to a newer version of the provider.\n\n### 4 RESULTS\n\nThis section presents the associated findings for each RQ.\n\n#### 4.1 RQ1. How Often Do Manifesting Breaking Changes Occur in the Client Package?\n\n**Finding 1:** 11.7% of the client packages (regardless of their releases) and 13.9% of the client releases were impacted by a manifesting breaking change. From all 384 client packages, 45 (11.7%) suffered a failing test from a manifesting breaking change in at least one release. From 3,230 client releases for which the tests were executed, 1,276 failed, and all errors were manually analyzed. In 450 (13.9%) releases, the error was raised by the provider packages, characterizing a manifesting breaking change. In 86 (2.7%) releases, we could not identify which package raised the error.\nTable 2. Results of Releases\u2019 Analyses\n\n| Results | Releases (#) | (%) |\n|-------------------------------|--------------|------|\n| Success | 1,954 | 60.5 |\n| Fail | | |\n| Client\u2019s errors | 479 | 14.8 |\n| manifesting breaking changes | 450 | 13.9 |\n| Breaking due to external changes | 261 | 8.1 |\n| Errors not identified | 86 | 2.7 |\n| Total | 3,230 | 100 |\n\nWe detected that 261 (8.1%) releases suffered a particular error type that we call *breaking due to external change*. These releases used a provider that relied on data/resources from an external API/service (e.g., Twitter) that were no longer available, impacting all clients\u2019 releases. The provider cannot fix this error, because it does not own the resource. These cases imply that detecting manifest breaking changes by running the clients\u2019 tests can introduce false positives, which we simply ignored during our manual analyses. We also considered cases in which a provider package was removed from npm as *breaking due to external change*. Table 2 shows the results of analyses by releases.\n\n**Finding 2:** 92.2% of providers introduced a single manifesting breaking change. In our sample, 47 providers (92.2%) of 51 introduced a single release with a manifesting breaking change, and 4 providers introduced two releases with manifesting breaking changes. We detected 55 unique manifesting breaking change cases introduced by providers, some of which impacted multiple clients. For example, the breaking change exhibited in the *Incompatible Providers\u2019 Versions* classification (Finding 3) impacted six clients. Therefore, 64 manifesting breaking change cases manifested in the client packages. Finally, there were 1,909 providers on all clients\u2019 latest versions, and the percentage of providers that introduced manifesting breaking change was 2.6% (51 of 1,909).\n\n- About 11.7% of clients and 13.9% of their releases suffered from manifesting breaking changes.\n- We detected failing tests due to 2% of the providers with changes.\n- Over 90% of those that introduced manifesting breaking changes did so through just a single release with a manifesting breaking change.\n\n4.2 RQ2. What Issues in the Provider Package Caused a Breaking Change to Manifest?\n\n**Finding 3:** We found eight categories of issues. We grouped each manifesting breaking change into eight categories, depending on its root cause (issue). Table 3 presents each category, the number of occurrences, and the number of impacted client releases.\n\nIn the following, we describe each category and present an example that we found during our manual analysis.\n\n- **Feature change:** Manifesting breaking changes in this category are related to modifications of provider features (e.g., the default value of variables). An example happens in request@2.17.0\u2014this version was removed from npm, but the introduced change remained in the package\u2014when developers introduced a new decision rule into their code\\(^\\text{12}\\) as shown in Listing 5.\n\n\\(^{12}\\)https://github.com/request/request/commit/d05b6ba.\nTable 3. The Identified Categories of Manifesting Breaking Changes\n\n| Category | Cases | Releases |\n|---------------------------------|-------|----------|\n| | (#) | (%) | (#) | (%) |\n| Feature change | 25 | 39.1 | 101 | 22.4 |\n| Incompatible providers\u2019 versions| 15 | 23.4 | 64 | 14.2 |\n| Object type changed | 9 | 14.1 | 213 | 47.3 |\n| Undefined object | 5 | 7.8 | 28 | 6.2 |\n| Semantically wrong code | 5 | 7.8 | 14 | 3.1 |\n| Failed provider update | 2 | 3.1 | 24 | 5.3 |\n| Renamed function | 2 | 3.1 | 2 | 0.4 |\n| File not found | 1 | 1.6 | 4 | 0.9 |\n| **Total** | 64 | | 450 | |\n\nListing 5. Example of a manifesting breaking change categorized as feature change.\n\n```javascript\ndebug('emitting complete', self.uri.href)\n+ if(response.body == undefined && !self._json) {\n + response.body = \"\";\n+ }\nself.emit('complete', response, response.body)\n```\n\nIn Listing 5, the provider request assigns an empty string to the `response.body` variable instead of preserving `response.body` with its default `undefined` value.\n\n- **Incompatible providers\u2019 versions:** In this category, the client breaks because of a change in an indirect provider. An example happens in the packages `babel-eslint` and `escope`, where `escope` is an *indirect* provider of `babel-eslint`.\n\n```javascript\n} - },\n- visitClass: {\n+ },{\n+ key: 'visitClass',\n value: function visitClass(node) {\n```\n\nListing 6. Incompatible providers\u2019 versions example.\n\nThe release `escope@3.4` introduced the presented change in Listing 6. This change impacted the package `babel-eslint`, even though the `escope` had not been a direct provider to `babel-eslint`. This manifesting breaking change remained unresolved for a single day, during which `babel-eslint` received about 80k downloads from npm.\n\n- **Object type changed:** We detected nine (14.06%) cases in which the provider changed the type of an object, resulting in a breaking change in the client packages.\n\n```javascript\nthis.setup();\n- this.sockets = [];\n+ this.sockets = {};\nthis.nsps = {};\nthis.connect Buffer = [];\n}\nvar socket = nsp.add(this, function() {\n- self.sockets.push(socket);\n+ self.sockets[socket.id] = socket;\nself.nsps[nsp.name] = socket;\n```\n\nListing 7. Object type changed example.\n\n13 [https://github.com/babel/babel-eslint/issues/243](https://github.com/babel/babel-eslint/issues/243).\n14 [https://github.com/estools/escope/issues/99#issuecomment-178151491](https://github.com/estools/escope/issues/99#issuecomment-178151491).\nIn Listing 7, the provider socket.io@1.4.0 turned an array into an object.\\textsuperscript{15} This simple change broke many of socket.io\u2019s clients, even the package karma,\\textsuperscript{16} a browser test runner, which was forced to update its code\\textsuperscript{17} and publish karma@0.13.19. During the single day, the manifesting breaking change remained unresolved, and karma was downloaded about 146k times from npm.\n\n- **Undefined object:** In this category, an undefined object causes a runtime exception that breaks the provider, which throws the exception to the client package.\n\n```javascript\n+ app.options = app.options || {};\n app.options.babel = app.options.babel || {};\n app.options.babel.plugins = app.options.babel.plugins || [];\n```\n\nListing 8. Undefined object code example.\n\nThis error happened in the provider ember-cli-htmlbars-inline-precompile@0.1.3, which solved it as shown in Listing 8.\\textsuperscript{18}\n\n- **Failed provider update:** In this category, provider A updates its provider B, but provider A does not update its code to work with the new provider B. We detected two cases of this category. In addition to an explicit update, one provider A from this category specified its provider B as an accept-all range (\\(\\geq\\)). Over time, its provider B published a major release that introduced a manifesting breaking change. Despite provider A specifying an accept all range, it did not consider the implicit update of provider B and the client suffered an error.\n\n- **Semantically wrong code:** Manifesting breaking changes in this category happen when the provider writes a semantically wrong code, generating an error in its runtime process\\textsuperscript{19} and affecting the client. These errors could be caught in compile-time in a compiled language, but in JavaScript these errors happen at runtime. This occurred in the provider front-matter@0.2.0 and four other cases.\n\n```javascript\nconst separators = [ '---', '=' yaml =']\n- const pattern = pattern = '^('\n+ const pattern = '^('\n+ '((= yaml =)\\|\\(---\\))'\n```\n\nListing 9. Semantically wrong code example.\n\nIn Listing 9, the provider repeated the variable name (pattern) on its declaration, which generated a semantic error. Although this error can be easily detected and fixed, as the provider did\\textsuperscript{20} in Listing 9, the provider took almost 1 year to fix it (front-matter@0.2.2). Meanwhile, front-matter received about 366 downloads in that period.\n\n- **Renamed function:** The manifesting breaking changes in this category occur when functions are renamed. Our analysis revealed two cases in which the functions were renamed. The renaming case is our first motivating example (Section 2); we describe the second one below.\n\n```javascript\n- RedisClient.prototype.send_command = function (command, args, callback) {\n- var args_copy, arg, prefix_keys;\n+ RedisClient.prototype.internal_send_command = function (command, args, callback) {\n+ var arg, prefix_keys;\n```\n\nListing 10. Renamed function code example.\n\n\\textsuperscript{15}https://github.com/socketio/socket.io/commit/b73d9be.\n\\textsuperscript{16}https://github.com/socketio/socket.io/issues/2368.\n\\textsuperscript{17}https://github.com/karma-runner/karma/commit/3ab78d6.\n\\textsuperscript{18}https://github.com/ember-cli/ember-cli-htmlbars-inline-precompile/pull/5/commits/b3faf95.\n\\textsuperscript{19}https://hacks.mozilla.org/2017/02/a-crash-course-in-just-in-time-jit-compilers/.\n\\textsuperscript{20}https://github.com/jxson/front-matter/commit/f16fc01.\nTable 4. Manifesting Breaking Changes in Each Semantic Version Level\n\n| Levels | (#) | (%) |\n|------------|-----|------|\n| Major | 3 | 4.7 |\n| Minor | 28 | 43.75|\n| Patch | 28 | 43.75|\n| Pre-release| 5 | 7.8 |\n| Total | 64 | 100 |\n\nThe provider `redis@2.6.0-1` renamed a function, as in Listing 10.\\(^{21}\\) However, this function was used in a client package `fakeredis`,\\(^{22}\\) which broke with this change. Client package `fakeredis@1.0.3` recovered from this error by downgrading to `redis@2.6.0-0`.\\(^{23}\\) In the 5-day period within which the manifesting breaking change was not fixed, `fakeredis` received about 2.3k downloads from npm.\n\n- **File not found:** In the cases in this category, the provider removes a file or adds it to the version control ignore list (`.gitignore`) and the client tries to access it. In the unique case of this category in our sample, the provider referenced a file that was added to the ignore list.\n\n**Finding 4:** Manifesting breaking changes are often introduced in patch releases. As shown in Table 4, of the 64 cases of manifesting breaking changes we analyzed, 3 cases were introduced in major releases, 26 in minor releases, 28 in patch releases, and 5 in pre-releases. Although we only analyzed manifesting breaking changes from minor and patch releases, in three cases the manifesting breaking changes were introduced at major levels in an indirect provider, which transitively affected client packages\u2014as in the `jsdom@16` case (see Section 2).\n\nPre-releases precede a stable release and are considered unstable; anything may change until a stable version is released.\\(^{24}\\) In all detected breaking changes in pre-releases, the providers introduced unstable changes in pre-releases and propagated these changes to stable versions. An example is the pre-release `redis@2.6.0-1` (described in Section 3.2.2), whose rename of a function propagated to the stable version and caused a failure in the client packages.\n\n**Finding 5:** Manifesting breaking change fixes/recoveries are introduced by both clients and/or providers. We searched to identify which package fixed/recovered from the manifesting breaking changes\u2014client or provider\u2014and at which level the fixed/recovered release was published, as depicted in Figure 4.\n\nFigure 4 shows that client packages recover from nearly half of the manifesting breaking changes introduced in minor updates. In turn, 76.9% of the manifesting breaking changes that are introduced by providers in a minor release are fixed in a patch release. Providers fix the majority of the manifesting breaking changes introduced in patch releases (46.4% of the time), typically through a patch release (61.5%).\n\n**Finding 6:** 21.9% of the manifesting breaking changes are not documented. Although clients and providers often document the occurrence or repair of a manifesting breaking change in issue reports, pull requests, or changelogs, more than one-fifth of the manifesting breaking changes are undocumented.\n\n\\(^{21}\\)https://github.com/NodeRedis/node-redis/commit/861749f.\n\n\\(^{22}\\)https://github.com/NodeRedis/node-redis/issues/1030#issuecomment-205379483.\n\n\\(^{23}\\)https://github.com/hdachev/fakeredis/commit/01d1e99.\n\n\\(^{24}\\)https://semver.org/#spec-item-9.\nTable 5 shows that client and provider packages documented manifesting breaking changes in 78.1% of all manifesting breaking changes. Out of all cases that have documentation, 70% have more than one type of documentation. For example, the provider received an issue report, fixed the manifesting breaking change, and documented it in a changelog. Documenting manifesting breaking changes and their fixes supports client recovery (Section 3.2.3).\n\n**Finding 7:** 57.8% of the manifesting breaking changes are introduced by an indirect provider. Indirect providers might also introduce manifesting breaking changes, which can then propagate to the client. Table 6 shows the depth level in the dependency tree of each provider that introduced a manifesting breaking change. About 42.2% of manifesting breaking changes are introduced by a direct provider in the client\u2019s `package.json`. These providers are the ones the client directly installs and that perform function calls in their own code; they are in the first depth level of the dependency tree.\n\nManifesting breaking changes introduced by indirect providers in the depth level greater than 1 represent 57.8% of the cases. Six cases are in the third depth level and a single one is in the fourth depth level. Clients do not install these providers directly; rather, they come from the direct provider. In these cases, the manifesting breaking change may be totally unclear to client packages, since they are typically unaware of such providers (or have no direct control over their installation).\nTable 7. Packages Fixing/Recovering from the Error\n\n| Fixed by/Recovered from | (#) | (%) |\n|-------------------------|-----|-----|\n| Provider | 32 | 50 |\n| Client | 13 | 20.3|\n| Transitive provider | 12 | 18.8|\n| Client + Transitive provider | 25 | 39.1|\n| Not fixed/recovered | 7 | 10.9|\n| Total | 64 | 100 |\n\n- The most frequent issues with provider packages that introduced manifesting breaking changes were feature changes, incompatible providers, and object type changes.\n- Provider packages introduced these manifesting breaking changes at similar rates in minor and patch releases.\n- Most of the fixed manifesting breaking changes by providers were fixed in patch releases.\n- Manifesting breaking changes are documented in 78.1% of the cases, mainly on issue reports.\n- Indirect providers introduced manifesting breaking changes in most cases.\n\n4.3 RQ3. How Do Client Packages Recover from a Manifesting Breaking Change?\n\nFinding 8: Clients and transitive providers recover from breaking changes in 39.1% of cases. In the dependency tree, the transitive provider is located between the provider that introduced the manifesting breaking change and the client where it manifested (see Section 2.1). Table 7 shows which package fixed/recovered from each manifesting breaking change case. The provider packages fixed the majority of the manifesting breaking changes. Since they introduced the breaking change, theoretically this was the expected behavior. Client packages recovered from the manifesting breaking change in 20.3% of cases, and transitive providers recovered from manifesting breaking changes in 18.8% of cases. When the provider who introduced a manifesting breaking change does not fix it, the transitive provider may fix it and solve the client\u2019s issue.\n\nSince transitive providers are also clients of the providers that introduced the manifesting breaking change, clients (clients and transitive providers) recovered from these breaking changes in 39.1% of cases. This observation suggests that client packages occasionally have to work on a patch when a manifesting breaking change is introduced since in 39.1% of the cases clients and transitive providers need to take actions to recover from the manifesting breaking change.\n\nFinding 9: Transitive providers fix manifesting breaking changes faster than other packages: When a manifesting breaking change is introduced, it should be fixed by either the provider who introduced it or a transitive provider. In a few cases, the client package will also recover from it. Table 8 shows the time that each package takes to fix the breaking change. In general, manifesting breaking changes are fixed in 7 days by provider packages. Even in this relatively short period of time, many direct and indirect clients are affected.\n\nTransitive providers fix manifesting breaking changes faster than clients and even providers. Since the manifesting breaking change only exists when it is raised in the client packages, transitive providers break first and need a quick fix; transitive providers usually spent 4 days to fix a break. Meanwhile, providers that introduced the manifesting breaking change take a median of 7 days to introduce a fix. In cases where the provider neglected to introduce a fix or took longer than the client, client packages took a comparably lengthy 134 days (mean 286; SD 429) to recover from a\nmanifesting breaking change. According to Table 7, the direct providers and transitive providers fixed most of the manifesting breaking changes, about 78.8%, because clients can be slow to recover.\n\nHowever, because transitive providers are also clients, we can analyze the time that clients and transitive providers spend to fix/recover from a manifesting breaking change. Clients and transitive providers recovered from a manifesting breaking change in around 82 days.\n\n**Finding 10:** Upgrading is the most frequent way to recover from a manifesting breaking change. Table 9 describes how clients recovered from breaking changes. In 48 cases, the provider version was changed. In most cases (71.4%), client packages upgraded their providers\u2019 version. We analyzed all cases where clients and transitive providers recovered from the manifesting breaking change by changing the provider\u2019s version before the provider fixed the error. We observed an upgrade in 12 (52.2%) cases out of 23. Thus, in more than half of the cases where the client and transitive providers fixed/recovered from the manifesting breaking change, the provider package had newer versions, but the client was not using any follow-up releases from the provider packages.\n\nThe number of downgrades in a transitive provider may explain why they recover from the manifesting breaking change faster than the client packages. Since transitive providers are also providers, they should fix the manifesting breaking change as soon as possible, avoiding the propagation of the error caused by the manifesting breaking change. Consequently, the downgrade to a stable release of the provider is the most frequent way for transitive providers to recover from a manifesting breaking change. Finally, the provider is replaced or removed in a small proportion when a breaking change is raised\u2014about 7.2% for both cases combined.\n\n**Finding 11:** To recover from manifesting breaking changes, clients often change the adopted provider version without changing the range of automatically accepted versions. When a breaking change manifests itself, clients often update the provider\u2019s version. Figure 5 shows when the clients and transitive providers updated their providers\u2019 versions.\n\nWe verified that transitive providers never set a steady version of their provider. When a breaking change manifests in transitive providers, they use a range in the provider\u2019s version. However, a single transitive provider changed the range from a caret range to a steady one (e.g., \u02c61.2.1 \u2192 1.2.1) to recover from the manifesting breaking change. Nevertheless, when the clients used a caret range and a breaking change manifested, in 38.5% of the cases they downgraded the provider to a steady version.\nThe majority of the manifesting breaking changes were introduced when the clients and transitive providers used the caret range (\\(^\\ast\\)). It is the default range statement that npm inserts in the package.json when a provider is added as a dependency of a client package. In more than half of the cases, these clients changed the provider\u2019s version to another caret range. The accept all ranges (\\(\\geq\\), or \\(^*\\)) were less commonly used and less common when updating.\n\nClients and the transitive provider in 60.5% of cases retained the range type and updated it. The range type (all, caret, tilde, or steady) was kept, but the provider was updated/downgraded. For example, a client package specifies a provider p@\\(^\\ast\\)1.2.0 and receives a breaking change in p@1.3.2. Whenever the provider fixes the code, the client package will update it to, for example, p@\\(^\\ast\\)1.4.0 but will not change it for another range type, such as all, tilde, or steady range.\n\n- Client packages recovered manifesting breaking changes in 39.1% of cases, including clients and transitive providers.\n- Providers fixed manifesting breaking changes faster than client packages recovered from manifesting breaking changes by updating the provider, and clients preferred to update rather than downgrade their providers.\n- The provider\u2019s range can be updated or downgraded after a breaking change, but in around 60% of cases, they did not change the range type.\n\n5 DISCUSSION\n\nThis section discusses the implications of our findings for dependency management practices (Section 5.1) and the best practices that clients and providers can follow to mitigate the impact caused by manifesting breaking changes (Section 5.2). We also discuss the manifestation of breaking changes and the aspects of Semantic Versioning in the npm ecosystem (Section 5.3).\n\n5.1 Dependency Management\n\nWhen managing dependencies, client packages can use dependency bots in GitHub, such as Snyk and Dependabot, to receive automatic pull requests when there is a new provider\u2019s release [27]. These bots continuously check for new versions and providers\u2019 bugs/vulnerabilities fixes. They open pull requests in the client\u2019s repository, updating the package.json, including changelogs and information about the provider\u2019s new version. Mirhosseini and Parnin [16] show that packages using such bots update their dependencies 1.6x faster than through manual verification.\nAdditionally, tools such as JSFIX [20] can be helpful when upgrading provider releases, especially those that include manifesting breaking changes or major releases. The JSFIX tool was designed to adapt the client code to the new provider release, offering a safe way to upgrade providers.\n\nWe verified that a small percentage of the clients recovered from manifesting breaking changes by removing or replacing the provider (c.f., Finding 10), which may be difficult when several features or resources from the provider package are used by the client [2]. Instead, client packages tend to temporarily downgrade to a stable provider version. To ease the process to upgrade/downgrade providers and avoid surprises, clients should search in the provider changelogs for significant changes. As we verified in Finding 6, most manifesting breaking changes are documented in changelogs, issue reports, or pull requests. Dependency bots also could analyze the content of changelogs and issue reports to create red flags, like notifications, about documentation that cites a manifesting breaking change.\n\nFinally, client packages may use a `package-lock.json` file to better manage dependencies. We observed in Finding 7 that indirect providers\u2014the ones in depth 2 and 3 in the dependency tree\u2014are responsible for 57.8% of the manifesting breaking changes that affect a client package. Using a `package-lock.json` file, client packages can stay aware of all of the providers\u2019 versions of the latest successful build. When a provider is upgraded due to the range of versions and the new release manifests a breaking change on the client side, the client can still install all of the providers\u2019 versions that successfully built the client.\n\n### 5.2 Best Practices\n\nSeveral issues found in our manual classification of manifesting breaking changes (Section 3.2.2) could be avoided through the use of static analysis tools. Errors classified as *Semantically Wrong Code* and *Rename function* are typically captured by such tools. Both client and provider developers can use such tools. For a dynamic language such as JavaScript, these tools can help avoid some issues [26]. Options for JavaScript include `jshint`, `jslint`, and `standard`. T\u00f3masd\u00f3ttir et al. [26] and T\u00f3masd\u00f3ttir et al. [25] show that developers use linters mainly to prevent errors, bugs, and mistakes.\n\nDue to the dynamic nature of JavaScript, however, static analysis tools cannot verify inherited objects\u2019 properties. They do not capture errors classified as *Change one rule*, *Object type change*, and *Undefined object*, as well as *Rename Function* in functions of objects\u2019 properties. Thus, developers should be concerned about creating test cases that run their code along with the functionality of providers, as only then will they (client developers) find breaking changes that affect their own code. Many available frameworks, such as `mocha`, `chai`, and `ava`, support these tasks. These tests should also be executed on integrated environments every time the developer commits and pushes new changes. For this case, several tools are available, such as `Travis`, `Jenkins`, `Drone CI`, and `Codefresh`. Using linters and continuous integration systems, developers can catch most of these errors before releasing a new version.\n\nFinally, a good practice for npm packages is to keep a changelog or to document breaking changes and their fixes in issue reports and pull requests. This practice should continue and be more widely adopted, since currently around a fifth of providers do not do it (c.f., Finding 6). This would also help the development of automated tools (e.g., bots) for dealing with breaking changes. Providers could create issue reports and pull request templates to allow clients to specify consistent descriptions of issues they found.\n\n### 5.3 Breaking Changes Manifestation and Semantic Versioning\n\nBreaking changes often occur in the npm ecosystem and impact client packages (c.f., Finding 1). Most of the manifesting cases come from indirect providers, that is, providers from the second level\nor deeper in the dependency tree. Findings from Decan et al. [10] show that in 2016 half of the client packages in npm had at least 22 transitive dependencies (indirect providers), and a quarter had at least 95 transitive dependencies. In this context, clients may face challenges in diagnosing where the manifesting breaking changes came from, because when a manifesting breaking change is introduced by an indirect provider, the client may not know this provider.\n\nOur results show that provider packages introduce manifesting breaking changes in minor and patch levels, which in principle should only contain backward-compatible updates according to the Semantic Versioning specification. Semantic Versioning is a recommendation that providers can choose to use or not [4, 8]. If providers do not comply with Semantic Versioning, several errors might be introduced, as we observed in Finding 4 that all manifesting breaking changes in pre-releases were propagated to stable releases (c.f., Finding 4). One hypothesis is that providers might be unaware of the correct use of the Semantic Versioning rules, which may explain why they propagated the unstable changes to stable releases. Finally, npm could provide badges where provider packages would be able to explicitly show that they are aware of and adhere to the Semantic Versioning. Trockman [24] claims that developers use visible signals (specifically on GitHub) like badges to indicate project quality. This way, clients could make a better choice about their providers and prefer those aware of Semantic Versioning.\n\n6 RELATED WORK\n\nThis section describes related work regarding breaking changes in npm and other ecosystems.\n\n**Breaking changes in npm:** Bogart et al. [5] present a survey about the stability of dependencies in the npm and CRAN ecosystem. The authors interviewed seven package maintainers about software changes. In this paper, interviewees highlighted the importance of adhering to Semantic Versioning to avoid issues with dependency updates. More recently, the authors investigated policies and practices in 18 software ecosystems, finding that all ecosystems share values such as stability and compatibility but differ on other values [4]. Kraaijeveld [14] studied API breaking changes in three provider packages. The author uses 3k client packages, parsing the providers\u2019 and clients\u2019 files to detect API breaking changes and their impact on clients. This work identified that 9.8% to 25.8% of client releases are impacted by API breaking changes.\n\nMezzetti et al. [15] present a technique called type regression testing that verifies the type of a returned object from an API and compares it with the returned type in another provider release. The authors chose the 12 most popular provider packages and their major releases, applying the technique in all patch/minor releases belonging to the first major update. They verified type regression in 9.4% of the minor or patch releases. Our research focused on any kind of manifesting breaking changes and we analyzed both client and provider packages, with 13.9% of releases impacted by manifesting breaking changes.\n\nMujahid et al. [19] focus on detecting break-inducing versions of third-party dependencies. The authors analyzed 290k npm packages. They flagged each downgrade in the provider version as a possible breaking change. These provider versions were tested using client tests and the authors identified 4.1% of fails after an update, which resulted in a downgrade. Similar to these authors, we resolved each client\u2019s providers for a release, but we ran the tests whenever at least one provider version changed.\n\nM\u00f8ller et al. [17] present a tool that uses breaking change patterns described by providers and fixes the client code. They analyzed a dataset with 10 of the most used npm packages and searched for breaking changes described in changelogs. We can compare our classification (Finding 3) with theirs. They found 153 cases of breaking changes that were introduced in major releases. They claim that most of the breaking changes (85%) are related to specific package API\npoints, such as modules, properties, and function changes. Considering our classification (Finding 3), feature changes, object type changed, undefined object, and renamed function can also be classified as changes in the package API and, if so, we claim that 64.06% of manifesting breaking changes are package API related.\n\n**Breaking changes in other ecosystems:** Brito et al. [6] studied 400 providers from the Maven repository for 116 days. The provider packages were chosen by popularity on GitHub and the authors looked for commits that introduced an API breaking change during that period. Developers were asked about the reasons for breaking changes that occurred. Our article presents similar results: the authors claim that New Feature is the most frequent way a breaking change is introduced, while we claim that Feature Change is the main breaking change type (Finding 3). Also, the authors similarly detected that breaking changes are frequently documented on changelogs (Finding 6).\n\nFoo et al. [12] present a study about API breaking changes in the Maven, PyPI, and RubyGems ecosystems. The study focuses on detecting breaking changes by computing a diff between the code of two releases. They found API-breaking changes in 26% of provider packages, and their approach suggests automatic upgrades for 10% of the packages. Our approach goes beyond API breaking changes; we found that 11.7% of the client packages are impacted by manifesting breaking changes.\n\n### 7 THREATS TO VALIDITY\n\n**Internal validity:** When a breaking change was detected, we verified the type of change that the provider package introduced and collectively grouped the changes into categories. However, some cases might fall into more than one category. For example, a provider package changes the type of an object to change/improve its behavior. This case might fall into Feature change and Object type changed. So, we categorized the case in the category that most represents the error. In this case, since the object is changed by a feature change, the most appropriate category would be Feature change.\n\nThe error cases that we categorized as breaking due to external change are the ones in which the clients or providers use\u2014or depend on\u2014external data/resources from sites and APIs that changed over time (see Finding 1). These cases represent about 8.1% of the client\u2019s releases, and in these cases, we could not search for manifesting breaking changes because we could not execute the release tests. After all, the data/resource needed by the test were no longer available. So, about 8% of client releases might be impacted by breaking changes, but we could not analyze them.\n\n**Construct validity:** In our approach to detecting breaking changes, we only performed an analysis when the client tests failed. If a client used a provider version that had a breaking change but the client did not call the function that causes the breaking change or did not have tests to exercise that code, we could not detect the breaking change. This is why we call all of our cases manifesting breaking changes.\n\nTherefore, we might not have detected all API breaking changes, as we were able to detect only API name changes and API removal. Parameter changes may not be detected because JavaScript allows making a call to an API with any number of parameters.\\(^{25}\\)\n\nWe restored the working tree index in the respective commit tagged by the developer for each release. We listed all tags in the repository, and we used the checkout with the respective tag. However, for untagged releases we performed a checkout in the timestamp referenced in the package.json. We trusted the timestamp once we verified that the tags and timestamp point to the same commit in 94% of cases for tagged repositories.\n\n\\(^{25}\\)https://eloquentJavaScript.net/03_functions.html#p_kzCivbonMM.\nLastly, we did not mention the file `npm-shrinkwrap.json` in our study. This file is intended to work like the file `package-lock.json` when controlling transitive dependency updates, but it may be published along with the package. However, `npm` strongly recommend avoiding its use. Also, the existence of `npm-shrinkwrap.json` files does not play any major role in our study, as they do not affect our results, based on our adopted research method. We did not include them in our study.\n\n**External validity:** We randomly selected client packages that varied in release numbers, clients, providers, and size. However, since we only analyzed `npm` packages hosted at GitHub projects, our findings cannot be directly generalized to other settings. It is also important to state that representativeness can also be limited because `npm` increases the number of packages and releases daily. Future work can replicate our study in other platforms and ecosystems. Finally, since the number of projects in our sample is small, we do not have enough statistical power to perform hypothesis tests around results that involve package-level comparisons.\n\n**Conclusion validity:** Conclusion validity relates to the inability to draw statistically significant conclusions due to the lack of a large enough data sample. However, as our research used a qualitative approach, we mitigate any potential conclusion threat by conducting a sanity check on repositories of all client packages with fewer than four releases. This guarantees that all packages are intended for use in production (Section 3.1.2). Finally, all of the manifesting breaking changes that we claim in our work were manually analyzed to ensure they are legitimate breaking changes that impact clients in the real world (Section 3.1.3).\n\n8 CONCLUSIONS\n\nSoftware reuse is a widely adopted practice, and package ecosystems such as `npm` support reusing software packages. However, breaking changes are a negative side effect of software reuse. Breaking changes and their impacts are studied in the literature in several software ecosystems [3, 6, 18, 28]. A few papers examine breaking changes in the `npm` ecosystem from the client packages perspective, i.e., executing the client tests to verify the impact of breaking changes [5, 15, 19]. In this work, we analyzed manifesting breaking changes in the `npm` ecosystem from the client and provider perspectives, providing an empirical analysis regarding breaking changes in minor and patch levels.\n\nFrom the client\u2019s perspective, we analyzed the impact of manifesting breaking changes. We found that 11.7% of clients are impacted by such changes and offer some advice to help clients and automated tool developers discover, avoid, and recover from manifesting breaking changes. Clients can use dependency bots to accelerate the process of upgrading their providers, and clients can look at changelog files for any non-desired updating, such as breaking changes. From the provider\u2019s perspective, we analyzed the most frequent causes of manifesting breaking changes. We found that the most common causes were when providers changed some rules/behaviors on features that had been stable over the last releases, when an object type changed, and when there were unintentionally undefined objects at runtime. Maintainers should pay attention during code review phases regarding these issues. Future research can look into the correlation among package characteristics and metrics with breaking change occurrence.\n\nREFERENCES\n\n[1] 2018. This year in JavaScript: 2018 in review and npm\u2019s predictions for 2019. (Dec 2018). https://blog.npmjs.org/post/180868064080/this-year-in-javascript-2018-in-review-and-npms.html.\n\n[2] Hussein Alrubaye and Mohamed Wiem Mkaouer. 2018. Automating the detection of third-party java library migration at the function level. In Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering (CASCON\u201918). 60\u201371.\n[3] Christopher Bogart, Christian K\u00e4stner, James Herbsleb, and Ferdian Thung. 2016. How to break an API: Cost negotiation and community values in three software ecosystems. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE\u201916). 109\u2013120. https://doi.org/10.1145/2950290.2950325\n\n[4] Chris Bogart, Christian K\u00e4stner, James Herbsleb, and Ferdian Thung. 2021. When and how to make breaking changes: Policies and practices in 18 open source software ecosystems. ACM Trans. Softw. Eng. Methodol. 30, 4, Article 42 (July 2021), 56 pages. https://doi.org/10.1145/3447245\n\n[5] C. Bogart, C. K\u00e4stner, and J. Herbsleb. 2015. When it breaks, it breaks: How ecosystem developers reason about the stability of dependencies. In 2015 30th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW\u201915). 86\u201389. https://doi.org/10.1109/ASEW.2015.21\n\n[6] A. Brito, L. Xavier, A. Hora, and M. T. Valente. 2018. Why and how Java developers break APIs. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER\u201918). Campobasso, Mulise, Italy, 255\u2013265.\n\n[7] F. R. Cogo, G. A. Oliva, and A. E. Hassan. 2019. An empirical study of dependency downgrades in the npm ecosystem. IEEE Transactions on Software Engineering (Nov. 2019), 1\u201313.\n\n[8] A. Decan and T. Mens. 2019. What do package dependencies tell us about semantic versioning? IEEE Transactions on Software Engineering (May 2019), 1226\u20131240.\n\n[9] Alexandre Decan, Tom Mens, and Maelick Claes. 2016. On the topology of package dependency networks: A comparison of three programming language ecosystems. In Proceedings of the 10th European Conference on Software Architecture Workshops (ECSAW\u201916). Article 21, 4 pages. https://doi.org/10.1145/2993412.3003382\n\n[10] A. Decan, T. Mens, and M. Claes. 2017. An empirical comparison of dependency issues in OSS packaging ecosystems. In 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER\u201917). 2\u201312.\n\n[11] Alexandre Decan, Tom Mens, and Philippe Grosjean. 2019. An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empirical Software Engineer 24, 1 (Feb. 2019), 381\u2013416. https://doi.org/10.1007/s10664-017-9589-y\n\n[12] Darius Foo, Hendy Chua, Jason Yeo, Ming Yi Ang, and Asankhaya Sharma. 2018. Efficient static checking of library updates. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 791\u2013796. https://doi.org/10.1145/3236024.3275535\n\n[13] D. Garrison, Martha Cleveland-Innes, Marguerite Koole, and James Kappelman. 2006. Revisiting methodological issues in transcript analysis: Negotiated coding and reliability. Internet and Higher Education 9, 1 (2006), 1\u20138.\n\n[14] Michel Kraaijeveld. 2017. Detecting Breaking Changes in JavaScript APIs. Master\u2019s thesis. Dept. Soft. Tech., Delft University of Technology, Delft, Netherlands. http://resolver.tudelft.nl/uuid:56e646dc-d5c7-482b-8326-90e0de4ea419.\n\n[15] Gianluca Mezzetti, Anders M\u00f8ller, and Martin Toldam Torp. 2018. Type regression testing to detect breaking changes in Node.js libraries. In Proceedings of the 32nd European Conference on Object-Oriented Programming (ECOOP\u201918) (Leibniz International Proceedings in Informatics (LIPIcs)). 7:1\u20137:24.\n\n[16] S. Mirhosseini and C. Parnin. 2017. Can automated pull requests encourage software developers to upgrade out-of-date dependencies? In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE\u201917). 84\u201394.\n\n[17] Anders M\u00f8ller, Benjamin Barslev Nielsen, and Martin Toldam Torp. 2020. Detecting locations in JavaScript programs affected by breaking library changes. Proc. ACM Program. Lang. 4, OOPSLA, Article 187 (Nov. 2020), 25 pages. https://doi.org/10.1145/3428255\n\n[18] Anders M\u00f8ller and Martin Torp. 2019. Model-based testing of breaking changes in Node.js libraries. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 409\u2013419. https://doi.org/10.1145/3338906.3338940\n\n[19] Suhail Mujahid, Rabe Abdalkareem, Emad Shihab, and Shane McIntosh. 2020. Using others\u2019 tests to identify breaking updates. In International Conference on Mining Software Repositories. https://doi.org/10.1145/3379597.3387476\n\n[20] Benjamin Barslev Nielsen, Martin Toldam Torp, and Anders M\u00f8ller. 2021. Semantic patches for adaptation of JavaScript programs to evolving libraries. In Proc. 43rd International Conference on Software Engineering (ICSE\u201921).\n\n[21] S. Raemaekers, A. van Deursen, and J. Visser. 2014. Semantic versioning versus breaking changes: A study of the maven repository. In 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation. 215\u2013224. https://doi.org/10.1109/SCAM.2014.30\n\n[22] Anselm Strauss and Juliet Corbin. 1998. Basics of Qualitative Research Techniques. Thousand Oaks, CA: Sage Publications.\n\n[23] Jacob Stringer, Amjed Tahir, Kelly Blincoe, and Jens Dietrich. 2020. Technical lag of dependencies in major package managers. In Proceedings of the 27th Asia-Pacific Software Engineering Conference (APSEC\u201920). 228\u2013237. https://doi.org/10.1109/APSEC51365.2020.00031\n\n[24] Asher Trockman. 2018. Adding sparkle to social coding: An empirical study of repository badges in the npm ecosystem. In 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion\u201918). 524\u2013526.\n[25] K. F. T\u00f3masd\u00f3ttir, Maur\u00edcio Aniche, and Arie Deursen. 2018. The adoption of JavaScript linters in practice: A case study on ESLint. *IEEE Transactions on Software Engineering* PP (Sept. 2018), 26. https://doi.org/10.1109/TSE.2018.2871058\n\n[26] K. F. T\u00f3masd\u00f3ttir, M. Aniche, and A. van Deursen. 2017. *Why and How JavaScript Developers Use Linters*. Master\u2019s thesis. Dept. Soft. Tech., Delft University of Technology, Delft, Netherlands.\n\n[27] Mairieli Wessel, Bruno Mendes De Souza, Igor Steinmacher, Igor S. Wiese, Ivanilton Polato, Ana Paula Chaves, and Marco A. Gerosa. 2018. The power of bots: Characterizing and understanding bots in OSS projects. *Proceedings of the ACM on Human-Computer Interaction* 2, CSCW (2018), 1\u201319.\n\n[28] Jooyong Yi, Dawei Qi, Shin Hwei Tan, and Abhik Roychoudhury. 2013. Expressing and checking intended changes via software change contracts. In *Proceedings of the 2013 International Symposium on Software Testing and Analysis* (ISSTA\u201913). 1\u201311. https://doi.org/10.1145/2483760.2483772\n\n[29] Ahmed Zerouali, Eleni Constantinou, Tom Mens, Gregorio Robles, and Jesus Gonzalez-Barahona. 2018. An empirical analysis of technical lag in npm package dependencies. https://doi.org/10.1007/978-3-319-90421-4_6\n\nReceived 19 November 2021; revised 27 October 2022; accepted 8 November 2022", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/025_venturini.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 26, "total-input-tokens": 66754, "total-output-tokens": 21181, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 3737, 1], [3737, 7767, 2], [7767, 10956, 3], [10956, 14625, 4], [14625, 16769, 5], [16769, 19849, 6], [19849, 22557, 7], [22557, 26280, 8], [26280, 30473, 9], [30473, 34195, 10], [34195, 37896, 11], [37896, 41521, 12], [41521, 44739, 13], [44739, 47543, 14], [47543, 51077, 15], [51077, 54366, 16], [54366, 55917, 17], [55917, 59366, 18], [59366, 62112, 19], [62112, 64536, 20], [64536, 68623, 21], [68623, 72746, 22], [72746, 76602, 23], [76602, 80580, 24], [80580, 86148, 25], [86148, 87465, 26]]}}
|
|
{"id": "d9c52abedbf4d07960b139cd02507138e707b984", "text": "When and Why Developers Adopt and Change Software Licenses\n\nChristopher Vendome\\textsuperscript{1}, Mario Linares-V\u00e1squez\\textsuperscript{1}, Gabriele Bavota\\textsuperscript{2}, Massimiliano Di Penta\\textsuperscript{3}, Daniel M. German\\textsuperscript{4}, Denys Poshyvanyk\\textsuperscript{1}\n\n\\textsuperscript{1}The College of William and Mary, VA, USA \u2014 \\textsuperscript{2}Free University of Bolzano, Italy \u2014 \\textsuperscript{3}University of Sannio, Italy \u2014 \\textsuperscript{4}University of Victoria, BC, Canada\n\nAbstract\u2014Software licenses legally govern the way in which developers can use, modify, and redistribute a particular system. While previous studies either investigated licensing through mining software repositories or studied licensing through FOSS reuse, we aim at understanding the rationale behind developers\u2019 decisions for choosing or changing software licensing by surveying open source developers. In this paper, we analyze when developers consider licensing, the reasons why developers pick a license for their project, and the factors that influence licensing changes. Additionally, we explore the licensing-related problems that developers experienced and expectations they have for licensing support from forges (e.g., GitHub).\n\nOur investigation involves, on one hand, the analysis of the commit history of 16,221 Java open source projects to identify the commits where licenses were added or changed. On the other hand, it consisted of a survey\u2014in which 138 developers informed their involvement in licensing-related decisions and 52 provided deeper insights about the rationale behind the actions that they had undertaken. The results indicate that developers adopt licenses early in the project\u2019s development and change licensing after some period of development (if at all). We also found that developers have inherent biases with respect to software licensing. Additionally, reuse\u2014whether by a non-contributor or for commercial purposes\u2014is a dominant reason why developers change licenses of their systems. Finally, we discuss potential areas of research that could ameliorate the difficulties that software developers are facing with regard to licensing issues of their software systems.\n\nIndex Terms\u2014Software Licenses, Mining Software Repositories, Empirical Studies\n\nI. INTRODUCTION\n\nSoftware licenses are the legal mechanism used to determine how a system can be copied, modified, or redistributed. Software licenses allow a third party to utilize code as long as they adhere to the conditions of the license. In particular, open source licenses are those that comply with the Open Source Definition [4]. Specifically, the goal of these licenses is to facilitate further copying, modifying, and distributing software as long as a set of ten conditions are met (such as free redistribution and availability of source code).\n\nFor software to be open source, its creators must choose an open source license. However, there is a large number of open source licenses in use today. They range from highly restrictive (such as the General Public License\u2014GPL\u2014family of licenses) to ones with very few restrictions (such as the MIT license). The choice of a license will determine if, and how, a given open source software can be reused. This is especially true for libraries that are expected to be integrated and distributed with the software that uses them. Furthermore, the choice of a license might also be affected by the dependencies used (e.g., software that uses a library under the GPL requires the software to be GPL also, while software that uses a library under the MIT license can be under any license, including commercial).\n\nAt some point, the creators of open source software must choose a license that: 1) expresses the developers\u2019 philosophy; 2) meets their deployment goals, and 3) is consistent with the licenses of the components reused by that software. However, choosing a license is not an easy process. Developers do not necessarily have a clear idea on the exact consequences of licensing (or not licensing) their code under a specific license; for instance, developers ask questions on Question & Answer (Q&A) websites looking for advice on how to redistribute code licensed with a dual license among the other issues (e.g., question 2758409 in Stack Overflow [19] and question 139663 in the StackExchange site for programmers [28]). Also, the problem of license incompatibility between components is not trivial (see [15] for a detailed description of this problem).\n\nDuring the evolution of a software system, its license might change. In our previous work [30], we empirically showed\u2014for software hosted in GitHub\u2014that license changes are common phenomena. Stemming from the results that we previously captured by analyzing licensing and their changes in software repositories [30], the goal of this work is to understand when and why changes in licensing happen. Specifically, this paper reports the results of a survey of 138 developers with the aim of understanding (i) when developers consider adding a license to their project, (ii) why they choose a specific license for their projects, and (iii) factors influencing license changes. The 138 participants are the respondents from a set of 2,398 invitees, i.e., 5.75% of the invitees. We identified such developers by sampling 16,221 Java projects on GitHub, and then subsetting to 1,833 projects where the license changed over time. Of these 138 developers, 52 developers offered insights to the aforementioned questions, while the remaining developers reinforced that licensing decisions are not necessarily made by all contributors, but by a subset that are the copyright holders. The main findings of this study are as the following:\n\n1) Developers frequently license their code early, but the main rationale for delaying licensing is usually to wait until the first release;\n2) Developers have strong intrinsic beliefs that affect their choice of licenses. Also, open source foundations, such as the Apache Software Foundation, the Free Software Foundation, and the Eclipse Software Foundation exert a powerful influence on the choice of a license;\n\n3) We observed that the change of a license(s) of a system is predominantly influenced by the need to facilitate reuse (mostly in commercial systems);\n\n4) Developers experience difficulties in understanding the licensing terms and dealing with incompatible licenses.\n\nII. RELATED WORK\n\nOur work is mainly related to (i) the automatic identification and classification of licensing in software artifacts, (ii) empirical studies investigating license adoption and license evolution, (iii) qualitative studies on software licensing. Table I presents prior work in licensing by reporting the main purpose of each study and the corresponding dataset used.\n\nA. Identifying and Classifying Software Licensing\n\nAutomatic identification of software licensing has been widely explored before. To the best of our knowledge, the FOSSology project [17] was the first one aimed at solving the problem of license identification by extracting the licensing information of projects and using machine learning for classification. Another representative project is the ASLA tool by Tuunanen et al. [29], which showed an 89% accuracy with respect to classifying the licenses of files in FOSS systems.\n\nThe current state-of-the-art automated tool for license identification, Ninka, was proposed by German et al. [16]. Ninka relies on pattern-matching in order to identify licensing statements and return the license name and version (e.g., Apache-2.0). The evaluation of Ninka indicated a precision of 95%.\n\nSince software is not always distributed with or as source code, the traditional approaches for license identification that are based on the parsing of the licensing statements are not always applicable (byte-code or binaries do not inherently contain licensing information). To ameliorate this problem, Di Penta et al. [9] proposed an approach that uses code search and textual analysis to automatically identify the licensing of jars. The approach automatically queried Google Code Search by extracting information from decompiled code. Additionally, German et al. investigated the ability to identify FOSS licensing in conjunction with proprietary licensing by analyzing 523,930 archives [12].\n\nIn this paper, we rely on Ninka [16] for license identification, since it is the current state-of-the-art technique. However, our work does not aim to improve upon license identification or classification, but, rather, to understand the rationale behind licensing decisions.\n\nB. Empirical Studies on Licenses Adoption and Evolution\n\nDi Penta et al. [10] investigated license migration during the evolution and maintenance of six FOSS projects. While the authors were unable to find a generalizable pattern among the projects, the results suggested that both version and type of license were modified during the systems\u2019 life cycles.\n\nGerman et al. [15] investigated the way in which developers handle license incompatibilities by analyzing 124 FOSS packages and from this investigation they constructed a model that outlines the advantages and disadvantages of certain licenses as well as their applicability. Additionally, German et al. [13] conducted an empirical study to (i) understand the extent to which package licensing and source code files were consistent and (ii) evaluate the presence of licensing issues due to the dependencies among the packages. The authors investigated 3,874 packages of the Fedora-12 Linux distribution and they confirmed a subset of licensing issues with the developers at Fedora. Manabe et al. [21] analyzed FreeBSD, OpenBSD, Eclipse, and ArgoUML in order to identify changes in licensing. The authors found that each of the four projects exhibited different patterns of changes in licensing.\n\nGerman et al. analyzed fragments of cloned code between the Linux Kernel and both OpenBSD and FreeBSD [14]. They investigated the extent to which terms of the licenses were adhered during the cloning of these code fragments. Similarly, Wu et al. [31] found that cloned files have a potential to be inconsistent in terms of licenses (e.g., one has a license, while the other does not). The paper describes the types of inconsistencies and illustrates the problem and the difficulty to resolve it through an empirical study of Debian 7.5.\n\nThe most related empirical study to this work is our previous work [30], which analyzed license usage and license changes over 16,221 projects and sought to extract rationale from commit messages and issue tracker discussions. The results indicated a lack of documentation of licensing in both sources. While sharing the same motivation, this work is novel as it investigates when and why developers choose to license a project or change licensing (as opposed to the extent to which these changes occur) and presents rationale from a survey conducted with actual developers of the projects from our\n\n| Study | Purpose | Dataset |\n|----------------|-------------------------------------------------------------------------|--------------------------|\n| German et al. | Investigate the presence of license incompatibilities | 3,874 packages |\n| Di Penta et al.| Investigate license evolution during a system\u2019s maintenance and evolution | 6 systems |\n| German et al. | Investigate the way in which developers address incompatible licensing | 124 systems |\n| German et al. | Investigate licensing between copied code fragments in Linux and two BSD distributions | 3 systems |\n| Manabe et al. | Investigate license change patterns within FOSS systems | 4 systems |\n| Singh et al. | Investigate the reasons for the adoption of a particular FOSS license | 5,307 projects |\n| Sojer et al. | Investigate reuse and legal implication of Internet code | 686 developers |\n| Sojer et al. | Investigate FOSS code reuse | 869 developers |\n| Vendome et al. | Investigate license usage and changes in FOSS systems and the rationale in the revision history and issue tracker | 16,221 systems |\ndataset instead of relying just on the rationale from the issue tracker discussions or from commit messages.\n\nC. Qualitative Studies on Software Licensing\n\nSingh and Phelps [25] studied the reasons behind the adoption of a specific license in a FOSS project. Their results suggest that such a choice is mainly driven by social factors\u2014the adoption of a license in a new project is based on the licenses adopted by socially close existing projects (e.g., projects from the same ecosystem). Their work considered license adoption from a social networking perspective to see how the \u201clicensor\u201d may be influenced toward a particular license(s) based on social proximity. Our work does not investigate latent social connections between developers or the projects in which they contributed. Instead, we directly surveyed the developers to understand their reasoning for adopting a particular license.\n\nSojer et al. conducted a survey with 869 developers regarding reuse of open source code and the legal implications of the resulting code [26]. One key finding was that industry and academic institutions did not prioritize knowledge regarding licensing and reuse. The authors compared a self-assessment to a questionnaire on licensing and found a discrepancy between perceived knowledge and actual understanding of licensing. Additionally, Sojer et al. conducted a survey of 686 practitioners regarding reuse of FOSS code and found that licensing of FOSS code was the second largest impedance for reuse [27]. While the authors point to possible reasons for this observation, our study specifically aims to understand the reasons for choosing and changing licenses as well as the types of problems that practitioners face due to licensing.\n\nIII. DESIGN OF THE STUDY\n\nThe goal of our study is to investigate when developers consider licensing issues and the reasons why developers pick or change licensing in FOSS projects. The context consists of software projects, i.e., the change history of 16,221 Java FOSS projects mined from GitHub, and subjects, i.e., 138 practitioners contributing to a subset of the mined projects.\n\nA. Research Questions\n\nWe aim at answering the following research questions:\n\n- **RQ1.** When and why do developers first assert a licensing to their project? This research question first examines when developers commit a license to at least one file in FOSS projects hosted on GitHub (i.e., the project goes from no licensing to at least one license). We complement this analysis with questions for developers to understand the actual rationale behind the empirical observations.\n\n- **RQ2.** When and why do developers change the licensing of their project? This research question relies on a similar analysis as the previous question, but it specifically investigates licensing changes (i.e., the change from license $A$ to license $B$).\n\n- **RQ3.** What are the problems that developers face with licensing and what support do they expect from a forge? This question aims at understanding the problems that developers experience with licensing to better support them. Additionally, we are interested in understanding the expectation that developers may have for support incorporated by forges.\n\nIn order to answer our research questions, we consider two perspectives: (i) evidence collected by analyzing projects\u2019 change history; and (ii) evidence collected by surveying developers. Both perspectives are explained in the following.\n\nB. Analysis of the Projects\u2019 Change History\n\nTo investigate when developers pick or change licensing, we mined the entire commit history of 16,221 public Java projects on GitHub. We first queried GitHub, using the public API [2], to generate project information for all of the publicly available projects. We extracted a comprehensive list of 381,161 Java projects by mining the project information of over twelve million projects and locally cloned all of the Java repositories, which consumes a total of 6.3 Tb of storage space. We randomly sampled 16,221 projects due to the computation time of the underlying infrastructure to analyze the licensing of all the file revisions at commit-level granularity (1,731,828 commits that spanned 4,665,611 files). Table II reports statistics about size attributes of the analyzed dataset and the overall number of different licenses considered in our study.\n\nWe relied upon the MARKOS code analyzer [7] to extract the licensing throughout each project\u2019s revision history. The code analyzer incorporates the Ninka license classifier [16] in order to identify the licensing statements and classify the license by family and version (when applicable) for each file. The code analyzer mined the change log of the 16,221 projects and extracted commit hash, date, author, file, commit message, change to file (Addition, Modification, or Deletion), license change (Boolean value), license name and version (reported as a list, when multiple licenses are detected).\n\nThe data extraction step for the 16,221 projects took almost 40 days, and a total of 1,731,828 commits spanning 4,665,611 files were analyzed. In the case of BSD and CMU licenses we only reported a variant of either case, since Ninka was unable to identify the particular version. In the case of GPL and LGPL, it is possible for the license to have an exception that allows developers to pick future versions of that license and we annotate the license with a \u201c+\u201d (e.g., GPL-2.0+ signifies that the terms of GPL-3.0 can also be used).\n\nTo identify licensing changes, we followed the same procedure exploited in our previous work [30]. In particular, we identify a commit $c_i$ as responsible for introducing a license in a code file $F$ if before $c_i$ Ninka did not identify any license in $F$, while after $c_i$ a license in $F$ is retrieved (i.e., No License $\\rightarrow$ Some License transition on $F$). Instead, we consider $c_i$ as a licensing change if the license type and/or version detected by\nNinka on $F$ before $c_i$ is different than the one detected after $c_i$ (i.e., Some License $\\rightarrow$ Some Other License transitions).\n\nC. Analysis of the Developers\u2019 Survey\n\nTo investigate the reasons why developers add/change the license(s) of their systems, we surveyed the developers who made licensing changes in the systems to which they contributed. To find potential developers for our survey, we utilized the results of our quantitative analysis. From the 16,221 projects that we analyzed, we found 1,833 projects that had experienced either a delayed initial license addition (i.e., No License $\\rightarrow$ Some License transition happened after the first project commit) or licensing change (i.e., Some License $\\rightarrow$ Some Other License) over their change history. We included both scenarios to understand the rationale behind both RQ$_1$ and RQ$_2$, which required a change in licensing. For each of these projects, we used the version control history to extract the set of all its contributors. From the 1,833 projects with licensing changes, we identified a total of 2,398 valid developers e-mail address, whom we targeted as potential participants for our study. By valid, we refer filtering out contributor e-mail addresses matching the following two patterns \u2014 \u201c[user]@locahost.*\u201d or \u201c[user]@none.*\u201d\u2014 since they pointed to clearly invalid domains. We also removed developers of the Android framework, since these have always been licensed under the Apache license. The 2,398 developers were invited via e-mail to fill-in an online survey hosted on Qualtrics [5] (the survey answers were all anonymous). This e-mail invitation included (i) a link to the survey, and (ii) a description of the specific licensing addition/change(s) we observed in their project\u2019s history. After being contacted, some developers offered further insights regarding these changes by directly responding to our email. In total, we emailed 2,398 individuals and received 138 responses to the survey and 15 follow-up emails in which developers volunteered additional information. Overall, we had a response rate of 5.75% of the developers we contacted.\n\nThe survey consisted of seven questions (Q1-Q7); Q7 was optional (only 12 participants answered it). Tables III and IV list the survey questions and the responses of the developers. Q1 and Q2 were dichotomous questions. These questions were used to ensure that the respondents were involved in determining the project\u2019s licensing. If a respondent did not answer \u201cyes\u201d to Q2, the survey ended for the participant. Out of 138 participants, 62 responded \u201cno\u201d to the Q2 and so they were ineligible for the remaining questions (Q3-Q7). Questions Q3 to Q6 were multiple choice questions and included the \u201cOther\u201d option. If the respondents chose \u201cOther\u201d, they could further elaborate using an open-ended field. Question Q7 was optional and open-ended. We chose to make it optional, because some developers may not agree that the forge should be responsible for features supporting licensing. Out of 138 respondents, 76 developers were eligible for the entire survey (Q1-Q7) as per their response to Q2, but only 52 of those individuals completed the survey.\n\nSince questions Q3-Q7 also included open-ended responses, we relied on a formal grounded-theory [8] coding of the open-ended responses. Three authors read all the responses and categorized each response that represented the developer\u2019s rationale. The categories from the three authors were analyzed and merged during a second round to obtain a final taxonomy of categories. The Tables in Section IV present the final results of the grounded-theory process.\n\nIV. RESULTS\n\nThis section discusses the achieved results answering the three research questions formulated in Section III-A.\n\nA. When are the licenses added to FOSS projects?\n\nFig. 1 shows the distribution of the number of commits in which licenses were introduced into the projects within our dataset (e.g., a license introduced during the tenth commit will be represented by the number 10). We present the raw commits in log scale due to outliers from large commit histories. At least 25% (first quartile) of the projects were licensed in the first commit (Fig. 1). The median was also at two commits and third quartile was at five commits. This observation indicates that FOSS projects are licensed very early in the change history with over 75% of the projects having a license by the fifth commit. Assuming (but this might not be always the case) that the observed history corresponds to the entire project history, this result suggests that licensing is important to developers. It is interesting to note that the mean commit number for adding a license is 21 and the maximum value is 8623 commits. These two values are indicators of a long tail with a small number of projects that consider licensing late in the change history.\n\nSummary for RQ$_1$ (Project History Results): we observed that developers consider licensing early in the change histories of FOSS projects. While there are projects that assert a license after a larger number of commits, 75% of our dataset had a license asserted within the first five commits. Thus, the data suggests that most of the projects adopt licenses among the very first commit activities.\n\nB. Why are licenses added to FOSS projects?\n\nTable III reports the responses to Question 3 (Q3) of our survey in which we tried to ascertain the rationale behind the initial project licensing. 30.8% of developers indicated that the community influences the initial licensing. One explanation for the high prevalence of this response is that certain FOSS communities stipulate and enforce that a particular license\nmust be used. For example, the Apache Software Foundation requires that its projects or the code contributed to their projects are licensed under the Apache-2.0 license. Instead, the Free Software Foundation promotes the use of the GPL and LGPL family of licenses.\n\n19.2% of developers chose the license with the goal of making their project reusable in commercial applications. These responses also indicate a bias toward more permissive licenses that facilitate such usage while restrictive licenses can discourage such usage, since they require that a system is licensed under the same terms. This finding provides a partial explanation for the trend toward more permissive licenses we observed in our previous work [30].\n\nThe results of our survey also show that licensing-related decisions are impacted by inherent developer bias. 15.4% of developers supplied answers that we categorized as moral-ethical-beliefs. An example of this category was the response by one developer indicating, \u201cI always use GPL-3.0 for philosophical reasons.\u201d Similarly, a different developer echoed this comment stating \u201cI always licence GPL, moral reasons.\u201d\n\nSatisfying a dependency constraint (i.e., the need to use a license based on the license of dependencies) was a relevant reason (9.6% - 7.7% picking the explicit option and 1.9% with an \u201cOther\u201d response categorized as dependency constraint). This result is important, since little work has been done to analyze licensing across software dependencies. This problem also poses challenges in both identifying all of the necessary dependencies as well as the license(s) of those dependencies. Some automated build frameworks like Maven [6] or Gradle [3] attempt to ameliorate this difficulty by listing dependencies in a file that drives the building process (e.g., Project Object Model file in Maven). However, licensing is not a required field in those files.\n\nThe remaining answers to this question described situations in which the license was inherited by the initial founders and persisted over time. Also, the companies have policies to specifically dictate a licensing convention. In the latter case, the respondent indicated that \u201ccompany (...) policy is Apache-2.0\u201d (company name omitted for privacy). It was also interesting to see that nobody choose a license based on requests by outsiders.\n\nLastly, we identified a category in licensing changes that related to license adoption and not changes. 7.7% of developers respond to our question on licensing changes that indicated the license was missing and it was added in a later commit. For this case, we added (License Addition) to the category for Q4 in Table III. The developers noted that \u201cSetting the license was just forgotten in the first place\u201d and \u201cAccidentally didn\u2019t include explicit licence in initial commit\u201d. These cases are also important, since it can create inconsistencies within the system or mislead non-contributors that the project is unlicensed or licensed under incompatible terms. This result further reinforces that developers view early license adoption as important, but the lack of a license may be a mistake.\n\n**Summary for RQ1 (Survey Results):** the initial licensing is predominantly influenced by the community to which a developer is contributing. Subsequently, commercial reuse is a common factor, which may reinforce the prevalence of permissive license usage. While reuse is a consideration, non-contributors do not seem to impact the initial licensing choice. We also found that the inclusion of a particular dependency can impact the initial licensing of a project.\n\n**C. When are licenses changed in FOSS projects?**\n\nFig. 2 shows the distribution of when licenses were changed in the projects within our dataset (i.e., Some license\u2192Some Other License). As in the previous section, we present the raw commit number in which the changes occurred (log scale due to outliers from large commit histories). Interestingly, the minimum value was the second commit (i.e., a license changed right after its addition in the first commit). More generally, 25% of license changes occur in the first 100 commits. The median value is 559 commits while the mean is 3,993 commits. The third quartile (2,086 commits), quite smaller than the mean, suggests a long tail of license changes occurring late in the projects\u2019 change histories. The maximum commit number with a license change was commit 56,746. Numbers at this extreme would cause the larger mean value compared to the median. Overall, the data suggests that certain projects change licenses early in the change history; however, the license changes are much more prevalent in later commits.\n\n**Summary for RQ2 (Project History Results):** we observed that developers change licensing later in the change history of the FOSS projects. While there are projects that change licensing early, our first quartile was 100 commits and third quartile was 2,086 commits, demonstrating more substantial development occurred before changing licensing.\n\n**D. Why are licenses changed in FOSS projects?**\n\nTable III shows the responses to Question 4 (Q4) of our survey in which we investigated the rationale behind license changes. Allowing reuse in commercial software was the most common reason behind licensing changes (32.7%). This option was also the second most prevalent for choosing the initial license (19.2% of developers). Combining these two results, it is clear that the current license of a project is heavily affected by its need to be reused commercially. As previously stated, this result qualitatively supports the observation from our previous work [30], where we observed that projects tend to migrate toward less-restrictive licenses.\n\n7.7% of developers changed licensing due to community influence. This response was a more significant factor for the initial choice in licensing, but it further emphasizes the impact\nthat a community can assert. One developer commented, \u201ccommunity influence (contributing to Apache\u2019s projects)\u201d. Similarly, two developers commented about the influence the Eclipse Foundation exercised over license changes in their projects. Interestingly, one developer reported: \u201cI wanted to use the most common one for OSS Java projects\u201d. This response suggests that a particular license may pick up more momentum and spread for a particular language. Interestingly, we observed that 7.7% of the developers were willing to change the licensing due to requests from non-contributors. The fact that this response was more prevalent for changing licensing than choosing the initial license may be influenced by outsiders waiting until a project is stable or mature before inquiring about particular licensing.\n\nWe also observed that both the change in license(s) of a dependency or using a new dependency prompted developers to change licenses (5.8% of developers for both cases). This observation further demonstrates the difficulty or impact that dependency can have with respect to licensing. It also suggests that there could be inconsistencies between the licensing of a system and its dependencies.\n\nMoral-ethical-beliefs are also a reason for 5.8% of developers. Interestingly, we observed both the beliefs of developers and beliefs of a philanthropist, who is funding the project\u2019s development. While one developer acknowledged, \u201cI simply wanted to pick a \u2018free\u2019 license and chose Apache without much consideration,\u201d another developer indicated that \u201cPhilanthropic funders encouraged us to move to GPL3, as well as our own internal reflection on this as we came to understand GPL3 better.\u201d In the former example, it is notable that the developer\u2019s concern was not the impact of the Apache license in particular, but primary motivator was any free license (i.e., FOSS license). The latter indicates that the individuals funding the projects can influence the licensing. While the developers were not coerced to change to the GPL-3.0, they were still influenced by the beliefs of the individuals funding the system\u2019s change history.\n\n**Summary for RQ2 (Survey Results):** developers seem to change licensing to support reuse in commercial systems. While community influence still impacts changing licensing, it appears to be a less significant factor with respect to license adoption. Based on our survey results, the reasons behind changing licensing are more diverse and more evenly distributed among the topics than we observed in the selection of the initial license.\n\n**E. What are the problems that developers face with licensing and what support do they expect from a forge?**\n\nTable IV shows the results for Questions 5-7 (Q5-Q7) that investigate both the problems that developers experience with licensing and expected licensing support from the forge.\n\nIn Q5, we investigated the problems related to licensing that developers have experienced. 23 out of 52 developers (44.2%), explicitly mentioned \u201cNo problem\u201d in the \u201cOther\u201d field. For those who recognized problems, the main reason was the inability of others to use the project due to its license (17.3%). Since developers consider this a problem, it suggests that developers are interested in allowing broad access to their work. However, they may be constrained due to desired protections (e.g., patent protect from Apache-2.0 or GPL-3.0) or external factors, like the licensing of dependencies (external since the developers cannot change those licenses).\n\nAdditionally, developers indicated that choosing the correct license was difficult for them (13.5%). The litigious nature of these licenses can lead to misinterpretations by developers. For example, the Apache Foundation states on their webpage that \u201cThe Apache Software Foundation is still trying to determine if this version of the Apache License is compatible with the GPL\u201d [1]. Additionally, 5.8% developers indicated that they experienced misunderstandings with respect to license compatibility. To make matters worse, 9.6% of the developers experienced compatibility problems with dependencies. Therefore, developers not only faced difficulty while determining the appropriate license, but they also misunderstood the compatibility among licenses and experienced incompatibility between their project\u2019s licensing and a desired dependency\u2019s licensing.\n\nDevelopers also experienced difficulties with their users misinterpreting or not understanding the terms of their license. One developer stated that \u201cUsers do not read/understand the license, even though it is a most simple one.\u201d This result poses two possible problems \u2014 either users (i.e., developers looking to reuse the code) ignore the actual licensing text or they struggle to interpret even the easier licenses. The former would demonstrate a bigger problem in that users do not take licensing seriously, while the latter demonstrates that the difficulty in understanding licensing is more extensive than just very litigious licenses. Reinforcing the second scenario, another developer noted the problem was \u201cJust the usual challenges of talking with potential commercial partners who do not understand the GPL at all\u201d. By phrasing the comment with the usual challenges, it suggests that the developer had repeated experience with partners unable to understand licensing. This is not necessarily an isolated case, but rather potentially widespread experience shared by other developers.\n\nRegarding the support provided by the forge, in this case GitHub, we investigated the impact of a feature added to help document the license of a project\u2014see Q6 in Table IV. This feature was added as a response to the criticism from some practitioners [24]. While 36.5% of developers did not have access to the feature at the time they created their project, the interesting result is that more than half (51.9%) of developers were not influenced by the availability of such a tool. Additionally, the \u201cOther\u201d responses indicated that the feature would not have had an impact on their choice (3.8%) and a single developer specifically chose not to license her project, leading to a combined 58% of developers that were unaffected by this feature. Thus, our data suggests that this GitHub feature did not affect/influence developers when licensing (or not) software hosted in GitHub.\n\nFinally, we received 11 responses to our optional question (Q7) concerning whether forges should provide features that assist the licensing of their software. Since GitHub has been\ncriticized by practitioners [24] for a lack of licensing consideration, this question seeks to understand the features that practitioners expect from a forge to this end. 10 out of 11 participants answered \u201cNone\u201d. Of those 10 developers, only one explained that a third party tool should handle license compatibility analysis. The respondent indicated that the ideal tool would utilize the various forges and build frameworks to be a dependency graph of license compatibility stating the following:\n\n\u201cThis is the job of a 3rd party tool IMO since neither github nor forge do or should own all open source deps. A 3rd party tool ideally would know about github, bitbucket, etc + poms and pom license fields, etc and form a comprehensive dep-graph license compat view given a node.\u201d\n\nAnother developer noted, \u201cNone. From our perspective it really isn\u2019t that hard to put copyright and licence notices in our source files.\u201d This comment is interesting since it conflicts with results from Q4, where developers indicated that licenses were sometimes missing or an incorrect license was used.\n\nThe only developer wishing support from the forge indicated a desire for a license compatibility checker and a license selection wizard. This developer commented the desire for two particular features stating the following:\n\n\u201c1) License compatibility checker - verify the license of your project with the license of included support software (gems, libraries, includes) and alert user to potential conflicts. This could also be used for the use case that you want to adopt a piece of software to add to an existing project\n\n| Question/Answer | #D | % |\n|-----------------|----|---|\n| Q1. Were you involved in changes occurring to parts of the system that underwent license changes? | 138 | 54.3% |\n| Yes | 75 | 54.3% |\n| No | 63 | 45.7% |\n| Q2. Were you involved in determining the license or change in license of the project or some of its files? | 138 | 53.7% |\n| Yes | 76 | 53.7% |\n| No | 62 | 46.3% |\n| Q3. How did you determine/pick the initial license for your project or files in your project? | 52 | 7.7% |\n| Dependency constraint | 4 | 7.7% |\n| Community influence (e.g., contributing to Apache projects) | 16 | 30.8% |\n| Requests by non-contributors to reuse your code | 0 | 0% |\n| Interest of reuse for commercial purposes | 10 | 19.2% |\n| Other (please specify) | 22 | 42.3% |\n| \u2014 Closed-source | 1 | 1.9% |\n| \u2014 Company-policy | 2 | 3.8% |\n| \u2014 Dependency-constraint | 1 | 1.9% |\n| \u2014 Inherit-license | 3 | 5.8% |\n| \u2014 Moral-ethical-belief | 8 | 15.4% |\n| \u2014 Project-Specific | 2 | 3.8% |\n| \u2014 Social-trend | 2 | 3.8% |\n| \u2014 None | 3 | 5.8% |\n| Q4. What motivated or caused the change in license? | 52 | 5.8% |\n| License of dependencies changed | 3 | 5.8% |\n| Using a new library imposing specific licensing constraints | 3 | 5.8% |\n| Allow reuse in commercial software | 17 | 32.7% |\n| Requests by non-contributors to reuse your code | 4 | 7.7% |\n| Other (please specify) | 25 | 48.1% |\n| \u2014 Change-to-license-text | 2 | 3.8% |\n| \u2014 Community-influence | 4 | 7.7% |\n| \u2014 Fix-incorrect-licenses | 1 | 1.9% |\n| \u2014 Improve-clarity | 1 | 1.9% |\n| \u2014 Missing-license (License Adoption) | 4 | 7.7% |\n| \u2014 Moral-Ethical-belief | 3 | 5.8% |\n| \u2014 More-permissive-license | 1 | 1.9% |\n| \u2014 New-license-version | 2 | 3.8% |\n| \u2014 Personal-Preference/Project-specific | 1 | 1.9% |\n| \u2014 Private-to-public-project | 1 | 1.9% |\n| \u2014 Promote-Reuse | 1 | 1.9% |\n| \u2014 Unclear | 1 | 1.9% |\n| \u2014 None | 3 | 5.8% |\n\n| Question/Answer | #D | % |\n|-----------------|----|---|\n| Q5. What problems (if any) have you experienced due to license selection in terms of code reuse? | 52 | 9.6% |\n| My license was not compatible with desired dependencies | 5 | 9.6% |\n| Others were unable to use my project unless I re-licensed it | 9 | 17.3% |\n| A dependency changed licenses ad was no longer compatible | 1 | 1.9% |\n| There was a misunderstanding of compatibility between licensing terms of two licenses | 3 | 5.8% |\n| Choosing the correct license was difficult/confusing | 7 | 13.5% |\n| Other (please specify) | 27 | 52.9% |\n| \u2014 Code-unavailability | 1 | 1.9% |\n| \u2014 Lack-of-undersanding-by-Users | 2 | 3.8% |\n| \u2014 Unique-New-License | 1 | 1.9% |\n| \u2014 No problems | 23 | 44.2% |\n| Q6. Did GitHub\u2019s mechanism for licensing impact your decision on licensing your project? | 52 | 5.8% |\n| Yes, it caused me to license my project | 3 | 5.8% |\n| No, I already planned on licensing | 27 | 51.9% |\n| No, I did not want to license at project creation | 1 | 1.9% |\n| Such a mechanism was not yet available when I created my project | 19 | 36.5% |\n| Other (please specify) | 2 | 3.8% |\n| \u2014 No impact | 2 | 3.8% |\n| Q7. What kind of support would you expect from the forge/GitHub to help you managing licenses and licensing compatibility issues in your software? | 11 | 90.9% |\n| None | 10 | 90.9% |\n| License Checker and License Selection Wizard | 1 | 9.1% |\n- is it compatible? 2) License selection wizard - when you begin a project, the wizard can ask you a series of questions (do you want to allow commercial use, do you require mods to be licensed the same as original, etc) and then suggest a license for the project.\u201d\n\nWhile only one developer wanted support from the forge, this single developer\u2019s comments seem to address many of the problems and difficulty with respect to licensing for which we found evidence in Q6 of the survey.\n\n**Summary for RQ3 (Survey Results):** although 44.2% of the developers surveyed indicated that they have not experienced problems with licensing, the remaining respondents provided a diverse set of answers. They primarily were related to license incompatibility or difficulty understanding the licensing. Lastly, the survey indicated that GitHub\u2019s mechanism to encourage or aid in licensing was not necessary or unavailable to the surveyed developers. We also found that most developers did not expect support from the forge, but one did indicate the desire for a third-party tool. However, one developer did express interest in forge\u2019s support, and the comments aligned with our results regarding problems that developers actually faced.\n\n**V. LESSONS AND IMPLICATIONS**\n\n**Intrinsic beliefs of the developers.** The first important observation is that the participants have a bias toward FOSS licensing from an ethical perspective. 52% of the respondents indicated (Q6) that they planned on licensing the project prior to creation; only 6% of the respondents (Q6) were influenced to license their project due to GitHub\u2019s licensing feature (i.e., a combo list with license names). Similarly, the \u201cOther\u201d responses regarding the reason for a project\u2019s initial licensing (Q3) indicated a sense of obligation. For example, one developer said: \u201cIt was the only moral and ethical choice\u201d.\n\n**Delayed licensing.** Developers do not necessarily have to decide to open source from the beginning and delay doing it. While we empirically observed early license adoption in general, one developer wrote in an email that they waited to choose a license: \u201cthis project just didn\u2019t have a license on day 1 and it was added at first release.\u201d Similarly, one developer responded to the survey that licensing changed due to \u201cchange private to public project\u201d. This observation suggests that licensing is still important to these developers, but it may not be considered relevant until the project reaches a certain level of maturity. Thus, there is the need for tools to add and verify licensing information of a system at any given point in time.\n\n**Community and organizational influence.** Our results indicate that communities, and in particular FOSS foundations (such as the Apache Software, Eclipse, and the Free Software foundations) exert powerful influence on the choice of a license by its developers. About 31% of the participants responded that initial licensing is done by following community\u2019s specific licensing guidelines. Improving or developing on top of existing software from a foundation mostly requires using the same license aligning with foundation\u2019s philosophy.\n\n**License misunderstanding.** The survey stresses the need for aid in explaining licenses and the implications of their use. About 20% of the respondents highlighted that licensing is confusing and/or hard to understand (Q5): 13.5% of respondents indicated that developers\u2014both the authors and the users\u2014find licensing confusing or difficult (Q5), and 6% of developers also noted that there were misunderstandings between license compatibility. Additionally, one \u201cOther\u201d respondent stated, \u201cUsers do not read/understand the license, even though it is a most simple one,\u201d which suggests that developers experienced misunderstanding whether on their own or by users.\n\n**Reuse for commercial distribution.** The results regarding licensing changes indicated that commercial usage of code is a concern in the open source community. We found that practitioners used permissive licenses to facilitate commercial distributions, in some cases they change to a more permissive license for this purpose.\n\n**Dependency influence.** A software system must choose its dependencies so to avoid conflicts due to incompatibilities between the system\u2019s license(s) and the depending components\u2019 license(s). Similarly, others will choose to use a particular software system based on its license. Thus, the change of a license in a system has the potential of creating a chain reaction: those that use it might need to change their license, or drop it as a dependency; for the system changing license, the potential pool of reusable components will change accordingly\u2014it might need to drop a different dependency, or it might be able to add a dependency with a previously incompatible license.\n\n**Forge\u2019s support.** Most of our respondents do not expect any licensing support from the forge. It is likely that the individuals that benefit the most from licensing support in the forge are those who are looking to reuse software. This is supported by our results that indicate that the license(s) of dependencies is an important consideration, since it might impact the ability to reuse the dependency or require a change of the license(s) of the software that uses it. Thus, compliance-oriented features may aid developers to ensure they can legally reuse software.\n\nFinally, our results demonstrate that external factors like community, license prevalence, and licenses of dependencies have an important impact on licensing.\n\nA feature provided by the forge to support domain suggested licensing could benefit practitioners. Since developers indicated that licensing is difficult, a more informative feature could help practitioners determine the appropriate licensing. For instance, the current licensing support feature provided by GitHub feature is not particularly informative for developers. Basically, it provides a link to choosealicense.com, but does not provide further guidance to the developer. Also, it does not cover issues related to compatibilities at all. Moreover, applications within the same domain may be utilizing some of the same dependencies or require similar grants for redistribution and reuse. To better support developers, a forge could include a domain analysis feature to detect similar applications [22] and suggest to the developer/maintainer the license used by\nsimilar systems (if no other criteria has been considered, such as community or dependencies).\n\nVI. Threats to Validity\n\nThreats to construct validity relate to the relationship between theory and observation, and can be mainly due to imprecision extracting licensing and results from the developer survey. In order to identify the licenses, we relied on Ninka [16], which has been empirically evaluated indicating a precision of 95%, when it is able to identify the license (85% of the time in the same study showing the precision). In order to classify the free responses, we conducted a formal Grounded Theory analysis with two-author agreement. In particular, all of the responses were read and categorized by three authors and the agreement of two of them was considered necessary. Another threat concerns the fact that, possibly, GitHub could have mirrored only a fraction of the projects\u2019 change history; hence, it is possible that the first commits in GitHub may not correspond to the first commits in the projects\u2019 history. Finally, the response rate of our study is 5.75%, below the response rate often achieved in survey studies [18], i.e. 10%. However, explicitly targeting original developers is usually challenging because many of them may not be active, the email addresses are invalid, or even impossible to contact because they are no longer using the email addresses we collected.\n\nThreats to internal validity relate to internal, confounding factors that would bias the results of our study. In analyzing both license introduction and licensing changes, we considered the commit in which we observed the phenomena as an instance to ensure we did not introduce duplicates. We only excluded developers of projects from the Android framework since the project has always been Apache licensed. Therefore, we did not have a bias while selecting developers. To address lack of coverage of our original options in our survey, we added a free form option \u201cOther\u201d to each question. In addition, we only presented the full survey to developers that indicated that they were involved in the licensing decision(s). Another possible threat to internal validity concerns the fact that, possibly, the 138 respondents decided to participate to the survey because they had greater interest in licensing problems than others. However, results shown in Section IV suggest that this is not the case, e.g., respondents comprise people who are directly involved in the licensing, but they did not necessarily experience any licensing problems.\n\nThreats to external validity relate to the ability to generalize the results from the study, and we do not assert that these observations are representative of the FOSS community. While we randomly sampled the projects from GitHub, we only did it for Java projects. Thus, other languages and forges may demonstrate different behavior as well as the developers of those projects may have different beliefs. However, GitHub is the most popular forge with a large number of public repositories. A larger evaluation on multiple forges and projects in other languages is necessary to understand when licenses are adopted and changed in the general case. Additionally, we surveyed actual developers of these projects. While we do not claim that the rationale is complete, the conclusions represent explicit feedback as opposed to inferred understanding. Therefore, the rationale is a definitive subset. We do not claim that these results apply in the context of closed source systems, since we required source code to identify licensing.\n\nFinally, to limit this threat to external validity, we examined the diversity of our data set using the metrics proposed by Nagappan et al. [23]. To understand the diversity, we matched the projects in our dataset against the projects mined by Boa [11], finding 1,556 project names that matched between the two datasets. We used these 1,556 projects to calculate our diversity score across six dimensions. The results were 0.45 for programming language, 0.99 for developers, 1.00 for project age, 0.99 for number of committers, 0.96 for number of revisions, 0.99 for number of program languages, suggesting that our dataset is diverse excluding the programming language score (impacted by selecting Java projects). Overall, our score was 0.35, which suggests that we cover over a third of FOSS projects with 9.5% of our dataset.\n\nVII. Conclusions\n\nWe investigated the reasons on when and why developers adopt and change licenses during evolution of FOSS Java projects on GitHub. To this aim, we conducted a survey with developers that contributed changes to the projects that included licensing changes. We observed that developers typically adopt a license within the first few commits, suggesting that developers consider licensing as an important task. Similarly, we observe that most licensing changes appear after a non-negligible period of development as visible from the observed history. We then explored the reasons for the initial licensing, license changes, and problems experienced by developers with respect to software licensing. We observed that developers view licensing as an important yet non-trivial feature for their projects. License implications or compatibility are not always clear and so they can lead to changes. Additionally, there are external factors influencing the projects\u2019 licensing, such as community, purpose of usage (i.e., commercial systems), and use of third-party libraries. While developers did not strongly indicate an expectation for licensing support by the forge, it is evident that third-party tools or features within the forge would aid developers in helping to deal with licensing decisions and changes.\n\nAcknowledgements\n\nWe would like to thank all the open source developers who took time to participate in our survey. Specifically, we would like to acknowledge developers who provided in-depths answers and responded to follow-up questions. This work is supported in part by NSF CAREER CCF-1253837 grant. Massimiliano Di Penta is partially supported by the Markos project, funded by the European Commission under Contract Number FP7-317743. Any opinions, findings, and conclusions expressed herein are the authors\u2019 and do not necessarily reflect those of the sponsors.\nREFERENCES\n\n[1] Apache License, Version 2.0 (current) https://www.apache.org/licenses/. Last accessed: 2015/03/23.\n\n[2] GitHub API. https://developer.github.com/v3/. Last accessed: 2015/01/15.\n\n[3] Gradle. https://gradle.org/.\n\n[4] Open Source Definition http://opensource.org/osd.\n\n[5] Qualtrics http://www.qualtrics.com/.\n\n[6] Apache. Apache maven project. https://maven.apache.org/.\n\n[7] G. Bavota, A. Ciemniewska, I. Chulani, A. De Nigro, M. Di Penta, D. Galletti, R. Galoppini, T. F. Gordon, P. Kedziora, I. Lener, F. Torelli, R. Pratola, J. Pukacki, Y. Rebahi, and S. G. Villalonga. The market for open source: An intelligent virtual open source marketplace. In 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering, CSMR-WCRE 2014, Antwerp, Belgium, February 3-6, 2014, pages 399\u2013402, 2014.\n\n[8] J. Corbin and A. Strauss. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative Sociology, 13(1):3\u201321, 1990.\n\n[9] M. Di Penta, D. M. Germ\u00e1n, and G. Antoniol. Identifying licensing of jar archives using a code-search approach. In Proceedings of the 7th International Working Conference on Mining Software Repositories, MSR 2010 (Co-located with ICSE), Cape Town, South Africa, May 2-3, 2010, Proceedings, pages 151\u2013160, 2010.\n\n[10] M. Di Penta, D. M. Germ\u00e1n, Y. Gu\u00e9h\u00e9neuc, and G. Antoniol. An exploratory study of the evolution of software licensing. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE 2010, Cape Town, South Africa, 1-8 May 2010, pages 145\u2013154, 2010.\n\n[11] R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In 35th International Conference on Software Engineering, ICSE \u201913, San Francisco, CA, USA, May 18-26, 2013, pages 422\u2013431, 2013.\n\n[12] D. M. Germ\u00e1n and M. Di Penta. A method for open source license compliance of java applications. IEEE Software, 29(3):58\u201363, 2012.\n\n[13] D. M. Germ\u00e1n, M. Di Penta, and J. Davies. Understanding and auditing the licensing of open source software distributions. In The 18th IEEE International Conference on Program Comprehension, ICPC 2010, Braga, Minho, Portugal, June 30-July 2, 2010, pages 84\u201393, 2010.\n\n[14] D. M. Germ\u00e1n, M. Di Penta, Y. Gu\u00e9h\u00e9neuc, and G. Antoniol. Code siblings: Technical and legal implications of copying code between applications. In Proceedings of the 6th International Working Conference on Mining Software Repositories, MSR 2009 (Co-located with ICSE), Vancouver, BC, Canada, May 16-17, 2009, Proceedings, pages 81\u201390, 2009.\n\n[15] D. M. Germ\u00e1n and A. E. Hassan. License integration patterns: Addressing license mismatches in component-based development. In 31st International Conference on Software Engineering, ICSE 2009, May 16-24, 2009, Vancouver, Canada, Proceedings, pages 188\u2013198, 2009.\n\n[16] D. M. Germ\u00e1n, Y. Manabe, and K. Inoue. A sentence-matching method for automatic license identification of source code files. In ASE 2010, 25th IEEE/ACM International Conference on Automated Software Engineering, Antwerp, Belgium, September 20-24, 2010, pages 437\u2013446, 2010.\n\n[17] R. Gobeille. The FOSSology project. In Proceedings of the 2008 International Working Conference on Mining Software Repositories, MSR 2008 (Co-located with ICSE), Leipzig, Germany, May 10-11, 2008, Proceedings, pages 47\u201350, 2008.\n\n[18] R. M. Groves. Survey Methodology, 2nd edition. Wiley, 2009.\n\n[19] J. Hartsock. jquery, jquery ui, and dual licensed plugins (dual licensing) [closed] http://stackoverflow.com/questions/2758409/jquery-jquery-ui-and-dual-licensed-plugins-dual-licensing. Last accessed: 2015/02/15.\n\n[20] Y. Manabe, Y. Hayase, and K. Inoue. Evolutional analysis of licenses in FOSS. In Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), Antwerp, Belgium, September 20-21, 2010, pages 83\u201387. ACM, 2010.\n\n[21] Y. Manabe, Y. Hayase, and K. Inoue. Evolutional analysis of licenses in FOSS. In Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), Antwerp, Belgium, September 20-21, 2010., pages 83\u201387, 2010.\n\n[22] C. McMillan, M. Grechanik, and D. Poshyvanyk. Detecting similar software applications. In Proceedings of the 34th International Conference on Software Engineering, ICSE \u201912, pages 364\u2013374, Piscataway, NJ, USA, 2012. IEEE Press.\n\n[23] M. Nagappan, T. Zimmermann, and C. Bird. Diversity in software engineering research. In Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE\u201913, Saint Petersburg, Russian Federation, August 18-26, 2013, pages 466\u2013476, 2013.\n\n[24] S. Phipps. Github needs to take open source seriously http://www.infoworld.com/d/open-source-software-github-needs-take-open-source-seriously-208046.\n\n[25] P. Singh and C. Phelps. Networks, social influence, and the choice among competing innovations: Insights from open source software licenses. Information Systems Research, 24(3):539\u2013560, 2009.\n\n[26] M. Sojer, O. Alexy, S. Kleinknecht, and J. Henkel. Understanding the drivers of unethical programming behavior: The inappropriate reuse of internet-accessible code. J. of Management Information Systems, 31(3):287\u2013325, 2014.\n\n[27] M. Sojer and J. Henkel. Code reuse in open source software development: Quantitative evidence, drivers, and impediments. Journal of the Association for Information Systems, 11(12):868\u2013901, 2010.\n\n[28] J. T. Confusion about dual license (mit/gpl) javascript for use on my website http://programmers.stackexchange.com/questions/139663/confusion-about-dual-license-mit-gpl-javascript-for-use-on-my-website. Last accessed: 2015/02/15.\n\n[29] T. Tuunanen, J. Koskinen, and T. K\u00e4rkk\u00e4inen. Automated software license analysis. Autom. Softw. Eng., 16(3-4):455\u2013490, 2009.\n\n[30] C. Vendome, M. Linares-V\u00e1squez, G. Bavota, M. Di Penta, D. M. Germ\u00e1n, and D. Poshyvanyk. License usage and changes: A large-scale study of Java projects on GitHub. In The 23rd IEEE International Conference on Program Comprehension, ICPC 2015, Florence, Italy, May 18-19, 2015. IEEE, 2015.\n\n[31] Y. Wu, Y. Manabe, T. Kanda, D. M. Germ\u00e1n, and K. Inoue. A method to detect license inconsistencies in large-scale open source projects. In The 12th Working Conference on Mining Software Repositories MSR 2015, Florence, Italy, May 16-17, 2015. IEEE, 2015.", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/026_vendome.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 10, "total-input-tokens": 33641, "total-output-tokens": 14048, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 5892, 1], [5892, 12403, 2], [12403, 18367, 3], [18367, 24087, 4], [24087, 29998, 5], [29998, 36544, 6], [36544, 41443, 7], [41443, 47875, 8], [47875, 54171, 9], [54171, 60723, 10]]}}
|
|
{"id": "eb990205818de64d2f99bc05a4dcd368447a8313", "text": "License usage and changes: a large-scale study on GitHub\n\nChristopher Vendome1 \u00b7 Gabriele Bavota2 \u00b7 Massimiliano Di Penta3 \u00b7 Mario Linares-V\u00e1squez1 \u00b7 Daniel German4 \u00b7 Denys Poshyvanyk1\n\nPublished online: 6 June 2016\n\u00a9 Springer Science+Business Media New York 2016\n\nAbstract Open source software licenses determine, from a legal point of view, under which conditions software can be integrated and redistributed. The reason why developers of a project adopt (or change) a license may depend on various factors, e.g., the need for ensuring compatibility with certain third-party components, the perspective towards redistribution or commercialization of the software, or the need for protecting against somebody else\u2019s commercial usage of the software. This paper reports a large empirical study aimed at quantitatively and qualitatively investigating when and why developers adopt or change software licenses. Specifically, we first identify license changes in 1,731,828 commits, representing the entire history of 16,221 Java projects hosted on GitHub. Then, to understand the rationale of license changes, we perform a qualitative analysis on 1,160 projects written in seven different programming languages, namely C, C++, C#, Java, Javascript, Python, and Ruby\u2014following an open coding approach inspired by grounded theory\u2014on commit messages and issue tracker discussions concerning licensing topics, and whenever possible, try to build traceability links between discussions and changes. On one hand, our results highlight how, in different contexts, license adoption or changes can be triggered by various reasons. On the other hand, the results also highlight a lack of traceability of when and why licensing changes are made. This can be a major concern, because a change in the license of a system can negatively impact those that reuse it. In conclusion, results of the study trigger\n\nCommunicated by: Lin Tan\n\nChristopher Vendome\ncgvendome@email.wm.edu\n\n1 The College of William and Mary, Williamsburg, VA, USA\n2 Free University of Bozen-Bolzano, Bozen-Bolzano, Italy\n3 University of Sannio, Benevento, Italy\n4 University of Victoria, British Columbia, Canada\nthe need for better tool support in guiding developers in choosing/changing licenses and in keeping track of the rationale of license changes.\n\n**Keywords** Software licenses \u00b7 Mining software repositories \u00b7 Empirical studies\n\n## 1 Introduction\n\nIn recent and past years, the diffusion of Free and Open Source Software (FOSS) projects is increasing significantly, along with the availability of forges hosting such projects (e.g., SourceForge\\(^1\\) or GitHub\\(^2\\)) and foundations supporting and promoting the development and diffusion of FOSS (e.g., the Apache Software Foundation,\\(^3\\) the GNU Software Foundation,\\(^4\\) or the Eclipse Software Foundation\\(^5\\)). The availability of FOSS projects is a precious resource for developers, who can reuse existing assets, extend/evolve them, and in this way create new work productively and reduce costs. For example, a blog post by IBM\\(^6\\) outlines the reasons pushing companies to reuse open source code: \u201cYes, this [the cost factor] is one of the most important factors that attract not only the small companies or start-up\u2019s but also the big corporations these days\u201d. This can happen not only in the context of open source projects, but it is more and more frequent in commercial projects. In a survey conducted by Black Duck,\\(^7\\) it was found that 78 % of the companies use open source code (double from 2010), 93 % claimed an increase in open source reuse, 64 % contribute to open source development, and over 55 % indicated a lack of formal guidance when utilizing open source code. The findings by Black Duck demonstrate two key implications: i) commercial reuse of open source code has been increasing, and ii) in general, there is a lack of oversight in how this reuse occurs.\n\nNevertheless, whoever is interested in integrating FOSS code in their software project (and redistributing along with the project itself), or modifying existing FOSS projects to create a new work\u2014referred to as \u201cderivative work\u201d\u2014must be aware that such activities are regulated by *software licenses* and in particular by the specific FOSS license of the project being reused. In order to license software projects, developers either add a *licensing statement* to source code files (as a comment at the beginning of each file) and/or include a textual file containing the license statement in the project source code root directory or in its sub-directories.\n\nGenerally speaking, FOSS licenses can be classified into *restrictive* (also referred to as \u201ccopyleft\u201d or \u201creciprocal\u201d) and *permissive* licenses. A restrictive license requires developers to use the same license to distribute new software that incorporates software licensed under such restrictive license (i.e., the redistribution of the derivative work must be licensed under the same terms); meanwhile, permissive licenses allow re-distributors to incorpo-\n\n---\n\n1. [http://sourceforge.net](http://sourceforge.net)\n2. [https://github.com](https://github.com)\n3. [https://www.apache.org](https://www.apache.org)\n4. [http://www.gnu.org](http://www.gnu.org)\n5. [http://www.eclipse.org/](http://www.eclipse.org/)\n6. [https://www.ibm.com/developerworks/community/blogs/6e6f6d1b-95c3-46df-8a26-b7efd8ee4b57/entry/why_big_companies_are_embracing_open_source119?lang=en](https://www.ibm.com/developerworks/community/blogs/6e6f6d1b-95c3-46df-8a26-b7efd8ee4b57/entry/why_big_companies_are_embracing_open_source119?lang=en)\n7. [https://www.blackducksoftware.com/future-of-open-source](https://www.blackducksoftware.com/future-of-open-source)\nrate the reused software under a difference license (Singh and Phelps 2009, Free Software Foundation 2015). The **GPL** (in all of its versions) is a classic example of a restrictive license. In Section 5 of the **GPL-3.0**, the license addresses code modification stating that \u201cYou must license the entire work, as a whole, under this License to anyone who comes into possession of a copy\u201d (http://www.gnu.org/licenses/gpl.html). The **BSD** licenses are examples of permissive licenses. For instance, the **BSD 2-Clause** has two clauses that detail the use, redistribution, and modification of licensed code: (i) the source must contain the copyright notice and (ii) the binary must produce the copyright notice and contain the disclaimer in documentation (http://opensource.org/licenses/BSD-2-Clause).\n\nWhen developers (or organizations) decide to make a project available as open source, they can license their code under one or many different existing licenses. The choice may be dictated by the set of dependencies that the project has (e.g., what libraries it uses) since those dependencies might have specific licensing constraints to those that reuse them. For instance, if a project links (statically) some **GPL** code, then it must be released under the same **GPL** version; failing to fulfill such a constraint could create a potential legal risk. Also, as shown by Di Penta et al. (2010), the choice of the licenses in a FOSS project may have a massive impact on its success, as well as on projects using it. For example\u2014as it happened for the IPFilter project (http://www.openbsd.org/faq/pf)\u2014a highly restrictive license may prevent others from redistributing the project (in the case of IPFilter, this caused its exclusion from the OpenBSD distributions). An opposite case is the one of MySQL connect drivers, originally released under **GPL-2.0**, whose license was modified with an exception (Oracle http://www.mysql.com/about/legal/licensing/foss-exception/) to allow the driver\u2019s inclusion in other software released under some open source licenses, which would otherwise be incompatible with the **GPL** (e.g., the original **Apache** license). In summary, the choice of the license\u2014or even a decision to change an existing license\u2014is a crucial crossroad point in the context of software evolution of every FOSS project.\n\nIn order to encourage developers to think about licensing issues early in the development process, some forges (e.g., GitHub) have introduced mechanisms such as the possibility of picking the project license at the time the repository is created. Also, there are some Web sites (e.g., http://choosealicense.com) helping developers to choose a license. Furthermore, there are numerous research efforts aimed at supporting developers in classifying source code licenses (Gobeille 2008; Germ\u00e1n et al. 2010b) and identifying licensing incompatibilities (Germ\u00e1n et al. 2010a). Even initiatives such as the Software Package Data Exchange (SPDX) (http://spdx.org) have been aimed at proposing a formal model to document the license of a system. However, despite of the effort put by the FOSS community, researchers, and independent companies, it turns out that developers usually do not have a clear idea on the exact consequences of licensing (or not) their code using a specific license, or they are unsure (for example, on how to re-distribute code licensed with a dual license among the other issues Vendome et al. 2015b).\n\n**Paper Contributions** This paper reports the results of a large empirical study aimed at quantitatively and qualitatively investigating when and why licenses change in open source projects, and to what extent is it possible to establish traceability links between licensing related-discussions and changes. First, we perform a quantitative analysis conducted on 16,221 Java projects hosted on GitHub. To conduct this study, we first mined the entire change history of the projects, extracting the license name (e.g., **GPL** or **Apache**) and version (e.g., v1, v2), when applicable, from each of the 4,665,611 files involved in a total of 1,731,828 commits. Starting from this data, we provide quantitative evidence on (i) the\ndiffusion of licenses in FOSS systems, (ii) the most common license-change patterns, and (iii) the traceability between the license changes to both the commit messages and the issue tracker discussions. After that, following an open coding approach inspired by grounded theory (Corbin and Strauss 1990), we qualitatively analyze a sample of commit messages and issue tracker discussions likely related to license changes. Such a qualitative analysis has been performed on 1,160 projects written in seven different languages: 159 C, 91 C++, 78 C#, 324 Java, 166 Javascript, 147 Python, and 195 Ruby projects. The results of this analysis provide a rationale on why developers adopt specific license(s), both for initial licensing and for licensing changes.\n\nThe study reported in this paper poses its basis on previous work aimed at exploring license incompatibilities (Germ\u00e1n et al. 2010a), license changes (Di Penta et al. 2010), license evolution (Manabe et al. 2010), and integration patterns (Germ\u00e1n and Hassan 2009). Building upon previous work on licensing analysis, this paper:\n\n1. Constitutes, to the best of the authors\u2019 knowledge, the largest study aimed at analyzing the change patterns in licensing of software systems (earlier work was limited to the analysis of up to six projects Manabe et al. 2010; Di Penta et al. 2010).\n2. To the best of our knowledge, it is the first work aimed at explaining the rationale of license changes by means of a qualitative analysis of commit notes and issue tracker discussions.\n\nThe achieved results suggest that determining the appropriate license of a software project is far from trivial and that a community\u2019s usage and expectations can influence developers when picking a license. We also observe that licensing expectations may be different based on the programming language. Although choosing a license is considered important for developers, even from early releases of their projects, forges and third party-tools provide little or no support to developers when performing licensing-related tasks, e.g., picking a license, declaring the license of a project, changing license from a restrictive one towards a more permissive one (or vice versa) and, importantly, keeping track of the rationale for license changes. For example, during the creation of a new repository, GitHub allows the user to select an initial license from a list of commonly used ones, but offers no guidance on the implications of such a choice, and simply redirects the user to http://choosealicense.com/; aside from this, GitHub offers no support for licensing management. Also, there is a lack of consistency and standardization in the mechanism that should be used for declaring a license (e.g., putting it in source code heading comments, separate license files, README files, etc.). Moreover, the legal nature of the licenses exacerbate this problem since the implications and grants or restrictions are not always clear for developers when the license is present. Last, but not least, the currently available Software Configuration Management (SCM) technology provides no support to trace licensing-related discussions and decisions onto actual changes, whereas such traceability links can be useful to understand the impact of such decisions.\n\n**Paper Structure** The paper is organized as follows. Section 2 relates this work to the existing literature on licensing analysis. Section 3 describes the study design and details the data analysis procedure. Results are reported and discussed in Section 4. Lessons learned from the study results are summarized in Section 5, while Section 6 discusses the threats to the study\u2019s validity. Finally, Section 7 concludes the paper and outlines directions for future work.\n2 Related Work\n\nOur work is mainly related to (i) techniques and tools for automatically identifying and classifying licenses in software artifacts, and (ii) empirical studies focusing on different aspects of license adoption and evolution.\n\n2.1 Identifying and Classifying Software Licenses\n\nThe problem of license identification has firstly been tackled in the FOSSology project (Gobeille 2008) aimed at building a repository storing FOSS projects and their licensing information and using a machine learning approach to classify licenses. Tuunanen et al. (2009) proposed ASLA, a tool aimed at identifying licenses in FOSS systems; the tool has been shown to determine licenses in files with 89% accuracy.\n\nGerm\u00e1n et al. (2010b) proposed Ninka, a tool that uses a pattern-matching based approach for identifying statements that characterize various licenses. Given any text file as an input, Ninka outputs the license name and version. In the evaluation reported by the authors, Ninka achieved a precision $\\sim 95\\%$ while detecting licenses. Ninka is currently considered the state-of-the-art tool in the automatic identification of software licenses.\n\nWhile the typical license classification problem arises when source code is available, in some cases, source code is not available\u2014i.e., only byte code or binaries are available\u2014and the goal is to identify whether the byte code has been produced from source code under a certain license. To this aim, Di Penta et al. (2010) combined code search and textual analysis to automatically determine a license under which jar files were released. Their approach automatically infers the license from decompiled code by relying on the Google Code search engine. Note that, differently from the previous techniques, the approach in Di Penta et al. (2010) is only able to identify the license family (e.g., GPL) without specifying the version (e.g., 2.0).\n\n2.2 Empirical Studies on Licenses Adoption and Evolution\n\nDi Penta et al. (2010) investigated\u2014on six open source projects written in C, C++ and Java\u2014the migration of licenses over the course of a project\u2019s lifetime. The study suggests that licenses changed version and type during software evolution, but there was no generic patterns generalizable to the six analyzed FOSS projects. Also, Manabe et al. (2010) analyzed the changes in licenses but of FreeBSD, OpenBSD, Eclipse, and ArgoUML, finding that each project had different evolution patterns.\n\nGerm\u00e1n and Hassan (2009) analyzed 124 open source packages exploited by several applications to understand how developers deal with license incompatibilities. Based on this analysis, they built a model outlining when specific licenses are applicable and what are their advantages and disadvantages. Later, Germ\u00e1n et al. (2010a) presented an empirical study focused on the binary packages of the Fedora-12 Linux distribution aimed at (i) understanding if licenses declared in the packages were consistent with those present in the source code files, and (ii) detecting licensing issues derived by dependencies between packages; they were able to find some licensing issues confirmed by Fedora.\n\nGerm\u00e1n et al. (2009) analyzed the presence of cloned code fragments between the Linux Kernel and two distributions of BSD, i.e., OpenBSD and FreeBSD. The aim was to verify whether the cloning was performed in accordance to the terms of the licenses. Results\nshow that, in most cases, these code-migrations were admitted since they went from less restrictive licenses towards more restrictive ones.\n\nWu et al. (2015) investigated license inconsistencies between cloned files. They performed an empirical study on Debian 7.5 to demonstrate the ways in which licensing can become inconsistent between the file clones (e.g., the removal of a license in one of the clone pairs).\n\nIn our previous work (Vendome et al. 2015a), we focused our analysis only on Java projects. In this work, we expand our analysis to include six new languages\u2014C, C++, C#, Javascript, Python, and Ruby. Also, our new grounded theory analysis features a categorization of commit messages and issue discussions into seven categories, in turn further detailed in a total of 27 sub-categories. In addition to extracting new support and rationale, we also defined new sub-categories and subsequently distilled lessons from this new data. For example, we observed that asserting a license is not standardized or consistent across languages, and it would benefit developers to have a consistent means of documenting and presenting the license of a system within a forge.\n\nVendome et al. (2015b) conducted a survey with developers that contributed to projects that had experienced changes in licensing to understand the rationale for adopting and changing licensing. The survey results indicated that facilitating commercial reuse is a common reason for license changes. Also the survey highlighted that, in general, developers have a lack of understanding of the legal implications of open source licenses, highlighting the need for recommenders aimed at supporting them in choosing and changing licenses.\n\nWhile we share similar goals with prior related work\u2014understanding insights into license usage and migration\u2014our analysis is done on a much larger scale, including a (i) quantitative analysis on 16,221 Java projects, and (ii) a qualitative analysis upon a sample of commit messages and issue tracker discussions from 1,160 projects written in seven different programming languages. The latter allowed us to perform in-depth analysis of the rationale behind license usages and migrations.\n\n3 Design of the Empirical Study\n\nThe goal of our study is to investigate license adoption and evolution in FOSS projects, with the purpose of understanding the overall rationale behind picking a particular license or changing licenses and of determining the underlying license change patterns. The perspective is of researchers interested in understanding what are the main factors leading towards specific license adoption and change. The context consists of (i) the change history of 16,221 Java open source projects mined from GitHub, which will be used to quantitatively investigate the goals of the study, and (ii) commit messages and issue tracker discussions from 1,160 projects written in seven different programming languages (i.e., C, C++, C#, Java, JavaScript, Python, and Ruby), which will be exploited for qualitative analysis.\n\n3.1 Research Questions\n\nWe aim at answering the following research questions:\n\n1. **RQ1** What is the usage of different licenses by projects in GitHub? This research question examines the proportions of different types of licenses that are introduced by FOSS projects hosted in GitHub. In doing this, we should consider that GitHub is a relatively young forge (launched in April 2008), which has seen exponential growth in the number\nof projects over the past few years, and that most of the projects it hosts are young in terms of the first available commit or the date that the repository was created.\n\n2. **RQ\u2082** What are the most common licensing change patterns? Our second research question investigates the popular licensing change patterns in the GitHub Open Source community with the aim of driving out\u2014from a qualitative point of view\u2014the rationale behind such change patterns (e.g., satisfying dependency constraints).\n\n3. **RQ\u2083** To what extent are licensing changes documented in commit messages or issue tracker discussions? This research question investigates on whether licensing changes in a system can be traced to commit messages or issues\u2019 discussions.\n\n4. **RQ\u2084** What rationale do these sources contain for the licensing changes? This research question investigates the rationale behind the particular change in license(s) from a developer\u2019s perspective.\n\nWe address our four research questions by looking at the licensing phenomenon from two different points of view, namely (i) a **quantitative** analysis of the licenses under which projects were released, their changes across their evolution history, and the ability to match these changes to either commit messages or issue tracker discussions; and (ii) a **qualitative** analysis of licensing-related discussions made by developers over the issue trackers and of the way in which developers documented licensing changes through commit messages.\n\nFor the quantitative analysis of licensing changes, we are interested in analyzing license migration patterns that fall in the following three categories:\n\n- **No license \u2192 some License(s) \u2013 N2L.** This reflects the case in which developers realized the need for a license and added a licensing statement to files;\n- **some License(s) \u2192 No license \u2013 L2N.** In this case, for various reasons, licensing statements have been removed from source code files; for example, because a developer accidentally added a wrong license/license version;\n- **some License(s) \u2192 some other License(s) \u2013 L2L.** This is the most general case of a change in licensing between distinct licenses.\n\nTo address **RQ\u2081**, **RQ\u2082**, and **RQ\u2083**, we perform a quantitative analysis by mining the version history of 16,221 Java projects, while to address **RQ\u2084** we perform a qualitative analysis on the commit messages and issue tracker discussion of the 1,160 projects written in seven different programming languages. In the following subsections, we describe the two kinds of analysis in detail.\n\n### 3.2 Quantitative Analysis\n\nIn order to generate the dataset to be used in the study, we mined the version history of 16,221 Java projects publicly available on GitHub. GitHub hosts over twelve million Git repositories covering many popular programming languages, and provides a public API (https://developer.github.com/v3/) that can be used to query and mine project information. Also, the Git version control system allows for local cloning of the entire repository, which facilitates the comprehensive analysis of the project change-history and thus of the license changes happened in each commit.\n\nTo extract data for our quantitative analysis, we first identified a comprehensive list of projects hosted on GitHub by implementing a script exploiting GitHub\u2019s APIs. The computation of the comprehensive list resulted in over twelve million projects. Since the infrastructure we use for license extraction only supports Java systems (as it will be explained later), we filtered out all systems that were not written in Java, obtaining a list\nof 381,161 Java projects hosted on GitHub. We cloned all 381,161 git repositories locally for a total of 6.3 Terabytes of storage space. In our analysis, we randomly sampled 16,221 projects due to the computation time of the aforementioned infrastructure.\n\nOnce the Git repositories had been cloned, we used a code analyzer developed in the context of the MARKOS European project (Bavota et al. 2014) to extract license information at commit-level granularity. The MARKOS code analyzer uses the Ninka license classifier (Germ\u00e1n et al. 2010b) to identify and classify licenses contained in all the files hosted under the version control system of each project. For each of the 16,221 projects in our study, the MARKOS code analyzer mined the change log, producing the following information for each commit:\n\n1. **Commit Id:** The identifier of the commit that is currently checked out from the Git repository and analyzed;\n2. **Date:** The timestamp associated with the commit;\n3. **Author:** The person responsible for the commit;\n4. **Commit Message:** The message attached to the commit;\n5. **File:** The path of the files committed;\n6. **Change to File:** A field to indicate whether each file involved in the commit was Added, Deleted, or Modified;\n7. **License Changed:** A boolean value indicating whether the particular file has experienced a change in license in this commit with respect to its previous version. This feature applies to modified files only. In the case of an addition or deletion of a file, this field is set to false;\n8. **License:** The name and version (e.g., GPL-2.0) of each license applied to the file.\n\nThe computation of such information for all 16,221 projects took almost 40 days, and resulted in the analysis of a total of 1,731,828 developers\u2019 commits involving 4,665,611 files. Note that for the BSD and CMU licenses Ninka was not able to correctly identify its variants (reporting it as BSD var and CMU var). Additionally, the GPL and the LGPL may contain a \u201c+\u201d after the version number (e.g., 3.0+), which represents a clause in the license granting the ability to use future versions of the license (i.e., the GPL-2.0+ would allow for utilization under the terms of the GPL-3.0). Also, we have values of \u201cno license\u201d and \u201cunknown\u201d, which represents the case that no license was attached to the file or Ninka was unable to determine the license.\n\nTo determine whether there is a trend in the proportions of adopted licenses over the observed years, we used the Augmented Dickey-Fuller (ADF) test (Dickey and Fuller 1979, 1981). This test is widely used to test stationarity of time series. The test can be used to reject two different null hypotheses $H_0$: the time series is not significantly stationary or $H_0$: the time series is not significantly explosive; the latter can be used to determine whether there is a significantly increasing trend in the time series. In our statistical tests, we considered a significance level of 0.05 (i.e., we rejected null hypotheses for $p$-values < 0.05).\n\nWe quantitatively analyzed the collected data by presenting descriptive statistics about the license adoption and the most common atomic license changes that we found. The latter are defined as the commits in which we detected a specific kind of license change within at least one source code or textual file. For example, given a commit with three files experiencing the licensing change No license $\\rightarrow$ Apache-2.0, and 10 files with GPL-2.0 $\\rightarrow$ GPL-3.0, the atomic license changes from that commit are one No License $\\rightarrow$ Apache-2.0 change and one GPL-2.0 $\\rightarrow$ GPL-3.0 change. We prefer not to count the number of changes at file level as it was done in previous work (Di Penta et al. 2010) to avoid inflating our analysis because of large commits and to make comparable commits performed on both small\nand large projects. It is possible that this coarse-grained analysis may fail to capture some license changes, for example due to a change in licensing of a dependency, although also in this case, in principle, the licensing changes should be reflected at project level when appropriate.\n\nIn the end, we identified a total of 1,833 projects with atomic license changes out of our dataset of 16,221 projects. This subset of projects was used to investigate license change traceability. Intuitively, we require the presence of license changes in order to determine how well changes in licensing are documented in either the commit messages or issue tracker discussion. Therefore, we used a web crawler to identify, among these 1,833 projects, those using the GitHub issue tracker, finding a total of 1,586 projects having at least one issue on it. To link the licensing changes to commit messages/issue reports, we performed both string matching and date matching between either the commit messages or the issue tracker discussions and the extracted licensing information (e.g., license name or date that license was committed). We decided to rely on commit messages and issue discussions because (i) these two sources of information are publicly available for the considered subject projects; and (ii) both commit messages and issue discussions are likely to report, with a different level of detail, the rationale behind a specific change implemented (or just considered in the case of issues) by developers, including changes related to software licenses.\n\n3.3 Qualitative Analysis\n\nOur qualitative analysis aims at answering RQ4 and it is based on manual inspection and categorization of commit messages and issue tracker discussions. Since we do not have limitations in terms of the project\u2019s programming language to analyze (unlike the quantitative analysis), we performed our qualitative analysis on commit messages and issue tracker discussions from a set of 1,160 projects written in seven different languages: 159 C, 91 C++, 78 C#, 324 Java, 166 Javascript, 147 Python, and 195 Ruby projects. Note that the choice of the languages considered in our study is not random: we focused on seven of the ten most popular programming languages during 2014 and 2015 (Zapponi http://githut.info; Cass http://spectrum.ieee.org/computing/software/the-2015-top-ten-programming-languages).\n\nThe considered projects were instead selected by applying the following procedure. Firstly, from our list of twelve million repositories, we extracted those written in the seven languages of interest. Then, we extracted only the repositories satisfying the following two criteria: (i) they were not forks of the main repository, and (ii) they had at least one star (i.e., at least one user expressed appreciation for the repository) or watcher (i.e., at least one user asked to receive notification about changes made in the repository). These selection criteria were used to exclude from our analysis personal repositories (e.g., the website of a GitHub user) that might have biased our results. However, it is important to note that for Java, we considered the comprehensive list of all 381,161 projects. In our initial investigation of Java projects (Vendome et al. 2015a), we observed the need for refinement that was thus adopted for the additional six languages, because we observed a high proportion of false positive commit messages and issues discussions. Thus, the filtering sought to improve the generated taxonomy.\n\nThen, we extracted the change log of the cloned projects in order to analyze them and identify the commit messages likely related to licensing. In total, 103,128,211 commits were considered. To identify commit messages likely related to license changes, we adopted a case-insensitive keyword-based filtering based on the critical words exploited by Ninka during license identification, and augmented them with license names. The detailed set of keywords being used for this matching is reported in Table 1. In some cases, our\nkeyword-filters included bi-grams composed of the license type and version, since some license types (e.g., apache) produced a very large amount of false positive discussions when they were considered alone (e.g., all the commit message talking about Apache projects).\n\nIn the end, the keyword-based filtering allowed us to identify a total of 746,874 commit messages (742,671 for Java, which amounted to approximately $\\sim 1\\%$ of the overall commits for Java). Given the high number of relevant commits, we sampled 20\\% of the commits found for each language as object of our manual inspection. However, we set a minimum threshold of 100 commits per language, and a maximum threshold of 500. These thresholds were adopted to ensure representativeness for each of the studied language, while keeping the manual analysis effort reasonable. Note that our sampling is statistically significant with a 95\\% confidence interval $\\pm 10\\%$ or better. This resulted in a total of 1,413 commits to be inspected. It is worth noting that for Java projects, in addition to the 500 sampled commit messages matching the keywords in Table 1, we also considered 224 randomly sampled commit messages from the commits of the 1,833 projects in which we identified (in our quantitative analysis) an instance of an atomic license change, because we were interested in investigating the reasons behind such changes. Clearly, this was not possible for the systems written in other programming languages that, as said before, were not part of our quantitative analysis. The number of sampled commits by each programming language is reported in the second column of Table 2.\n\n| Language | #of commits | # of issue tracker discussions |\n|----------|-------------|--------------------------------|\n| C | 227 | 30 |\n| C# | 100 | 6 |\n| C++ | 139 | 12 |\n| Python | 130 | 41 |\n| Java | 724 | 273 |\n| JavaScript | 122 | 79 |\n| Ruby | 195 | 45 |\n| Overall | 1,637 | 486 |\nConcerning the issue tracker discussions, we built a Web crawler collecting the information present in all issue trackers of the studied projects. In particular, for each issue, our crawler collected (i) its title and description, (ii) the text of each comment added to it, (iii) and the date the issue was opened and closed (when applicable). Then, in order to find the relevant issues (i.e., those presenting discussions about software licenses), we used a keyword search mechanism aimed at matching, in the issue title, keywords related to licensing (as previously explained for the commit messages). By applying this procedure, we identified a total of 486 issue discussions potentially related to licensing, as shown in the third column of the Table 2.\n\nAfter collecting commit messages and issue discussions, in order to analyze and categorize them, we followed an open coding process inspired by the Grounded Theory (GT) principles formulated by Corbin and Strauss (1990). This analysis of commit messages and issue tracker discussions aimed at finding the rationale for licensing changes; in particular, we aimed at answering the following two sub-questions: What are the reasons pushing developers to associate a particular license to their project? and What causes them to migrate licenses or release their project under a new license (i.e., co-licensing)?\n\nTo perform the open coding, we distributed the commit messages and the issue tracker discussions among the authors such that two authors were randomly assigned to each message (a message can be a commit message or an entire issue tracker discussion). After each round of open coding in which the authors independently created classifications for the messages, we met to discuss the coding identified by each of us, and we refined them into categories. Note that during each round the categories defined in previous rounds were refined accordingly to the new knowledge created from the additional manual inspections and from the authors\u2019 discussions. Overall, the open coding concerned (i) 1,413 randomly selected licensing-related commit messages identified via the keywords-based mechanism; (ii) the 224 commit messages from the Java systems\u2019 commits where a licensing change was observed in our quantitative analysis; and (iii) the 486 issue tracker discussions matching licensing-related keywords. The output of our open coding procedure is a set of categories and group explaining why licenses are adopted and changed. We qualitatively discuss the findings of this analysis in Section 4.4, presenting our categories classification and examples of commit messages and issue tracker discussions belonging to the various categories.\n\n3.4 Dataset Diversity Analysis\n\nTo get an idea of the external validity of our dataset, we measured the diversity metric proposed by Nagappan et al. (2013) for our dataset by matching the list of our mined projects from GitHub to the list of available projects from Boa (Dyer et al. 2013). Given the different datasets exploited in the context of our quantitative and qualitative analysis, we discuss the diversity metrics separately.\n\n3.4.1 Quantitative Analysis\n\nWe were able to match by name 1,556 out of the 16,221 projects exploited in our quantitative analysis against the names of the projects in the diversity metric dataset by Nagappan et al. (2013). This subset was used in the computation of the diversity metric, obtaining a score.\n\n---\n\n8We looked for the target keywords only in the issue titles, because we found that including the issue descriptions in the search generates a considerable number of false positives.\nof 0.35, indicating that around 10% of our dataset covers just over a third of the open source projects according to six dimensions: programming language, developers, project age, number of committers, number of revisions, and number of programming languages. The dimensional scores are 0.45, 0.99, 1.00, 0.99, 0.96, 0.99, respectively, suggesting that our subset covers the relevant dimensions for our analysis. However, the focus on Java projects limits the programming language score, affecting the overall score.\n\nAnother important aspect to evaluate is the representativeness of the licenses present in our dataset with respect to those diffused in the FOSS community. The Open Source Initiative (OSI) specifies a list of 70 approved licenses, indicating the ones reported in the first column of Table 3 as the most commonly used in FOSS software (they do not specify any order). The second column of Table 3 reports the top licenses as extracted from the FLOSSmole\u2019s SourceForge snapshot of December 2009 (Howison et al.), while the third column shows the top licenses as extracted from our sample of GitHub projects exploited for the quantitative analysis.\n\nThe licenses declared by OSI as the most commonly used were also the most commonly found in our dataset (BSD 2 and 3 fall both in the BSD type). In the comparison between our dataset and SourceForge, while the order of diffusion for the different licenses is not exactly the same, six of the top eight licenses in SourceForge are also present in our dataset (all but Public Domain and Academic Free License). This analysis, together with the diversity metric, suggests that the dataset we exploited in our quantitative analysis is representative of Open Source systems.\n\nTable 4 reports the year of the first commit date for each of the 16,221 considered projects. This table clearly shows the exponential growth of GitHub until 2012, confirming what already was observed by people in the GitHub community (Doll http://tinyurl.com/muyxkru). While GitHub also experienced exponential growth in 2013 (https://octoverse.github.com/), our dataset does not mirror this fact. This is due to a design choice we made while randomly choosing the projects to clone. In particular, we cloned projects during January 2014, excluding projects with a commit history less than one year from the set of 381,161 Java projects (i.e., projects with the first commit performed no later than January 2013). This was needed since, in the context of RQ$_2$, we are interested in observing migration patterns occurring over the projects\u2019 history. Thus, projects having a very short commit history were not likely to be relevant for the purpose of this study. Moreover, since in RQ$_1$ we are interested in observing licenses\u2019 usage in the context of the GitHub\u2019s drastic\n\n| OSI popular license (unordered) | SourceForge (Dec. 2009) | Our Github data set (Quant. Analys.) |\n|--------------------------------|-------------------------|-------------------------------------|\n| Apache-2 Lic | GNU Public Lics | GNU Public Lics |\n| BSD 2-Clause Lic | Lesser GNU Public Lics | Apache Lics |\n| BSD 3-Clause Lic | BSD Lics | Lesser GNU Public Lics |\n| GNU Public Lics | Apache Lics | MIT Lic |\n| Lesser GNU Public Lics | Public Domain | Eclipse Public Lic |\n| MIT Lic | MIT Lic | Comm. Dev. and Dist. Lic |\n| Mozilla Public Lic 2 | Academic Free Lic | Mozilla Public Lic |\n| Comm. Dev. and Dist. Lic | Mozilla Public Lics | BSD Lics |\n| Eclipse Public Lic | | |\nexpansion, we decided to exclude the 60 projects having the first commit in 2013 from our analysis due to the severe lack of representation in our sample despite the continued growth of GitHub.\n\n### 3.4.2 Qualitative Analysis\n\nSimilarly, we were able to match by name 471 out of the 1,160 projects (against the names of the projects in the diversity metric dataset (Nagappan et al. 2013)) from which we manually investigated commit messages and issue discussions in our qualitative analysis. As done for the quantitative analysis, we considered the matched subset for the computation of the diversity metric, obtaining a score of 0.32, indicating that \\( \\sim 40\\% \\) of our dataset covers just under a third of the open source projects according to six dimensions: programming language, developers, project age, number of committers, number of revisions, and number of programming languages. The dimensional scores are 0.43, 0.99, 1.00, 0.99, 0.94, and 1.0, respectively. Intuitively, these scores are directly impacted by the limited number of projects that we were able to match. However, we still observe relatively high diversity scores suggesting that our qualitative analysis is representative for a substantial portion of the open source systems.\n\n### 3.5 Replication Package\n\nThe working data set of our study is available at: http://www.cs.wm.edu/semeru/data/EMSE15-licensing. It includes (i) the lists of projects and their urls, (ii) the issues tracker and commit data, (iii) the analysis scripts, and (iv) a summary of the achieved results.\n\n### 4 Study Results\n\nThis section discusses the achieved results answering the four research questions formulated in Section 3.1.\n\n#### 4.1 RQ 1: What is the Usage of Different Licenses in GitHub?\n\nFigure 1 depicts the percentage of licenses that were first introduced into a project in the given year, which we refer to as relative license usage. We only report the first occurrence of each license committed to any file of the project. To ease readability, the bars are grouped by permissive (dashed bars) or restrictive licenses (solid bars). Additionally, we omit data prior to 2002 due to the limited number of projects created during those years in our sampled dataset (see Table 4).\nFor the year 2002, we observed that restrictive licenses and permissive licenses had been used approximately equally with a slight bias towards using restrictive licenses. Although the LGPL-2.1 and LGPL-2.1+ variants are restrictive licenses, they are less restrictive than their GPL counter-part. The LGPL specifically aimed at ameliorating licensing conflicts that arose when linking code to a non-(L)GPL library. Instead, the various versions of the GPL license require the system to change its license to the same version of the GPL, or else the component would not legally be able to be redistributed together with the project source code. Thus, it suggests a bias toward using less restrictive licenses even among the mostly used copyleft licenses. By the subsequent year (2003), a clear movement towards using less restrictive licenses can be seen with the wider adoption of the MIT/X11 license as well as the Apache-1.1 license. Additionally, we observe that the LGPL is still prominent, while the CMU, CPL-1.0, and GPL-2.0+ licenses were declining.\n\nDuring the following five years (2004\u20132008), the Apache-2.0, CDDL-1.0, EPL-1.0, GPL-3.0, LGPL-3.0, and DWTFYW-2 licenses were created. For the same observation period, Bavota et al. found that the Apache ecosystem grew exponentially (Bavota et al. 2013). This observation explains the rapid diffusion of the Apache-2.0 license among FOSS projects. We observed a growth that resulted in the Apache-2.0 license accounting for approximately 41% of licensing in 2008. Conversely, we observed a decline in the relative usage of both GPL and LGPL licenses. These two observations suggest a clear shift toward permissive licenses, since \\( \\sim 60\\% \\) of licenses attributed were permissive starting from 2003 (with small drops in 2007 and 2009).\n\nAnother interesting observation was that the newer version of the GPL (GPL-3.0 or GPL-3.0+) had a lower relative usage compared to its earlier version until 2011. Additionally, the adoption rate was more gradual than for the Apache-2.0 license that appears to supersede Apache-1.1 license. However, the LGPL-3.0 and LGPL-3.0+ do not have more\npopularity than prior versions in terms of adoption, despite the relative decline of the LGPL-2.1\u2019s usage starting in 2010. Our manual analysis of commits highlighted explicit reasons that pushed some developers to choose the LGPL license. For instance, a developer of the hibernate-tools project when committing the addition of the LGPL-2.1+ license to her project wrote:\n\n*The LGPL guarantees that Hibernate and any modifications made to Hibernate will stay open source, protecting our and your work.*\n\nThis commit note indicates that LGPL-2.1+ was chosen as the best option to balance the freedom for reuse and guarantee that the software will remain free.\n\nConversely, we observed the abandonment of old licenses and old license versions as newer FOSS licenses are introduced. For example, Apache-1.1 and CPL-1.0 become increasingly less prevalent or no longer used among the projects. In both cases, a newer license appears to replace the former license. While the Apache-2.0 offers increased protections (e.g., protections for patent litigation), the EPL-1.0 and the CPL-1.0 are the same license, with the only difference that IBM is replaced by the Eclipse Foundation as the steward of the license. Thus, the two licenses are intrinsically the same from a legal perspective, and most likely projects migrated from the CPL to the EPL; this would explain why the EPL adoption grew as the CPL usage shrunk.\n\nFinally, we observed fluctuations in the adoption of the MIT/X11 license. As the adoption of permissive licenses grew with the introduction of the Apache-2.0 license, it first declined in adoption and was followed by growth to approximately its original adoption. Ultimately, we observed a stabilization of the MIT/X11 usage at approximately 10% starting in 2007.\n\nIn order to determine whether the proportions for a given license exhibited a stationary trend, or a clearly increasing trend over the observed years, we performed ADF-tests as explained in Section 3.2. Results are reported in Table 5, where significant $p$-values (shown in bold face) in the second column indicate that the series is stationary ($H_0$ rejected), while significant $p$-values in the third column indicates that the series has an explosive, i.e., clearly increasing, trend ($H_{0e}$ rejected). The results indicate that:\n\n- Almost no license is exhibiting a stationary trend. The results only show significant differences for the zend-2.0 license, which is not particularly popular, and a marginal significance for CMU, CPL-1.0 and GPL-1.0+.\n- Confirming the discussion above, we have a clearly increasing trend not only for permissive licenses such as Apache-2.0 and MIT/X11 but also for new versions of restrictive licenses facilitating the integration with other licenses (in particular, GPL-3.0, which eases the compatibility with the Apache license, as well as LGPL-2.0, which facilitates compatibility when code is integrated as a library). We also see an increase for DWTFYW-2.0, but, as it will be discussed in Section 5, this can be likely due to cases in which developers do not have a clear idea about the license to be used.\n\n**Summary for RQ 1** For the analyzed Java projects, we observed a clear trend towards using permissive licenses like Apache-2.0 and MIT/X11. Additionally, the permissiveness or restrictiveness of a license can impact the adoption of newer license versions, where permissive licenses are more rapidly adopted. Conversely, restrictive licenses seem to maintain a greater ability to survive in usage as compared to the permissive licenses, which become superseded. Restrictive (GPL-3.0) or semi-restrictive (LGPL-2.0) licenses that facilitate\nTable 5 The results of the augmented Dickey-Fuller test to determine stationary or explosive trends in the license usage\n\n| License | Stationary trend (p-value) | Explosive Trend (p-value) |\n|---------------|---------------------------|---------------------------|\n| Apache-1.1 | 0.14 | 0.86 |\n| Apache-2.0 | 0.98 | **0.02** |\n| BSD | 0.73 | 0.27 |\n| CDDL v1 | 0.42 | 0.58 |\n| CMU | 0.05 | 0.95 |\n| CPL-1.0 | 0.43 | 0.57 |\n| EPL-1.0 | 0.07 | 0.93 |\n| DWTFYW-2.0 | 0.99 | **0.01** |\n| MPL-1.0 | 0.90 | 0.10 |\n| MPL-1.1 | 0.32 | 0.68 |\n| NPL-1.1 | 0.55 | 0.45 |\n| svnkit+ | 0.78 | 0.22 |\n| zend-2.0 | **0.01** | 0.99 |\n| MIT/X11 | 0.97 | **0.03** |\n| GPL-1.0+ | 0.05 | 0.95 |\n| GPL-2.0 | 0.67 | 0.33 |\n| GPL-2.0+ | 0.66 | 0.34 |\n| GPL-3.0 | 0.98 | **0.02** |\n| GPL-3.0+ | 0.69 | 0.31 |\n| LGPL-2.0 | 0.99 | **0.01** |\n| LGPL-2.0+ | 0.67 | 0.33 |\n| LGPL-2.1 | 0.35 | 0.65 |\n| LGPL-2.1+ | 0.54 | 0.46 |\n| LGPL-3.0 | 0.63 | 0.37 |\n| LGPL-3.0+ | 0.52 | 0.48 |\n\nintegration with other licenses also exhibit an increasing trend. Finally, we observed a stabilization in the license adoption proportions of particular licenses, despite the exponential growth of the GitHub code base.\n\n4.2 RQ2: What are the Most Common Licensing Change Patterns?\n\nWe analyzed commits, where a license change occurred, with a two-fold goal (i) analyze license change patterns to understand both the prevalence and types of licensing changes affecting software systems, and (ii) understand the rationale behind these changes. Overall, we found 204 different atomic license change patterns. To analyze them, we identified the patterns having the highest proportion across projects (i.e., global patterns) and within a project (i.e., local patterns). We sought to distinguish between dominant global patterns (Table 6) and dominant local patterns (Table 7) to study, on one hand, the overall trend of licensing changes and, on the other hand, to understand specific phenomena occurring in certain projects.\n\nThe global patterns were extracted by identifying and counting the presence of a pattern only once per project and then aggregating the counts over all projects. For instance, 823\nTable 6 Top ten global atomic license change patterns\n\n| Top Patterns (Overall) | Frequency |\n|------------------------|-----------|\n| no license or unknown \u2192 Apache-2.0 | 823 |\n| Apache-2.0 \u2192 no license or unknown | 504 |\n| no license or unknown \u2192 GPL-3.0+ | 269 |\n| GPL-3.0+ \u2192 no license or unknown | 181 |\n| no license or unknown \u2192 MIT/X11 | 163 |\n| no license or unknown \u2192 GPL-2.0+ | 113 |\n| GPL-2.0+ \u2192 no license or unknown | 111 |\n| MIT/X11 \u2192 no license or unknown | 98 |\n| no license or unknown \u2192 EPL-1.0 | 94 |\n| no license or unknown \u2192 LGPL-2.1+ | 91 |\n\n| Top Migration Patterns Between Licenses | Frequency |\n|----------------------------------------|-----------|\n| GPL-3.0+ \u2192 Apache-2.0 | 25 |\n| GPL-2.0+ \u2192 GPL-3.0+ | 25 |\n| Apache-2.0 \u2192 GPL-3.0+ | 24 |\n| GPL-2.0+ \u2192 LGPL-2.1+ | 22 |\n| GPL-3.0+ \u2192 GPL-2.0+ | 21 |\n| LGPL-2.1+ \u2192 Apache-2.0 | 16 |\n| GPL-2.0+ \u2192 Apache-2.0 | 15 |\n| Apache-2.0 \u2192 GPL-2.0+ | 13 |\n| MPL-1.1 \u2192 MIT/X11 | 11 |\n| MIT/X11 \u2192 Apache-2.0 | 11 |\n\nprojects in our dataset experienced at least one change (each) from No license \u2192 Apache-2.0, thus the final count (globally) for the pattern is 823. The most dominant global patterns were either a change from either no license or an unknown license to a particular license, or a change from either a particular license to no license or an unknown license. Table 6 shows the top ten global patterns. We observe that the inclusion of Apache-2.0 was the most common pattern for unlicensed or unknown code. Clearly, this is likely due to the specific programming language (i.e., Java) exploited by the sample of projects we quantitatively analyzed.\n\nTable 6 also shows the most common global migrations when focusing the attention on changes happened between different licenses. We observe that the migration towards the more permissive Apache-2.0 was a dominant change among the top ten atomic license changes for global license migrations. An interesting observation is the license upgrade and downgrade between GPL-2.0+ and GPL-3.0+. GPL-3.0 is considered by the Free Software Foundation as a compatible license with the Apache-2.0 license.9 Due to the large usage of the Apache license in Java projects, this pattern is quite expected. However, the migration GPL-3.0+ \u2192 GPL-2.0+ is interesting, since it not only still allows for the project to be redistributed as GPL-3.0 but also allows for the usage as GPL-2.0, which is less restrictive, as well.\n\nRegarding the local patterns (Table 7), the frequencies were computed by first identifying the most frequent (i.e., dominant) pattern in each project, and then counting the number of\n\n9http://gplv3.fsf.org/wiki/index.php/Compatible licenses\ntimes a specific pattern is the most frequent across the whole dataset. For instance, the \\textit{GPL-1.0+} \\rightarrow \\textit{GPL-3.0+} pattern is the most frequent in 36 projects from our dataset. Table 7 summarizes the most common local migrations. The migrations appear to be toward a less restrictive license or license version. The low frequency of the \\textit{atomic license change} local patterns indicates that migrating licenses is non-trivial. It can also introduce problems with respect to reuse. For example, we observed a single project where \\textit{GPL-1.0+} code was changed to \\textit{LGPL-2.0+} a total of nine times. \\textit{LGPL} is less restrictive than \\textit{GPL}, when the code is used as a library. Thus, if parts of the system are \\textit{GPL}, the developer must comply with the more restrictive and possibly incompatible constraints.\n\nUntil now, we considered \\textit{atomic license changes} among any file in the repository. This was needed since most of the analyzed projects lack of a specific file (e.g., license.txt) declaring the project license. To extract the declared project license, we considered a file in the top level directory named: \\textit{license}, \\textit{copying}, \\textit{copyright}, or \\textit{readme}. When just focusing on projects including such files, we extracted 24 different change patterns. Table 8 illustrates the top eight licensing changes between particular licenses (i.e., we excluded no license or unknown license from this table) for declared project licenses. We only considered the top eight, since there was a tie between five other patterns or the next group of change patterns. We observe that the change from \\textit{Apache-2.0} \\rightarrow \\textit{MIT/X11} was the most prevalent license change pattern, and the co-license of \\textit{MIT/X11} with \\textit{Apache-2.0} is the second most prevalent one. Interestingly, this pattern was not dominant in our file-level analysis, although the Grounded Theory analysis provided us support for this pattern. The \\textit{MIT/X11} license\n\n| Pattern | Frequency |\n|---------|-----------|\n| GPL-2.0+ \\rightarrow GPL-3.0+ | 36 |\n| GPL-2.0+ \\rightarrow LGPL-3.0+ | 15 |\n| LGPL-3.0+; Apache-2.0 \\rightarrow Apache-2.0 | 12 |\n| GPL-3.0+; Apache-2.0 \\rightarrow Apache-2.0 | 12 |\n| GPL-2.0+ \\rightarrow LGPL-2.1+ | 10 |\n| GPL-1.0+ \\rightarrow LGPL-2.0+ | 9 |\n| GPL-2.0+ \\rightarrow GPL-3.0+ | 9 |\n| GPL-3.0+ \\rightarrow Apache-2.0 | 8 |\n| GPL-3.0+ \\rightarrow GPL-2.0+ | 8 |\n| GPL-3.0+ \\rightarrow LGPL-3.0+ | 8 |\n\n| Pattern | Frequency |\n|---------|-----------|\n| Apache-2.0 \\rightarrow MIT/X11 | 12 |\n| Apache-2.0 \\rightarrow MIT/X11; Apache-2.0 | 8 |\n| GPL-2.0+ \\rightarrow GPL-3.0+ | 7 |\n| MIT/X11 \\rightarrow Apache-2.0 | 6 |\n| GPL-3.0+ \\rightarrow Apache-2.0 | 6 |\n| MIT/X11; Apache-2.0 \\rightarrow Apache-2.0 | 5 |\n| Apache-2.0 \\rightarrow GPL-3.0+ | 5 |\n| GPL-3.0+ \\rightarrow MIT/X11 | 3 |\nwas used to allow commercial reuse, while still maintaining the open source nature of the project.\n\nThe pattern of $\\text{GPL-2.0+} \\rightarrow \\text{GPL-3.0+}$ (Top-3 in Table 8) was expected since it was tied for the most prevalent among global atomic license changes. Similarly, the patterns of $\\text{MIT/X} \\rightarrow \\text{Apache-2.0}$, $\\text{GPL-3.0+} \\rightarrow \\text{Apache-2.0}$, and $\\text{Apache-2.0} \\rightarrow \\text{GPL-3.0}$ were also among the top eight global changes. Another notable observation is that license changes are frequently happening toward permissive licenses. Excluding the five changes from $\\text{Apache-2.0} \\rightarrow \\text{GPL-3.0+}$, the remaining changes for the top eight are either a licensing change from a restrictive (or copyleft) license to a permissive license or a licensing change between two different permissive licenses.\n\n**Summary for RQ 2** The key insight from the analysis of atomic license change patterns observed on the studied Java projects is that the licenses tend to migrate toward less restrictive licenses.\n\n### 4.3 RQ 3: to What Extent are Licensing Changes Documented in Commit Notes or Issue Tracker Discussions?\n\nTable 9 reports the results of the identification of traceability links between licensing changes and commit messages/issue tracker discussions. We found a clear lack of traceability between license changes in both the commit message history and the issue tracker. In both data sources, we first extracted the instances (i.e., commit messages and issue tracker discussion comments).\n\n| Data source | Linking query | Links |\n|-------------|-------------------------------------------------------------------------------|--------|\n| Commit | Commits with the keyword \u201clicense\u201d | 70,746 |\n| Messages | Commits containing new license name | 519 |\n| | Commits containing new license name and the keyword \u201clicense\u201d | 399 |\n| Issue | Comments from closed issues containing the keyword \u201clicense\u201d | 0 |\n| Tracker | Comments from closed issues containing the new license | 0 |\n| Comment | Comments from closed issues containing the new license and the keyword \u201clicense\u201d | 0 |\n| Matching | Comments from open issues containing the keyword \u201clicense\u201d | 68 |\n| | Comments from open issues containing the new license | 712 |\n| | Comments from open issues containing the new license and the keyword \u201clicense\u201d | 16 |\n| Issue | Closed comments opened before license change and closed before or at license change | 197 |\n| Tracker | Open comments open before the license change | 2,241 |\n| Date-based | Comments from closed issues open before the license change and closed before or at the license change with keyword \u201clicense\u201d | 0 |\n| Matching | Comments from open issues open before the license change with keyword \u201clicense\u201d | 0 |\n| Issue | Comments in closed issues containing the keyword \u201cFixed #[issue_num]\u201d | 66,025 |\n| | Comments in open issues containing the keyword \u201cFixed #[issue_num]\u201d | 3,407 |\n| Commit | Comments in closed issues containing the commit hash where the license change occurs | 0 |\n| Matching | Comments in open issues containing the commit hash where the license change occurs | 1 |\ndiscussions) where the keyword \u201clicense\u201d appears or where a license name was mentioned (e.g., \u201cApache\u201d). In the former case, we are identifying potential commits or issues that are related to licensing, while the latter attempts to capture those related to specific types of licenses.\n\nBy using the first approach, we retrieved 70,746 commits and 68 issues; while looking for license names, we identified 519 commits and 712 issues. However, these numbers are inflated by false positives (e.g., \u201cApache\u201d can relate to the license or it can relate to one of the Apache Software Foundation\u2019s libraries). For this reason, we then looked for commit messages and issue discussions containing both the word \u201clicense\u201d as well as the name of a license. This resulted in a drop of the linked commit messages to 399 and in zero issue discussions. Such results highlight that license changes are rarely documented by developers in commit messages and issues.\n\nWe also investigated whether relevant commits and issues could be linked together. We linked commit messages to issues when the former explicitly mentions fixing a particular issue (e.g., \u201cFixed #7\u201d would denote issue 7 was fixed). We observed that this technique resulted in a large number of pairs between issues and commits; thus, our observation of a lack of license traceability is not simply an artifact of poor traceability for these projects. To further investigate the linking, we extracted the commit hashes where a license change occurred and attempted to find these hashes in the issue tracker\u2019s comments. Since the issue tracker comments contain the abbreviated hash, we truncated the hashes appropriately prior to linking. Our results indicated only one match for an open issue and zero matches for closed issues.\n\nFinally, we attempted to link changes to issues by matching date ranges of the issues to the commit date of the license change. The issue had to be open prior to the change and if the issue had been closed the closing date must have been after the change. However, we did not find any matches with a date-based approach.\n\n**Summary for RQ 3** For the analyzed Java projects, both the issue tracker discussions and commit messages yielded very minimal traceability to license changes, suggesting that the analysis of licensing requires fine-grained approaches analyzing the source code.\n\n### 4.4 RQ 4: What Rationale do These Sources Contain for the Licensing Changes?\n\nIn this section, we firstly present the taxonomy that resulted from the open coding of commit messages and issue tracker discussions. As explained in Section 3, this analysis has been performed on 1,637 commit messages and 486 issue tracker discussions from 1,160 projects written in seven programming languages, and aims at modeling the rationale of license adoption and changes. Secondly, we present our findings when looking at the commits that introduce atomic license changes in the analyzed Java projects.\n\n#### 4.4.1 Analyzing Commit Messages and Issue Discussions\n\nTable 10 reports the categories obtained in the open coding process. In total, we grouped commit messages and issue tracker discussions into 28 categories, and organized them into seven groups that will be described in detail in the rest of this section. Additionally, 430 commits and 161 issue discussions identified by means of pattern matching as potentially related to licensing were classified as false positives. This is mainly due to the wide range of matching keywords that we used for our filtering (see Section 3) to identify as many commits/issues as possible. Finally, for 16 commits and two issue discussions that were\nTable 10 Categories defined through open coding for the Issue tracker discussion comments and Commit notes\n\n| Category | C | C++ | C# | Java | Javascript | Python | Ruby | Overall |\n|---------------------------|-----|------|-----|------|------------|--------|------|---------|\n| | I C | I C | I C | I C | I C | I C | I C | I C |\n| Generic license additions | | | | | | | | |\n| Choosing license | 1 | 0 | 0 | 0 | 0 | 6 | 0 | 2 | 0 | 1 | 0 | 11 | 0 |\n| License added | 1 | 22 | 3 | 19 | 0 | 15 | 25 | 75 | 22 | 34 | 9 | 34 | 1 | 33 | 59 | 232 |\n| License change | | | | | | | | |\n| License change | 2 | 14 | 1 | 8 | 1 | 5 | 3 | 14 | 4 | 9 | 2 | 6 | 2 | 18 | 15 | 74 |\n| License upgrade | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |\n| License rollback | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 3 |\n| Removed licensing | 0 | 3 | 0 | 3 | 0 | 4 | 0 | 6 | 1 | 8 | 0 | 2 | 0 | 3 | 1 | 29 |\n| Changes to copyright | | | | | | | | |\n| Copyright added | 0 | 6 | 0 | 3 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 15 |\n| Copyright update | 2 | 24 | 0 | 7 | 1 | 6 | 5 | 89 | 2 | 7 | 2 | 4 | 1 | 8 | 13 | 138 |\n| License fixes | | | | | | | | |\n| Link broken | 7 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 16 | 0 | 1 | 0 | 19 | 0 | 46 | 0 |\n| License mismatch | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |\n| Fix licensing | 4 | 2 | 0 | 1 | 0 | 2 | 1 | 3 | 2 | 0 | 0 | 1 | 2 | 1 | 9 | 10 |\n| License file modification | 0 | 11 | 0 | 8 | 0 | 14 | 0 | 0 | 1 | 11 | 1 | 7 | 1 | 29 | 3 | 80 |\n| Missing licensing | 1 | 1 | 0 | 0 | 0 | 3 | 2 | 0 | 7 | 0 | 12 | 0 | 4 | 1 | 26 | 5 |\n| License compliance | | | | | | | | |\n| Compliance discussion | 1 | 9 | 0 | 5 | 1 | 1 | 0 | 1 | 0 | 3 | 0 | 1 | 0 | 0 | 2 | 20 |\n| Derivative work inconsistency | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |\n| Add compatible library | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 3 | 3 |\n| Removed third-party code | 3 | 13 | 1 | 8 | 0 | 1 | 0 | 1 | 0 | 2 | 0 | 4 | 0 | 3 | 4 | 32 |\n| License compatibility | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |\n| Reuse | 1 | 1 | 1 | 0 | 0 | 17 | 0 | 1 | 0 | 10 | 0 | 1 | 0 | 0 | 21 | 1 |\n| Dep. license added | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |\n| Dep. license issue | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |\n| Clarifications/Discussions| | | | | | | | |\n| License clarification | 2 | 0 | 2 | 1 | 1 | 0 | 19 | 0 | 2 | 1 | 4 | 0 | 2 | 0 | 32 | 2 |\n| Terms clarification | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 7 | 0 |\n| Verify licensing | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 |\n| License agreement | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |\n| Request for a license | | | | | | | | |\n| Licensing request | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 6 | 0 | 11 | 0 |\n| License output for the end user | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |\n\nrelated to licensing, it was not possible, based on the available information, to perform a clear categorization. Thus, they were excluded from this study.\n\nIn the following, we discuss examples related to the various groups of categories.\nGeneric License Additions This group of categories concerns cases in which a license was added in a file, project or component where it was not present, as well as discussions related to choosing the license to be added in a project. One typical example of commit message, related to the very first introduction of a software license into the repository, mentioned:\n\n\u201cAdded a license page to TARDIS.\u201d (https://github.com/tardis-sn/tardis/commit/07b2a072d89d45c386d5f988f04435d76464750e)\n\nOther commit messages falling in this category were even precise in reporting the exact license committed into the repository, e.g.,:\n\n\u201cAdd MIT license. Rename README to include rst file extension.\u201d (https://github.com/Schevo/schevorecipe.db/commit/b73bef14adeb7c87c002a908384253c8f686c625)\n\nFinally, commit messages automatically generated by the GitHub\u2019s licensing feature were present, e.g.,:\n\n\u201cCreated LICENSE.md.\u201d\n\nWhile commit messages show the addition of a license to a project, they do not provide the rationale behind the specific choice. This can be found, sometimes, in the discussions carried out by the developers in the issue trackers to establish the license under which their project would be released. For example, one of the issue discussions we analyzed was titled \u201cAdd LICENSE file\u201d (https://github.com/rosedu/web-workshops/issues/1) in the project web-workshops, and the issue opener explained the need for (i) deciding the license to adopt and (ii) involve all projects\u2019 contributors in such a decision:\n\n\u201cA license needs to be chosen for this repo. All contributors need to agree with the chosen license. A list of contributors is enclosed below.\u201d\n\nDoubts and indecision about which license to adopt were also evident in several of the issue discussions that we manually analyzed:\n\n\u201cWhat license to use? BSD, GNU GPL, or APACHE?\u201d (https://github.com/kovmarci86/d3-armory/issues/5)\n\nInterestingly, one developer submitted an issue for the project InTeX entitled \u201cDual license under LGPL and EPL\u201d (https://github.com/mtr/intex/issues/1) that related to adding a new license to balance code reuse of the system, while avoiding \u201ccontagious\u201d licensing (the term \u201ccontagious\u201d was used by the original developer of the system). The developer commented:\n\n\u201cYour package is licensed under GPL. I\u2019m not a lawyer but as far as I understand the intention of the GPL, all LaTeX documents compiled with the InTeX package will have to be made available under GPL, too. [...] I think, you want users to publish changes they did at your code. A dual license under LGPL and EPL would ensure that a) changes on your code have to be published along with a binary publication and b) that your code can be used in GPL and non-GPL projects. See JGraphT\u2019s relicensing for more background.\u201d\n\nThis response demonstrates a potential lack of understanding regarding the license implications to compiled LaTeX and proposes dual-licensing as a solution. However, the original\ndeveloper also indicates a lack of legal background and is not willing to offer a dual-license based on his understanding stating:\n\n\"Thank you for your interest. I not a lawyer myself either, but my intentions are:\n\n1. I want changes to the source code of InTeX to be made available so that others can benefit from them too.\n\n2. I do not want any \u201ccontagious\u201d copyright of documents compiled with InTeX. However, I\u2019ve always thought of InTeX as a (pre)compiler, and given this GPL FAQ answer, I think licensing the compiler\u2019s source code under GPL does not limit or affect the copyright of the documents it is used to process.\n\nUnless you can prove me wrong about this, I will close this issue.\"\n\nThus, the developer responds by providing his understanding of the GPL by referencing a response by GNU regarding compiled Emacs. However, the developer does indicate an openness to adding a new license if the GPL would in fact be applied to generated LaTeX documents. This example is particularly interesting, since it shows the original developer\u2019s rationale for picking the GPL as well as the difficulty that developers have with respect to licensing.\n\n**License Change** This group of categories concerns cases in which (i) a licensing statement was changed from one license towards a different one; (ii) a license was upgraded towards a new version, e.g., from GPL-2.0 to GPL-3.0; (iii) cases of license rollback (i.e., when a license was erroneously changed, and then a rollback to the previous license was needed to ensure legal compliance); and (iv) cases in which for various reasons developers removed a previously added license.\n\nMost commit messages briefly document the performed change, e.g., \u201cSwitched to a BSD-style license\u201d, \u201cSwitch to GPL\u201d. Some others, partially report the rationale behind the change:\n\n\u201cThe NetBSD Foundation has granted permission to remove clause 3 and 4 from their software\u201d\n\nThe commit message explains that permission has been granted for the license change by the NetBSD Foundation. However, the committer does not explain the reason for the removal of the two clauses. Other commits are instead very detailed in providing full picture of what happened in terms of licensing:\n\n\u201cRelicensed CZMQ to MPLv2 - fixed all source file headers - removed COPYING/COPYING.LESSER with GPLv3 and LPGv3 + exceptions - added LICENSE with MPLv2 text - removed ztree class which cannot be relicensed - (that should be reintroduced as foreign code wrapped in CZMQ code).\u201d\n\nThe commit message from the project CZMQ (https://github.com/zeromq/czmq/commit/eabe063c2588cde0af90e5ae951a2798b7c5f7e4) is very informative, reporting the former license (i.e., GPL-3.0 and LGPL-3.0), the new license (i.e., MPL-2.0), and the changes applied in the repository to ensure compliance to the new licensing terms (e.g., the removal of the ztree class). This license change demonstrates a move towards a more permissive license, which has been shown to be prevalent in our study of Java projects.\n\n10http://www.gnu.org/licenses/old-licenses/gpl-2.0-faq.html#CanIUseGPLToolsForNF\nWe also found commit messages reporting the rationale behind specific license changes, such as the following commit from the project nimble (https://github.com/bradleybeddoes/nimble/commit/e1e273ff18730d2f8e0d7c2af1951970e676c8d1):\n\n\u201cChange in project License from AGPL 3.0 to Apache 2.0 prior to first public release. Several factors influenced this decision the largest being community building and making things as easy as possible for folks to get started with the project. We don\u2019t however believe Open Source == Free and will continue to investigate the best way to commercialize this. Restrictive copy-left licenses aren\u2019t however the answer.\u201d\n\nWhile the developers want to enable external developers to reuse the system, they are also interested in commercializing the software product. The developers acknowledge that copy-left licenses do not meet their needs.\n\nFor License Rollback, we observed that the project PostGIS reverted back licensing to a custom license (https://github.com/postgis/postgis/commit/4eb4127299382c971ea579c8596cc41cb1c089bc). The commit does not offer rationale since it simply states:\n\n\u201cRestore original license terms.\u201d\n\nFrom the analysis of the commit emerged that the author had re-licensed the system under the GPL earlier and subsequently reverted back the licensing to his own custom license. However, it is not clear if this rollback was due to a misappropriation of GPL, an incompatibility in the system, or to other factors.\n\nAdditionally, we found commit messages illustrating that license removals do not necessarily indicate that the licensing of the system was removed. For instance:\n\n\u201cRemoving license as it is declared elsewhere\u201d (https://github.com/ros/ros_comm/commit/e451639226e9fe4eebc997962435cc454687567c)\n\n\u201cRemove extra LICENSE files\nOne repository, one license. No need to put these on the box either.\u201d (https://github.com/openatv/enigma2/commit/b4dfdf09842b3dcacb2a6215fc040f7ebbbb3c03)\n\n\u201cRemove licenses for unused libraries\u201d (https://github.com/ttop/cuanto/commit/a1e58f2c93de40ab304c494e05853957c549fd44)\n\nIn these cases, the system contains redundant or superfluous license files that can be removed. This observation highlights that strictly analyzing the license changes that have happened in the history of a software system could (wrongly) suggest that the system has migrated toward closed-source. The third commit message, instead, indicates that licenses were removed due to unused code, which required those licenses. Such cases, in which a project is adopting unnecessary licenses due to third-party libraries no longer needed, should be carefully managed since it may discourage other developers to reuse the project, especially if the unnecessary licenses are restrictive.\n\nChanges to Copyright This group of categories includes commits/issues related to simple changes/additions applied to the copyright statement, like copyright year, or authors. Changes to a list of author names occur to indicate names of people who provided a substantial contribution to the project, therefore claiming their ownership. Previous work indicated that often such additions occur in correspondence of large changes performed by contributors whose names are not mentioned yet in the copyright statement (Penta and Germ\u00e1n\nChanges to copyright years have also been previously investigated, and are often added to allow claiming right on source code modified in a given year (Di Penta et al. 2010).\n\n**License Fixes** This group of categories is related to changes in the license mainly due to various kinds of mistakes or formatting issues, as well as to cases in which a licensing statement was accidentally missing (note that this is different to cases of license addition in which the license was originally intended to be absent from the project).\n\nFor example, in this group, we observed cases of issues discussing *license mismatch*, where developers found conflicting headers or conflicts between the declared license and the license headers. In the former case, a developer posted an issue to the project *gtksourcecompletion*\u2019s issue tracker (https://github.com/chuchiperriman/gtksourcecompletion/issues/1):\n\n> \u201cThe license states that this is all LGPL-3, but the copyright headers of the source files say otherwise (and some are missing). Is this intentional, or should these all be under the same license? I\u2019ve included licensecheck output below.\u201d\n\nSubsequently, the issue poster listed files in the system with *GPL*, *LGPL*, and no copyright. Additionally, he indicated cases where the Free Software Foundation address was incorrect as well. We observed a similar situation in another project: a developer opened the issue \u201cLICENSE file doesn\u2019t match license in header of svgeezy.js\u201d (https://github.com/benhowdle89/svgeezy/issues/20) to svgeezy\u2019s issue tracker and stated:\n\n> The LICENSE file specifies the MIT license, but the header in svgeezy.js says it\u2019s released under the WTFPL. Which is the correct license?\n\nIn this second case, we observe that the declared license and source header are not consistent. However, the issue has not been resolved at the time of writing this paper and so we cannot report the resolution or any feedback offered by the original developers of the system.\n\nOther interesting cases are the ones related to the fix of *missing licenses*. Often developers are made aware of missing licenses via the issue tracker by projects\u2019 users reporting the issue. Sometimes, the complete project may be unlicensed, leading to discussions like the one titled \u201cGNU LGPL license is missing\u201d from the project *rcswitch-pi* (https://github.com/r10r/rcswitch-pi/issues/17):\n\n> Under which license is this source code published? This project is heavily based on wiring-pi and rc-switch: rc-switch: GNU Lesser GPL wiring-pi: GNU Lesser GPL\n> The GNU Lesser GPL should be added: http://www.gnu.org/licenses/lgpl.html\n\nBased on the project\u2019s characteristics (i.e., its foundations on previously existing projects), the developer recommends the addition of the missing LGPL license.\n\nThe commits and issues falling in the *License File Modification* category are related to changes applied to the license file type or name. For example, developers may change the license file from the default `LICENSE.md` file generated by GitHub to a `.txt` or `.rtf`. Additionally, developers change the file name often to make it more meaningful as illustrated in this commit message of the project *Haml* (https://github.com/haml/haml/commit/537497464612f1f5126a526e13e661698c86fd91):\n\n> \u201cRenamed the LICENSE to MIT-LICENSE so you don\u2019t have to open the file to find out what license the software is released under. Also wrapped it to 80 characters because I\u2019m a picky [edited]\u201d (quote edited for language)\nOther typical changes concern the renaming of the `COPYRIGHT` file to `LICENSE` or the move of the license file in the project\u2019s root directory. These cases do not indicate changes towards a different license or in general any change to the license semantics, but only in the way in which the license is presented.\n\n**License Compliance** This group of categories is probably the most interesting to analyze, and concerns categories related to discussions and changes because of license compliance. Specifically, other than generic compliance discussions, there are cases in which (i) a derivative work\u2019s legal inconsistency was spotted or discussed; (ii) a compatible library is added to replace another incompatible library from a licensing point of view; (iii) third-party code is completely removed when no legally-compliant alternative was possible; (iv) cases of discussion related to license compatibility in the context of reuse; and (v) cases in which an added dependency or an existing dependency has conflicts with the current license.\n\nA very interesting example is the issue discussion entitled \u201cUsing OpenSSL violates GPL licence\u201d in the project SteamPP (https://github.com/seishun/SteamPP/issues/1). Surprisingly, the developer of the project initially commented:\n\n> gnutls and libnss have terrible documentation and I don\u2019t consider this a priority issue anyway. If you would like to submit a pull request, then be my guest.\n\nDespite this initial reaction, the OpenSSL library was replaced by Crypto++ within a week in order to meet the licensing requirements.\n\nExamples of third-party libraries removed due to licensing issues are also prevalent in commit messages, e.g., :\n\n> \u201cRemove elle(1) editor, due to an incompatible license.\u201d (https://github.com/booster23/minixwall/commit/342171fa9e9d769ce4aa48525142a569b34962f7)\n\nThe incompatibility in this case was due to `elle`\u2019s clause explicitly reporting: \u201cNOT be sold or made part of licensed products.\u201d. Additionally, we saw the commit from the project wkhtmltopdf-qt-batch, where files were removed due to a recommendation by the project\u2019s legal staff: \u201cRemove some files as instructed by Legal department\u201d (https://github.com/alexkoltun/wkhtmltopdf-qt-batch/commit/9b142a07a7576afa15ba458e97935aac5921ef8d). This shows that license compliance may not be always straightforward to developers and that they may need to rely on legal council in order to determine whether licensing terms have been met.\n\nWe also observed changes in the system\u2019s licensing aimed at satisfying compliance with third-party code in the project gubg (https://github.com/gfannes/gubg.deprecated/commit/4d291ef433f0596dbd09d5733b25d27b3a921cf4):\n\n> Changed the license to LGPL to be able to use the msgpack implementation in GET nv.\n\nSimilarly, we found issue tracker discussions about conflicting licenses or about the compatibility of licenses between the project and third-party libraries. Interestingly, there was an issue opened by a non-contributor of the project android-sensorium (https://github.com/fmetzger/android-sensorium/issues/11), stating:\n\n> Google Play Services (GMS) is proprietary, hence not compatible with GNU LGPL. (The jar inside the Android library referred to in the project.properties).\n\n> F-Droid.org publishes the ....o3gm package, but we cant publish this without removing this library.\nThus, the license incompatibility not only created a potential license violation for the project but also prevented the non-contributor from cataloging the system among projects hosted on F-Droid (https://f-droid.org/), a well-known forge of open source Android apps.\n\nAdditionally, we observed issues related to reuse, where one contributor suggests a dual license to allow for greater reuse in other applications. The contributor of the project python-hpilo (https://github.com/seveas/python-hpilo/issues/85) stated,\n\n*Due to incompatibility between GPLv3 and Apache 2.0 it is hard to use python-hpilo from, for instance, OpenStack. It would therefore be helpful if the project code could also be released under a more permissive license, like for instance Apache 2.0 (which is how OpenStack is licensed)*\n\nThe other contributors subsequently utilized the thread to vote and ultimately agreed upon the dual license. Not only does this example indicate the consideration for reuse but it also demonstrates that licensing decisions are determined by all copyright holders and not by a single developer. It is also important to note that GPL-3.0 and Apache-2.0 are not considered incompatible by the Free Software Foundation.\n\nConversely, we also observed an interesting discussion in which the issue posted in the project patchelf (https://github.com/NixOS/patchelf/issues/37) asked \u201cIs it possible for you to change GPL to LGPL? It would help me using your software.\u201d. The developer posting the question was developing a system licensed under the BSD license with which GPL would not be compatible. A contributor refused to change licensing by stating: \u201cGPL would not be compatible\u201d. Moreover, one of the contributors explained that changing licensing is non-trivial by responding:\n\n*It wouldn\u2019t be easy to change the license, given that it contains code from several contributors, who would all need to approve of the change.*\n\nAgain, this response highlights the importance for all contributors to approve a license change. However, reaching an agreement among all contributors might be far from trivial, due to personal biases developers could have with respect to licensing (Vendome et al. 2015b).\n\nWe also observed a case related to derivative work, where the license differed from the original system\u2019s licensing (category: Derivative Work Inconsistency). A developer created the issue \u201cOrigin and License Issue\u201d for the project tablib (https://github.com/kennethreitz/tablib/issues/114) to which he offered support, but first noted:\n\n*While tablib is MIT-licensed, there are several potential provenance and license issues with Oo, XLS and XLSX formats that tablib embeds. I have collected some of these potential issues here. This is at best ... byzantine. [...] https://bitbucket.org/ericgazoni/openpyxl/ is reported as being derived from PHPExcel which is LGPL-licensed at https://github.com/PHPOffice/PHPExcel but openpyxl is not LGPL but MIT-licensed. If this is really derived then there is a possible issue as the license may be that of the original not of the derivative.*\n\nThe issue poster lists the various components used with their licensing to point out incompatibility issues, and in particular those related to the derivative code that the system utilizes.\n\n**Clarifications/Discussions** This group of categories contains issues related to clarifying the project\u2019s licensing, the terms or implications of the licensing, and the agreement between\ncontributors made in a Contributor License Agreement (CLA). License Clarification were about the actual license of the project and typically occurred when the system did not contain a license file (i.e., a declared project license). For example, one project\u2019s user created the issue \u201cPlease add a LICENSE file\u201d for the Mozilla\u2019s project 123done (https://github.com/mozilla/123done/issues/139) stating:\n\nThe repo is public, but it\u2019s not easy to find out how I\u2019m allowed to use or share the code.\n\nCould you add a LICENSE file to make it easier for users to understand how you\u2019d like it to be used?\n\nSimilarly, another project, pyelection, has the issue \u201cWhat license is this code released under?\u201d (https://github.com/alex/pyelection/issues/1) with no further comments from the poster. Thus, we observe that developers use the issue tracker as a mean to understand the licensing and request an explicit licensing file.\n\nAnother surprising issue discussion is related to understanding the terms of a license. The issue was posted to the neunode\u2019s issue tracker (https://github.com/snakajima/neunode/issues/5) by an external developer looking to reuse the code and asked:\n\nWe are impressed with what you\u2019ve done with neu.Node and are interested in using it for offline mapping applications. However, we work at a company that has more than 1M$ in revenue. Your license terms say MIT for companies with less than 1M$ in revenue (which is not an approach I\u2019ve seen before). Please could you clarify the license terms for a company that is larger that that? We\u2019re trying to make some decisions on our direction at the moment, so a quick response would be appreciated if possible.\n\nInterestingly, the license terms set conditions based on the money value of the company looking to reuse the code. In this case, the external developer\u2019s company exceeds the threshold. The original developer indicates that his software is intended to benefit the developer community as a whole, and more specifically students and individuals. The original developer gave two options: (i) a large check without maintenance support, or (ii) detail descriptions of the product, a compelling argument for giving a free license to reuse the system, and acknowledgment in the description. Thus, the original developer is not interested to financial gain (though, he could reasonably be convinced at the right price), but rather wants to support the open source community and receive credit for his work.\n\nWe identified a category of License Agreement. This scenario arises when an external developer to the project submits some code contribution to the project, and the project contributors require that developer to complete a Contributor License Agreement (CLA) to avoid licensing/copyright disputes. We observed a discussion related to updating the textual information of the project\u2019s CLA with respect to country designations (http://github.com/adobe/brackets/issues/8337). Similarly, in our previous Java study (Vendome et al. 2015a), a developer submitted a patch, but it could not be merged into the system until that developer filled out the CLA (https://github.com/FasterXML/jackson-module-jsonSchema/issues/35). A CLA makes it explicit that the author of a contribution is granting the recipient project the right to reuse and further distribute such contribution (Brock 2010). Thus, it prevents the contributed code from becoming a ground for a potential lawsuit.\n\nRequest for a License This group contains issue discussions in which a developer asks for a license or a license file. While these are similar to reuse, it differs since the developers\ndo not necessarily state that they want to reuse the system, since it is possible that they want to contribute as well. Thus, these are more generic requests for the developer to attribute a license to the system without explaining the reason for such a request. For example, we found the issue titled \u201cNo license included in repository\u201d for the project jquery-browserify (https://github.com/jmars/jquery-browserify/issues/20) in which the poster commented:\n\nWould you consider adding a license to the repository? It\u2019s currently missing one and according to TOS.\n\n[Not posting a license] means that you retain all rights to your source code and that nobody else may reproduce, distribute, or create derivative works from your work. This might not be what you intend.\n\nEven if this is what you intend, if you publish your source code in a public repository on GitHub, you have accepted the Terms of Service which do allow other GitHub users some rights. Specifically, you allow others to view and fork your repository.\n\nIf you want to share your work with others, we strongly encourage you to include an open source license.\n\nIf you don\u2019t intend on putting a license up that\u2019s fine, but if you do want to use an open source license please do so. I\u2019d be happy to fork/PR for you if you just let me know which license you want to put in (MIT/BSD/Apache/etc.)\n\nThis comment demonstrates that licensing also impacts derivative work and can prevent other developers from contributing to a system. This is an important distinction, since findings and prior work (Vendome et al. 2015a, b; Sojer and Henkel 2010) demonstrate that licensing could be an impediment to reuse and not an impediment to contribute towards a project/system.\n\nLicense Output for the End User This category describes a unique case where an issue was posted regarding the output of the license to the end user. The issue stated:\n\n\u201cThis output could be read by monitoring tools, for example to automatically warn about expiration (although Phusion also emails expiration warnings, the desired upfront time for the warning is not configurable like that).\u201d (http://github.com/phusion/passenger/issues/1482)\n\nUnlike the previous categories, this issue relates to end user licensing the software. The contributor of the system suggests the inclusion of a feature to aid in monitoring the license expiration. Interestingly, this category shows that developers also consider licensing from the impact on the \u201cclient\u201d using the system. This aspect of understanding the impact of licensing on the \u201cclient\u201d or end user has also been unexplored in prior studies.\n\n4.4.2 Analysing Commits Implementing Atomic License Changes in Java Systems\n\nIn this analysis, we specifically targeted commit messages where a licensing change occurred so that we could understand the rationale behind the change. We did not apply a keyword for these commit messages since we knew they were commits related to changes in licensing. When reading these commits, we also included the atomic license change pattern that was observed at that particular commit to add context. We observed new support for the existing categories and the results are reported in Table 11. We refer to new support as commit messages indicating new rationale for the existing categories.\nTable 11 Categories defined through open coding for the commit messages in which a license change occurred\n\n| Category | Commits |\n|--------------------------------|---------|\n| Generic license additions | |\n| Choosing license | 0 |\n| License added | 63 |\n| License change | |\n| License change | 9 |\n| License upgrade | 1 |\n| License rollback | 1 |\n| License removal | 19 |\n| Changes to copyright | |\n| Copyright added | 0 |\n| Copyright update | 1 |\n| License fixes | |\n| Link broken | 0 |\n| License mismatch | 0 |\n| Fix missing licensing | 9 |\n| License file modification | 0 |\n| Missing licensing | 1 |\n| License compliance | |\n| Compliance discussion | 0 |\n| Derivative work inconsistency | 0 |\n| Add compatible library | 0 |\n| Removed third-party code | 1 |\n| License compatibility | 0 |\n| Reuse | 0 |\n| Dep. license added | 0 |\n| Dep. license issue | 0 |\n| Clarifications/Discussions | |\n| License clarification | 0 |\n| Terms clarification | 0 |\n| Verify licensing | 0 |\n| License agreement | 0 |\n| Request for a license | |\n| Licensing request | 0 |\n| License output for the end user| |\n| Output licensing | 0 |\n\nAs for the License Change group of categories, we observed general messages indicating a license change occurred and in some cases explicitly stating the new license, such as the following commit messages:\n\n\u201cRewrite to get LGPL code.\u201d\n\n\u201cChanged license to Apache v2\u201d\nThese two commit messages do not offer rationale, but they at least indicate the new license that has been attributed to the system. So, a developer inspecting the change history would be able to accurately understand the particular license change.\n\nSince we observed many instances of no license $\\rightarrow$ some license, the prevalence of License Added was expected. However, these License Added commit messages resembled the License Change messages since they often did not include a clear rationale (i.e., while being part of the License Added category, their level of detail was similar to the License Change category). For example, a developer asserted the Apache-2.0 license to the headers of the source files across his project, but his commit message simply stated:\n\n\u201cEnforce license\u201d\n\nIn the case of License Removal, we observed that licenses were removed due to code clean up, files deletion, and dependencies removal. For example, we observed the removal of the GPL-2.0 license with the following commit message,\n\n\u201cNo more smoketestclientlib\u201d\n\nIt indicates the removal of a previously exploited library. Additionally, licenses were removed as developers cleaned up the project.\n\nFix Missing Licensing is related to a license addition, but it occurred when the author intended to license the file, but forgot either in the initial commit or in the commit introducing the licensing. For example, one commit message stated:\n\n\u201cAdded missing Apache License header.\u201d\n\nThis indicates that the available source code may inaccurately seem unlicensed.\n\nAdditionally, License Upgrade refers to license change, where the version of the license is modified to the most recent. In this particular case, we observed a change from GPL-2.0+ to GPL-3.0+. The commit message stated:\n\n\u201c...Change copyright header to refer to version + 3 of the GNU General Public License and to point readers at the + COPYING3 file and the FSF\u2019s license web page.\u201d\n\nWhile the commit message describes the version change, it does not supply rationale. Instead, the message is a log of the changes.\n\nAn important observation from the second round of our analysis was the ambiguity of commit messages. For example, we observed a commit classified as Copyright Update stating,\n\n\u201cUpdated copyright info.\u201d\n\nHowever, this commit corresponded to a change in licensing from GPL-2.0 to LGPL-2.1+. This case both illustrates the lack of detail offered by developers in commit messages, and it illustrates that an update can be more significant than adding a header or changing a copyright year.\n\nSince we sampled commits from all Java projects, it was infeasible to sample a larger representative number of commit messages. Thus, augmenting the second round by considering commits in which an atomic license change occurred benefited the taxonomy by targeting relevant commits better. However, we were able to sample statistically representative sample sizes in this work due to pre-filtering the projects. The results corroborate the representativeness, since we observed the same categories.\nAnother important observation that appears to support the supposition from our traceability analysis that developers remove licensing related issues from the issue tracker is that we found links that were removed in the period of time between our crawling and our data analysis. These were categorized as *Link Broken* and amounted to 45 of the overall issues. It is also possible that these cases represent developers that utilize external bug tracking systems as well.\n\n**Summary for RQ 4** While our open coding analysis, based on grounded theory, indicated some lack of documentation (e.g., prevalence of false positives) and poor quality in documentation with respect to licensing in both issue tracker discussion and commits notes, we formally categorized the available rationale. We also found that the rationale may be incomplete or ambiguously describe the underlying change (e.g., \u201cUpdated copyright info\u201d representing a change between different licenses). Finally, we observed that issue trackers also served as conduits for project authors and external developers to discuss licensing.\n\n## 5 Lessons and Implications\n\nThe analysis of the commit messages and issue tracker discussions highlighted that the information offered with respect to licensing choice/change is very often quite limited. A developer interested in reusing code would be forced to check the source code of the component to understand the exact licensing or to ask for clarification (using the issue tracker, for example). Additionally, the reason behind the change is not usually well documented. This detail is particularly important when a system uses external/third-party libraries since a license may change during the addition or removal of those libraries.\n\nAn important observation from our open coding analysis also stresses the need for better licensing traceability and aid in explaining the license grants/restrictions. We found several instances in which the issue tracker was used to ask for clarifications regarding licensing by external developers who sought to reuse the code. For example, we observed that developers interpret the implications of licensing differently, which generates misunderstandings in terms of reuse. This suggests that code reuse is problematic for developers due to licensing. Therefore, our study demonstrates a need for clear and explicit licensing information for the projects hosted on a forge.\n\nSimilarly, we observed that external developers would request a license, since the projects appeared to be unlicensed; however, a subset of these requests were due to licensing being attributed in a different manner than external developers expected (e.g., part of the *gemspec* file for Ruby projects and not a *LICENSE* file). We also observed developers adding license files to parent directories as opposed to headers in the source code as well as appending the license name to the license file (e.g., *LICENSE* would be renamed *LICENSE.MIT*). This way of declaring a license is particularly used in GitHub project where the system asks the developer(s) to choose a license, when a project is created, and then it creates the *LICENSE* file in the project\u2019s root directory.\n\nThese observations indicate a lack of standardization in how licensing is expressed among both projects in the same language and projects across different languages. It suggests that developers need a standardized mechanism to declare the license of a software project. Third-party tools or forges could support developers by maintaining this standardized documentation automatically.\n\nAnother important observation is the type of difficulty that developers have with the licensing of third-party code and the ways in which they achieve compliance. We observe in\nboth the issue discussions and commit messages that libraries are removed due to incompatible licensing terms. Conversely, libraries are also chosen due to the particular license of the source code. This feature can be important for open source developers that aim for a wide adoption of their systems. Their choice in licensing may directly impact the adoption. This suggests that the choice in licensing can directly impact the adoption of libraries. Therefore, we foresee that library/code recommenders based on open source code base should be license aware. This consideration applies, for example, to approaches recommending code examples or libraries by sensing the developers\u2019 context (Cubranic et al. 2005; Holmes and Murphy 2005; Ponzanelli et al. 2013, 2014). In other words, on one hand the project\u2019s license should be a relevant part of the context, on the other hand, the code search engines (e.g., Grechanik et al. 2010; McMillan et al. 2012a, b, c, 2011, 2013) should consider the target code license as a constraint in the search.\n\nThe lack of traceability of licensing changes is also important for researchers investigating software licensing on GitHub. While we cannot generalize to other features, it does suggest that commit message analysis may be largely incomplete with respect to details of the licensing-related changes made during that commit. One way to achieve this for developers is to take advantage of summarization tools such as ARENA (Moreno et al. 2014) and ChangeScribe (Cort\u00e9s-Coy et al. 2014; Linares-V\u00e1squez et al. 2015). While ARENA analyzes and documents licensing changes at release level, ChangeScribe automatically generates commit messages; however, using ChangeScribe would require extending it to analyze licensing changes at commit level. Another option is that forges (and software tools in general) verify that every file contains a license and that every project properly documents its license (this feature could be optional). In summary, it would greatly improve traceability between license changes and their rationale, and assert a consistency among the repositories. Also, it would be beneficial for developers using another project to be informed when a licensing change occurs. For example, a developer could mark specific projects as dependents and receive automated notifications when particular changes occur. This would be very beneficial with licensing since a change in the license of a dependency could result in license incompatibilities.\n\nThe open coding of commit messages and issue tracker discussions also suggests that commercial usage of code is a concern in the open source community. Currently, the MIT/X license and the Apache license seem to be the most prominent licenses for this purpose. Indeed, also the quantitative analysis of Java projects showed a trend towards the use of permissive licenses. The lack of a license is an important consideration in open source development, since it suggests that the code may in fact be closed source (or copyrighted by the original author). We observed such issues in discussions related to lack of licensing, since it hindered reuse. Indeed, sometimes developers initiate an open source project without attributing it a license. This is either because they lack a deep knowledge of the importance of the licensing on the possibility of (dis)allowing certain types of reuse for their code (Vendome et al. 2015b), but also because there is limited support in the task of choosing the most suitable license for a project. Existing tool support, such as Choose A License, helps users in choosing a license, but the tool is completely context-insensitive with respect to the constraints imposed. A better, context-sensitive tool support is provided in the Markos project (Bavota et al. 2014), but it mainly provides the list of compatible licenses for a given component.\n\n11http://choosealicense.com\n6 Threats to Validity\n\nThreats to construct validity concern the relationship between theory and observation, and relate to possible measurement imprecision when extracting data used in this study. In mining the Git repositories, we relied on both the GitHub API and the git command line utility. These are both tools under active development and have a community supporting them. Additionally, the GitHub API is the primary interface to extract project information. We cannot exclude imprecision due to the implementation of such API. In terms of license classification, we rely on Ninka, a state-of-the-art approach that has been shown to have 95% precision (Germ\u00e1n et al. 2010b); however, it is not always capable of identifying the license (15% of the time in that study). For what concerns the open coding performed in the context of RQ4, we have identified, through a stratified sampling, a sample of commit messages and issue tracker discussions large enough to ensure an error of \u00b110% with a confidence level of 95%. Such a sample has been identified starting from candidate commit messages and discussions identified by means of pattern matching, using the keywords of Table 1. Although we aimed to build a comprehensive set of licensing-related keywords, it is possible that we missed licensing-related discussions not matching any of these keywords.\n\nThreats to internal validity can be related to confounding factors, internal to our study, that could have affected the results. For the atomic licensing changes, we reduced the threat of having the project size as a confounding factor by representing the presences of a particular change at each commit. A license change typically is handled at a given instance and not frequency. By using commit-level analysis, we prevent the number of files from inflating the results so that they do not inappropriately suggest large numbers of changes occurred in a project. To analyze the changes across projects, we took a binary approach of analyzing the presence of a pattern. Therefore, a particular project would not dominate our results due to size. To limit the subjectiveness of the open coding, classifications were always performed by two of the authors, and then every case of discording classification was discussed as explained in Section 3.3.\n\nThreats to external validity represent the ability to generalize the observations in our study. Our quantitative study is based on the analysis of over 16K Java projects. This makes us confident that our findings have a good generalizability for what concerns Java systems, while they cannot be extended to systems written in other programming languages. Our qualitative study has been performed instead on commit messages and issue discussions extracted from software systems written in seven different languages. However, the generalizability of our qualitative results is limited to the seven considered languages and it is supported by the relatively low number of considered systems (i.e., 1,160) due to the manual effort required for the identification of the rationale behind licensing decisions (as well as the limited number of potential repositories with license-related commit messages or issue discussions).\n\nGitHub\u2019s exponential growth and popularity as a public forge indicates that it represents a large portion of the open source community. While the exponential growth or relative youth of projects can be seen as impacting the data, these two characteristics represent the growth of open source development and should not be discounted. Additionally, GitHub contains a large number of repositories, but it may not necessarily be a comprehensive set of all open source projects or even all Java projects. However, the large number of projects in our dataset (and relatively high diversity metrics values as shown in Section 3.4) gives us enough confidence about the obtained findings. Further evaluation of projects\nacross other open source repositories (and other programming languages for the quantitative part) would be necessary to validate our observations in a more general context. It is also important to note that our observations only consider open source projects. Since we need to extract licenses from source code, we did not consider any closed source projects and we cannot assert that any of our results would be representative in closed source projects.\n\n7 Conclusions\n\nThis paper reported an empirical study aimed at analyzing, from a quantitative and qualitative point of view, the adoption and change of licenses in open source projects hosted on GitHub. The study consists of (i) a quantitative part, in which we studied license usage and licensing changes in a set of 16,221 Java projects hosted on GitHub, and (ii) a qualitative analysis in which we analyzed commit messages and issue tracker discussions from 1,160 projects hosted on GitHub and developed using seven most popular programming languages (i.e., C, C++, C#, Java, Javascript, Python, and Ruby).\n\nThe quantitative analysis on the Java projects aimed at (i) providing an overview of the kinds of licenses being used over time by different projects, (ii) analyzing licensing changes, and (iii) identifying traceability links between licensing changes and licensing-related discussions. Results indicated that:\n\n\u2013 New license versions were quickly adopted by developers. Additionally, new license versions of restrictive licenses (e.g., GPL-3.0 vs GPL-2.0) favored longer survival of earlier versions, unlike the earlier version of permissive licenses that seem to disappear;\n\u2013 Licensing changes are predominantly toward or between permissive licenses, which ease some kind of derivative work and redistribution, e.g., within commercial products;\n\u2013 There is a clear lack of traceability between discussions and related license changes.\n\nThe qualitative analysis was based on an open coding procedure inspired by grounded theory (Corbin and Strauss 1990), and aimed at categorizing licensing-related discussions and commits. The results indicate that:\n\n\u2013 Developers post questions to the issue tracker to ascertain the project\u2019s license and/or the implications of the license suggesting that licensing is difficult;\n\u2013 There is a lack of standardization or consistency in how licensing is attributed to a system (both within the same programming language and across different programming languages), which causes misunderstandings or confusion for external developers looking to reuse a system;\n\u2013 Developers, in general, do not supply detailed rationale nor document changes in the commit messages or issue tracker discussions;\n\u2013 License compatibility can impact both the adoption and removal of a third-party library due to issues of license compliance.\n\nThis work is mainly exploratory in nature as it is aimed at empirically investigating license usage and licensing changes from both quantitative and qualitative points of view. Nevertheless, there are different possible uses one can make of the results of this paper. Our results indicate that developers frequently deal with licensing-related issues, highlighting the need for developing (semi)automatic recommendation systems aimed at supporting\nlicense compliance verification and management. Additionally, tools compatible or integrated within the forge to support licensing documentation, change notification, education (i.e., picking the appropriate license), and compatibility would benefit developers attempting to reuse code. While working in this direction, one should be aware of possible factors that could influence the usage of specific licenses and the factors motivating licensing changes. This paper provides solid empirical results and analysis of such factors from real developers.\n\nFuture work in this area should aim at (i) extending the study by performing a larger quantitative and qualitative analysis on more projects, and (ii) performing a deeper investigation into the rationale for licensing changes, for example, by performing an analysis of dependencies in software projects and relating such analysis with the changes being performed. Last, but not least, as discussed in Section 5, it would be useful to incorporate licensing analysis into existing software recommender systems. Such recommenders could not only rely on the local project\u2019s context, but also exploit rationale from previous licensing changes to produce recommendations.\n\nAcknowledgments This work is supported in part by NSF CAREER CCF-1253837 grant. Massimiliano Di Penta is partially supported by the Markos project, funded by the European Commission under Contract Number FP7-317743. Any opinions, findings, and conclusions expressed herein are the authors\u2019 and do not necessarily reflect those of the sponsors.\n\nReferences\n\n123done issue 139 https://github.com/mozilla/123done/issues/139\nandroid-sensorium issue 11 https://github.com/fmetzger/android-sensorium/issues/11\nBavota G, Canfora G, Di Penta M, Oliveto R, Panichella S (2013) The evolution of project inter-dependencies in a software ecosystem: The case of apache:280\u2013289\nBavota G, Ciemniewska A, Chulani I, De Nigro A, Di Penta M, Galletti D, Galoppini R, Gordon TF, Kedziora P, Lener I, Torelli F, Pratola R, Pukacki J, Rebahi Y, Villalonga SG (2014) The market for open source: an intelligent virtual open source marketplace. In: 2014 software evolution week - IEEE conference on software maintenance, reengineering, and reverse engineering, CSMR-WCRE 2014, Antwerp, Belgium February 3-6, 2014, pp 399\u2013402\nbrackets issue 8337. http://github.com/adobe/brackets/issues/8337\nBrock A (2010) Project harmony: inbound transfer of rights in FOSS projects. Intl. Free and Open Source Software Law Review 2(2):139\u2013150\nCass S. The 2015 top ten programming languages. http://spectrum.ieee.org/computing/software/the-2015-top-ten-programming-languages\nCorbin J, Strauss A (1990) Grounded theory research: procedures, canons, and evaluative criteria. Qual Sociol 13(1):3\u201321\nCort\u00e9s-Coy LF, Linares-V\u00e1squez M, Aponte J, Poshyvanyk D (2014) On automatically generating commit messages via summarization of source code changes. In: 2014 IEEE 14th international working conference on source code analysis and manipulation (SCAM), IEEE, pp 275\u2013284\nCuanto commit. https://github.com/ttop/cuanto/commit/a1e58f2c93de40ab304c494e05853957c549fd44\nCubranic D, Murphy GC, Singer J, Booth K. S. (2005) Hipikat: a project memory for software development. IEEE Trans Softw Eng 31(6):446\u2013465\nCzmq commit. https://github.com/zeromq/czmq/commit/eabe063c2588cde0af90e5ae951a2798b7c5f7e4\nd3-armory issue 5. https://github.com/kovmarci86/d3-armory/issues/5\nDi Penta M, Germ\u00e1n DM, Antoniol G (2010) Identifying licensing of jar archives using a code-search approach. In: Proceedings of the 7th international working conference on mining software repositories, MSR 2010 (Co-located with ICSE), Cape Town, South Africa May 2\u20133, 2010, Proceedings, pp 151\u2013160\nDi Penta M, Germ\u00e1n DM, Gu\u00e9h\u00e9neuc Y, Antoniol G (2010) An exploratory study of the evolution of software licensing. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering - Volume 1, ICSE 2010 Cape Town, South Africa, 1\u20138 May 2010, pp 145\u2013154\nDickey DA, Fuller WA (1979) Distributions of the estimators for autoregressive time series with a unit root. J Am Stat Assoc 74:427\u2013431\n\nDickey DA, Fuller WA (1981) Likelihood ratio statistics for autoregressive time series with a unit root. Econometrica 49(4):1057\u20131072\n\nDoll B The octoverse in 2012. http://tinyurl.com/muyxkru. Last accessed: 2015/01/15\n\nDyer R, Nguyen HA, Rajan H, Nguyen TN (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: 35th international conference on software engineering, ICSE \u201913, San Francisco, CA USA, May 18\u201326, 2013, pp 422\u2013431\n\nenigma2 commit. https://github.com/openatv/enigma2/commit/b4dfdf09842b3dcacb2a6215fc040f7ebbbb3c03\n\nFree Software Foundation (2015) Categories of free and nonfree software. https://www.gnu.org/philosophy/categories.html. Last accessed: 2015/01/15\n\nF-Droid. https://f-droid.org/. Last accessed: 2015/01/15\n\nGerm\u00e1n DM, Hassan AE (2009) License integration patterns: addressing license mismatches in component-based development. In: 31st international conference on software engineering, ICSE 2009, May 16-24, 2009, Vancouver, Canada, Proceedings, pp 188\u2013198\n\nGerm\u00e1n DM, Di Penta M, Gu\u00e9h\u00e9neuc Y, siblings G. Antoniol. (2009) Code technical and legal implications of copying code between applications. In: Proceedings of the 6th international working conference on mining software repositories, MSR 2009 (Co-located with ICSE), Vancouver, BC Canada May 16-17, 2009 Proceedings, pp 81\u201390\n\nGerm\u00e1n DM, Di Penta M, Davies J (2010a) Understanding and auditing the licensing of open source software distributions. In: The 18th IEEE international conference on program comprehension, ICPC 2010, Braga, Minho, Portugal, June 30-July 2 2010, pp 84\u201393\n\nGerm\u00e1n DM, Manabe Y, Inoue K (2010b) A sentence-matching method for automatic license identification of source code files. In: ASE 2010, 25th IEEE/ACM international conference on automated software engineering, Antwerp Belgium, September 20\u201324 2010, pp 437\u2013446\n\nGitHub API. https://developer.github.com/v3/. Last accessed: 2015/01/15\n\nGNU General Public License (2015). http://www.gnu.org/licenses/gpl.html. Last accessed: 2015/01/15\n\ngtksourcecompletion issue 1. https://github.com/chuchiperriman/gtksourcecompletion/issues/1\n\nGobeille R (2008) The FOSSology project. In: Proceedings of the 2008 international working conference on mining software repositories, MSR 2008 (Co-located with ICSE), Leipzig, Germany May 10\u201311, 2008 Proceedings, pp 47\u201350\n\nGrechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. In: Proceedings of the 32Nd ACM/IEEE international conference on software engineering - Volume 1, ICSE \u201910, New York, NY, USA ACM, pp 475\u2013484\n\ngubg commit https://github.com/gfannes/gubg.deprecated/commit/4d291ef433f0596dbd09d5733b25d27b3a921cf4\n\nHolmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: 27th international conference on software engineering (ICSE 2005), 15\u201321 May 2005 St. Louis, Missouri USA, pp 117\u2013125\n\nHowison J, Conklin M, Crowston K FLOSSmole: a collaborative repository for FLOSS research data and analyses. IJITWE\u201906 1:17\u201326\n\nHaml commit https://github.com/haml/haml/commit/537497464612f1f5126a526e13e661698c86fd91\n\nIntex issue 1 https://github.com/mtr/intex/issues/1\n\njackson-module-jsonschema issue 35 https://github.com/FasterXML/jackson-module-jsonSchema/issues/35\n\njquery-browserify issue 20 https://github.com/jmars/jquery-browserify/issues/20\n\nLinares-V\u00e1squez M, Cort\u00e9s-Coy LF, Aponte J, Poshyvanyk D (2015) ChangeScribe: A tool for automatically generating commit messages. In: 37th IEEE/ACM international conference on software engineering (ICSE\u201915), formal research tool demonstration, page to appear\n\nManabe Y, Hayase Y, Inoue K (2010) Evolutional analysis of licenses in FOSS. In: Proceedings of the joint ERCIM workshop on software evolution (EVOL) and international workshop on principles of software evolution (IWPSE), Antwerp, Belgium, September 20\u201321, 2010, pp 83\u201387 ACM\n\nMcMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usage. In: Proceedings of the 33rd international conference on software engineering, ICSE \u201911, New York, NY, USA, ACM\n\nMcMillan C, Grechanik M, Poshyvanyk D (2012a) Detecting similar software applications, pp 364\u2013374\n\nMcMillan C, Grechanik M, Poshyvanyk D, Fu C, Xie Q (2012b) Exemplar: A source code search engine for finding highly relevant applications. IEEE Trans Softw Eng 38(5):1069\u20131087\nMcMillan C, Hariri N, Poshyvanyk D, Cleland-Huang J, Mobasher B (2012c) Recommending source code for use in rapid software prototypes. In: Proceedings of the 34th international conference on software engineering, ICSE '12, Piscataway, NJ, USA, IEEE Press, pp 848\u2013858\n\nMcmillan C, Poshyvanyk D, Grechanik M, Xie Q, Fu C. (2013) Portfolio: searching for relevant functions and their usages in millions of lines of code. ACM Trans Softw Eng Methodol 22(4):37:1\u201337:30\n\nminixwall commit https://github.com/booster23/minixwall/commit/342171fa9e9d769ce4aa48525142a569b34962f7\n\nMoreno L, Bavota G, Di Penta M, Oliveto R, Marcus A, Canfora G (2014) Automatic generation of release notes. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, (FSE-22), Hong Kong, China November 16\u201322 2014, pp 484\u2013495\n\nNagappan M, Zimmermann T, Bird C (2013) Diversity in software engineering research. In: Joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE'13, Saint Petersburg, Russian Federation, August 18\u201326 2013, pp 466\u2013476\n\nneunode issue 5 https://github.com/snakajima/neunode/issues/5\n\nNimble commit https://github.com/bradleybeddoes/nimble/commit/e1e273ff18730d2f8e0d7c2 af1951970e676c8d1\n\nOracle MySQL - FOSS License Exception. http://www.mysql.com/about/legal/licensing/foss-exception/. Last accessed: 2015/01/15\n\nPassenger issue 1482 http://github.com/phusion/passenger/issues/1482\n\npatchelf issue 37 https://github.com/NixOS/patchelf/issues/37\n\nPenta MD, Germ\u00e1n DM (2009) Who are source code contributors and how do they change? In: 16th working conference on reverse engineering, WCRE 2009, 13\u201316 October 2009, Lille France, pp 11\u201320\n\nPF: The OpenBSD Packet Filter http://www.openbsd.org/faq/pf Last accessed: 2015/01/15\n\nPonzanelli L, Bacchelli A, Lanza M (2013) Leveraging crowd knowledge for software comprehension and development. In: 17th european conference on software maintenance and reengineering, CSMR 2013, Genova, Italy, March 5\u20138 2013, pp 57\u201366\n\nPonzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the IDE into a self-confident programming prompter. In: 11th working conference on mining software repositories, MSR 2014, Proceedings, May 31 - June 1 Hyderabad, India, pp 102\u2013111\n\nPostgis commit https://github.com/postgis/postgis/commit/4eb4127299382c971ea579c8596cc41cb1c089bc\n\npyelection issue 1 https://github.com/alex/pyelection/issues/1\n\npython-hpilo issue 85 https://github.com/seveas/python-hpilo/issues/85\n\nrcswitch-pi issue 17 https://github.com/r10r/rcswitch-pi/issues/17\n\nRos-comm commit https://github.com/ros/ros_comm/commit/e451639226e9fe4eebc997962435cc454687567c\n\nschevorecipe.db commit https://github.com/Schevo/schevorecipe.db/commit/b73bef14adeb7c87c002a908384253c8f686c625\n\nSingh P, Phelps C (2009) Networks, social influence, and the choice among competing innovations: Insights from open source software licenses. Inf Syst Res 24(3):539\u2013560\n\nSojer M, Henkel J (2010) Code reuse in open source software development: Quantitative evidence, drivers, and impediments. J Assoc Inf Syst 11(12):868\u2013901\n\nSoftware Package Data Exchange (SPDX) http://spdx.org Last accessed: 2015/01/15\n\nState of the Octoverse in 2012 https://octoverse.github.com/ Last accessed: 2015/01/15\n\nSteampp issue 1 https://github.com/seishun/SteamPP/issues/1\n\nsvgeezy issue 20 https://github.com/benhowdle89/svgeezy/issues/20\n\ntablib issue 114 https://github.com/kennethreitz/tablib/issues/114\n\nTardis commit https://github.com/tardis-sn/tardis/commit/07b2a072d89d45c386d5f988f04435d76464750e\n\nThe BSD 2-Clause License. http://opensource.org/licenses/BSD-2-Clause. Last accessed: 2015/01/15\n\nTuunanen T, Koskinen J, K\u00e4rkk\u00e4inen T (2009) Automated software license analysis. Softw Autom Eng 16(3-4):455\u2013490\n\nVendome C, Linares-V\u00e1squez M, Bavota G, Di Penta M, Germ\u00e1n DM, Poshyvanyk D (2015a) License usage and changes: A large-scale study of Java projects on GitHub. In: The 23rd IEEE international conference on program comprehension, ICPC 2015, Florence, Italy, May 18\u201319, 2015. IEEE\n\nVendome C, Linares-V\u00e1squez M, Bavota G, Di Penta M, German DM, Poshyvanyk D (2015b) When and why developers adopt and change software licenses. In: The 31st IEEE international conference on software maintenance and evolution, ICSME 2015 Bremen, Germany, September 29 - October 1, 2015, pages 31\u201340 IEEE\nChristopher Vendome is a fourth year Ph.D. student at the College of William & Mary. He is a member of the SEMERU Research Group and is advised by Dr. Denys Poshyvanyk. He received a B.S. in Computer Science from Emory University in 2012 and he received his M.S. in Computer Science from The College of William & Mary in 2014. His main research areas are software maintenance and evolution, mining software repositories, software provenance, and software licensing. He is member of the IEEE and ACM.\n\nGabriele Bavota is an assistant professor at the Free University of Bolzano-Bozen. received (cum laude) the Laurea in Computer Science from the University of Salerno (Italy) in 2009 defending a thesis on Traceability Management, advised by Prof. Andrea De Lucia. He received the PhD in Computer Science from the University of Salerno in 2013. Form January 2013 to October 2014 he has been a research fellow at the Department of Engineering of the University of Sannio. His research interests include software maintenance and evolution, refactoring of software systems, mining software repositories, empirical software engineering, and information retrieval.\nMassimiliano Di Penta is an associate professor at the University of Sannio, Italy since December 2011. Before that, he was assistant professor in the same University since December 2004. His research interests include software maintenance and evolution, mining software repositories, empirical software engineering, search-based software engineering, and service-centric software engineering. He is currently involved as principal investigator for the University of Sannio in a European Project about code search and licensing issues (MARKOS - www.markosproject.eu).\n\nMario Linares-V\u00e1squez is a Ph.D. candidate at the College of William and Mary advised by Dr. Denys Poshyvanyk, and co-founder of liminal ltda. He received his B.S. in Systems Engineering from Universidad Nacional de Colombia in 2005, and his M.S. in Systems Engineering and Computing from Universidad Nacional de Colombia in 2009. His research interests include software evolution and maintenance, software architecture, mining software repositories, application of data mining and machine learning techniques to support software engineering tasks, and mobile development. He is member of the IEEE and ACM.\nDaniel M. German is an Associate Professor at the University of Victoria in Victoria, Canada. He received his Ph.D. degree in Computer Science from University of Waterloo in Canada. His research interests are in software engineering. In particular, software evolution, open source and intellectual property.\n\nDenys Poshyvanyk is an Associate Professor at the College of William and Mary in Virginia. He received his Ph.D. degree in Computer Science from Wayne State University in 2008. He also obtained his M.S. and M.A. degrees in Computer Science from the National University of Kyiv-Mohyla Academy, Ukraine and Wayne State University in 2003 and 2006, respectively. His research interests are in software engineering, software maintenance and evolution, program comprehension, reverse engineering, software repository mining, source code analysis and metrics. He is a member of the IEEE and ACM.", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/027_vendome.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 41, "total-input-tokens": 96226, "total-output-tokens": 33542, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 2169, 1], [2169, 5707, 2], [5707, 9904, 3], [9904, 13656, 4], [13656, 17064, 5], [17064, 20544, 6], [20544, 24154, 7], [24154, 28035, 8], [28035, 32074, 9], [32074, 34328, 10], [34328, 37963, 11], [37963, 41867, 12], [41867, 44112, 13], [44112, 46256, 14], [46256, 49928, 15], [49928, 53253, 16], [53253, 55923, 17], [55923, 58832, 18], [58832, 62461, 19], [62461, 66113, 20], [66113, 71904, 21], [71904, 74861, 22], [74861, 77946, 23], [77946, 81225, 24], [81225, 84722, 25], [84722, 88083, 26], [88083, 91553, 27], [91553, 95182, 28], [95182, 98478, 29], [98478, 100520, 30], [100520, 103580, 31], [103580, 107347, 32], [107347, 111262, 33], [111262, 115205, 34], [115205, 118474, 35], [118474, 122483, 36], [122483, 127060, 37], [127060, 131521, 38], [131521, 132680, 39], [132680, 133856, 40], [133856, 134754, 41]]}}
|
|
{"id": "bbd4aa855a9647b306e6be994a6940715662fa28", "text": "Understanding the Usage, Impact, and Adoption of Non-OSI Approved Licenses\n\nR\u00f4mulo Meloca1, Gustavo Pinto2, Leonardo Baiser1, Marco Mattos1, Ivanilton Polato1, Igor Scaliante Wiese1, Daniel M German3\n\n1Federal University of Technology \u2013 Paran\u00e1 (UTFPR), 2University of Par\u00e1 (UFPA), 3University of Victoria\n\nABSTRACT\n\nThe software license is one of the most important non-executable pieces of any software system. However, due to its non-technical nature, developers often misuse or misunderstand software licenses. Although previous studies reported problems related to licenses clashes and inconsistencies, in this paper we shed the light on an important but yet overlooked issue: the use of non-approved open-source licenses. Such licenses claim to be open-source, but have not been formally approved by the Open Source Initiative (OSI). When a developer releases a software under a non-approved license, even if the interest is to make it open-source, the original author might not be granting the rights required by those who use the software. To uncover the reasons behind the use of non-approved licenses, we conducted a mix-method study, mining data from 657K open-source projects and their 4,367K versions, and surveying 76 developers that published some of these projects. Although 1,058,554 of the project versions employ at least one non-approved license, non-approved licenses account for 21.51% of license usage. We also observed that it is not uncommon for developers to change from a non-approved to an approved license. When asked, some developers mentioned that this transition was due to a better understanding of the disadvantages of using an non-approved license. This perspective is particularly important since developers often rely on package managers to easily and quickly get their dependencies working.\n\nCCS CONCEPTS\n\n\u2022 Software and its engineering \u2192 Open source model;\n\nKEYWORDS\n\nOpen Source Software, Software license, OSI approved\n\nACM Reference Format:\n\nR\u00f4mulo Meloca1, Gustavo Pinto2, Leonardo Baiser1, Marco Mattos1, Ivanilton Polato1, Igor Scaliante Wiese1, Daniel M German3. 2018. Understanding the Usage, Impact, and Adoption of Non-OSI Approved Licenses. In Proceedings of MSR \u201918: 15th International Conference on Mining Software Repositories, Gothenburg, Sweden, May 28\u201329, 2018 (MSR \u201918), 11 pages. https://doi.org/10.1145/3196398.3196427\n\n1 INTRODUCTION\n\nThe software licenses are one of the most important non-executable part of any software system [5]. Particularly relevant to open-source software (OSS), open-source licenses not only drive how one can use an OSS but also ensure to what extent others can reuse it [19]. Similarly to software code, software licenses change [27] and evolve [25]. Software relicensing is, indeed, commonplace in open-source software world [7]. As an example, Facebook recently relicensed four key open-source softwares from BSD + Patents to MIT license1. According to them, this change was motivated by an unhappy community looking for alternatives under permissive licenses. This concern, however, pertains not only to large software companies that maintain open-source softwares, since software license is a common good of any open-source software. Therefore, there is no surprise that software licensing is an active research field [1, 4, 16, 23].\n\nDespite of its importance, developers do not fully understand problems related to license usage [1], such as the lack of licenses or license inconsistencies. The way developers develop software only exacerbates this problem, since simple actions such as copying a code snippet from the web has the potential of infringing a software license [12, 13]. This issue becomes even more relevant in the open-source era, where a constant flow of new open-source software is born at a regular basis [10]. That is, developers have a myriad of codebases to refer to, but the way they do might infringe a software license (and consequently the whole chain of software that depends on it).\n\nAnother relevant yet not fully understood problem is the use of open-source licenses that have not been approved by OSI, the Open Source Initiative (see Section 2 for details). Such software licenses were not formally approved by an open-source regulator and, therefore, has not been vetted to be open-source. Currently, OSI maintains a list with 83 approved open-source software licenses2. All these licenses went through a rigorous review process, and not all licenses submitted are approved (e.g., the CC0 license3 has been submitted but was not approved). According to their website, the purpose of the OSI\u2019s license review process is to \u201c(1) Ensure approved licenses conform to the Open Source Definition (OSD), (2) Identify appropriate License Proliferation Category, (3) Discourage vanity and duplicative Licenses\u201d4. Furthermore, because OSI defined what open source is (the Open Source Definition) it claims that \u201conly software licensed under an OSI-approved Open Source license should be labeled \u2018Open Source\u2019 software.\u201d5\n\n1https://code.facebook.com/posts/300798627056246\n2https://opensource.org/licenses/alphabetical\n3https://opensource.org/faq#cc-zero\n4https://opensource.org/approval\n5https://opensource.org/faq\nIn this study, we investigate to what extent software licenses that do not provide open-source guarantees (or \u201cnon-approved licenses\u201d for short) are used in open-source projects published on package managers. Package managers are particularly relevant to license usage due to at least two reasons: (1) they are growing up faster in terms of number of libraries available and packages published [3, 28], and (2) since packages obey a standardized architecture [22], installing and reusing a third-party package comes with no pain. Therefore, packages published in package managers might have a higher number of dependencies than those that do not rely on a package manager. As we shall see in Section 4, on average, a package at NPM has 4.80 dependencies (3rd Quartile: 5, Max: 792).\n\nIn this paper we study three well-known package managers: NPM (Node Package Manager), RubyGems and CRAN (The Comprehensive R Archive Network). For each one of these package managers, we downloaded and investigated all packages available on them. After this process, we ended up with a comprehensive list of 657,811 software packages scattered through the three well-known, long lived package managers. Specifically, we investigated 510,964 NPM packages, 11,366 CRAN packages, and 135,481 RubyGems packages. Still, in order to provide an evolutionary perspective of the license usage on these packages, we studied 4,367,440 different packages versions (3,539,494 on NPM and 816,580 on RubyGems, and 11,366 on CRAN. We manually analyzed each license employed in each one of these package versions.\n\nThis paper makes the following contributions:\n\n1. We conducted the largest study on licenses usage and evolution targeting \u223c660k packages (and their 4.3 million versions) published in three well-known package managers (NPM, RubyGems and CRAN).\n2. We studied the impact of the use of non-approved licenses comprehending the whole dependency chain.\n3. We deployed a survey with 76 package publishers (package developers, owners, or authors) to understand how and why do they use non-approved licenses.\n\n2 BACKGROUND ON OPEN-SOURCE LICENSES\n\nThe Open Source Definition [17], published by OSI defines 10 properties that a software license must satisfy to be called Open Source. OSI has also established an approval process, through which a license will be approved as Open Source. As of today, only 83 licenses have been approved (although many other have been submitted). Other organizations also approve licenses to be open source, such as the Free Software Foundation (FSF), and the Debian Foundation (these two call them Free Software Licenses\u2014with one exception, the NASA Open Source Agreement 1.3, all OSI approved licenses are considered free software by the FSF7).\n\nIn the scope of this paper, we consider licenses approved by OSI only. This decision was motivated by the fact that differently than FSF, which can both develop and approve licenses, OSI does not develop \u2014 only approves \u2014 licenses. Since a license can be submitted by anyone interested in contributing to open-source, the community participation, a crucial aspect of modern open-source software [2, 18], is much more strong at the OSI side.\n\nTo better understand the approval process and the implications of not using an OSI approved license, we conducted a semi-structured interview with an OSI\u2019s board member. According to him, anybody can submit a license for OSI approval. During the certification process, everyone is invited to participate in the review and discussion about the license. The goal of the certification process is to make sure that the submitted license meets all criteria stated at the Open-Source Definition. If the licenses satisfies the requirements set by the Open-Source definition, the license is approved.\n\nOne of the main benefits of using an OSI approved license is the guarantee that OSI\u2014and the open source community at large\u2014has vetted the license and that the license is widely known. Therefore, the community can understand, trust, and use a license. Otherwise, if there was no OSI, everyone could develop a new license and claim that it was open-source; this would require that those using the software hire lawyers to understand such license.\n\nThis means that, even if some license is very popular in other domains, such as the Create Commons Zero (CC0) license, software released under CC0 is not open-source software. According to the board member, more importantly, this threat applies recursively: \u201cif project \u2018A\u2019 (which uses an OSI approved license) depends on project \u2018B\u2019 (which does not use an OSI approved license), this would be as dangerous as project \u2018A\u2019 not using an OSI approved license\u201d. Nevertheless, if one is interested in publishing software assets only (such as data or images), such open-source data can be safely released under CC0 (the requirements of the OSD does not apply for assets). A similar issue occurs when one does not state any license. In this case, the original author has not granted any rights to the recipient of the software. That is, without a permission from the original author, no one can use, redistribute, or create derivative works. Which is the clearly the opposite of the open-source concepts.\n\n3 METHOD\n\nIn this section we present our research questions and method, the data gathered and, our ground definitions.\n\n3.1 Research Questions\n\nThe main goal of this study is to gain an in-depth understanding of non-approved open-source licenses. We designed the following three research questions to guide our research:\n\n- **RQ1**: How common are non-approved licenses on software packages?\n- **RQ2**: What is the impact of non-approved licenses on the package managers ecosystem?\n- **RQ3**: Why developers adopt non-approved licenses?\n\nTo answer these questions, we conducted a two-phase research, adopting a sequential mixed-method approach. First, we collected data about license usage and evolution on a corpus of \u223c660k software packages (Section 3.2). After that, we performed a survey targeting 76 package publishers (Section 3.3).\n3.2 First study: mining license usage\n\n3.2.1 Package and Package Managers. In our first study, we mined license information of software packages hosted in three well-known, long-lived package managers: NPM, RubyGems, and CRAN. The package managers studied have the following characteristics:\n\n- **NPM** manages and indexes Node.js packages. Node.js is a JavaScript runtime environment. The NPM package manager was launched in 2009 and, as of October 2017, it contains over 521K packages. Although it offers support for maintaining packages in-site (it has a version control system), most of the packages available on it are maintained elsewhere (e.g., GitHub). To submit a package to NPM, a user must create an account and push the package using the NPM software utility.\n\n- **RubyGems** manages and indexes Ruby packages. RubyGems was launched in 2009 and, as of October 2017, it contains over 192K packages. It also offers support for maintaining packages in-site, but most of the packages published are maintained elsewhere (e.g., GitHub). RubyGems distributes binaries (i.e., a gem file) through its web interface. Anyone interested in submitting a package to RubyGems must create an account and push the package using the gem software utility.\n\n- **CRAN** manages and indexes R packages. Differently from NPM and RubyGems, CRAN distributes both the source and binary code of packages published on it. CRAN was launched in 1998 and, as of October 2017, it contains over 11K packages. One interested in submitting a package to CRAN needs to create an account and submit the package through CRAN web interface.\n\nThese package managers are the host of several well-known and non-trivial software packages, including React on NPM, Rails on RubyGems, and ggplot2 on CRAN. Packages in these package managers are downloaded millions of times per month. For instance, only on September 2017, on NPM, the packages BlueBird\\(^\\text{11}\\), React\\(^\\text{12}\\), and Lodash\\(^\\text{13}\\) were, in total, downloaded more than 69 million times (18 mi, 6 mi, and 45 mi, respectively). Package managers also make available package releases (i.e., a new version). Table 1 presents the distribution of versions per package. As we can see, the 56% of packages published at NPM have up to three version (58% on RubyGems, and 75% on CRAN). Packages with 10 or more versions are also common (17% on NPM, 16% on RubyGems, but 0.5% on CRAN). Generally speaking, CRAN has less package versions than NPM and RubyGems.\n\n3.2.2 Data Collection. We created an infrastructure to download, extract data, and match dependencies between package versions. Our infrastructure downloaded metadata for all packages available on the three package managers. Both NPM and RubyGems provide an API to collect relevant data\\(^\\text{14}\\). Our infrastructure gathers CRAN metadata navigating through its public HTML files. For CRAN and NPM, we collected our data on September the 7th, 2017. We collected RubyGems metadata on September the 15th, 2017. Table 2 depicts the metadata download in each package version for each package manager.\n\n| # of Versions | CRAN | NPM | RubyGems |\n|---------------|------|-----|----------|\n| 1 | 8,848| 150,546| 42,668 |\n| 2 | 1,942| 80,243| 22,720 |\n| 3 | 360 | 55,028| 15,089 |\n| 4 | 140 | 39,890| 10,743 |\n| 5 | 67 | 30,192| 7,688 |\n| 6 | 38 | 22,886| 5,814 |\n| 7 | 30 | 18,190| 4,549 |\n| 8 | 12 | 15,105| 3,550 |\n| 9 | 17 | 12,000| 2,870 |\n| \u226510 | 67 | 86,884| 19,790 |\n\nAfter downloading the metadata, our infrastructure validated whether a (downloaded) package \\(X\\) depends on (also downloaded) a package \\(Y\\). We validated dependencies using the version number stated in package \\(X\\) and the version number defined in package \\(Y\\). The three package managers use the notion of delimiters to express a range of possible versions that are compatible with a given package. Example of delimiters include the characters \u201c\\(>\\)\u201d, \u201c\\(<\\)\u201d, \u201c\\(^\\sim\\)\u201d, and \u201c\\(^\\wedge\\)\u201d. For example, a package \\(X\\) that depends on the \u2018react\u2019 package can declare a dependency as \u201creact@\\(^\\sim\\)15.0.0\u201d, which indicates that package \\(X\\) depends on any version compatible with react@15.0.0. In addition, in the NPM and in the RubyGems, package publishers could use the \u201c\\(x\\)\u201d character to specify a small range of versions (e.g., 1.1.x or 1.x). To match dependencies, we selected the first version available that matched the pattern. As an example, NPM package \u2018gulp\u2019, version \u20182.6.0\u2019 (gulp@2.6.0 for short) depends on package event-stream@3.0.x. As a result, our infrastructure successfully matched package gulp@2.6.0 to event-stream@3.0.0 dependency. This match procedure is important for the impact analysis (RQ2).\n\nWe downloaded data using three Google Cloud Platform VMs. We used one dual-core VM with 7.5Gb of main memory and 20Gb of SSD, and two single-core VMs with 3.5Gb of main memory and 10Gb of hard disk. After downloading, our dataset occupied 1.2Gb of disk space (1.1Gb of NPM data, 4.6Mb of CRAN data, and 182Mb of RubyGems data).\nof RubyGems data). The infrastructure used as well as the data collected can be found at the companion website\\(^\\text{15}\\).\n\nTable 3 shows the distribution of number of licenses per package version.\n\n| # of Licenses | CRAN | NPM | RubyGems |\n|---------------|----------|----------|----------|\n| 0 | 0 | 369,914 | 394,582 |\n| 1 | 5,346 | 3,158,391| 419,095 |\n| 2 | 5,881 | 10,287 | 2,411 |\n| 3 | 130 | 669 | 355 |\n| 4 | 6 | 222 | 29 |\n| 5 | 1 | 11 | 61 |\n| 6 | 2 | 0 | 46 |\n| 10 | 0 | 0 | 1 |\n\nAs we can see, the majority of packages have a single license. Interestingly, no package with no license could be found at CRAN. This happens because CRAN does not publish packages without the selection of a license\\(^\\text{16}\\). Still, package versions with two or more licenses are common. For instance, the package `sixarm_ruby_unaccent@1.1.2`, published at RubyGems, was released with 10 licenses (they are: apache-2.0, artistic-2.0, bsd-3-clause, cc-by-nc-sa-4.0, agpl-3.0, gpl-3.0, lgpl-3.0, mit, mpl-2.0, and ruby).\n\nTable 4 presents the number of dependencies per version package. Approximately 29% of NPM package versions have no dependencies (39% for CRAN and 30% for RubyGems, respectively).\n\n| # of Dependencies | CRAN | NPM | RubyGems |\n|-------------------|----------|----------|----------|\n| 0 | 6,435 | 1,047,089| 258,810 |\n| 1 | 1,782 | 537,283 | 194,312 |\n| 2 | 1,701 | 412,121 | 143,616 |\n| 3 | 1,517 | 322,234 | 84,679 |\n| 4 | 1,183 | 241,449 | 51,338 |\n| 5 | 978 | 180,349 | 31,424 |\n| 6 | 733 | 139,429 | 22,698 |\n| 7 | 521 | 111,070 | 13,720 |\n| 8 | 436 | 85,631 | 11,302 |\n| 9 | 323 | 69,024 | 8,699 |\n| \u226510 | 1,060 | 472,466 | 32,879 |\n\nAlthough the average number of dependencies per package version is 3.8, outliers were found. For instance, the CRAN package `seurat@2.0.1` has 41 dependencies, the RubyGems package `aws-sdk-resources@3.1.0` has 105 dependencies, and the NPM package `primeng-custom@4.0.0-beta.1` has 500 dependencies.\n\n3.2.3 License Groups. As aforementioned, we downloaded metadata for 657,811 software packages (510,964 NPM packages, 11,366 CRAN packages, and 135,481 RubyGems packages), spanning 4,367,440 versions (3,539,494 on NPM and 816,580 on RubyGems, and 11,366 on CRAN). When analyzing the licenses with which each version was released, we found that some of them included typos or wrong names. This happened because NPM and RubyGems allow one to fill the license field with any information. We then manually normalized each license found.\n\nThe normalization process was conducted in pairs, followed by conflict resolution meetings. For each license, two authors checked if it (1) was approved by OSI, (2) was not approved but was defined somewhere else, i.e., in the Software Package Data Exchange\\(^\\text{17}\\), (3) was not approved neither not defined anywhere else. Licenses not found at OSI list neither at SPDX were allocated in the Other category. To check whether the license was already defined, we searched for its specification on blog posts, Q&A websites, and mailing lists. If the formal specification of a license was not found, the license was included on the non-approved license group. After this process, we ended up with six license groups, namely:\n\n- **OSI licenses**: Any licenses approved by OSI. For this case, we also fixed small issues, such as trivial typos. As an example, we successfully normalized the \"apache 2\" license to its correct form, \"apache-2.0\".\n- **Incomplete licenses**: Any probably approved license, although we could not fix some issues. For instance, package publishers often omit the version number, e.g., \"bsd\" or \"lgpl\", so we could not be sure about which license version was used.\n- **SPDX (but not OSI) licenses**: These are the licenses listed in the SPDX License List\\(^\\text{18}\\) that were not formally approved by OSI. This group include popular and defined licenses, such as, the \"Do What the Fuck You Want to Public License\" (WTFPL) or the \"Creative Commons Zero\" (CC0) license.\n- **Missing or Absence of a license**: We aggregated in this group package versions without any license at all (i.e., when package publishers left empty the license field), or developers filled explicit with the NONE word the license field. This is a sub-category of copyright licenses because, as discussed in Section 2, when no license is declared, the original authors retains all rights.\n- **Other**: Any licenses with undefined typos, wrong names, or even curses. Examples include: the \"d\" license and the \"Not specified\" license. Additionally, we included in this group licenses that the packager publisher put an external link in the license information. We did not inspect each file individually and such data was not included in any of the analysis we conducted because they represent less than 0.5%.\n- **Copyright licenses**: This occurs when package publishers explicitly mention that they retain the copyright. Examples include the \"my own\" license, the \"(c) Copyright\" license, or the \"all rights reserved\" license.\n\nAt the end of this normalization process, we ended up with 973 distinct licenses (758 at NPM, 46 at CRAN, and 336 at RubyGems).\n\n\\(^{15}\\)https://github.com/rmeloca/EcosystemsAnalysis\n\n\\(^{16}\\)https://cran.r-project.org/web/packages/policies.html\n\n\\(^{17}\\)https://spdx.org/\n\n\\(^{18}\\)https://spdx.org/licenses/\nNon-approved licenses comprehend all licenses but not OSI licenses and Incomplete licenses.\n\n### 3.3 Second study: a survey with package publishers\n\nIn our second study, we deployed a survey with package publishers of the NPM package manager. We focused on this package manager because (1) the email addresses of the package publishers could be recovered and (2) packages in this package manager exhibits the greatest number of dependencies, which are more likely to affect/be affected, if a license inconsistency is found. We used the following criteria to identify our population: we selected the package publishers of packages versions released under a non-approved license with at least one dependency. This ensures that the irregularity propagates to other packages. After apply the criteria, we obtained 385 package publishers from different project.\n\nOur survey was based on the recommendation Smith et al. [21], employing principles for increasing survey participation, such as sending personalized invitations, allowing participants to remain completely anonymous and, asking closed and direct questions as much as possible. Our survey had 14 questions (three of which were open), grouped in three broad interests: demographics (e.g., what is your gender? and what is your profession?), understanding non-approved adoption (e.g., why did you choose? and are you aware of the implications?), and usage frequency (e.g., how often do you use non-approved licenses? and how often do you do not declare a license?). The open questions were analyzed in pairs, followed by conflict resolution meetings. Participation was voluntary and the estimated time to complete each survey was 5-10 minutes. When sending our invitation email, 8 messages were not delivered due to technical reasons. We received 76 responses, representing 20% of response rate. The survey is available at: https://goo.gl/Jiuwzp.\n\n### 4 RESULTS\n\nIn this section, we report the results of our study grouped by each research question.\n\n#### 4.1 RQ1. How common are non-approved licenses on software packages?\n\nAfter the normalization process, we found a total of 973 distinct licenses. These licenses were declared a total of 4,369,024 times. The number of license declarations is higher than the number of its package versions given that one package often employs more than one license (as showed at Table 3). Table 5 shows the distribution of each license group.\n\nAs we can see, non-approved licenses (all licenses defined at Section 3.2.3 except OSI licenses and Incomplete licenses) were used 858,311 times, which corresponds to roughly 20% of the overall license usage. Most of them, nevertheless, are related to the absence of a license. We found 764,496 package versions without any license declaration (which is accounts for 89% of the non-approved license usage). In particular, on RubyGems, missing licenses correspond to 48% of the total license used (10.41% on NPM).\n\nWe also studied license usage through an evolutionary perspective. In order to provide a general overview, Table 6 groups evolution patterns of license changes. We pairwise analyzed all versions available in order to verify how many times a license changed from one group to another. The results show that package versions, regardless of the package manager, tend to propagate their license used over their versions. Therefore, the main diagonal always have the higher values. For instance, at NPM we found that 311,455 package versions without any license associated still had this non-approved license in the next version.\n\n#### Table 5: License Groups on Package Versions\n\n| Group | CRAN | NPM | RubyGems | TOTAL |\n|---------------------|------|-------|----------|--------|\n| OSI | 15,724 | 3,009,782 | 403,693 | 3,429,199 |\n| INCOMPLETE | 34 | 73,647 | 7,833 | 81,514 |\n| SPDX but not OSI | 162 | 30,688 | 6,215 | 37,065 |\n| MISSING | 8 | 400,618 | 396,178 | 796,804 |\n| OTHER | 220 | 10,978 | 4,953 | 16,151 |\n| COPYRIGHT | 0 | 7,106 | 1,185 | 8,291 |\n\n#### Table 6: Patterns of license evolution\n\n| NPM From\\To | OSI | INC | SPDX | MISS | OTH | COP |\n|-------------|-----|-----|------|------|-----|-----|\n| OSI | 2,576,692 | 3,012 | 2,060 | 2,125 | 423 | 116 |\n| INC | 4,573 | 61,535 | 44 | 144 | 363 | 205 |\n| SPDX | 2,153 | 26 | 25,489 | 182 | 78 | 56 |\n| MISS | 8,911 | 321 | 256 | 337,711 | 87 | 23 |\n| OTH | 502 | 345 | 99 | 51 | 9,231 | 241 |\n| COP | 200 | 212 | 58 | 19 | 267 | 6,424 |\n\n| RubyGems From\\To | OSI | INC | SPDX | MISS | OTH | COP |\n|------------------|-----|-----|------|------|-----|-----|\n| OSI | 336,639 | 505 | 574 | 380 | 553 | 37 |\n| INC | 854 | 6,575 | 99 | 5 | 116 | 0 |\n| SPDX | 618 | 82 | 5,095 | 51 | 270 | 1 |\n| MISS | 8,112 | 329 | 279 | 324,153 | 185 | 10 |\n| OTH | 808 | 119 | 272 | 9 | 4,197 | 14 |\n| COP | 50 | 1 | 1 | 5 | 15 | 1,029 |\n\nSince the changes from approved to non-approved are the most relevant ones to our study, we counted how many times a package version changed from an OSI-approved license to a non-approved license, and vice-versa. We identified these changes in 12,491 packages at RubyGems and 24,075 packages at NPM. Among these package, on RubyGems, 10,442 package versions changed from a non-approved to an approved license. In this case, the publishers corrected their wrong license as presented in Table 8.\n\nInterestingly, the number of changes from an approved to a non-approved license was much lesser. On RubyGems, we found only 2,049 package versions that changed from an approved license to a non-approved license. A similar behavior occurred at NPM. The number of changes from a non-approved license is much greater than the opposite: (16,339 package versions changed from a non-approved license to an approved one, whereas 7,736 package\nversions changed from an approved to a non-approved one). As an example, when upgrading from zorg@0.0.1 to zorg@0.0.10, the NPM package changed from the know \"ISC\" license to no license at all. We did not performed this analysis at CRAN because it does not provide such information.\n\nTo provide a more fine-grained perspective about the evolution patterns, we analyzed the top 10 most common changes from an approved license to a non-approved license, and vice-versa. Table 7 presents the evolution patterns, focusing on changes from an approved to a non-approved license. The majority of changes observed were when changing from MIT license to no license at all (1,286 instances found on NPM, and 248 on RubyGems). The effects of a missing license are exactly the opposite a developer might think: it applies the copyright instead of opening the source code. Therefore, the migration from a missing license to the MIT license can be explained as a correction of this effect, specially due to the permissive characteristics of such license. This evidence is supported by Almeida [1] and by ours findings that developers might not fully understand the licensing process of a software.\n\nTable 7: The 10 Most Common License Evolution Patterns: From Approved to Non-Approved\n\n| NPM | RubyGems |\n|-----|----------|\n| Evolution Patterns | # | Evolution Patterns | # |\n| mit \u2192 missing | 1,286 | mit \u2192 missing | 248 |\n| isc \u2192 missing | 604 | apache-2.0 \u2192 missing | 85 |\n| apache-2.0 \u2192 missing | 116 | bsd-3-clause \u2192 missing | 33 |\n| bsd-2-clause \u2192 missing | 37 | lgpl-2.0 \u2192 missing | 4 |\n| gpl-3.0 \u2192 missing | 20 | gpl-3.0 \u2192 missing | 4 |\n| bsd-3-clause \u2192 missing | 19 | bsd-2-clause \u2192 missing | 2 |\n| gpl-2.0 \u2192 missing | 12 | gpl-2.0 \u2192 missing | 2 |\n| lgpl-3.0 \u2192 missing | 9 | lgpl-3.0 \u2192 missing | 1 |\n| fair \u2192 missing | 9 | ms-pl \u2192 missing | 1 |\n| mpl-2.0 \u2192 missing | 7 | \u2014 | \u2014 |\n\nRQ1 Summary. We found 1,058,554 packages versions (24.23%) released under non-approved licenses. Packages published on RubyGems are the most affected ones (55% of them employed a non-approved license). The missing (lack of a license) license is widespread. When license change occurs, most of the package versions keep the same license, although changes from a non-approved to an approved license, and vice-versa, are common.\n\nTable 8: The 10 Most Common License Evolution Patterns: From Non-Approved to Approved\n\n| NPM | RubyGems |\n|-----|----------|\n| Evolution Patterns | # | Evolution Patterns | # |\n| missing \u2192 mit | 6,667 | missing \u2192 mit | 6,556 |\n| missing \u2192 isc | 831 | missing \u2192 apache-2.0 | 614 |\n| missing \u2192 apache-2.0 | 633 | missing \u2192 gpl-3.0 | 239 |\n| missing \u2192 bsd-3-clause | 262 | missing \u2192 gpl-2.0 | 153 |\n| missing \u2192 gpl-3.0 | 137 | missing \u2192 bsd-3-clause | 133 |\n| missing \u2192 bsd-2-clause | 91 | missing \u2192 lgpl-3.0 | 86 |\n| missing \u2192 gpl-2.0 | 85 | missing \u2192 bsd-2-clause | 81 |\n| missing \u2192 lgpl-3.0 | 61 | missing \u2192 artistic-2.0 | 73 |\n| missing \u2192 mpl-2.0 | 49 | missing \u2192 agpl-3.0 | 33 |\n| missing \u2192 agpl-3.0 | 35 | missing \u2192 lgpl-2.1 | 31 |\n\n4.2 RQ2. What is the impact of non-approved licenses on the package managers ecosystem?\n\nTo understand the impact of a non-approved license, we calculated two types of metrics (irregular and affected) in three different granularities (graph order).\n\n- **Irregular.** A package is called irregular if at least one of its versions has a direct dependency to a package released under a non-approved license. If a package is irregular it means that it can affect other packages that depends on it.\n\n- **Affected.** A package is affected if at least one of its versions has direct or indirect dependency to a package that is irregular. Direct dependency is when one package father (affected) depends on its child (irregular). Indirect dependency is when there are more than one level between affect and irregular packages.\n\nWith these metrics, we analyzed the whole dependency graph of all package versions. Table 9 shows the impact of non-approved licenses in terms of packages, versions, and dependencies. In terms of packages, although NPM have more irregular and affected packages, RubyGems presents a higher proportion of irregular (46% vs 18%) and affected (55% vs 38%) packages than NPM, which suggests that almost half of all package versions on RubyGems are irregular. The low number of packages, versions, and dependencies affected at CRAN is because CRAN prevents the absence of licenses by requiring package publishers to choose at least one from their license selection. Again, when we projected the impact including the indirect dependencies of each package version, the impact in NPM is higher than RubyGems, because NPM packages have more versions.\n\nTo provide a more detailed example, Figure 1 shows a fragment of a dependency graph of the package request@0.8.1.0. This particular package has 23,205 direct dependencies to it (6,840 are irregular) and 42,938 indirect dependencies to it (parents). Moreover, we omitted from Figure 1 the regular direct dependencies. In the figure, solid lines edges are regular dependencies and dotted lines edges are irregular dependencies. Double border lines vertexes are regular package versions whereas single solid border ones are irregular.\nTable 9: Impact caused by non-approved licenses in each package manager\n\n| Graph Order | Metric | CRAN | NPM | RubyGems |\n|-------------|--------|------|-----|----------|\n| Packages | # | 11,366 | 510,964 | 135,481 |\n| | Irregular | 1082 | 78,224 | 62,967 |\n| | Proportion | 0.095 | 0.153 | 0.464 |\n| | Affected | 1455 | 194,741 | 75,475 |\n| | Proportion | 0.128 | 0.381 | 0.557 |\n| Versions | # | 11,366 | 3,539,494 | 816,580 |\n| | Irregular | 35 | 690,703 | 440,443 |\n| | Proportion | 0.003 | 0.195 | 0.539 |\n| | Affected | 36 | 1,619,248 | 520,967 |\n| | Proportion | 0.003 | 0.457 | 0.637 |\n| Dependencies | # | 1,086 | 15,521,508 | 1,765,288 |\n| | Irregular | 59 | 1,364,281 | 1,088,298 |\n| | Proportion | 0.054 | 0.087 | 0.616 |\n\nDotted border vertexes represents affected packages. Notice that a package might be irregular and affect at the same time.\n\nWe also observed that in this fragment of the graph, three packages have a non-approved missing license associated with: \"assert-plus\", \"verror\", and \"extsprintf\". It is worth to mention that package \"assert-plus\" and \"extsprintf\" are considered regular packages because they do not have a dependency to any package version released under a non-approved license.\n\nFigure 1: Example of a affected package version dependency tree\n\nAnother example occurs on RubyGems package manager: the package activesupport, actually on version 4.2.6, was downloaded 174,538,434 times on its entire life cycle, but in the version 4.0.0, released on 2013 (25th June), this package was depending to the unlicensed packages minitest@4.2.0, multi_json@1.3.3, thread_safe@0.1.0 and tzinfo@0.3.37 (activesupport also was depending to the MIT-licensed package i18n@0.6.4). This particular version was downloaded 3,107,216 times and was used by 1,093 another published packages directly and by 16,526 packages taking into account both direct and indirect dependencies. The package activesupport is a toolkit extracted from the Rails framework\u2019s core.\n\nTo provide an extra perspective of the impact of non-approved licenses, we compared the number of irregular and affected values with incomplete licenses. We chose incomplete licenses because they can be interpreted as wrong licenses, since they do not have a correct name or version license.\n\nTable 10 presents the most common incomplete licenses per package manager. Among the most incomplete licenses, we observed that package publishers are using a number of licenses omitting its version.\n\nTable 10: Top 10 Incomplete Licenses\n\n| CRAN | NPM | RubyGems |\n|------|-----|----------|\n| License # | License # | License # |\n| agpl 12 | bsd 59,132 | bsd 4,280 |\n| bsd 11 | gpl 7,904 | gpl 1,783 |\n| cecill 6 | lgpl 2,747 | lgpl 1,067 |\n| mpl 2 | epl 1,173 | agpl 304 |\n| epl 2 | mpl 854 | artistic 166 |\n| bsl 1 | agpl 832 | epl 71 |\n| \u2014\u2014 | free 218 | mpl 50 |\n| \u2014\u2014 | ibm 216 | free 36 |\n| \u2014\u2014 | apl 194 | osl 26 |\n| \u2014\u2014 | cecill 179 | afl 16 |\n\nIn this sense, Table 11 presents the impact of Incomplete licenses. It is worth to mention that even if we consider the incomplete licenses as inconsistent licenses, non-approved licenses (9) presented a higher impact than Incomplete licenses, for instance, the number of irregular packages caused by non-approved licenses are 62,154 against 63,329 irregular packages caused by Incomplete licenses on RubyGems (the ratio of the difference 813/362 is almost 2.5 times higher). If we compare the affected versions on RubyGems, the impact of non-approved licenses are almost 69 times higher than the Incomplete Licenses. In a general way, we also found that NPM is more affected by Incomplete licenses than RubyGems.\n\nFinally, CRAN packages were highly impacted by Incomplete licenses, which is mostly due to the lack of a license version. This behavior turns \u223c11% of CRAN packages irregulars, which affects almost 15% of the published packages.\n\nWe recognize that non-approved licenses are dangerous to both package authors (publishers on package managers) and users \u2013 that create but not explicit publish a package with direct dependencies to published packages \u2013 because of the uncertainty whether the dependencies of the desired-to-publish package are regular or not. In fact, package publishers should look at the whole dependency chain. However, a few factors might imply in the presence of such irregularities in package managers, such as the height of the package dependency tree, and the presence of newcomers at\nTable 11: Impact caused by Incomplete licenses in each package manager\n\n| Graph Order | Metric | CRAN | NPM | RubyGems |\n|-------------|--------|------|-----|----------|\n| Packages | # | 11,366 | 510,964 | 135,481 |\n| | Irregular | 1,256 | 94,515 | 63,329 |\n| | Proportion | 0.110 | 0.184 | 0.467 |\n| | Affected | 1,480 | 197,626 | 75,455 |\n| | Proportion | 0.130 | 0.386 | 0.556 |\n| Versions | # | 11,366 | 3,539,494 | 816,580 |\n| | Irregular | 38 | 825,520 | 443,072 |\n| | Proportion | 0.003 | 0.233 | 0.542 |\n| | Affected | 38 | 1,639,430 | 520,836 |\n| | Proportion | 0.003 | 0.463 | 0.637 |\n| Dependencies | # | 1,086 | 15,521,508 | 1,765,288 |\n| | Irregular | 62 | 1,759,643 | 1,098,489 |\n| | Proportion | 0.057 | 0.113 | 0.622 |\n\nthe open source community, who might not be completely aware about license constraints.\n\n**RQ2 Summary:** Non-approved licenses impact packages from NPM and RubyGems, making packages irregular and affecting both its direct and indirect dependencies. Non-approved licenses can be considered more harmful than incomplete licenses since their impact is higher when compared to the amount of irregular and affected packages and versions by each License group.\n\n### 4.3 RQ3. Why developers adopt non-approved licenses?\n\nTo answer this question, we report the results of our survey with 76 package publishers. Our target population is 94% male and 96% work for the software development industry. About 53% of them have created or contribute to up to 30 open-source projects (18% of them have created or contribute to more than 100 open-source projects). Still, 48% of the respondents believe that about 20% of these created/contributed open-source projects use a non-approved license. More interestingly, however, is the fact that 27% of the respondents have no idea about how many projects they contribute use a non-approved license. Similarly, in Section 4.1, we showed evidence that about 18% of the package versions studied use a non-approved license.\n\nWhen we asked why do they use a non-approved license, we found that 26 of the respondents do not care about the specific license terms. Along this line, one respondent mentioned that \"I chose WTFPL license because I really don\u2019t care about who and how use my modules. I share my code with people and it\u2019s a pleasure for me to just know if someone finds it useful. Maybe if I wrote something really great like Facebook\u2019s React I would think about fame\". Also, 17 respondents acknowledged that using a non-approved license was a naive decision: \"I thought I was appropriate\". Still, small projects seem to be more prone to be licensed under a non-approved license. Yet, 5 respondents are aware that a non-approved license makes sense when licensing non-software projects, for instance, \"Because it fits the content of the repository best (it is not a source code repository, but contains only data)\". Finally, some developers adopt non-approved licenses because they claim they are simpler (6 occurrences) or more open (4 occurrences), for instance, one respondent said that she likes \"the idea of WTFPL. Makes everything pretty clear. You just do what you want.\"\n\nRight afterwards, we asked whether they are aware of the implications of using a non-approved license; 43% of the respondents mentioned a lack of awareness. For those who mentioned to be aware of the implications, we asked them to cite one example of an implication. Among the answers we found that developers believe that a non-approved license might limit the adoption of their software (12 occurrences). As an example, one respondent said that \"If you use a license others have never heard of, others are less likely to contribute and/or may be wary of using you software.\" Code thefts was also a recurring implication, mentioned by 7 respondents. Finally, one respondent raised the fact that the main implication of using a non-approved license is that \"it can\u2019t be automatically recognized by machines to categorize software under any license which may exclude the software from search results\". This is particularly interesting, since Github helps the project owners to choose a correct license for their repositories. However, the Github help documentation also highlight to developers that they are responsible to define the correct license as we can see on this paragraph: \"GitHub provides the information on an as-is basis and makes no warranties regarding any information or licenses provided on or through it, and disclaims liability for damages resulting from using the license information.\"\n\nIn the next five following questions, we asked how often do they (Q9) investigate if the license that you chose conforms with the license that your project depends, (Q10) do not declare a license, (Q11) use a non-approved license, (Q12) use a copyright license in one open-source software, and (Q13) use more than one license (either approved or not)? Figure 2 shows the results.\n\n\n\nThis figure shows a couple of interesting information. First, we can see 46% of respondents \"Never\" or \"Rarely\" take into account\nthe license used in the software\u2019s dependencies. We believe this is an important result because, as we discussed in Section 2, licenses inconsistencies directly impact any project that depends upon. With similar implications, 11% of the respondents \u201cAlways\u201d or \u201cVery Often\u201d do not declare a license. One respondent even mentioned that she \u201cFrequently forget to declare any license and it seems unimportant.\u201d Similarly, 25% of the respondents \u201cAlways\u201d or \u201cVery Often\u201d use a non-approved license. Finally, 94% mentioned that they \u201cNever\u201d or \u201cRarely\u201d use more than one license (either approved or not). One respondent mentioned that the reasons of why she uses more than one license is related to the fork-based model: \u201cTypoPRO is a collection(!) of fonts and each font already has its distinct Open Source software license from their upstream vendor. So, TypoPRO stays under (this union) set of upstream licenses.\u201d\n\nRQ3 Summary. 26 respondents do not care about the license used. Some respondents believe that non-approved licenses are more open and simpler to use. Among the implications, 12 respondents believe that non-approved licenses can limit the adoption of their software. 46% of the respondents do not take license into account when choosing a package dependency.\n\n5 IMPLICATIONS\n\nThis research has implications for different kinds of stakeholders. Three of these possible groups are discussed below.\n\nPackage managers. Since we observed that both NPM and RubyGems do not require developers to inform a license, many packages published on these packages managers either (1) do not use any license or (2) state a wrong or incomplete license name (RQ1). This problem not only hinders researchers from conducting in-depth studies on license usage, but also have the potential of confusing software developers interested in using the software package. Package managers, therefore, might introduce mechanisms to prevent the introduction of wrong (or even non-existing) license names.\n\nResearchers. Although software licensing is an established research topic, our notion of non-approved licenses was not yet fully explored (RQ1) and its implications were unclear (RQ2). Researchers can expand our comprehension of non-approved licenses in many ways. First, researchers could introduce mechanisms to automatically detect the use of non-approved licenses. Still, since packages tend to propagate their licenses over the releases (RQ1), researchers can create techniques to avoid non-approved license propagation.\n\nCS Professors. Educators can also benefit from the findings of this study. Since software license is a common misunderstood topic among software developers [1], software engineering professors could bring problems related to license usage to the classroom, and invite students to discuss possible solutions or compare to the perception of professional software developers (RQ3). Similarly, in order to make software licenses more appealing to aspiring software engineers, professors can use our license inconsistency graph (RQ2) in advanced data-structure classes, and invite students to understand license inconsistencies in complex and deeper graphs.\n\n6 THREATS TO VALIDITY\n\nIn a study of such proportion, there are always many limitations and threats to validity. First, we could not retrieve data from 2,140 packages (1,079 NPM packages, 1,052 RubyGems packages, and 9 CRAN packages). This happened because such packages metadata could not be located. However, these packages represent only 0.04% of the whole universe of packages from our study.\n\nSecond, the normalization process was manual and, therefore, error-prone. We mitigated this threat using pair-review work. Each author independently analyzed the same set of licenses, with subsequent conflict resolution meetings. Both the original and normalized license sets are available for future analysis. We choose not to analyze the external FILE licenses because most of these package versions are hosted on GitHub and would require manual search for the license file into the repositories. At CRAN 1,391 package versions have a file license declared; on NPM, 19,010; and, RubyGems have more than 20,000 package version using the FILE license.\n\nThird, one might argue that our packages studied might be full of simple, trivial software projects. However, packages available on package managers are often more mature, when compared to software projects hosted on other coding websites such as Github, which are often personal projects and class projects [9].\n\nFourth, we rely on the licenses approved by OSI. Even if a license is commonplace \u2014 for instance, we found 4,927 package versions using the creative commons zero (CC0) license (104 at CRAN, 3,022 at NPM and 1,801 at RubyGems) \u2014 we still consider such licenses as non-approved. Although we are aware that many other institutions such as the Free Software Foundation (FSF) and the Debian Foundation approve licenses, we decided to stick to OSI approval because: (1) licenses can be submitted by anyone interested in to get an OSI approve, and (2) licenses approved by OSI are commonly used \u2014 as we shown in Table 5, there are only few licenses found in our dataset that were not recognized by OSI.\n\nFinally, we did not double checked whether the license informed at the package manager was, indeed, the same declared at the official package website. We chose not to validate the license used due to two reasons: first, the package publisher (which is often a core member of the project) is in charge of declaring the license used in a given published version. That is, no one other than the package publisher would be more confident to state the correct license used; second, because we manually studied hundreds of thousands of software packages. These software packages are often hosted in a third-party coding website (e.g., GitHub or BitBucket), which store license information using distinct ways, e.g.: Github shows the license name at the project\u2019s first page (if the algorithm succeed at inferring the license, which is not always the case); BitBucket, on the other hand, does not explicitly demand any license when creating a repository. Additionally, if the project has the proper license file, it will display the license on the project\u2019s cover page. This problem only exacerbates when considering license information per version release. Therefore, due to the lack of standards and our substantial sample size, performing such manual process would be prohibitive.\n\n7 RELATED WORK\n\nRecent studies investigated licenses inconsistencies, which is a similar to our concept of non-approved licenses. Since non-approved licenses also introduce inconsistencies, one can see non-approved\nas a subset of license inconsistencies. However, we believe that the implication of non-approved licenses are greater than the known problems related to licenses inconsistencies.\n\nTo the best of our knowledge, our work is the first to analyze the usage and adoption of Non-Approved licenses. We also discussed the impact of Non-Approved licenses compared to incomplete licenses in the package manager context, which have attracted more attention from practitioners and researchers, since NPM, CRAN and RubyGems are growing faster and becoming increasingly popular. We summarize the related work in terms of licenses maintenance and evolution and licenses inconsistencies.\n\nDi Penta et al. [4] proposed a method to track the evolution of software licensing and investigated its relevance on six open source projects. Most of the inconsistencies found were related to files without a license. Vendome et al. [24, 27] conducted a large empirical study investigating when and why developers adopt or change software licenses. Recently, Vendome et al. [26] performed another large-scale empirical study on the change history of over 51K FOSS systems to investigate the prevalence of known license exceptions, presenting a categorization and a Machine Learning-Based Detection algorithm to do identify license exceptions. Santos [20] analyzed a set of 756 projects from FLOSSmole repository of Sourceforge.net data that had changed their source code distribution allowances. The author found 88 projects with a \u201cnone\u201d license \u2013 which might leave projects exposed and legally unattended \u2013 and, 55 times where projects changed their current state of having a license to one where they have no license.\n\nGerman et al. [8] investigated how the licenses declared in packages are consistent with the source code files in the Fedora ecosystem. Manabe et al. [15] extended it by proposing a graph visualization to understand those relationships. They found that the GPL Licenses are more likely to include other licenses, while Apache Licenses tend to contain files only under the same license. The authors reported changes from a valid license to none and some cases where a non-valid license was changed to a valid license.\n\nWu et al. [30, 31] investigated license inconsistencies caused by re-distributors that removed or modified the license header in the source code. The authors described and categorized different types of license inconsistencies, proposing a method to detect them in the Debian ecosystem. The authors found that, on average, more than 24% of packages relationship have a \u201cnone\u201d license between them, however this effect was not discussed. Wu et al. [29] also studied whether the issues of license inconsistencies are properly solved by analyzing two versions of Debian investigating the evolution patterns of license inconsistencies, which will disappear when the downstream projects get synchronized.\n\nLee et al. [14] compared machine-based algorithms to identify potential license violations and guide non-experts to manually inspect violations. The authors reported that the accuracy of crowds is comparable to that of experts and to the machine learning algorithm. Interesting to note that approximately 25% of files from 227 projects (79.4% of projects analyzed) did not have any license.\n\nAlmeida et al. [1] conducted a survey with 375 developers to understand whether they understand violations and assumptions from three popular open source licenses (GNU GPL3.0, GNU LGPL 3.0 and MPL 2.0) both alone and in combination. The authors confront the answers with expert\u2019s opinion, and found that the answers were consistent in 62% of 42 cases. Although previous work in understanding software licenses pointed \u201cNone\u201d as frequently choose for files and packagers, neither scenario involved this aspect.\n\nVan der Burg et al. [23] proposed an approach to construct and analyze the Concrete Build Dependency Graph (CBDG) of a software system by tracing system calls at build-time. Through a case study of seven open source systems, the authors showed that the constructed CBDGs can accurately classify sources as included in or excluded from deliverables with 88%-100% precision and 98%-100% recall, and can uncover license compliance inconsistencies in real software systems. German and Di Penta [6] presented a method for open source license compliance of Java applications. The authors implemented a tool called Kenen, to mitigate any potential legal risk for developers that reuse open source components. Kapitsaki et al. [11] compared tools that are used to detect licenses of software components and avoid license violations, classifying them in three types: License information identification from source code and binaries, software metadata stored in code repositories, and license modeling and associated reasoning actions.\n\n8 CONCLUSION\n\nIn this paper we conducted a large-scale study on non-approved licenses, in terms of usage, impact, and adoption. Non-approved licenses are any license not approved by OSI, the Open Source Initiative. Software released under a non-approved license cannot be claimed to be open-source (the original author retains all rights). Non-approved licenses include licenses with typos, wrong names, or even curses, or even missing licenses (e.g., when package publishers do not fill the license information).\n\nWhen mining data from ~657k open-source projects, we observed that hundreds of non-approved licenses exist. About 24% of the packages released used at least one of these non-approved licenses. The majority of non-approved licenses found are, in fact, the absence of a license. Still, we found that package publishers tend to propagate the same license used though package versions. Non-approved licenses impact packages from NPM and RubyGems more than Incomplete licenses when we compared to the amount of irregular and affected packages and versions. Finally, when we asked packagers publishers about non-approved license, we found that 46% of the respondents do not take license into account when choosing a package dependency, some respondents believe that non-approved licenses are more open and simpler to use. On the other hand, 12 respondents believe that non-approved licenses may limit the adoption of their software.\n\nFor future work, we plan to investigate the evolution of non-approved licenses in a fine-grained way (e.g., through commits instead of version releases). This would deepen our understanding on why non-approved licenses are adopted. Still, since CRAN developers might have a more diverse background (e.g., biologists, mathematicians, among others), we plan to get in touch with them to understand their motivations behind the usage of non-approved licenses.\n\nACKNOWLEDGMENTS\n\nThis work is supported by Funda\u00e7\u00e3o Arauc\u00e1ria; CNPq (406308/2016-0 and 430642/2016-4); PROPESP/UFPA; and FAPESP (2015/24527-3).\nREFERENCES\n\n[1] D. A. Almeida, G. C. Murphy, G. Wilson, and M. Hoye. 2017. Do Software Developers Understand Open Source Licenses?. In 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC) - 1\u201311. https://doi.org/10.1109/ICPC.\n\n[2] Jailton Coelho and Marco Tulio Valente. 2017. Why Modern Open Source Projects Fail. In 25th International Symposium on the Foundations of Software Engineering (FSE) 186\u2013186.\n\n[3] Eirini Kalliamvakou and Tom Mens. 2017. An Empirical Comparison of Developer Retention in the RubyGems and Npm Software Ecosystems. Innov. Syst. Softw. Eng. 13, 2-3 (Sept. 2017), 101\u2013115. https://doi.org/10.1007/s11334-017-0030-4\n\n[4] Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2016. An in-depth study of the promises and perils of mining GitHub. Empirical Software Engineering 21, 5 (2016), 2035\u20132071. https://doi.org/10.1007/s10664-015-9393-5\n\n[5] Karl Fogel. 2017. Producing Open Source Software: How to Run a Successful Free Software Project (second ed.). O'Reilly Media. http://www.producingoss.com/.\n\n[6] D. German and M. Di Penta. 2012. A Method for Open Source License Compliance of Java Applications. IEEE Software 29, 3 (May 2012), 58\u201363. https://doi.org/10.1109/MS.2012.50\n\n[7] Daniel M. German and Jes\u00fas M. Gonz\u00e1lez-Baralona. 2009. An Empirical Study of the Reuse of Software Licensed under the GNU General Public License. Springer Berlin Heidelberg, Berlin, Heidelberg, 185\u2013198. https://doi.org/10.1007/978-3-642-02032-2_17\n\n[8] D. M. German, M. Di Penta, and J. Davies. 2010. Understanding and Auditing the Licensing of Open Source Software Distributions. In 2010 IEEE 18th International Conference on Program Comprehension. 84\u201393. https://doi.org/10.1109/ICPC.2010.48\n\n[9] Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2016. An in-depth study of the promises and perils of mining GitHub. Empirical Software Engineering 21, 5 (2016), 2035\u20132071. https://doi.org/10.1007/s10664-015-9393-5\n\n[10] Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014) 92\u2013101.\n\n[11] Georgia M. Kapitsaki, Nikolaos D. Tselikas, and Ioannis E. Foukarakis. 2015. An insight into license tools for open source software systems. Journal of Systems and Software 102 (2015), 72 \u2013 87. https://doi.org/10.1016/j.jss.2014.12.050\n\n[12] Cory Kapser and Michael W. Godfrey. 2008. \u201cCloning considered harmful\u201d considered harmful. patterns of cloning in software. Empirical Software Engineering 13, 6 (2008), 645\u2013692.\n\n[13] Miryung Kim, L. Bergman, T. Lau, and D. Notkin. 2004. An ethnographic study of copy and paste programming practices in OOPL. In Empirical Software Engineering, 2004. ISESE \u201904. Proceedings. 2004 International Symposium on. 83\u201392.\n\n[14] Sanghoon Lee, Daniel M. German, Seung-won Hwang, and Sunghun Kim. 2015. Crowdsourcing Identification of License Violations. Journal of Computing Science and Engineering 9, 4 (2015), 190\u2013203.\n\n[15] Yuki Manabe, Daniel M. German, and Katsuro Inoue. 2014. Analyzing the Relationship between the License of Packages and Their Files in Free and Open Source Software. Springer Berlin Heidelberg, Berlin, Heidelberg, 51\u201360. https://doi.org/10.1007/978-3-642-55129-4_6\n\n[16] Trevor Maryka, Daniel M. German, and Germ\u00e1n Poo-Caama\u00f1o. 2015. On the Variability of the BSD and MIT Licenses. Springer International Publishing, Cham, 146\u2013156. https://doi.org/10.1007/978-3-319-17837-0_14\n\n[17] OSD. 2018. The Open Source Definition (Annotated). (2018). https://opensource.org/osd-annotated\n\n[18] Gustavo Pinto, Igor Steinmacher, and Marco Aur\u00e9lio Gerosa. 2016. More Common Than You Think: An In-depth Study of Casual Contributors. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1. 112\u2013123. https://doi.org/10.1109/ICPC.\n\n[19] Lawrence Rosen. 2004. Open Source Licensing: Software Freedom and Intellectual Property Law. Prentice Hall PTR, Upper Saddle River, NJ, USA.\n\n[20] Carlos Denner dos Santos. 2017. Changes in free and open source software licenses: managerial interventions and variations on project attractiveness. Journal of Internet Services and Applications 8, 1 (07 Aug 2017), 11. https://doi.org/10.1186/s13174-017-0062-3\n\n[21] E. Smith, R. Loftin, E. Murphy-Hill, C. Bird, and T. Zimmermann. 2013. Improving developer participation rates in surveys. In 2013 6th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE). 89\u201392. https://doi.org/10.1109/CHASE.2013.6614738\n\n[22] Diomidis Spinellis. 2012. Package Management Systems. IEEE Software 29, 2 (2012), 84\u201386.\n\n[23] Sander van der Burg, Eelco Dolstra, Shane McIntosh, Julius Davies, Daniel M. German, and Armijn Hemel. 2014. Tracing Software Build Processes to Uncover License Compliance Inconsistencies. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering (ASE \u201914). ACM, New York, NY, USA, 731\u2013742. https://doi.org/10.1145/2642937.2643013\n\n[24] Christopher Vendome, Gabriele Bavota, Massimiliano Di Penta, Mario Linares-V\u00e1squez, Daniel M. German, and Denys Poshyvanyk. 2017. License usage and changes: a large-scale study on gitHub. Empirical Software Engineering 22, 3 (01 Jun 2017), 1537\u20131577. https://doi.org/10.1007/s10664-016-9438-4\n\n[25] Christopher Vendome, Gabriele Bavota, Massimiliano Di Penta, Mario Linares-V\u00e1squez, Daniel M. Germ\u00e1n, and Denys Poshyvanyk. 2017. License usage and changes: a large-scale study on gitHub. Empirical Software Engineering 22, 3 (2017), 1537\u20131577.\n\n[26] Christopher Vendome, Mario Linares-Vasquez, Gabriele Bavota, Massimiliano Di Penta, Daniel M. German, and Denys Poshyvanyk. 2017. Machine Learning-based Detection of Open Source License Exceptions. In Proceedings of the 39th International Conference on Software Engineering (ICSE \u201917). IEEE Press, Piscataway, NJ, USA, 118\u2013129. https://doi.org/10.1109/ICSE.2017.19\n\n[27] Christopher Vendome, Mario Linares-Vasquez, Gabriele Bavota, Massimiliano Di Penta, Daniel M. German, and Denys Poshyvanyk. 2015. When and Why Developers Adopt and Change Software Licenses. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) (ICSME \u201915). IEEE Computer Society, Washington, DC, USA, 31\u201340. https://doi.org/10.1109/ICSM.2015.7332449\n\n[28] Erik Wittern, Philippe Suter, and Shriram Rajagopalan. 2016. A Look at the Dynamics of the JavaScript Package Ecosystem. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR \u201916). ACM, New York, NY, USA, 351\u2013361. https://doi.org/10.1145/2901739.2901743\n\n[29] Yuhao Wu, Yuki Manabe, Daniel M. German, and Katsuro Inoue. 2017. How are Developers Treating License Inconsistency Issues? A Case Study on License Inconsistency Evolution in FOSS Projects. Springer International Publishing, Cham, 69\u201379. https://doi.org/10.1007/978-3-319-57735-7_8\n\n[30] Y. Wu, Y. Manabe, T. Kanda, D. M. German, and K. Inoue. 2015. A Method to Detect License Inconsistencies in Large-Scale Open Source Projects. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. 324\u2013333. https://doi.org/10.1109/MSR.2015.37\n\n[31] Yuhao Wu, Yuki Manabe, Tetsuya Kanda, Daniel M. German, and Katsuro Inoue. 2017. Analysis of license inconsistency in large collections of open source projects. Empirical Software Engineering 22, 3 (01 Jun 2017), 1194\u20131222. https://doi.org/10.1007/s10664-016-9487-8", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/028_meloca.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 11, "total-input-tokens": 39507, "total-output-tokens": 18015, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 5228, 1], [5228, 11299, 2], [11299, 16487, 3], [16487, 22295, 4], [22295, 28367, 5], [28367, 33598, 6], [33598, 38185, 7], [38185, 43718, 8], [43718, 50442, 9], [50442, 57331, 10], [57331, 65028, 11]]}}
|
|
{"id": "770f0a6a41edc74871264e4f2331f0890c747bd5", "text": "On the Extent and Nature of Software Reuse in Open Source Java Projects\n\nLars Heinemann, Florian Deissenboeck, Mario Gleirscher, Benjamin Hummel, and Maximilian Irlbeck\n\nInstitut f\u00fcr Informatik, Technische Universit\u00e4t M\u00fcnchen, Germany\n{heineman,deissenb,gleirsch,hummelb,irlbeck}@in.tum.de\n\nAbstract. Code repositories on the Internet provide a tremendous amount of freely available open source code that can be reused for building new software. It has been argued that only software reuse can bring the gain of productivity in software construction demanded by the market. However, knowledge about the extent of reuse in software projects is only sparse. To remedy this, we report on an empirical study about software reuse in 20 open source Java projects with a total of 3.3 MLOC. The study investigates (1) whether open source projects reuse third party code and (2) how much white-box and black-box reuse occurs. To answer these questions, we utilize static dependency analysis for quantifying black-box reuse and code clone detection for detecting white-box reuse from a corpus with 6.1 MLOC of reusable Java libraries. Our results indicate that software reuse is common among open source Java projects and that black-box reuse is the predominant form of reuse.\n\n1 Introduction\n\nSoftware reuse involves the use of existing software artifacts for the construction of new software [9]. Reuse has multiple positive effects on the competitiveness of a development organization. By reusing mature software components, the overall quality of the resulting software product is increased. Moreover, the development costs as well as the time to market are reduced [7, 11]. Finally, maintenance costs are reduced, since maintenance tasks concerning the reused parts are \u201coutsourced\u201d to other organizations. It has even been stated that there are few alternatives to software reuse that are capable of providing the gain of productivity and quality in software projects demanded by the industry [15].\n\nToday, practitioners and researchers alike fret about the failure of reuse in form of a software components subindustry as imagined by McIlroy over 40 years ago [13]. Newer approaches, such as software product lines [2] or the development of product specific modeling languages and code generation [8], typically focus on reuse within a single product family and a single development organization. However, reuse of existing third party code is\u2014from our observation\u2014a common practice in almost all software projects of significant size. Software repositories on the Internet provide a tremendous amount of freely reusable source code, frameworks and libraries for many recurring problems. Popular examples are...\nthe frameworks for web applications provided by the Apache Foundation and the Eclipse platform for the development of rich client applications. Due to its ubiquitous availability in software development, the Internet itself has become an interesting reuse repository for software projects [3, 6]. Search engines like Google Code Search\\footnote{http://www.google.com/codesearch} provide powerful search capabilities and direct access to millions of source code files written in a multitude of programming languages. Open source software repositories like Sourceforge\\footnote{http://sourceforge.net}, which currently hosts almost a quarter million projects, offer the possibility for open source software projects to conveniently share their code with a world-wide audience.\n\n**Research problem.** Despite the widely recognized importance of software reuse and its proven positive effects on quality, productivity and time to market, it remains largely unknown to what extent current software projects make use of the extensive reuse opportunities provided by code repositories on the Internet. Literature is scarce on how much software reuse occurs in software projects. It is also unclear how much code is reused in black-box or white-box fashion. We consider this lack of empirical knowledge about the extent and nature of software reuse in practice problematic and argue that a solid basis of data is required in order to assess the success of software reuse.\n\n**Contribution.** This paper extends the empirical knowledge about the extent and nature of code reuse in open source projects. Concretely, we present quantitative data on reuse in 20 open source projects that was acquired with different types of static analysis techniques. The data describes the reuse rate of each project and the relation between white-box and black-box reuse. The provided data helps to substantiate the academical discussion about the success or failure of software reuse and supports practitioners by providing them with a benchmark for software reuse in 20 successful open source projects.\n\n## 2 Terms\n\nThis section briefly introduces the fundamental terms this study is based on.\n\n**Software reuse.** In this paper, we use a rather simple notion of software reuse: software reuse is considered as the utilization of code developed by third parties besides the functionality provided by the operating system and the programming platform.\n\nWe distinguish between two reuse strategies, namely *black-box* and *white-box* reuse. Our definitions of these strategies follow the notions from [17].\n\n**White-box reuse.** We consider the reuse of code to be of the white-box type, if it is incorporated in the project files in source form, *i.e.*, the internals of the reused code are exposed to the developers of the software. This implies that the\ncode may potentially be modified. The reuse rate for white-box reuse is defined as the ratio between the amount of reused lines of code and the total amount of lines of code (incl. reused source code).\n\n**Black-box reuse.** We consider the reuse of code to be of the black-box type, if it is incorporated in the project in binary form, *i.e.*, the internals of the reused code are hidden from the developers and maintainers of the software. This implies that the code is reused *as is*, *i.e.*, without modifications. For black-box reuse the reuse rate is given by the ratio between the size of the reused binary code and the size of the binary code of the whole software system (incl. reused binary code).\n\n## 3 Methodology\n\nThis section describes the empirical study that was performed to analyze the extent and nature of software reuse in open source projects.\n\n### 3.1 Study Design\n\nWe use the Goal-Question-Metric template from [20] for defining this study:\n\nWe analyze *open source projects* for the purpose of *understanding the state of the practice in software reuse* with respect to *its extent and nature* from the viewpoint of *the developers and maintainers* in the context of *Java open source software*.\n\nTo achieve this, we investigate the following three research questions.\n\n**RQ 1 Do open source projects reuse software?** The first question of the study asks whether open source projects reuse software at all, according to our definition.\n\n**RQ 2 How much white-box reuse occurs?** For those projects that do reuse existing software, we ask how much of the code is reused in a white-box fashion as defined in Section 2. We use as metrics the number of copied lines of code from external sources as well as the reuse rate for white-box reuse.\n\n**RQ 3 How much black-box reuse occurs?** We further ask how much of the code is reused in a black-box fashion according to our definition. For this question we use as metrics the aggregated byte code size of the reused classes from external libraries and the reuse rate for black-box reuse. Although not covered by our definition of software reuse, we separately measure the numbers for black-box reuse of the Java API, since one could argue that this is also a form of software reuse.\n\n### 3.2 Study Objects\n\nThis section describes how we selected the projects that were analyzed in the study and how they were preprocessed in advance to the reuse analyses.\nTable 1. The 20 studied Java applications\n\n| System | Version | Description | LOC | Size (KB) |\n|-------------------------|---------------|------------------------------|---------|-----------|\n| Azureus/Vuze | 4.504 | P2P File Sharing Client | 786,865 | 22,761 |\n| Buddi | 3.4.0.3 | Budgeting Program | 27,690 | 1,149 |\n| DavMail | 3.8.5-1480 | Mail Gateway | 29,545 | 932 |\n| DrJava | stable-20100913-r5387 | Java Programming Env. | 160,256 | 6,199 |\n| FreeMind | 0.9.0 RC 9 | Mind Mapper | 71,133 | 2,352 |\n| HSQLDB | 1.8.1.3 | Relational Database Engine | 144,394 | 2,032 |\n| iReport-Designer | 3.7.5 | Visual Reporting Tool | 338,819 | 10,783 |\n| JabRef | 2.6 | BibTeX Reference Manager | 109,373 | 3,598 |\n| JEdit | 4.3.2 | Text Editor | 176,672 | 4,010 |\n| MediathekView | 2.2.0 | Media Center Management | 23,789 | 933 |\n| Mobile Atlas Creator | 1.8 beta 2 | Atlas Creation Tool | 36,701 | 1,259 |\n| OpenProj | 1.4 | Project Management | 151,910 | 3,885 |\n| PDF Split and Merge | 0.0.6 | PDF Manipulation Tool | 411 | 17 |\n| RODIN | 2.0 RC 1 | Service Development | 273,080 | 8,834 |\n| soapUI | 3.6 | Web Service Testing Tool | 238,375 | 9,712 |\n| SQuirreL SQL Client | Snapshot-20100918-1811 | Graphical SQL Client | 328,156 | 10,918 |\n| subsonic | 4.1 | Web-based Music Streamer | 30,641 | 1,050 |\n| Sweet Home 3D | 2.6 | Interior Design Application | 77,336 | 3,498 |\n| TV-Browser | 3.0 RC 1 | TV Guide | 187,216 | 6,064 |\n| YouTube Downloader | 1.9 | Video Download Utility | 2,969 | 99 |\n| **Overall** | | | **3,195,331** | **100,085** |\n\nSelection Process. We chose 20 projects from the open source software repository Sourceforge as study objects. Sourceforge is the largest repository of open source applications on the Internet. It currently hosts 240,000 software projects and has 2.6 million users.\\(^3\\)\n\nWe used the following procedure for selecting the study objects.\\(^4\\) We searched for Java projects with the development status *Production/Stable*. We then sorted the resulting list descending by number of weekly downloads. We stepped through the list beginning from the top and selected each project that was a standalone application, purely implemented in Java, based on the Java SE Platform and had a source download. All of the 20 study objects selected by this procedure were among the 50 most downloaded projects. Thereby, we obtained a set of successful projects in terms of user acceptance. The application domains of the projects were diverse and included accounting, file sharing, e-mail, software development and visualization. The size of the downloaded packages (zipped files) had a broad variety, ranging from 40 KB to 53 MB.\n\nTable 1 shows overview information about the study objects. The *LOC* column denotes the total number of lines in Java source files in the downloaded and preprocessed source package as described below. The *Size* column shows the bytecode sizes of the study objects.\n\nPreprocessing. We deleted test code from the projects following a set of simple heuristics (e.g. folders named test/tests). In few cases, we had to remove code that was not compilable. For one project we omitted code that referenced a commercial library.\n\n\\(^3\\) [http://sourceforge.net/about](http://sourceforge.net/about)\n\n\\(^4\\) The project selection was performed on October 5th, 2010.\nTable 2. The 22 libraries used as potential sources for white-box reuse\n\n| Library | Description | Version | LOC |\n|--------------------------|--------------------------------------|---------|---------|\n| ANTLR | Parser Generator | 3.2 | 66,864 |\n| Apache Ant | Build Support | 1.8.1 | 251,315 |\n| Apache Commons | Utility Methods | 5/Oct/2010 | 1,221,669 |\n| log4j | Logging | 1.2.16 | 68,612 |\n| ASM | Byte-Code Analysis | 3.3 | 3,710 |\n| Batik | SVG Rendering and Manipulation | 1.7 | 366,507 |\n| BCEL | Byte-Code Analysis | 5.2 | 48,166 |\n| Eclipse | Rich Platform Framework | 3.5 | 1,404,122 |\n| HSQLDB | Database | 1.8.1.3 | 157,935 |\n| Jaxen | XML Parsing | 1.1.3 | 48,451 |\n| JCommon | Utility Methods | 1.0.16 | 67,807 |\n| JDOM | XML Parsing | 1.1.1 | 32,575 |\n| Berkeley DB Java Edition | Database | 4.0.103 | 367,715 |\n| JFreeChart | Chart Rendering | 1.0.13 | 313,268 |\n| JGraphT | Graph Algorithms and Layout | 0.8.1 | 41,887 |\n| JUNG | Graph Algorithms and Layout | 2.0.1 | 67,024 |\n| Jython | Scripting Language | 2.5.1 | 252,062 |\n| Lucene | Text Indexing | 3.0.2 | 274,270 |\n| Spring Framework | J2EE Framework | 3.0.3 | 619,334 |\n| SVNKit | Subversion Access | 1.3.4 | 178,953 |\n| Velocity Engine | Template Engine | 1.6.4 | 70,804 |\n| Xerces-J | XML Parsing | 2.9.0 | 226,389 |\n| **Overall** | | | **6,149,439** |\n\nWe also added missing libraries that we downloaded separately in order to make the source code compilable. We either obtained the libraries from the binary package of the project or from the library\u2019s website. In the latter case we chose the latest version of the library.\n\n3.3 Study Implementation and Execution\n\nThis section details how the study was implemented and executed on the study objects. All automated analyses were implemented in Java on top of our open source quality analysis framework ConQAT\\(^5\\), which provides\u2014among others\u2014clone detection algorithms and basis functionality for static code analysis.\n\nDetecting White-Box Reuse. As white-box reuse involves copying external source code into the project\u2019s code, the sources of reuse are not limited to libraries available at compile time, but can virtually span all existing Java source code. The best approximation of all existing Java source code is probably provided by the indices of the large code search engines, such as Google Code Search or Koders. Unfortunately, access to these engines is typically limited and does not allow to search for large amounts of code, such as the 3 MLOC of our study objects. Consequently, we only considered a selection of commonly used Java libraries and frameworks as potential sources for white-box reuse. We selected 22 libraries which are commonly reused based on our experience with both own development projects and systems we analyzed during earlier studies. The libraries\n\n\\(^5\\) [http://www.conqat.org](http://www.conqat.org)\nare listed in Table 2 and comprise more than 6 MLOC. For the sake of presentation, we treated the Apache Commons as a single library, although it consists of 39 individual libraries that are developed and versioned independently. The same holds for Eclipse, where we chose a selection of its plug-ins.\n\nTo find potentially copied code, we used our clone detection algorithm presented in [5] to find duplications between the selected libraries and the study objects. We computed all clones consisting of at least 15 statements with normalization of formatting and identifiers (type-2 clones), which allowed us to also find partially copied files (or files which are not fully identical due to further independent evolution), while keeping the rate of false positives low. All clones reported by our tool were also inspected manually, to remove any remaining false positives.\n\nWe complemented the clone detection approach by manual inspection of the source code of all study objects. The size of the study objects only allows a very shallow inspection, based on the names of files and directories (which correspond to Java packages). For this we scanned the directory trees of the projects for files residing in separate source folders or in packages that were significantly different from the package names used for the project itself. The files found this way were then inspected and their source identified based on header comments or a web search. Of course this step only can find large scale reuse, where multiple files are copied into a project and the original package names are preserved (which are typically different from the project\u2019s package names). However, during this inspection we are not limited to the 22 selected libraries, but potentially can find other reused code as well.\n\nDetecting Black-Box Reuse. The primary way of black-box reuse in Java programs is the inclusion of libraries. Technically, these are Java Archive Files (JAR), which are zipped files containing the byte code of the Java types. Ideally, one would measure the reuse rate based on the source code of the libraries. However, obtaining the source code for such libraries is error-prone as many projects do not document the exact version of the used libraries. In certain cases, the source code of libraries is not available at all. To avoid these problems and prevent measurement inaccuracies, we performed the analysis of black-box reuse directly on the Java byte code stored in the JAR files.\n\nWhile JAR files are the standard way of packaging reusable functionality in Java, the JAR files themselves are not directly reused. They merely represent a container for Java types (classes, interfaces, enumerations and annotations) that are referenced by other types. Hence, the type is the main entity of reuse in Java. Our black-box reuse analysis determines which types from libraries are referenced from the types of the project code. The dependencies are defined by the Java Constant Pool [12], a part of the Java class file that holds information about all referenced types. References are method calls and all type usages, induced e.g., by local variables or inheritance. Our analysis transitively traverses the\n\n---\n\n6 In addition to JAR files, Java provides a package concept that resembles a logical modularization concept. Packages, however, cannot directly be reused.\ndependency graph, i.e., also those types that are indirectly referenced by reused types are included in the resulting set of reused types. The analysis approach ensures that in contrast to counting the whole library as reused code, only the subset that is actually referenced by the project is considered. The rationale for this is that a project can incorporate a large library but use only a small fraction of it. To quantify black-box reuse, the analysis measures the size of the reused types by computing their aggregated byte code size. The black-box analysis is based on the BCEL library\\footnote{http://jakarta.apache.org/bcel} that provides byte code processing functionality.\n\nOur analysis can lead to an overestimation of reuse as we always include whole types although only specific methods of a type may actually be reused. Moreover, a method may reference certain types but the method itself could be unreachable. On the other hand, our approach can lead to an underestimation of reuse as the implementations of interfaces are not considered as reused unless they are discovered on another path of the dependency search. Details regarding this potential error can be found in the section that discusses the threats to validity (Section 6).\n\nAlthough reuse of the Java API is not covered by our definition of software reuse, we also measured reuse of the Java API, since potential variations in the reuse rates of the Java API are worthwhile to investigate. Since every Java class inherits from \\texttt{java.lang.Object} and thereby (transitively) references a significant part of the Java API classes, even a trivial Java program exhibits\u2014according to our analysis\u2014a certain amount of black-box reuse. To determine this baseline, we performed the analysis for an artificial minimal Java program that only consists of an empty \\texttt{main} method. This baseline of black-box reuse of the Java API consisted of 2,082 types and accounted for about 5 MB of byte code. We investigated the reason for this rather large baseline and found that \\texttt{Object} has a reference to \\texttt{Class} which in turn references \\texttt{ClassLoader} and \\texttt{SecurityManager}. These classes belong to the core functionality for running Java applications. Other referenced parts include the Reflection API and the Collection API. Due to the special role of the Java API, we captured the numbers for black-box reuse of the Java API separately. All black-box reuse analyses were performed with a Sun Java Runtime Environment for Linux 64 Bit in version 1.6.0.20.\n\n4 Results\n\nThis section contains the results of the study in the order of the research questions.\n\n4.1 RQ 1: Do Open Source Projects Reuse Software?\n\nThe reuse analyses revealed that 18 of the 20 projects do reuse software from third parties, i.e., of the analyzed projects 90\\% reuse code. \\texttt{HSQLDB} and \\texttt{YouTube Downloader} were the only projects for which no reuse\u2014neither black-box nor white-box\u2014was found.\n4.2 RQ 2: How Much White-Box Reuse Occurs?\n\nWe attempt to answer this question by a combination of automatic techniques (clone detection) and manual inspections. The clone detection between the code of the study objects and the libraries from Table 2 reported 337 clone classes (i.e., groups of clones) with 791 clone instances all together. These numbers only include clones between a study object and one or more libraries; clones within the study objects or the libraries were not considered. As we had HSQLDB both in our set of study objects and the libraries used, we discarded all clones between these two.\n\nManual inspection of these clones led to the observation that, typically, all clones are in just a few of the file pairs which are nearly completely covered by clones. So, the unit of reuse (as far as we found it) is the file/class level; single methods (or sets of methods) were not copied. Most of the copied files were not completely identical. These changes are caused either by minor modifications to the files after copying them to the study objects, or (more likely) due to different versions of the libraries used. As the differences between the files were minor, we counted the entire file as copied if the major part of it was covered by clones.\n\nBy manual inspection of the study objects we found entire libraries copied in four of the study objects. These libraries were either less well-known (GNU ritopt), no longer available as individual project (microstar XML parser), or not released as an individual project but rather extracted from another project (OSM JMapViewer). All of these could not be found by the clone detection algorithm, as the corresponding libraries were not part of our original set.\n\nThe results for the duplicated code found by clone detection and the code found during manual inspection are summarized in Table 3. The last column gives the overall amount of white-box reused code relative to the project\u2019s size.\n\n| System | Clone Detection (LOC) | Manual Inspection (LOC) | Overall Percent |\n|-------------------------|-----------------------|-------------------------|-----------------|\n| Azureus/Vuze | 1040 | 57,086 | 7.39% |\n| Buddi | \u2014 | \u2014 | \u2014 |\n| DavMail | \u2014 | \u2014 | \u2014 |\n| DrJava | \u2014 | \u2014 | \u2014 |\n| FreeMind | \u2014 | \u2014 | \u2014 |\n| HSQLDB | \u2014 | \u2014 | \u2014 |\n| iReport-Designer | 298 | \u2014 | 0.09% |\n| JabRef | \u2014 | 7,725 | 7.06% |\n| JEdit | 7,261 | 9,333 | 9.39% |\n| MediathekView | \u2014 | \u2014 | \u2014 |\n| Mobile Atlas Creator | \u2014 | 2,577 | 7.02% |\n| OpenProj | \u2014 | 87 | 0.06% |\n| PDF Split and Merge | \u2014 | \u2014 | \u2014 |\n| RODIN | \u2014 | 382 | 0.14% |\n| soapUI | \u2014 | 2,120 | 0.89% |\n| SQuirreL SQL Client | \u2014 | \u2014 | \u2014 |\n| subsonic | \u2014 | \u2014 | \u2014 |\n| Sweet Home 3D | \u2014 | \u2014 | \u2014 |\n| TV-Browser | \u2014 | 513 | 0.27% |\n| YouTube Downloader | \u2014 | \u2014 | \u2014 |\n| Overall | 11,701 | 76,721 | n.a. |\nin LOC. For 11 of the 20 study objects no white-box reuse whatsoever could be proven. For another 5 of them, reuse is below 1%. However, there are also 4 projects with white-box reuse in the range of 7% to 10%. The overall LOC numbers shown in the last row indicate that the amount of code that results from copying entire libraries outnumbers by far the code reused by more selective copy&paste.\n\n4.3 RQ 3: How Much Black-Box Reuse Occurs?\n\nFigure 1 illustrates the absolute bytecode size distributions between the project code (own), the reused parts of the libraries (3rd party) and the Java API ordered descending by the total amount of bytecode. The horizontal line indicates the baseline usage of the Java API. The reuse of third party libraries ranged between 0 MB and 42.2 MB. The amount of reuse of the Java API was similar among the analyzed projects and ranged between 12.9 MB and 16.6 MB. The median was 2.4 MB for third party libraries and 13.3 MB for the Java API. The project iReport-Designer reused the most functionality in a black-box fashion both from libraries and from the Java API. The project with the smallest extent of black-box reuse was YouTube Downloader.\n\nFigure 2 is based on the same data but shows the relative distributions of the bytecode size. The projects are ordered descending by the total amount of relative reuse. The relative reuse from third party libraries was 0% to 61.7% with a median of 11.8%. The relative amount of reused code from the Java API ranged between 23.0% and 99.3% with a median of 73.0%. Overall (third party and Java API combined), the relative amount of reused code ranged between 41.3% and 99.9% with a median of 85.4%. The project iReport-Designer had the highest black-box reuse rate. YouTube Downloader used the most code from the Java API relative to its own code size. For 19 of the 20 projects, the amount of reused code was larger than the amount of own code. Of the overall amount of reused code in the sample projects, 34% stemmed from third party libraries and 66% from the Java API.\nFigure 3 illustrates the relative byte code size distributions between the own code and third party libraries, i.e., without considering the Java API as a reused library. The projects are ordered descending by reuse rate. The relative amount of reused library code ranged from 0% to 98.9% with a median of 45.1%. For 9 of the 20 projects the amount of reused code from third party libraries was larger than the amount of own code.\n\n5 Discussion\n\nThe data presented in the previous sections lead to interesting insights into the current state of open source Java development, but also open new questions which were not part of our study setup. We discuss both in the following sections.\n5.1 Extent of Reuse\n\nOur study reveals that software reuse is common among open source Java projects, with black-box reuse as the predominant form. None of the 20 projects analyzed has less than 40% black-box reuse when including the Java API. Even when not considering the Java API the median reuse rate is still above 40% and only 4 projects are below the 10% threshold. Contrary, white-box reuse is only found in about half of the projects at all and never exceeds 10% of the code.\n\nThis difference can probably be explained by the increased maintenance efforts that are commonly associated with white-box reuse as described by Jacobson et al. [7] and Mili et al. [14]. The detailed results of RQ 2 also revealed that larger parts consisting of multiple files were mostly copied if either the originating library was no longer maintained or the files were never released as an individual library. In both cases the project\u2019s developers would have to maintain the reused code in any case, which removes the major criticism of white-box reuse.\n\nIt also seems that the amount of reused third party libraries seldom exceeds the amount of code reused from the Java API. The only projects for which this is not the case are iReport-Designer, RODIN and soapUI, from which the first two are built upon NetBeans respectively Eclipse, which provide rich platforms on top of the Java API.\n\nBased on our data, it is obvious that the early visions of reusable components that only have to be connected by small amounts of glue code and would lead to reuse rates beyond 90% are not realistic today. On the other hand, the reuse rates we found are high enough to have a significant impact on the development effort. We would expect that reuse of software, as it is also fostered by the open source movement, has a huge contribution to the rich set of applications available today.\n\n5.2 Influence of Project Size on Reuse Rate\n\nThe amount of reuse ranges significantly between the different projects. While PDF Split and Merge is just a very thin wrapper around existing libraries, there are also large projects which have (relatively) small reuse rates (e.g., less than 10% for Azureus without counting the Java API).\n\nMotivated by a study by Lee and Litecky [10], we investigated a possible correlation between code size and reuse rate in our data set. Their study was based on a survey in the domain of commercial Ada development on 73 samples and found a negative influence of software size on the rate of reuse. For the reuse rate without the Java API (only third party code) we found a Spearman correlation coefficient of 0.05 with the size of the project\u2019s own code (two-tailed p-value: 0.83). Thus, we can infer no dependence between these values. If we use the overall reuse rate (including the Java API), the Spearman coefficient is -0.93 (p-value < 0.0001), which indicates a significant and strong negative correlation. This confirms the results of [10] that project size typically reduces the reuse rate.\n5.3 Types of Reused Functionality\n\nIt is interesting to investigate what kind of functionality is actually reused by software. Therefore, we tried to categorize all reused libraries into different groups of common functionality. Consequently, we analyzed the purpose of each reused library and divided them into seven categories (e.g., Networking, Text/XML, Rich Client Platforms or Graphics/UI). To determine to which extent a certain type of functionality is reused we employed our black-box reuse detection algorithm presented in Section 3.3 to calculate the amount of bytecode for each library that is reused inside a project.\n\nWe observed that there is no predominant type of reused functionality and that nearly all projects are reusing functionality belonging to more than one category. We believe that there is no significant insight we can report except that reuse seems to be diverse among the categories and is not concentrated on a single purpose.\n\n6 Threats to Validity\n\nThis section discusses potential threats to the internal and external validity of the results presented in this paper.\n\n6.1 Internal Validity\n\nThe amount of reuse measured fundamentally depends on the definition of software reuse and the techniques used to measure it. We discuss possible flaws that can lead to an overestimation of the actual reuse, an underestimation, or otherwise threaten our results.\n\nOverestimation of reuse. The measurement of white-box reuse used the results of a clone detection, which could contain false positives. Thus, not all reported clones indicate actual reuse. To mitigate this, we manually inspected the clones found. Additionally, for both the automatically and manually found duplicates, it is not known whether the code was copied into the study objects or rather from them. However, all findings were manually verified, for example by checking the header comments, we ensured that the code was actually copied from the library into the study object.\n\nOur estimation of black-box reuse is based on static references in the bytecode. We consider a class as completely reused if it is referenced, which may not be the case. For example, the method holding the reference to another class might never be called. Another possibility would be to use dynamic analysis and execution traces to determine the amount of reused functionality. However, this approach has the disadvantage that only a finite subset of all execution traces could be considered, leading to a potentially large underestimation of reuse.\nUnderestimation of reuse. The application of clone detection was limited to a fixed set of libraries. Thus, copied code could be missed as the source it was taken from was not included in our comparison set. Additionally, the detector might miss actual clones (low recall) due to weak normalization settings. To address this, we chose settings that yield higher recall (at the cost of precision). The manual inspection of the study objects\u2019 code for further white-box reuse is inherently incomplete; due to the large amounts of code only the most obvious copied parts could be found.\n\nThe static analysis used to determine black-box reuse misses certain dependencies, such as method calls performed via Java\u2019s reflection mechanism or classes that are loaded based on configuration information. Additionally, our analysis can not penetrate the boundaries created by Java interfaces. The actual implementations used at run-time (and their dependencies) might not be included in our reuse estimate. To mitigate this, one could search for an implementing class and include the first match into the further dependency search and the result set. However, preliminary experiments showed that this approach leads to a large overestimation. For example a command line program that references an interface that is also implemented by a UI class could lead us to the false conclusion that the program reuses UI code.\n\nThere are many other forms of software reuse that are not covered by our approach. One example are reusable generators. If a project uses a code generator to generate source code from models, this would not be detected as a form of reuse by our approach. Moreover, there are many other ways in which software components can interact with each other besides use dependencies in the source code. Examples are inter-process communication, web services that utilize other services via SOAP calls, or the integration of a database via an SQL interface.\n\n6.2 External Validity\n\nWhile we tried to use a comprehensible way of sampling the study objects, it is not clear to what extent they are representative for the class of open source Java programs. First, the choice of Sourceforge as source for the study objects could bias our selection, as a certain kind of open source developers could prefer other project repositories (such as Google Code). Second, we selected the projects from the 50 most downloaded ones, which could bias our results.\n\nAs the scope of the study are open source Java programs, transferability of the results to other programming languages or commercially developed software is unclear. Especially the programming language is expected to have a huge impact on reuse, as the availability of both open source and commercial reusable code heavily depends on the language used.\n\n7 Related Work\n\nSoftware reuse is a research field with an extensive body of literature. An overview of different reuse approaches can be found in the survey from Krueger [9]. In the\nfollowing, we focus on empirical work that aims at quantifying the extent of software reuse in real software projects.\n\nIn [18], Sojer et al. investigate the usage of existing open source code for the development of new open source software by conducting a survey among 686 open source developers. They analyze the degree of code reuse with respect to developer and project characteristics. They report that software reuse plays an important role in open source development. Their study reveals that a mean of 30% of the implemented functionality in the projects of the survey participants is based on reused code. Since Sojer et al. use a survey to analyze the extent of code reuse, the results may be subject to inaccurate estimates of the respondents. Our approach analyzes the source code of the projects and therefore avoids this potential inaccuracy. Our results are confirmed by their study, since they also report that software reuse is common in open source projects.\n\nHaefliger et al. [4] analyzed code reuse within six open source projects by performing interviews with developers as well as inspecting source code, code modification comments, mailing lists and project web pages. Their study revealed that all sample projects reuse software. Moreover, the authors found that by far the dominant form of reuse within their sample was black-box reuse. In the sample of 6 MLOC, 55 components which in total account for 16.9 MLOC were reused. Of the 6 MLOC, only about 38 kLOC were reused in a white-box fashion. The developers also confirmed that this form of reuse occurs only infrequently and in small quantities. Their study is related to ours, however the granularity for the black-box analysis was different. While they treated whole components as reusable entities, we measured the fraction of the library that is actually used. Since they use code repository commit comments for identifying white-box reuse, their results are sensitive with regards to the accuracy of these comments. In contrast, our method utilizes clone detection and is therefore not dependent on correct commit comments. Their study confirms our finding that black-box is the by far predominant form of reuse.\n\nIn [16], Mockus investigates large-scale code reuse in open source projects by identifying components that are reused among several projects. The approach looks for directories in the projects that share a certain fraction of files with equal names. He investigates how much of the files are reused among the sample projects and identify what type of components are reused the most. In the studied projects, about 50% of the files were used in more than one project. Libraries reused in a black-box fashion are not considered by his approach. While Mockus\u2019 work quantifies how often code entities are reused, our work quantifies the fraction of reused code compared to the own code within projects. Moreover, reused entities that are smaller than a group of files are not considered. However, their results are in line with our findings regarding the observation that code reuse is commonly practiced in open source projects.\n\nIn [10], Lee et al. report on an empirical study that investigates how organizations employ reuse technologies and how different criteria influence the reuse rate in organizations using Ada technologies. They surveyed 500 Ada professionals from the ACM Special Interest Group on Ada with a one-page questionnaire.\nThe authors determine the amount of reuse with a survey. Therefore their results may be inaccurate due to subjective judgement of the respondents. Again, our approach mitigates this risk by analyzing the source code of the project.\n\nIn [19], von Krogh et al. report on an exploratory study that analyzes knowledge reuse in open source software. The authors surveyed the developers of 15 open source projects to find out whether knowledge is reused among the projects and to identify conceptual categories of reuse. They analyze commit comments from the code repository to identify accredited lines of code as a direct form of knowledge reuse. Their study reveals that all the considered projects do reuse software components. Our observation that software reuse is common in open source development is therefore confirmed by their study. Like Haefliger et al., Krogh et al. rely on commit comments of the code repository with the already mentioned potential drawbacks.\n\nBasili et al. [1] investigated the influence of reuse on productivity and quality in object-oriented systems. Within their study, they determine the reuse rate for 8 projects developed by students with a size ranging from about 5 kSLOCs to 14 kSLOCs. While they report reuse rates in a similar range as those from our results, they analyzed rather small programs written by students in the context of the study. In contrast to that, we analyzed open source projects.\n\n8 Conclusions and Future Work\n\nSoftware reuse, often called the holy grail of software engineering, has certainly not been found in the form of reusable components that simply need to be plugged together. However, our study not only shows that reuse is common in almost all open source Java projects but also that significant amounts of software are reused: Of the analyzed 20 projects 9 projects have reuse rates of more than 50%\u2014even if reuse of the Java API is not considered. Reassuringly, these reuse rates are to a great extent realized through black-box reuse and not by copy&pasting source code.\n\nWe conclude that in the world of open-source Java development, high reuse rates are not a theoretical option but are achieved in practice. Especially, the availability of reusable functionality, which is a necessary prerequisite for reuse to occur, is well-established for the Java platform.\n\nAs a next step, we plan to extend our studies to other programming ecosystems and other development models. In particular, we are interested in the extent and nature of reuse for projects implemented in legacy languages like COBOL and PL/1 on the one hand and currently hyped languages like Python and Scala on the other hand. Moreover, our future studies will include commercial software systems to investigate to what extent the open-source development model promotes reuse.\n\nAcknowledgment\n\nThe authors want to thank Elmar Juergens for inspiring discussions and helpful comments on the paper.\nReferences\n\n1. Basili, V., Briand, L., Melo, W.: How reuse influences productivity in object-oriented systems. Communications of the ACM 39(10), 116 (1996)\n2. Clements, P., Northrop, L.M.: Software Product Lines: Practices and Patterns, 6th edn. Addison-Wesley, Reading (2007)\n3. Frakes, W., Kang, K.: Software reuse research: Status and future. IEEE Transactions on Software Engineering 31(7), 529\u2013536 (2005)\n4. Haefliger, S., Von Krogh, G., Spaeth, S.: Code Reuse in Open Source Software. Management Science 54(1), 180\u2013193 (2008)\n5. Hummel, B., Juergens, E., Heinemann, L., Conradt, M.: Index-Based Code Clone Detection: Incremental, Distributed, Scalable. In: ICSM 2010 (2010)\n6. Hummel, O., Atkinson, C.: Using the web as a reuse repository. In: Morisio, M. (ed.) ICSR 2006. LNCS, vol. 4039, pp. 298\u2013311. Springer, Heidelberg (2006)\n7. Jacobson, I., Griss, M., Jonsson, P.: Software reuse: architecture, process and organization for business success. Addison-Wesley, Reading (1997)\n8. Kelly, S., Tolvanen, J.-P.: Domain-Specific Modeling. Wiley, Chichester (2008)\n9. Krueger, C.: Software reuse. ACM Comput. Surv. 24(2), 131\u2013183 (1992)\n10. Lee, N., Litecky, C.: An empirical study of software reuse with special attention to Ada. IEEE Transactions on Software Engineering 23(9), 537\u2013549 (1997)\n11. Lim, W.: Effects of reuse on quality, productivity, and economics. IEEE Software 11(5), 23\u201330 (2002)\n12. Lindholm, T., Yellin, F.: Java virtual machine specification. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)\n13. McIlroy, M., Buxton, J., Naur, P., Randell, B.: Mass produced software components. In: Software Engineering Concepts and Techniques, pp. 88\u201398 (1969)\n14. Mili, H., Mili, A., Yacoub, S., Addy, E.: Reuse-Based Software Engineering: Techniques, Organizations, and Controls. Wiley Interscience, Hoboken (2001)\n15. Mili, H., Mili, F., Mili, A.: Reusing software: Issues and research directions. IEEE Transactions on Software Engineering 21(6), 528\u2013562 (1995)\n16. Mockus, A.: Large-scale code reuse in open source software. In: FLOSS 2007 (2007)\n17. Ravichandran, T., Rothenberger, M.: Software reuse strategies and component markets. Communications of the ACM 46(8), 109\u2013114 (2003)\n18. Sojer, M., Henkel, J.: Code Reuse in Open Source Software Development: Quantitative Evidence, Drivers, and Impediments. JAIS (to appear, 2011)\n19. von Krogh, G., Spaeth, S., Haefliger, S.: Knowledge Reuse in Open Source Software: An Exploratory Study of 15 Open Source Projects. In: HICSS 2005 (2005)\n20. Wohlin, C., Runeson, P., H\u00f6st, M.: Experimentation in software engineering: An introduction. Kluwer Academic, Dordrecht (2000)", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/029_heinemann.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 16, "total-input-tokens": 35999, "total-output-tokens": 11053, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 2709, 1], [2709, 5540, 2], [5540, 7964, 3], [7964, 12031, 4], [12031, 15860, 5], [15860, 19226, 6], [19226, 22211, 7], [22211, 26384, 8], [26384, 28441, 9], [28441, 29127, 10], [29127, 32132, 11], [32132, 34658, 12], [34658, 37643, 13], [37643, 41082, 14], [41082, 44014, 15], [44014, 46658, 16]]}}
|
|
{"id": "b23184d213af49339a95d5dff5e95b2853bf1df6", "text": "On the impact of using trivial packages: an empirical case study on npm and PyPI\n\nRabe Abdalkareem1 \u00b7 Vinicius Oda1 \u00b7 Suhaib Mujahid1 \u00b7 Emad Shihab1\n\nPublished online: 9 January 2020\n\u00a9 Springer Science+Business Media, LLC, part of Springer Nature 2020\n\nAbstract\nCode reuse has traditionally been encouraged since it enables one to avoid re-inventing the wheel. Due to the npm left-pad package incident where a trivial package led to the breakdown of some of the most popular web applications such as Facebook and Netflix, some questioned such reuse. Reuse of trivial packages is particularly prevalent in platforms such as npm. To date, there is no study that examines the reason why developers reuse trivial packages other than in npm. Therefore, in this paper, we study two large platforms npm and PyPI. We mine more than 500,000 npm packages and 38,000 JavaScript applications and more than 63,000 PyPI packages and 14,000 Python applications to study the prevalence of trivial packages. We found that trivial packages are common, making up between 16.0% to 10.5% of the studied platforms. We performed surveys with 125 developers who use trivial packages to understand the reasons and drawbacks of their use. Our surveys revealed that trivial packages are used because they are perceived to be well implemented and tested pieces of code. However, developers are concerned about maintaining and the risks of breakages due to the extra dependencies trivial packages introduce. To objectively verify the survey results, we validate the most cited reason and drawback. We find that contrary to developers\u2019 beliefs only around 28% of npm and 49% PyPI trivial packages have tests. However, trivial packages appear to be \u2018deployment tested\u2019 and to have similar test, usage and community interest as non-trivial packages. On the other hand, we found that 18.4% and 2.9% of the studied trivial packages have more than 20 dependencies in npm and PyPI, respectively.\n\nKeywords Trivial packages \u00b7 JavaScript \u00b7 Node.js \u00b7 Python \u00b7 npm \u00b7 PyPI \u00b7 Code reuse \u00b7 Empirical studies\n\n1 Introduction\nCode reuse, in the form of combining related functionalities in packages, has been encouraged due to the fact that it can reduce the time-to-market, improve software quality and\n\nCommunicated by: Arie van Deursen\n\nRabe Abdalkareem\nrab_abdu@encs.concordia.ca\n\nExtended author information available on the last page of the article.\nboost overall productivity (Basili et al. 1996; Lim 1994; Mohagheghi et al. 2004). Therefore, it is no surprise that platforms such as Node.js encourage reuse and attempt to facilitate code sharing, often delivered as packages or modules\\(^1\\) that are available on package management platforms, such as the Node Package Manager (\\textit{npm}) and Python Package Index (\\textit{PyPI}) (npm 2016; Bogart et al. 2016).\n\nHowever, it is not all good news. There are many cases where code reuse has had negative effects, leading to an increase in maintenance costs and even legal action (McCamant and Ernst 2003; Orsila et al. 2008; Inoue et al. 2012; Abdalkareem et al. 2017a). For example, an incident of code reuse of a JavaScript package called left-pad, which was used by Babel, caused interruptions to some of the largest Internet sites, e.g., Facebook, Netflix, and Airbnb. Many referred to the incident as the case that \u2018almost broke the Internet\u2019 (Macdonald 2016; Williams 2016). That incident lead to many heated discussions about code reuse, sparked by David Haney\u2019s blog post: \u201cHave We Forgotten How to Program?\u201d (Haney 2016).\n\nWhile the real reason for the left-pad incident was that \\textit{npm} allowed authors to unpublish packages (a problem which has been resolved (npm Blog 2016)), it raised awareness of the broader issue of taking on dependencies for trivial tasks that can be easily implemented (Haney 2016). In our previous work (Abdalkareem et al. 2017), we defined and examined trivial packages in \\textit{npm}, and discovered a number of relevant findings:\n\n- Trivial JavaScript packages tend to be small in size and less complex.\n- Trivial packages are prevalent, making up approximately 16.8\\% of all the packages on \\textit{npm}.\n- JavaScript developers generally use trivial packages since they believe that trivial packages provide them with well tested and implemented code, however, they are concerned about the management of extra dependencies.\n\nIn addition, we found that in some cases, these trivial JavaScript packages can have their own dependencies, imposing significant overhead.\n\nHowever, one major limitation of the original work was its deep focus on JavaScript and \\textit{npm} in particular (Abdalkareem et al. 2017). For example, questions about the existence of trivial packages (and how they are defined) in other package management platforms remain. Also, whether the perceived advantages (e.g., that trivial packages are well tested) and disadvantages (e.g., management of additional dependencies) of using trivial packages generalized beyond JavaScript developers remain unanswered.\n\nHence, this paper has extended our previous work (Abdalkareem et al. 2017) to strengthen the empirical evidence on the use of trivial packages by replicating and extending our study on the Python Package Index (\\textit{PyPI}). We chose to examine the \\textit{PyPI} package management platform since 1) Python is one of the most popular general purpose programming languages, 2) Python has only one main well-established package platform, \\textit{PyPI}, and 3) \\textit{PyPI} is a mature package management platform that has been in existence for more than twelve years. Our extended study provides the following key additions:\n\n- We extended our study of the \\textit{npm} package management platform and increased the \\textit{npm} dataset from 231,092 to 501,001 packages.\n- We provide a definition of \\textit{PyPI} trivial packages and examine the prevalence of trivial packages in the Python ecosystem.\n\n\\(^1\\)In this paper, we use the term package to refer to a software library that is published on the studied package management platforms.\n\u2013 We surveyed 37 Python developers to investigate the reasons for and drawback of using trivial packages in the PyPI package management platform.\n\u2013 We examine the top main reasons of and drawbacks of using PyPI trivial packages based on the developers survey.\n\nAltogether, our study involves more than 500,000 npm packages and 38,000 JavaScript applications and 63,000 PyPI packages and 14,000 Python applications. The study also contains survey results from 125 JavaScript and Python developers. Our findings indicate that:\n\n**The definition of trivial packages is the same in JavaScript and Python** The developers from the two different package management platforms tended to have the same definition of trivial packages. While we found in the original paper (Abdalkareem et al. 2017) that npm trivial packages are packages that have $\\leq 35$ LOC and a McCabe\u2019s cyclomatic complexity $\\leq 10$, we also found that PyPI trivial packages have the same definition.\n\n**Trivial packages are common and popular in both, npm and PyPI management platforms** Of the 501,001 npm and 63,912 PyPI packages in our dataset, 16.0% and 10.6% of them are trivial packages. Moreover, of the 38,807 JavaScript and 14,717 Python applications on GitHub, 26.1% and 6.9% of them directly depend on one or more trivial packages.\n\n**JavaScript and Python developers differ in their perception of trivial packages** Only 23.9% of JavaScript developers considered the use of trivial packages as bad, whereas, 70.3% of Python developers consider the use of trivial package as a bad practice.\n\n**Developers believe that trivial packages provide them with well implemented/tested code and increase productivity** At the same time, the increase in dependency overhead and the risk of breakage of their applications are the two most cited drawbacks.\n\n**Developers need to be careful which trivial packages they use** Our empirical findings show that many trivial packages have their own dependencies. In npm, 43.2% of trivial packages have at least one dependency and 18.4% of trivial packages have more than 20 dependencies. In PyPI, 36.8% of trivial packages have at least one dependency, and 2.9% have more than 20 dependencies.\n\nTo facilitate the replicability of our work, we make our dataset and the anonymized developer responses publicly available (Abdalkareem et al. 2019).\n\n### 1.1 Paper Organization\n\nThe paper is organized as follows: Section 2 provides the background and introduces our datasets. Section 3 presents how we determine what a trivial package is. Section 4 examines the prevalence of trivial packages and their use in JavaScript and Python applications. Section 5 presents the results of our developer surveys, presenting the reasons and perceived drawbacks for developers who use trivial packages. Section 6 presents our quantitative validation of the most commonly cited reason for and drawback of using trivial packages. The implications of our findings are noted in Section 7. We discuss the related works in Section 8, the limitations of our study in Section 9, and present our conclusions in Section 10.\n2 Background and Case Studies\n\nIn this section, we provide background on the two studied package management platforms, npm and PyPI. We also provide an overview of the dataset collected and used in the rest of our study.\n\n2.1 Node Package Manager (npm)\n\nJavaScript is used to write client and server side applications. The popularity of JavaScript has steadily grown, thanks to popular frameworks such as Node.js and an active developer community (Bogart et al. 2016; Wittern et al. 2016). JavaScript projects can be classified into two main categories: JavaScript packages that are used in other applications or JavaScript applications that are used as standalone software. The Node Package Manager (npm) provides tools to manage JavaScript packages.\n\nTo perform our study, we gather two datasets from two sources. We obtain JavaScript packages from the npm registry and applications that use npm packages from GitHub.\n\n**npm Packages:** Since we are interested in examining the impact of \u2018trivial packages\u2019, we mined the latest version of all the JavaScript packages from npm as of September 30, 2017. For each package we obtained its source code from the npm registry. In total, we mined 549,629 packages.\n\n**GitHub JavaScript Applications:** We also want to examine the use of the npm packages in JavaScript applications. Therefore, we mined all of the JavaScript applications on GitHub. To obtain a list of JavaScript applications, we extracted all the applications identified as JavaScript application from the GHTorrent dataset (Gousios et al. 2014). Then, to ensure that we are indeed only obtaining the JavaScript applications from GitHub, and not npm packages, we compare the URL of the GitHub repositories from GHTorrent to all of the URLs we obtained from npm for the packages. If a URL from GitHub was also in npm, we flagged it as being an npm package and removed it from the application list.\n\nTo determine that an application uses npm packages, we looked for the \u2018package.json\u2019 file, which specifies (amongst others) the npm package dependencies used by the application.\n\nFinally, to eliminate dummy applications that may exist in GitHub, we choose non-forked applications with more than 100 commits and more than 2 developers. Similar filtering criteria were used in prior work by Kalliamvakou et al. (2014). In total, we obtained 115,621 JavaScript applications and after removing applications that did not use the npm platform, we were left with 38,807 JavaScript applications.\n\n2.2 Python Package Index (PyPI)\n\nPyPI is the official package management platform for the Python programming language. Python is one of the most popular programming language today, mainly due to its strong community support and versatility, i.e., Python is used in many different domains from game development to server side applications (Vasilescu et al. 2015; Ray et al. 2014). Once again, we distinguish between Python packages, which are used in Python applications and standalone Python applications, which typically use Python packages. Similar to the case of JavaScript, we gather two datasets from two sources to perform our study. We obtain Python packages from the PyPI registry and applications that use PyPI packages from GitHub.\n**PyPI Packages:** We collected the latest versions of the Python packages from PyPI in order to determine which packages are \u2018trivial packages\u2019. PyPI contains around 118,324 packages (Libraries.io 2017), as of September 30, 2017. In total, we were able to obtain 116,905 packages from the PyPI registry since some packages did not exist anymore.\n\n**GitHub Python Applications:** To examine the usage of \u2018trivial packages\u2019 in Python applications, we mined all of the Python applications hosted on GitHub provided by the GHTorrent dataset (Gousios et al. 2014). We followed the same aforementioned process used to gather JavaScript applications, to ensure that we are indeed only obtaining the Python applications from GitHub, and not PyPI package repositories. In a nutshell, we compare the URL of the GitHub repositories to the URLs we obtained from PyPI for the packages. If a URL from GitHub was also in PyPI, we flagged it as being a PyPI package and removed it from the application list. In total, we obtained 14,717 Python applications that are hosted on GitHub. In addition, to eliminate dummy or immature Python applications that may exist in GitHub, we performed the filtering steps as we did for the JavaScript application. We choose non-forked Python applications with more than 100 commits and more than 2 developers.\n\n### 3 Defining Trivial Packages\n\nAlthough what a trivial package is has been loosely defined in the past (e.g., in blogs (Hemanth 2015; Harris 2015)), we want a more precise and objective way to determine trivial packages. To determine what constitutes a trivial package, we conducted two separate surveys, one for each of the studied package management platforms (npm and PyPI). We mainly asked participants what they considered to be a trivial package and what indicators they used to determine if a package is trivial or not. We conducted two different surveys since: 1) the two studied package management platforms serve different programming languages, 2) developers from the two package management platforms may have different perspective of what they consider to be \u2018trivial packages\u2019.\n\nFor each package management platform (npm and PyPI), we devised an online survey that presented the source code of 16 randomly selected packages that range in size between 4 - 250 JavaScript/Python lines of code (LOC). Participants were asked to 1) indicate if they thought the package was trivial or not and 2) specify what indicators they use to determine a trivial package. We opted to limit the size of the selected packages in the surveys to a maximum of 250 JavaScript/Python LOC since we did not want to overwhelm the participants with the review of excessive amounts of code.\n\nWe asked the survey participants to indicate trivial packages from the list of packages provided. We provided the survey participants with a loose definition of what a trivial package is, i.e., a package that contains code that they can easily code themselves and hence, is not worth taking on an extra dependency for. Figure 1 shows an example of a trivial JavaScript package, called *is-Positive*, which simply checks if a number is positive. The survey questions were divided into three parts: 1) questions about the participant\u2019s development\n\n```javascript\nmodule.exports = function (n) {\n return toString.call(n) === '[object Number]' && n > 0;\n};\n```\n\nFig. 1 Package is-Positive on npm\nbackground, 2) questions about the classification of the provided packages, and 3) questions about what indicators the participant would use to determine a trivial package. For the npm survey, we sent the survey to 22 developers and colleagues that were familiar with JavaScript development and received a total of 12 responses. We also sent the PyPI survey to 18 developers and colleagues that were familiar with Python development and received a total of 13 responses. It is important to note that we sent the two surveys to different groups of developers, to make sure that the participants in one survey are not biased through their experience of participating in the other (i.e., first) survey.\n\n**Participants\u2019 Background and Experience:** The first four columns of Table 1 show the background of participants in the npm survey. Of the 12 respondents, 2 are undergraduate students, 8 are graduate students, and 2 are professional developers. Ten of the 12 respondents have at least 2 years of JavaScript experience and half of the participants have been developing with JavaScript for more than five years.\n\nThe last four columns of Table 1 show the background of participants in the PyPI survey. Of the 13 participants in this survey, 9 identified themselves as graduate students and 4 as professional developers working in industry; 7 participants had more than 5 years of Python development experience, 2 respondents had between 3 to 5 years, 3 others had 2 to 3 years of experience, and finally one person had less than 1 year of Python practice. We were happy to have the majority of our respondents be well-experienced with Python.\n\n**Result:** We asked participants of the two surveys to list what indicators they use to determine if a package is trivial or not and to indicate all the packages that they considered to be trivial. Of the 12 participants in the JavaScript survey, 11 (92%) state that the complexity of the code and 9 (75%) state that size of the code are indicators they use to determine a trivial package. Also, 3 (20%) mentioned that they used code comments and other indicators (e.g., functionality) to indicate if a package is trivial or not. The results of the Python survey reveal that 9 (69%) of the developers use size of the code and 9 (69%) of them use complexity of the code as the main indicators to determine trivial packages. Also, 7 (54%) of the participants stated that they use source code comments to determine trivial Python packages and 3 (23%) of the participants mentioned some other indicators that they can use to identify a trivial package. For example one participant related a trivial Python package as \u201cIf it\u2019s only one function\u201d.\n\n| npm | Experience in JavaScript | # | Developers\u2019 position | # | PyPI | Experience in python | # | Developers\u2019 position | # |\n|-----|--------------------------|---|----------------------|---|-----|----------------------|---|----------------------|---|\n| <1 | 2 | | Undergrad Student | 2 | <1 | 1 | | Undergrad Student | 0 |\n| 2 \u2013 3| 3 | | Graduate Student | 8 | 2 \u2013 3| 3 | | Graduate Student | 9 |\n| 3 \u2013 5| 1 | | Professional Developer| 2 | 3 \u2013 5| 2 | | Professional Developer| 4 |\n| >5 | 6 | | \u2013 | \u2013 | >5 | 7 | | \u2013 | \u2013 |\n| Total| 12 | | Total | 12| Total| 13 | | Total | 13|\nSince it is clear that size and complexity are the most common indicators of trivial packages and they are a universal measure that can be measured for both, JavaScript and Python, we use these two measures to determine trivial packages. It should be mentioned that participants could provide more than one indicator, hence the percentages above sum to more than 100%.\n\nNext, we analyze all of the packages that were marked as trivial from the two surveys. Our main goal of this analysis is to find which values of the size and complexity metrics are indicative of trivial packages.\n\n**npm Survey Responses:** In total, we received 69 votes for the 16 packages. We ranked the packages in ascending order, based on their size, and tallied the votes for the most voted packages. We find that 79% of the votes consider packages with less than 35 lines of code to be trivial. We also examine the complexity of the packages using McCabe\u2019s cyclomatic complexity, and find that 84% of the votes marked packages that have a total complexity value of 10 or lower to be trivial. It is important to note that although we provide the source code of the packages to the participants, we do not explicitly provide the size or the complexity of the packages to the participants to not bias them towards any specific metrics.\n\n**PyPI Survey Responses:** we received 89 votes for the 16 packages. Similar to the case of npm, we ranked the packages in ascending order, based on their size, and tallied the votes for the most voted packages. We find that 76.4% of the votes consider packages that are equal or less than 35 lines of code to be trivial. We also examine the complexity of the packages using McCabe\u2019s cyclomatic complexity, and find that 79.8% of the votes marked packages that have a total complexity value of 10 or lower to be trivial Python package. Similar to npm, we also did not provide any metric values for the packages to avoid bias.\n\nBased on the aforementioned findings, we used the two indicators JavaScript/Python LOC \u2264 35 and complexity \u2264 10 to determine trivial packages in our dataset. Hence, we define trivial JavaScript/Python packages as \\( \\{ X_{LOC} \\leq 35 \\cap X_{Complexity} \\leq 10 \\} \\), where \\( X_{LOC} \\) represents the JavaScript/Python LOC and \\( X_{Complexity} \\) represents McCabe\u2019s cyclomatic complexity of package \\( X \\). Although we use the aforementioned measures to determine trivial packages, we do not consider this to be the only possible way to determine trivial packages.\n\n### 4 How Prevalent are Trivial Packages?\n\nIn this section, we want to know how prevalent trivial packages are. We examine prevalence from two aspects: the first aspect is from package management platforms (npm and PyPI) perspective, where we are interested in knowing how many of the packages on these two\npackage management platforms are trivial. The second aspect considers the use of trivial packages in JavaScript and Python applications.\n\nTo identify trivial packages in our two datasets, we calculate the LOC and complexity of all the npm and PyPI packages. For the LOC, we calculate the number of lines of source code after removing white space and source code comments. As for the complexity, we use McCabe\u2019s complexity since it is widely used in industry and academia (Ebert and Cain 2016). Then, for each package, we removed test code since we are mostly interested in the actual source code of the packages. To identify and remove the test code, similar to prior work (Gousios et al. 2014; Tsay et al. 2014; Zhu et al. 2014), we look for the term \u201ctest\u201d (and its variants such as \u2018tests\u2019 and/or \u2018TEST_code\u2019) in the file names and file paths. To calculate the LOC and the complexity of every package in our datasets, we use the Understand tool by SciTools (https://scitools.com/). Understand is a source code analysis tool that provides various code metrics and has been extensively used in other work (e.g., Rahman et al. 2019; Castelluccio et al. 2019).\n\n4.1 How Many of npm\u2019s & PyPI\u2019s Packages are Trivial?\n\n**npm**: We use the two measures, LOC and complexity, to determine trivial packages, which we now use to quantify the number of trivial packages in our dataset. Our dataset contained a total of 549,629 npm packages. For each package, we calculated the number of JavaScript code lines and removed packages that had zero LOC, which removed 48,628 packages. We eliminated npm packages that have zero LOC since they present dummy or empty packages that developers publish for different reasons such as reserve a unique package name. This left us with a final number of 501,001 packages.\n\nOut of the 501,001 npm packages we mined, 80,232 (16.0%) packages are trivial packages. In addition, we examined the growth of trivial packages in npm. Figure 2 shows the percentage of trivial to all packages published on npm per month. We see an increasing trend in the number of trivial packages published over time before the growth of trivial packages became stable around the beginning of 2015. Overall, approximately 14.0% of the packages added every month are trivial packages. We investigated the spike around March 2016 and found that this spike corresponds to the time when npm disallowed the un-publishing of packages (npm Blog 2016).\n\n\nIn addition, to see the effect of the left-pad incident on the number of published trivial packages, we investigate the number of published trivial npm packages before and after the left-pad incident. Out of 216,309 npm packages that published before the left-pad incident, we found 34,750 (16.1%) are trivial packages. As after the left-pad incident, out of the 284,692 that are published, we found 45,482 (16.0%) are trivial packages.\n\n**PyPI:** For the PyPI dataset, we are also interested in discerning the trivial packages from the others in terms of LOC and complexity. For such, we mined the 116,905 available packages on the PyPI platform. We got all the 116,905 packages from PyPI register. However, a package on PyPI could be released/distributed in different formats and we were not able to process them. We found that 42,242 of PyPI packages are platform exclusive (e.g., windows .exe or mac .dmg) or are corrupted compressed .gz files that we could not analyzed. This process left us with 74,663 PyPI packages for which we measure their LOC and complexity. We then remove packages that had zero LOC, which removed another 10,751 packages. We remove packages that had zero LOC since we do not want to count empty packages that exist on PyPI for various reasons such as learning to publish packages on PyPI.\n\nOur analysis reveals that out of the 63,912 PyPI packages we analyzed, 6,759 (10.6%) packages are trivial packages in the PyPI package management platform. We again examined the growth of trivial packages in PyPI. Figure 3 shows the percentage of trivial to all packages published on PyPI per month for the time period between 2011 and 2017. We see there is a slight increase in the trend of publishing trivial packages on the PyPI platform and that trend starts to decrease in late 2013. We also found that approximately 11% of the packages added every month are trivial packages.\n\nWe also looked at the percentage of trivial to all packages publish before and after the left-pad incident. We found that out of 33,335 PyPI package published prior to the left-pad incident, 3,717 (11.2%) of them are trivial packages while 3,042 (10.0%) of all packages published after the left-pad incident are trivial.\n\n\n4.2 How Many Applications Depend on Trivial Packages?\n\n**JavaScript Applications:** Just because trivial packages exist on npm, it does not mean that they are actually being used. We also examine the number of applications that use trivial packages. To do so, we examine the package.json file, which contains all the dependencies that an application installs from npm. However, in some cases, an application may install a package but not use it. To avoid counting such instances, we parse the JavaScript code of all the examined applications and use regular expressions to detect the required dependency statements, which indicates that the application actually uses the package in its code\\(^2\\). Finally, we measured the number of packages that are trivial in the set of packages used by the applications. Note that we only consider npm packages since it is the most popular package manager for JavaScript packages and other package managers only manage a subset of packages (e.g., Bower (2012) only manages front-end/client-side frameworks, libraries and modules). We find that of the 38,807 applications in our dataset, 10,139 (26.1%) directly depend on at least one trivial package.\n\n**Python Applications:** Similar to the case of JavaScript, we also analyzed the Python applications that depend on trivial packages. In contrast to JavaScript\u2019s availability of a \u2018packages.json\u2019 file, analyzing Python applications presents some challenges to fully identify a given script\u2019s dependency set for the reasons described previously on Section 4.1. We statically parse the source code after relevant \u201cimport\u201d like clauses, along with other statements that allow for verifying that the packages are effectively being put in use (i.e., the package is both supposed to be installed and its functions/definitions are indeed being called, rather than merely being just imported and not used). To facilitate this analysis, we use the popular snakefood (http://furius.ca/snakefood/) tool. The tool generates dependency graphs from Python code through parsing the Abstract Syntax Tree of the Python files. Our analysis showed that out of the 14,717 examined Python applications, 1,024 (6.9%) were found to depend on one or more trivial PyPI package.\n\n---\n\n5 Survey Results\n\nWe surveyed developers to understand the reasons for and the drawbacks of using trivial packages. We used a survey because it allows us to obtain first-hand information from the developers who use these trivial packages. In order to select the most relevant participants, we sent out the survey to developers who use trivial packages. We used Git\u2019s pickaxe command on the lines that contain the required dependency statements in the JavaScript and Python applications. Doing so helped us identify the name and email of the developer who introduced the trivial package dependency.\n\n\\(^2\\)Note that if a package is required in the application, but does not exist, it will break the application.\nSurvey Participants: To mitigate the possibility of introducing misunderstood or misleading questions, we initially sent the survey to two developers and incorporated their minor suggestions to improve the survey. For npm participants, we sent the survey to 1,055 JavaScript developers from 1,696 applications. To select the developers, we ranked them based on the number of trivial packages they use. We then took a sample of 600 developers that use trivial packages the most, and another 600 of those that indicated the least use of trivial packages. The survey was emailed to the 1,200 selected developers, however, since some of the emails were returned for various reasons (e.g., the email account does not exist anymore, etc.), we could only reach 1,055 developers. We also sent the survey to all Python developers after filtering out the invalid and duplicated developers\u2019 emails. We successfully sent the survey to 460 Python developers that introduce trivial Python packages from PyPI in 1,024 Python applications in our dataset.\n\nWe designed the survey using Google Forms. The survey listed the trivial package and the application that we detected the trivial package in. In total, we received 125 developer responses. First, we received 88 responses to our survey from the JavaScript developers, which translates to a response rate of 8.3%. Our survey response rate is higher than the typical 5% response rate reported in questionnaire-based software engineering surveys (Singer et al. 2008). The left part of Table 2 show the JavaScript experience and the position of the developers. The majority (67) of the respondents have more than 5 years of experience, 14 have between 3-5 years and 7 have 1-3 years of experience. As for the position of the survey respondents, of the 88 respondents, 83 of them identified as developers working either in industry (68) or as full time independent developers (15). The remaining 5 identified as being casual developers (2) or other (3), including one student and two developers working in executive positions at npm.\n\nSecond, we received 37 survey responses from the Python developers, yielding a response rate of 8.04%, which is again in accordance with what is supposedly been observed on other studies in the software engineering domain (Singer et al. 2008). The right part of Table 2 shows the Python experience and position of the developers. The vast majority of the respondents (92%) identified themselves to have more than five years of Python development experiences. 3 respondents only identified themselves to have development experience in Pythons range between more than 3 to five years. Regarding the current position of the survey respondents, 27 of the respondents refer themselves as developers working in industry and 4 developers identified themselves as full time independent developers. The reset of the respondents are identified as being a casual developers (1) or other (5) including researchers and students.\n\n| npm | Experience in JavaScript | # | Developers\u2019 Position | # | PyPI | Experience in Python | # | Developers\u2019 Position | # |\n|-----|--------------------------|---|----------------------|---|-----|----------------------|---|----------------------|---|\n| 1 - 3 years | 7 | Industrials | 68 | 1 - 3 years | 0 | Industrials | 27 |\n| > 3 - 5 years | 14 | Independent | 15 | > 3 - 5 years | 3 | Independent | 4 |\n| > 5 years | 67 | Casual | 2 | > 5 years | 34 | Casual | 1 |\n| \u2013 | \u2013 | Other | 3 | \u2013 | \u2013 | Other | 5 |\n| Total | 88 | Total | 88 | Total | 37 | Total | 37 |\nThe fact that most of the respondents are experienced JavaScript and Python developers gives us confidence in our survey responses.\n\n5.1 Do Developers Consider Trivial Packages Harmful?\n\nThe first question of our survey to the participants is: \u201cDo you consider the use of trivial packages as bad practice?\u201d The reason to ask this question so bluntly is that it allows us to gauge, in a very deterministic way, how the developers felt about the issue of using trivial packages. We provided three possible replies, Yes, No or Other in which case they were provided with a text box to elaborate. Figure 4 shows the distribution of responses from both JavaScript and Python developers. Of the 88 JavaScript participants, 51 (57.9%) stated that they do NOT consider the use of trivial packages as bad practice. Another 21 (23.9%) stated that they indeed think that using trivial package is a bad practice. The remaining 16 (18.2%) stated that it really depends on the circumstances, such as the time available, how critical a piece of code is, and if the package used has been thoroughly tested.\n\nContrary to the case of JavaScript, 26 (70.3%) of the Python developers who responded to our survey generally consider the use of trivial packages as bad practice. Only 3 (8.1%) of survey participants stated that they do not think that using trivial package is a bad practice. The remaining 8 (21.6%) indicate that it really depends on the circumstances. For example, P-PyPI 3 states: \u201cIf the language doesn\u2019t provide such common, inherently useful functionality then fixing this oversight by the use of a third-party library is only reasonable. Moreover, little functionality is actually \u2018trivial\u2019. It may be short to implement but most likely a mistake in it will introduce a bug into the program as surely as a mistake in something \u2018non-trivial\u2019.\u201d\n\n\n\nFig. 4 Developer responses to the question \u201cis using a trivial package bad?\u201d Most JavaScript developers answered no, whereas most Python developers answered yes.\n5.2 Why Do Developers Use Trivial Packages?\n\nWhile we have answered the question as to whether developers say using trivial packages is a bad practice, what we are most interested in is why do developers resort to using trivial packages and what do they view as the drawbacks of using trivial packages. Therefore, the second part of the survey asks participants to list the reasons why they resort to using trivial packages. To ensure that we do not bias the responses of the developers, the answer fields for these questions were in free-form text, i.e., no predetermined suggestions were provided. We then analyze separately the responses from the two surveys (JavaScript and Python). After gathering all of the responses, we grouped and categorized the responses in a two-phase iterative process. In the first phase, two of the authors carefully read the participant\u2019s answers and independently came up with a number of categories that the responses fell under. Next, they discussed their groupings and agreed on the extracted categories. Whenever they failed to agree on a category, the third author was asked to help break the tie. Once all of the categories were decided, the same two authors went through all the answers again and independently classified them into their respective categories. For the majority of the cases, the two authors agreed on most categories and the classifications of the responses. To measure the agreement between the two authors, we used Cohen\u2019s Kappa coefficient (Cohen 1960). The Cohen\u2019s Kappa coefficient has been used to evaluate inter-rater agreement levels for categorical scales, and provides the proportion of agreement corrected for chance. The resulting coefficient is scaled to range between -1 and 1, where a negative value means less than chance agreement, zero indicates exactly chance agreement, and a positive value indicates better than chance agreement (Fleiss and Cohen 1973). In our categorization, the level of agreement measured between the authors was of 0.90 and 0.83 for the npm survey and PyPI survey, respectively, which is considered to be excellent inter-rater agreement.\n\nTable 3 shows the reasons for using trivial packages, as reported by respondents from both JavaScript and Python surveys. As we can see from the table, the two most cited reasons\n\n| Reason | Description | npm #Resp. | % | PyPI #Resp. | % |\n|-------------------------------|-----------------------------------------------------------------------------|------------|------|-------------|------|\n| Well-implemented & tested | Participants state that trivial packages are effectively implemented and tested. | 48 | 54.6%| 20 | 54.1%|\n| Increased productivity | Trivial packages reduce the time needed to implement existing source code. | 42 | 47.7%| 12 | 32.4%|\n| Well-maintained code | It eases source code maintenance, since other developers maintain the trivial package. | 8 | 9.1% | 2 | 5.4% |\n| Improved readability & reduced complexity | Using trivial packages improve the source code quality in terms of readability and reduce complexity. | 8 | 9.1% | 5 | 13.5%|\n| Better performance | Trivial packages improve the performance of web applications compared to the use of large frameworks. | 3 | 3.4% | 0 | 0.0% |\n| No reason | \u2013 | 7 | 8.0% | 7 | 18.9%|\n(i.e., well-implemented & tests and increased productivity) are the same for both npm and PyPI package management platforms. However, when it comes to the 3 less common reasons, there is a slight difference between npm and PyPI, most notably, the reason of trivial packages provide better performance was not evident in our survey.\n\nNext, we discuss each of the reasons presented in Table 3 in more detail:\n\nR1. **Well-implemented & tested:** The most cited reason for using trivial packages is that they provide well implemented and tested code. More than half of the responses mentioned this reason with 54.6% and 54.1% of the responses from JavaScript and Python, respectively. In particular, although it may be easy for developers to code these trivial packages themselves, it is more difficult to make sure that all the details are addressed, e.g., one needs to carefully consider all edge cases. Some example responses that mention these issues are stated by participants P-npm 68, P-npm 4, and P-PyPI 5, who cite their reasons for using trivial packages as follows: P-npm 68: \u201cTests already written, a lot of edge cases captured [...].\u201d, P-npm 4: \u201cThere may be a more elegant/efficient/correct/cross-environment-complatable solution to a trivial problem than yours\u201d, and P-PyPI 5: \u201cThey have covered extra cases that I would not do or thought initially.\u201d\n\nR2. **Increased productivity:** The second most cited reason is the improved productivity that using trivial packages enables with 47.7% and 32.4% for JavaScript and Python, respectively. Trivial tasks or not, writing code on your own requires time and effort, hence, many developers view the use of trivial packages as a way to boost their productivity. In particular, early on in a project, a developer does not want to worry about small details, they would rather focus their efforts on implementing the more difficult tasks. For example, participants P-npm 13 and P-npm 27 from the JavaScript survey state: P-npm 13: \u201c[...] and it does save time to not have to think about how best to implement even the simple things.\u201d & P-npm 27: \u201cDon\u2019t reinvent the wheel! if the task has been done before.\u201d. Another example from the Python survey, participant P-PyPI 17 states: \u201cOften I do write the code myself. And then package it into a re-usable module so that I don\u2019t have to write it again later. And again. And again... At this point, whether the module is authored by myself or someone else is mostly irrelevant. What\u2019s relevant is that I get to avoid repeatedly implementing the same functionality for each new project.\u201d\n\nThe aforementioned are clear examples of how developers would rather not code something, even if it is trivial. Of course, this comes at a cost, which we discuss later.\n\nR3. **Well-maintained code:** A less common (9.1% and 5.4% of the responses from JavaScript and Python), but cited reason for using trivial packages is the fact that the maintenance of the code need not to be performed by the developers themselves; in essence, it is outsourced to the community or the contributors of the trivial packages. For example, participants P-npm 45 and P-PyPI 1 states, P-npm 45: \u201cAlso, a highly used trivial package is probable to be well maintained.\u201d and P-PyPI 1: \u201cThe simple advantages are that they may be trivial AND used by many people and therefore potentially maintained by developers.\u201d Even tasks such as bug fixes are dealt with by the contributors of the trivial packages, which is very attractive to the users of the trivial packages, as reported by participant P-npm 80: \u201c[...], leveraging feedback from a larger community to fix bugs, etc.\u201d\n\nR4. **Improved readability & reduced complexity:** Participants also reported that using trivial packages improves the readability and reduces the complexity of their code.\nwith 9.1% and 13% responses for the two package management platforms. For example, P-npm 34 states: \u201cimmediate clarity of use and readability for other developers for commonly used packages[...].\u201d & P-npm 47 states: \u201cSimple abstract brings less complexity.\u201d Python developers report the same advantage of using trivial packages. For example, P-PyPI 5 states that \u201cCode clarity. When many two liners become one liners it saves space. Its the whole point of batteries included mentally...\u201d\n\nR5. Better performance: A few of the JavaScript participants (3.4%) stated that using trivial packages improves performance since it alleviates the need for their application to depend on large frameworks. Notably, the load time of trivial packages compared to larger JavaScript packages is small, which speeds up the overall load time of the applications. For example, P-npm 35 states: \u201c[...] you do not depend on some huge utility library of which you do not need the most part.\u201d While JavaScript developers reported that trivial packages improve the performance, the Python developers do not report such a claim. One explanation for this is that JavaScript is used to develop front-end applications, which is often sensitive to performance i.e., load time, whereas the Python is used to implement applications in a wide variety of domains.\n\nOverall the developer responses show that there is a different perception of using trivial package among developers from the two package management platforms. Only a small percentage (8.0%) of the respondents from JavaScript stated that they do not see a reason to use trivial packages. However, for Python developers 18.9% of the respondents believe that there are no advantages of using trivial packages.\n\n5.3 Drawbacks of Using Trivial Packages\n\nIn addition to knowing the reasons why developers resort to trivial packages, we wanted to understand the other side of the coin - what they perceive to be the drawbacks of their decision to use these packages. The drawbacks question was part of our survey and we followed the same aforementioned process to analyze the survey responses. In the case of the drawbacks the Cohen\u2019s Kappa agreement measure was 0.86 and 0.91 for npm and PyPI, respectively, which is considered to be an excellent agreement.\n\nTable 4 lists the drawback mentioned by the survey respondents along with a brief description and the frequency of each drawback. As we can see from the table, the top two most cited drawbacks (i.e., dependency overhead and breakage of applications) are the same for both, npm and PyPI. However, for the less cited drawbacks, npm developers cited performance, development slow down and missed learning opportunities as the next set of drawbacks, whereas in PyPI, the developers consider security, development slow down and decreased performance as the next set of drawbacks. It is worth noting however that there is very little difference between the individual drawbacks (e.g., security vs. development\nTable 4 Drawback of using trivial packages in npm and PyPI\n\n| Drawback | Description | npm | % | Python | % |\n|------------------------|-----------------------------------------------------------------------------|------|------|--------|------|\n| Dependency overhead | Using trivial packages results in a dependency mess that is hard to update and maintain. | 49 | 55.7%| 25 | 67.6%|\n| Breakage of applications | Depending on a trivial package could cause the application to break if the package becomes unavailable or has a breaking update. | 16 | 18.2%| 12 | 32.4%|\n| Decreased performance | Trivial packages decrease the performance of applications, which includes the time to install and build the application. | 14 | 15.9%| 3 | 8.1% |\n| Slows development | Finding a relevant and high quality trivial package is a challenging and time consuming task. | 11 | 12.5%| 4 | 10.8%|\n| Missed learning opportu | The practice of using trivial packages leads to developers not learning and experiencing writing code for trivial tasks. | 8 | 9.1% | 0 | 0% |\n| Security | Using trivial packages can open a door for security vulnerability. | 7 | 8.0% | 5 | 13.5%|\n| Licensing issues | Using trivial packages could cause licensing conflicts. | 3 | 3.4% | 2 | 5.4% |\n| No drawbacks | \u2013 | 7 | 8.0% | 3 | 8.1% |\n\nslow down) within the two package management platforms (i.e., npm and PyPI). Next, we discuss each of the drawbacks in more detail:\n\nD1. Dependency overhead: The most cited drawback of using trivial packages is the increased dependency overhead, e.g., keeping all dependencies up to date and dealing with complex dependency chains, that developers need to bear (Bogart et al. 2016; Mirhosseini and Parnin 2017). This situation is often referred to as \u2018dependency hell\u2019, especially when the trivial packages themselves have additional dependencies. This drawback came through clearly in many comments, which account for 55.7% of the responses form JavaScript developers. For example, P-npm 41 states: \u201c[...] people who don\u2019t actively manage their dependency versions could [be] exposed to serious problems [...]\u201d & P-npm 40: \u201cHard to maintain a lot of tiny packages\u201d. For Python developers, the percentage of responses related to dependency overhead is high (67.6%) as well. Some example responses from Python developers that mention these issues are stated by participants P-PyPI 2, P-PyPI 4 & P-PyPI 13 who state that: P-PyPI 2: \u201c...it\u2019s more difficult to distribute something with a dependency that doesn\u2019t come with Python.\u201d, P-PyPI 4: \u201cLots of brittle dependencies.\u201d & P-PyPI 13: \u201cWhen your projects consist of a lot trivial modules, it becomes almost impossible to track their update and some time you might forget what even they do.\u201d Hence, while trivial packages may provide well-implemented/tested code and improve productivity, developers are clearly aware that the management of the additional dependencies is something they need to deal with.\nD2. **Breakage of applications:** Developers also worry about the potential breakage of their application due to a specific package or version becoming unavailable. JavaScript developers stated this issue in 18.2% of the responses while the percentage is 32.4% for Python developers. For example, in the left-pad issue, the main reason for the breakage was the removal of left-pad, P-npm 4 states: \u201cObviously the whole \u2018left-pad crash\u2019 exposed an issue\u201d & P-PyPI 22 states: \u201cpotential for breaking (NPM leftpad situation)\u201d. However, since that incident, npm has disabled the possibility of a package being removed (npm Blog 2016). Although disallowing the removal solves part of the problem, packages can still be updated, which may break an application. This issue was clear from one of the responses, P-PyPI 7, who stated \u201cPotential for breaking changes from version to version.\u201d For a non-trivial package, it may be worth it to take the risk, however, for trivial packages, it may not be worth taking such a risk.\n\nD3. **Decreased performance:** This issue is related to the dependency overhead drawback. Developers mentioned that incurring the additional dependencies slowed down the build and run time and increased application installation times (15.9% and 8.1%). For example, P-npm 64 states: \u201cToo many metadata to download and store than a real code.\u201d & P-npm 34 states: \u201c[...], slow installs; can make project noisy and unintuitive by attempting to cobble together too many disparate pieces instead of more targeted code.\u201d Another Python developer, P-PyPI 1, states: \u201cIf the modules are not so ubiquitous, then needing the dependency is a real drag as one will have to install it. Also, the same job done with your own may run much faster and be easier to understand. As mentioned earlier, in some cases it is not just the fact that the trivial package adds a dependency, but in some cases the trivial package itself depends on additional packages, which negatively impacts performance even further.\n\nD4. **Slows development:** In some cases, the use of trivial packages may actually have a reverse effect and slow down development with 12.5% & 10.8% of responses from JavaScript and Python developers. For example, as P-npm 23 and P-npm 15 state: P-npm 23: \u201cCan actually slow the team down as, no matter how trivial a package, if a developer hasn\u2019t required it themselves they will have to read the docs in order to double check what it does, rather than just reading a few lines of your own source.\u201d & P-npm 15: \u201c[...], we have the problem of locating packages that are both useful and \u201ctrustworthy\u201d [...].\u201d It can be difficult to find a relevant and trustworthy package. Even if others try to build on your code, it is much more difficult to go fetch a package and learn it, rather than read a few lines of your code. Python developers also agree on this issue, for example P-PyPI 15 states \u201cIf finding, reading, and understanding the documentation of a module takes longer than reading its implementation, the hiding of functionality in third-part trivial modules obscures the source base.\u201d\n\nD5. **Missed learning opportunities:** In certain cases reported by only JavaScript developers (9.1%), the use of these trivial packages is seen as a missed learning opportunity for developers. For example, P-npm 24 states: \u201cSometimes people forget how to do things and that could lead to a lack of control and knowledge of the language/technology you are using\u201d. This is a clear example of where just using a package, rather than coding the solution yourself, will lead to less knowledge about the code base. In contrast to JavaScript developers, Python developers seem to not to be worried about this issue since the use of trivial packages is not as common within the Python developer community as JavaScript developers.\n\nD6. **Security:** In some cases the trivial packages may have security flaws that make the application more vulnerable. This is an issue pointed out by a few developers (8.0% and 13.5%), for example, as P-npm 15 mentioned earlier, it is difficult to find\npackages that are trustworthy. Also, P-npm 57 mentions: \u201cIf you depend on public trivial packages then you should be very careful when selecting packages for security reasons\u201d & P-PyPI 3 states \u201cmore dependencies, greater likelihood of not knowing of how code actually works at lower level, security issues.\u201d As in the case of any dependency one takes on, there is always a chance that a security vulnerability could be exposed in one of these packages.\n\nD7. Licensing issues (3.4%): In some cases from both responses (3.4% and 5.4% for JavaScript and Python), developers are concerned about potential licensing conflicts that trivial packages may cause. For example, P-npm 73 states: \u201c[...], possibly license-issues\u201d, P-npm 62: \u201c[...], there is a risk that the \u2018trivial\u2019 package might be licensed under the GPL must be replaced anyway prior to shipping.\u201d P-PyPI 23 also mentions \u201cCan be licensing hell.\u201d\n\nIn general, we observe similar concerns regarding the use of trivial packages in the two software managements platforms studied. There were also approximately 8% of the responses in both package management platforms that stated they do not see any drawbacks with using trivial packages.\n\n6 Putting Developer Perceptions Under the Microscope\n\nThe developer surveys provided us with valuable insights on why developers use trivial packages and what they perceive to be their drawbacks. Whether there is empirical evidence to support their perceptions remains unexplored. Thus, we examine the most commonly cited reason for using trivial packages, i.e., the developers\u2019 belief that trivial packages are well tested, and drawback, i.e., the impact of additional dependencies, based on our findings in Section 5.\n\n6.1 Examining the \u2018Well Tested\u2019 Perception\n\nAs shown in Table 3, more than half of the responses from the studied package management platforms indicate that they use trivial packages because developers believe that they are well implemented and tested. However, is this really the case - are trivial packages really well tested? In this section, we want to examine whether this belief has any grounds or not.\n\n6.1.1 Node Package Manager (npm)\n\nnpm requires that developers provide a test script name with the submission of their packages (listed in the package.json file). In fact, 73.7% (59,110 out of 80,232) of the trivial packages in our dataset have some test script name listed. However, since developers can provide any script name under this field, it is difficult to know if a package is actually tested.\n\nWe examine whether a npm package is really well tested and implemented from two aspects; first, we check if a package has tests written for it. Second, since in many cases, developers consider packages to be \u2018deployment tested\u2019, which means that the trivial\npackages are used by many developers, we also consider the usage of a package as an indicator of it being well tested and implemented (Zambonini 2011). To carefully examine whether a package is really well tested and implemented, we use the npm online search tool (known as npms (Cruz and Duarte 2017)) to measure various metrics related to how well the packages are tested, used and valued. To provide its ranking of the packages, npms mines and calculates a number of metrics based on development (e.g., tests) and usage (e.g., no. of downloads) data. We use three metrics measured by npms to validate the \u2018well tested and implemented\u2019 perception of developers, which are:\n\n1) **Tests:** considers the tests\u2019 size, coverage percentage and build status for a project. We looked into the npms source code and found that the Tests metric is calculated as: \n\\[ \\text{testsSize} \\times 0.6 + \\text{buildStatus} \\times 0.25 + \\text{coveragePercentage} \\times 0.15. \\]\nWe use the Tests metric to determine if a package is tested and how trivial packages compare to non-trivial packages in terms of how well tested they are. One example that motivates us to investigate how well tested a trivial package is the response by P-npm 68, who says: \u201cTests already written, a lot edge cases captured [...].\u201d\n\n2) **Community interest:** evaluates the community interest in the packages, using the number of stars on GitHub & npm, forks, subscribers and contributors. Once again, we find through the source code of npms that Community interest is simply the sum of the aforementioned metrics, measured as: \n\\[ \\text{starsCount} + \\text{forksCount} + \\text{subscribersCount} + \\text{contributorsCount}. \\]\nWe use this metric to compare how interested the community is in trivial and non-trivial packages. We measure the community interest since developers view the importance of the trivial packages as evidence of its quality as stated by P-npm 56, who says: \u201c[...] Using an isolated module that is well-tested and vetted by a large community helps to mitigate the chance of small bugs creeping in.\u201d\n\n3) **Download count:** measures the mean downloads for the last three months. Again, the number of downloads of a package is often viewed as an indicator of the package\u2019s quality; as P-npm 61 mentions: \u201cthis code is tested and used by many, which makes it more trustful and reliable\u201d.\n\nAs an initial step, we calculate the number of trivial packages that have a Tests value greater than zero, which means trivial packages that have some tests. We find that only 28.4% of the trivial packages have tests, i.e., a Tests value > 0. In addition, we compare the values of the Tests, Community interest and Download count for Trivial and non-Trivial packages. Our focus is on the values of the aforementioned metric values for trivial packages, however, we also present the results for non-trivial packages to put our results in context.\n\nFigure 5 shows the bean-plots for the Tests, Community interest and Download count. In all cases trivial packages have, on median, a smaller Community interest value and Download count compared to non-trivial packages except for the Tests value. The Fig. 5a shows that for the Tests metric, trivial packages have, on median, a similar value as non-trivial packages. That said, we observe from Fig. 5a that the distribution of the Tests metric is similar for both, trivial and non-trivial packages. Most packages have a Tests value of zero, then there are small pockets of packages that have values of aprox. 0.30,\n\n---\n\n3It is important to note that the motivation and full derivation (e.g., why they put a weight of 0.15 on the test coverage, etc.) of the metrics is beyond the scope of this paper. We refer interested readers to the npms documentation for more details (Cruz and Duarte 2017). To make our paper self-sufficient, we include how the metrics are calculated here.\n0.6, 0.9 and 1.0. In the case of the Community interest and Download count metrics, once again, we see similar distributions, although clearly the median values are lower for trivial packages.\n\nTo examine whether the difference in metric values between trivial and non-trivial packages is statistically significant, we performed a Mann-Whitney test to compare the two distributions and determine if the difference is statistically significant, with a $p$-value < 0.05. We also use Cliff\u2019s Delta ($d$), which is a non-parametric effect size measure to interpret the effect size between trivial and non-trivial packages. As suggested in Grissom and Kim (2005), we interpret the effect size value to be small for $d < 0.33$ (positive as well as negative values), medium for $0.33 \\leq d < 0.474$ and large for $d \\geq 0.474$.\n\nTable 5 shows the $p$-values and effect size values. We observe that in all cases the differences are statistically significant, however, the effect size is small. The results show that although the majority of trivial packages do not have tests written for them, and have statistically lower Community interest and Download count values, their effect size is smaller than non-trivial packages.\n\n### 6.1.2 Python Package Index (PyPI)\n\nSince PyPI does not collect any metadata to show if the Python package is tested or not, we use other data sources to examine the well tested perception. To do so, we use two ways to examine whether Python packages are tested or not: 1) we use the source code of the packages that are hosted on GitHub. 2) we relied on information about Python packages\n\n| Metrics | $p$-value | $d$ |\n|------------------|-----------|--------------|\n| Tests | 2.2e-16 | $-0.222$ (small) |\n| Community interest | 2.2e-16 | $-0.225$ (small) |\n| Downloads count | 2.2e-16 | $-0.261$ (small) |\ncollected by the open source service libraries.io (https://libraries.io/). libraries.io monitors and collects the metadata of open source packages across 36 different package management platforms. It falls under the CC-BY-SA 4.0 licenses and has been used in other research work (e.g., Decan et al. 2018a, b). We obtain the extracted metadata information related to PyPI package management. Once again, we examine the testing perception in three complementary ways.\n\n1) **Tests:** we examine if the package has any test code written. Since there is no standard way to determine that a Python application has tests (e.g., there exist more than 100 Python testing tools (https://wiki.python.org/moin/PythonTestingToolsTaxonomy)), we manually investigate whether the PyPI package contains test code written or not. The idea is that if the developers writes tests, then they will put these tests in the package repository. One example that motivated us to look for the test code of a package is the developer response: P-PyPI 11 who stated \u201cShorter code overall, well-tested code for fundamental tasks helps smooth over language nits\u201d.\n\nSince this is a heavily manual process, we decide to examine a representative sample of the packages. Therefore, we take a statistically significant sample from the 6,759 Python packages that we identify as trivial Python packages (Section 4.1). The sample size is selected randomly to attain 5% confidence interval and a 95% confidence level. This sampling process result in 364 PyPI trivial packages. Then, two of the authors manually examine the code bases of sampled packages looking for test code to identify the packages that has test. After that, we measure Cohen\u2019s Kappa coefficient to evaluate the level of agreement between the two annotators (Cohen 1960). As a result of this process, we find that the level of agreement between the two authors to be 0.97, which is consider to be excellent agreement. Finally, the two authors discuss the cases that they do not agree on and come to an agreement.\n\n2) **Community interest:** evaluates the community interest in the packages, using the number of stars on GitHub, forks, subscribers and contributors. We adopted the same formula defined by npms, which is basically the sum of the aforementioned metrics, measured as: \\( \\text{starsCount} + \\text{forksCount} + \\text{subscribersCount} + \\text{contributorsCount} \\). We use this metric to compare how interested the community is in trivial and non-trivial packages. We measure the community interest since developers view the importance of the trivial packages as evidence of its quality.\n\n3) **Usage count:** represents the number of applications that use a package. The more applications using a Python package, the more popular that package is. This may also indicate that the package is of high quality. For example, P-PyPI 11 indicated \u201cThe simple advantages are that they may be trivial AND used by many people and therefore potentially maintained by developers.\u201d Hence, we use the usage count metric since it indicates the package quality; thus, many developers use it in their applications. To calculate the number of Python applications that use PyPI trivial packages, we use the libraries.io dataset that provides a list of Python applications and the packages they depend on. Also, for each PyPI package in our dataset, we count the number of Python applications that use that package.\n\nWe found that out of the 364 sampled trivial Python packages that we manually examined, 185 (50.82%) packages do not have test code in them, while 179 (49.18%) of the examined packages have test code written in them. It is important to note that our analysis only examines whether a trivial package has tests or not, whether these tests are actually effective is a completely different issue and is one of the reasons for examining the other two metrics Community interest and Usage count.\nFigure 6 shows the bean-plots for the Community interest and Usage count values for trivial and non-trivial Python packages in our dataset. The figures show that in the two cases trivial Python packages have, on median, a smaller Community interest value and Usage count compared to non-trivial packages. That said, we observe from Fig. 6a that in the case of the Community interest metric, we see clearly the median values are lower for trivial packages. Figure 6b shows that the distribution of the Usage count metric is similar for both, trivial and non-trivial packages. Once again, we examine whether the difference in metric values between trivial and non-trivial packages is statistically significant. We performed a Mann-Whitney test to compare the two distributions and determine if the difference is statistically significant. We also use Cliff\u2019s Delta ($d$) to measure the effect size between PyPI trivial and non-trivial packages. Table 6 shows the $p$-values and effect size values. We observe that in the cases of community interest and usage count, the differences are statistically significant, and the effect size is small and negligible, respectively.\n\n6.2 Examining the \u2018Dependency Overhead\u2019 Perception\n\nAs discussed in Section 5, the top cited drawback of using trivial packages is that developers need to take on and maintain extra dependencies, i.e, dependency overhead. Examining the impact of dependencies is a complex and well-studied issue (e.g., de Souza and Redmiles 2008; Decan et al. 2016; Abate et al. 2009) that can be examined in a multitude of ways. We choose to examine the issue from both, the application and the package perspectives.\n\n6.2.1 Application-level Analysis\n\nWhen compared to coding trivial tasks themselves, using a trivial package imposes extra dependencies. One of the most problematic aspects of managing dependencies for\nTable 6 Mann-Whitney Test (p-value) and Cliff\u2019s Delta (d) for trivial vs. non trivial packages in PyPI\n\n| Metrics | p-value | d |\n|------------------|-----------|------------|\n| Community interest | 2.2e-16 | \u22120.251 (small) |\n| Usage count | 0.004557 | \u22120.039 (negligible) |\n\nApplications is when these dependencies are updated, causing a potential to break their application. Therefore, as a first step, we examined the number of releases for trivial and non-trivial packages. The intuition here is that developers need to put in extra effort to ensure the proper integration of new releases. The bean-plots in Figs. 7 & 8 show the distribution of the number of releases for our studied package management platforms. Figure 7a shows that trivial packages on npm have less releases than non-trivial packages (median is 1 for trivial and 2 for non-trivial packages). However, when we examine the number of different release types, we found that trivial and non-trivial npm packages have similar numbers of minor and major releases (Fig. 7c & b). As for the patch releases, trivial npm packages have less patch releases. In Fig. 8a, we also observe that trivial packages on PyPI have less releases than non-trivial packages. We again examine the number of releases of PyPI packages based on the release type. Figures 8b, c, and d show the distribution of minor, major, and patch releases for trivial and non-trivial PyPI packages. From Fig. 8b and c, we do not see any difference between trivial and non-trivial packages for the minor and major releases. As for the patch releases, we observe that trivial PyPI packages have a smaller number of patch releases. The fact that the trivial packages are updated less frequently may be attributed to the fact that trivial packages \u2018perform less functionality\u2019, hence they need to be updated less frequently. In addition, to examine whether the differences in the distribution of the type of releases between trivial and non-trivial packages are statistically significant, we performed a Wilcox test. We also use Cliff\u2019s Delta (d) to examine the effect size. Table 7 shows the p-values and the effect size for all the releases types for npm and PyPI. It shows that for all the releases types the differences are statistically significant, having p-values < 0.05. Also, the effect size values are small or negligible.\n\nNext, we examined how developers choose to deal with the updates of trivial packages. One way that application developers reduce the risk of a package impacting their application is to \u2018version lock\u2019 the package. For example in the JavaScript application that use npm packages, version locking a dependency/package means that it is not updated automatically, and that only the specific version mentioned in the packages.json file is used. As stated in a few responses from our survey, e.g., P-npm 8: \u201c[...] Also, people who don\u2019t lock...\u201d\n\nFig. 7 Distribution of different types of releases for trivial and non-trivial npm packages\ndown their versions are in for some pain\u201d. In general, there are different types of version locks, i.e., only updating major releases, updating patches only, updating minor releases or no lock at all, which means the package automatically updates. The version locks are specified in a configuration file next to every package name for example npm defines it in the packages.json file. We examined the frequency at which trivial and non-trivial packages are locked. For npm, we find that on average, trivial packages are locked 26.3% of the time, whereas non-trivial packages are locked 28.2% of the time. The Wilcox test also shows that the difference is statistically significant $p$-value $< 0.05$ ($p$-value $= 9.116e-07$). On the other hand, in PyPI, we find that on average, trivial packages are locked 31.7% of the time, whereas non-trivial packages are locked 36.2% of the time. Also, the Wilcox test shows that the difference is statistically significant with $p$-value $= 9.707e-08$.\n\nOur findings show that trivial packages are locked less in npm and the same is true in PyPI where trivial packages are locked less than non-trivial packages. In both cases however, we find that there is not a large difference between the percentage of packages (trivial vs. non-trivial) being locked.\n\n### 6.2.2 Package-level Analysis\n\nAt the package level, we investigate the direct and indirect dependencies of trivial packages. In particular, we would like to determine if the trivial packages have their own dependencies, which makes the dependency chain even more complex. For each trivial and non-trivial package on npm, we install it and then count the actual number of (direct and indirect) dependencies that the package requires. Doing so, allows us to know the true (direct and indirect) dependencies that each package requires. Note that simply looking into the .json\n\n| Release type | npm $p$-value | $d$ (small) | PyPI $p$-value | $d$ (small) |\n|--------------|---------------|-------------|---------------|-------------|\n| All | 2.2e-16 | -0.2016 | 2.2e-16 | -0.2995 |\n| Minor | 2.2e-16 | -0.0823 | 2.2e-16 | -0.2447 |\n| Major | 2.2e-16 | -0.1185 | 2.2e-16 | -0.1276 |\n| Patch | 2.2e-16 | -0.1985 | 2.2e-16 | -0.2729 |\nfile and the `require` statements will provide the direct dependencies, but not the indirect dependencies. Hence, we downloaded all the packages in our `npm` dataset, mock installed\\(^4\\) them and build the dependency graph for the `npm` platform.\n\nSimilarly, for `PyPI`, we count the actual number of (direct and indirect) dependencies that the package requires. To do so, we leveraged the metadata provided by Valiev et al. (2018). In their study, Valiev et al. extracted the list of direct and indirect dependencies of each package on `PyPI`. We resort to use the data provided in Valiev et al. (2018) since it is recently extracted data and covers the history of `PyPI` for more than six years. We then read the dependencies of each package and build a dependency graph for the `PyPI` platform.\n\nFigure 9 shows the distribution of dependencies for trivial and non-trivial packages for the `npm` and `PyPI`. Since most trivial packages have no dependencies, the median is zero. Therefore, we bin the trivial packages based on the number of their dependencies and calculate the percentage of packages in each bin.\n\nTable 8 shows the percentage of packages and their respective number of dependencies for both `npm` and `PyPI`. We observe that the majority of `npm` trivial packages (56.9%) have zero dependencies, 21% have between 1-10 dependencies, 3.8% have between 11-20 dependencies, and 18.4% have more than 20 dependencies. The table also shows that `PyPI` trivial packages do not have as much dependencies as the `npm` packages. In fact 63.2% of `PyPI` packages have zero dependencies and approx. 34% of trivial packages have between 1-20 dependencies. Only approx. 3% of the `PyPI` trivial packages have more than 20 dependencies. Interestingly, the table shows that some of the trivial packages in `npm` have many dependencies, which indicates that indeed, trivial packages can introduce significant dependency overhead. It also shows that `PyPI` trivial packages have small number of dependencies. One explanation of such a difference is that Python language has a more mature standard API that provides most of the needed utility functionalities.\n\n\\(^4\\)we modified the `npm` code to intercept the install call and counted the installations needed for every package.\nTable 8 Percentage of packages vs. the number of dependencies used in the npm and PyPI package management platforms\n\n| Packages | npm # Dependencies (Direct & Indirect) | PyPI # Dependencies (Direct & Indirect) |\n|------------|----------------------------------------|----------------------------------------|\n| | 0 | 1-10 | 11-20 | >20 | 0 | 1-10 | 11-20 | >20 |\n| Trivial | 56.9% | 21% | 3.8% | 18.4% | 63.2% | 29.6% | 4.3% | 2.9% |\n| Non Trivial| 37.1% | 24.1% | 6.8% | 32.1% | 42.5% | 39.4% | 10.7% | 7.4% |\n\nTrivial packages have fewer releases and are less likely to be version locked than non-trivial packages. That said, developers should be careful when using trivial packages, since in some cases, trivial packages can have numerous dependencies. In fact, we find that 43.4% of npm trivial packages have at least one dependency and 18.4% of npm trivial packages have more than 20 dependencies while 36.8% of PyPI trivial packages have at least one dependency and 2.9% of PyPI trivial packages have more than 20 dependencies.\n\n7 Relevance and Implications\n\nA common question that is asked in empirical studies is - so what? what are the implications of your findings? why would practitioners care about your findings? We discuss the issue of relevance of our study to the developer community, based on the responses of our survey and highlight some of the implications of our study.\n\n7.1 Relevance: Do Practitioners care?\n\nAt the start of the study, we were not sure how practically relevant our study of trivial packages is. However, we were surprised by the interest of developers in our study. In fact, one of the developers (P-npm 39) explicitly mentioned the lack of research on this topic, stating \u201cThere has not been enough research on this, but I\u2019ve been taking note of people\u2019s proposed \u201cquick and simple\u201d code to handle the functionality of trivial packages, and it\u2019s surprised me to see the high percentage of times the proposed code is buggy or incomplete.\u201d\n\nMoreover, when we conducted our studies, we asked respondents if they would like to know the outcome of our study and if so, they provide us with an email address. Of the 125 JavaScript and Python respondents, 81 (aprox. 65%) of them provided their email for us to provide them with the outcomes of our study. Some of these respondents hold very high level leadership roles in npm. To us this is an indicator that our study and its outcomes are of high relevance to the JavaScript and Python development communities.\n\n7.2 Implications of Our Study\n\nOur study has a number of implications on both software engineering practice and research.\n7.2.1 Practical Implications\n\nA direct implication of our findings is that trivial packages are commonly used by others, perhaps indicating that developers do not view their use as a bad practice, especially JavaScript developers. Moreover, developers should not assume that all trivial packages are well implemented and tested, since our findings show otherwise. npm developers need to expect more trivial packages to be submitted, making the task of finding the most relevant package even harder. Hence, the issue of how to manage and help developers find the best packages needs to be addressed. For example P-npm 15 indicated that \u201c... we have the problem of locating packages that are both useful and \u2018trustworthy\u2019 in an ever growing sea of packages.\u201d To some extent, npms has been recently adopted by npm to specifically address the aforementioned issue. Developers highlighted that the lack of a decent core or standard JavaScript library causes them to resort to trivial packages. Often, they do not want to install large frameworks just to leverage small parts of the framework, hence they resort to using trivial packages. For example, P-npm 35 \u201cespecially in JavaScript relieves you from thinking about cross browser compatibility for special cases/coming up with polyfills and testing all edge cases yourself. Basically it\u2019s a substitute for the missing standard library. And you do not depend on some huge utility library of which you do not need the most part\u201d & P-PyPI 23 \u201cUsually an indication of the inadequacy of the standard library. This seems particularly so of JavaScript where you might find yourself using many such modules.\u201d Therefore, there is a need by the JavaScript community to create a standard JavaScript API or library in order to reduce the dependence on trivial packages. This issue of creating such a standard JavaScript library is under much debate (Fuchs 2016).\n\n7.2.2 Implications for Future Research\n\nOur study mostly focused on determining the prevalence, reasons for and drawbacks of using trivial packages in two large package management platforms npm and PyPI. Based on our findings, we find a number of implications and motivations for future work. First, our survey respondents indicated that the choice to use trivial packages is not black or white. In many cases, it depends on the team and the application. For example, one survey respondent stated that on his team, less experienced developers are more likely to use trivial packages, whereas the more experienced developers would rather write their own code for trivial tasks. The issue here is that the experienced developers are more likely to trust their own code, while the less experienced are more likely to trust an external package. Another aspect is the maturity of the application. As some of the survey respondents pointed out, they are much more likely to use trivial packages early on in the development life cycle, so they do not waste time on trivial tasks and focus on the more fundamental tasks of their application. Once their application matures, they start to look for ways to reduce dependencies since they pose potential points of failure for their application. Our study motivates future work to examine the relationship between team experience and application maturity and the use of trivial packages.\n\nSecond, survey respondents also pointed out that using trivial packages is seen favourably compared to using code from Questions & Answers (Q&A) sites such as StackOverflow or Reddit. For example, P-npm 84 stated that \u201cI\u2019d have to do research on how to solve a particular problem, peruse questions and answers on StackOverflow, Reddit, or Coderanch, and find the most recent and readable solution among everything I\u2019ve found, then write it myself. Why go through all of this work when you can simply \u2018require()\u2019 someone else\u2019s solution and continue working towards your goal in a matter of seconds?\u201d When compared\nto using code on StackOverflow, where the developer does not know who posted the code, who else uses it or whether the code may have tests or not, using a trivial package that is on npm and/or PyPI is seen as much better option. In this case, using trivial packages is not seen as the best choice, but it is certainly a better choice. Although there have been many studies that examined how developers use Q&A sites such as StackOverflow (Abdalkareem et al. 2017a, b; Wu et al. 2018; Baltes and Diehl 2018), we are not aware of any studies that compare code reuse from Q&A sites and trivial packages. Our findings indicate the need for such a study.\n\n8 Related Work\n\nIn this section, we discuss the work that is related to our study. We divided the related work to work related to code reuse in general and work studied software ecosystems.\n\n8.1 Studies of Code Reuse\n\nPrior research on code reuse has shown its many benefits, which include improving quality, development speed, and reducing development and maintenance costs (Mockus 2007; Lim 1994; Mohagheghi et al. 2004; Basili et al. 1996). For example, Sojer and Henkel (2010) surveyed 686 open source developers to investigate how they reuse code. Their findings show that more experienced developers reuse source code and 30% of the functionality of open source software (OSS) projects reuse existing components. Developers also reveal that they see code reuse as a quick way to start new projects. Similarly, Haefliger et al. (2008) conducted a study to empirically investigate the reuse in open source software, and the development practices of developers in OSS. They triangulated three sources of data (developer interviews, code inspections and mailing list data) of six OSS projects. Their results showed that developers used tools and relied on standards when reusing components. Mockus (2007) conducted an empirical study to identify large-scale reuse of open source libraries. Their study shows that more than 50% of source files include code from other OSS libraries. On the other hand, the practice of reusing source code has some challenging drawbacks including the effort and resource required to integrate reused code (Di Cosmo et al. 2011). Furthermore, a bug in the reused component could propagate to the target system (Dogguy et al. 2011). While our study corroborates some of these findings, the main goal is to define and empirically investigate the phenomenon of reusing trivial packages, in particular in JavaScript and Python applications.\n\n8.2 Studies of Software Ecosystems\n\nIn recent years, analyzing the characteristics of ecosystems in software engineering has gained momentum (Bavota et al. 2013; Bloemen et al. 2014; Manikas 2016; Decan et al. 2016). For example, in a recent study, Bogart et al. (2015) and Bogart et al. (2016) empirically studied three ecosystems, including npm, and found that developers struggle with changing versions as they might break dependent code. Wittern et al. (2016) investigated the evolution of the npm ecosystem in an extensive study that covers the dependence between npm packages, download metrics and the usage of npm packages in real applications. One of their main findings is that npm packages and updates of these packages are steadily growing. More than 80% of packages have at least one direct dependency.\nOther studies examined the size characteristics of packages in an ecosystem. German et al. (2013) studied the evolution of the statistical computing project GNU R, with the aim of analyzing the differences between code characteristics of core and user-contributed packages. They found that user-contributed packages are growing faster than core packages. Additionally, they reported that user-contributed packages are typically smaller than core packages in the R ecosystem. Kabbedijk and Jansen (2011) analyzed the Ruby ecosystem and found that many small and large projects are interconnected. Decan et al. (2018b) investigated the evolution of package dependency networks for seven packaging ecosystems. Their findings reveal that the studied packaging ecosystems grow over time in term of number of published and updated packages. They also observed that there is an increasing number of transitive dependencies for some packages.\n\nOther works investigate the challenges of using external packages of a software ecosystem including; identify conflicts between JavaScript package (Patra et al. 2018), examine how pull requests help developers to upgrade out-of-date dependencies in their applications (Mirhosseini and Parnin 2017), study the usage of repository badges in the npm ecosystem (Trockman et al. 2018), and the usage of dependency graph to discover hidden trend in an ecosystem (Kula et al. 2018).\n\nIn many ways, our study complements the previous work since, instead of focusing on all packages in an ecosystem, we specifically focus on trivial packages and we studied them in two different package management platforms npm and PyPI. Moreover, we examine the reasons developers use trivial package and what they view as their drawbacks. We study the reuse of trivial packages, which is a subset of general code reuse. Hence, we do expect there to be some overlap with prior work. Like many empirical studies, we confirm some of the prior findings, which is a contribution on its own (Hunter 2001; Seaman 1999). Moreover, our paper adds to the prior findings through, for example, our validation of the developers\u2019 assumptions. Lastly, we do believe our study fills a real gap since 65% of the participants said they wanted to know our study outcomes.\n\n9 Threats to Validity\n\nIn this section, we discuss the threats to the validity of our case study.\n\n9.1 Internal Validity\n\nInternal validity concerns factors that may have influenced our results such as our datasets collection process. To study the reasons for and drawback of using trivial packages, we surveyed developers. There is potential that our survey questions may have influenced the replies from the respondents. However, to minimize such influence, we made sure to ask for free-form responses and we publicly share our survey and all of our anonymized survey responses (Abdalkareem et al. 2019). Moreover, the way we asked the survey questions might have affected the response from our respondents, causing their responses to advocate or not advocate the use of trivial packages. To reduce this bias, we ensure participants\u2019 anonymity. Also, our study may be impacted by the fact that an overlap does not exist between the developer groups who participated in the two user studies (i.e., defining trivial packages and understanding developers\u2019 perception about the use of trivial packages). We find that the second survey served as a confirmation of the observations made by the first survey participants, however, given that these are two different populations, they may have reported on different observations.\nWe removed test code from our dataset to ensure that our analysis only considers production source code. We identified test code by searching for the term \u2018test\u2019 (and its variants e.g., \u2018TEST_code\u2019) in the file names and file paths. Even though this technique is widely accepted in the literature (Gousios et al. 2014; Tsay et al. 2014; Zhu et al. 2014), to confirm whether our technique is correct, i.e., files that have the term \u2018test\u2019 in their names and paths actually contain test code, we took a statistically significant sample of the packages to achieve a 95% confidence level and a 5% confidence interval and examined them manually. We found that in all the examined cases contain test code.\n\nIn addition, to examine the well-tested perception for the PyPI trivial packages, the first two authors manually examined the source code of the trivial packages to classify whether they have test code written or not. To ensure the validity of our classification, we measure the classification agreement between the two authors. We found that the classification agreement between the two authors to be excellent (Cohen\u2019s Kappa value of 0.97).\n\n9.2 Construct Validity\n\nConstruct validity considers the relationship between theory and observation, in case the measured variables do not measure the actual factors. To define trivial packages, we surveyed 12 JavaScript and 13 Python developers. However, we find that there was consensus for what is considered a trivial package. Although our analysis shows that packages with \\( \\leq 35 \\) LOC and a complexity \\( \\leq 10 \\) are trivial packages, we believe that other definitions are possible for trivial packages. That said, of the 125 survey participants that we emailed about using trivial packages, only 2 mentioned that a flagged package is not a trivial package (even though it fit our criteria). To us, this is a confirmation that our definition applies in the vast majority of cases, although clearly it is not perfect.\n\nIn addition, to determine what is considered to be a trivial package, we conducted an experiment with JavaScript and Python developers who are mostly students (undergraduate and graduate students) with some professional experience. While this may not present professional developers per se (Sjoberg et al. 2002), prior work has shown that experiment with students will provide the same results as professional developers in software engineering domain (Salman et al. 2015; H\u00f6ste et al. 2000).\n\nTo identify the JavaScript and Python applications that we examine in our study, we rely on the metadata provided by the GHTorrent dataset (Gousios et al. 2014). Thus, our selection of JavaScript and Python applications heavily depends on the correctness of the applications\u2019 programming language listed in GHTorrent.\n\nWe use the LOC and cyclomatic complexity of the code to determine trivial packages. In some cases, these may not be the only measures that need to be considered to determine a trivial packages. For example, some of the trivial packages have their own dependencies, which may need to be taken into consideration. Our experience tells us that most developers only look at the package itself and not at its dependencies when determining if it is trivial or not. That said, when we replicated this questionnaire with another set of participants from the Python language community, we found that developers seem to confirm our definition of trivial JavaScript/Python packages (Abdalkareem et al. 2019).\n\nBased on our user study, we defined trivial npm packages as a package that have \\(<= 35\\) LOC and Cyclomatic Complexity \\(<= 10\\). However, one threat to this definition is that 10 cyclomatic complexity is high for a package to be trivial. To examine this concern, we calculate the cyclomatic complexity of all the non-trivial packages in our dataset and found that on average non-trivial npm packages have a cyclomatic complexity of 803, which indicates\nthat 10 Cyclomatic complexity value in our definition is still significantly smaller compared to the one for non-trivial packages.\n\nTo study trivial packages in the PyPI package management platform, we were able to extract 63,912 packages. Collecting more packages may provide more details about trivial packages on the PyPI package management platform. Also, to identify the Python applications that use PyPI trivial packages, we use the snakefood tool (http://furius.ca/snakefood/) to extract the applications dependencies. Hence, we are limited by the accuracy of snakefood in extracting the used packages in Python applications.\n\nIn our study, to understand why developers use trivial packages, we conducted two user surveys with JavaScript and Python developers. These two surveys were performed on different dates, and as a consequence, may affect the outcome of the survey results. However, given that these two package management platforms are independent, we envision that the impact of this date shift is not significant.\n\nIn our study, to identify developers who used trivial packages in their applications, we use regular expressions to identify these packages. This process may flag the wrong package by the developers. To mitigate this threat, during our analysis, we make sure that we extract the right packages through several rounds of manual checking of the results. In addition, none of the developers that we contacted indicated that she/he does not use the identified packages, which serves as a slight confirmation that our methodology is not incorrect.\n\nIn our study on npm, we used npms to measure various quantitative metrics related to testing, community interest and download counts. Our measurements are only as accurate as npms, however, given that it is the main search tool for npm, we are confident in the npms metrics. We also use libraries.io to calculate the community interested and the usage count metrics for PyPI packages, and our measurements are as accurate as libraries.io. We resort to use the libraries.io data since it has been used on other prior work (e.g., Decan et al. 2018a, b). In addition, we use the dataset provided by Valiev et al. (2018) to measure the direct and indirect dependencies of the packages on PyPI.\n\nIn our analysis, we also use different R packages to perform our analysis, our analysis may be impacted by the accuracy of these used R packages. To mitigate this threat we make our dataset and used tools available online (Abdalkareem et al. 2019).\n\n9.3 External Validity\n\nExternal validity considers the generalization of our findings. All of our findings were derived from open source JavaScript applications and npm packages and its replication on Python and PyPI packages. Even though we believe that the two studied package management platforms are amongst the most commonly used ones, our findings may not generalize to other platforms or ecosystems. That said, historical evidence shows that examples of individual cases contributed significantly in areas such as physics, economics, social sciences and even software engineering (Flyvbjerg 2006). We believe that strong empirical evidence is built from both studies on individual cases and studies on large samples.\n\nOur list of reasons for and drawbacks of using trivial packages are based on a survey of 88 JavaScript and 37 Python developers. Although this is a large number of developers, our results may not hold for all developers. A different sample of developers may result in a different list or ranking of advantages and disadvantages. To mitigate the risk due to this sampling, we contacted developers from different applications and as our responses show, most of them are experienced developers.\n\nWe do not distinguish between the domain of studied packages, which may impact the findings. However, to help mitigate any bias we analyzed more than 500,000 npm and\n74,663 PyPI packages that cover a wide range of package domains. Lastly, our study is based on open source applications that are hosted on GitHub, therefore, our study may not generalize to other open source or commercial applications.\n\n10 Conclusion\n\nThe use of trivial packages is an increasingly popular trend in software development (Abdalkareem et al. 2017; Abdalkareem 2017). Like any development practice, it has its proponents and opponents. The goal of our study is to extend our understanding of the use of trivial packages. We examine the prevalence, reasons, and drawbacks of using trivial packages in different package management platforms. Thus, we consider trivial packages in PyPI in addition to the previous studied npm (Abdalkareem et al. 2017).\n\nOur results indicate that trivial packages are commonly and widely used in JavaScript and Python applications. We also find that while the majority of JavaScript developers in our study do not oppose the use of trivial packages, the majority of Python developers believe that using trivial packages could be harmful. Additionally, based on the developers\u2019 responses, developers from the two package management platforms stated that the main reasons for developers to use trivial packages is due to the fact that they are considered to be well implemented and tested. They do cite the additional dependencies\u2019 overhead as a drawback of using these trivial packages. Our empirical study showed that considering trivial packages to be well tested is a misconception since more than half of the studied trivial package do not even have tests. However, these trivial packages seem to be \u2018deployment tested\u2019 and have similar Community interest and Download/Usage count values as non-trivial packages. In addition, we find that some of the trivial packages have their own dependencies. In our studied dataset, 18.4% of the npm and 2.9% of the PyPI trivial packages have more than 20 dependencies. Hence, developers should be careful about which trivial packages they use.\n\nBased on our findings, we provide the following practical suggestions for software developers:\n\n\u2013 Developers should not assume that trivial packages are well-tested and implemented since we found only 28.4% and 49.2% of npm and PyPI trivial packages have test code.\n\u2013 Due to the fact that trivial packages have their own dependencies, developers should be aware that using these trivial packages would increase the dependency overhead of their applications.\n\nAcknowledgments The authors are grateful to the many survey respondents who dedicated their valuable time to respond to our surveys. Also, the authors would like to thank the anonymous reviewers and the editor for their thoughtful feedback and suggestions that help us improve our study.\n\nReferences\n\nAbate P, Di Cosmo R, Boender J, Zacchiroli S (2009) Strong dependencies between software components. In: Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement, ESEM \u201909, IEEE Computer Society, pp 89\u201399\n\nAbdalkareem R (2017) Reasons and drawbacks of using trivial npm packages: The developers\u2019 perspective. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, ACM, pp 1062\u20131064\nAbdalkareem R, Nourry O, Wehaibi S, Mujahid S, Shihab E (2017) Why do developers use trivial packages? an empirical case study on npm. In: Proceedings of the 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE \u201917, ACM, pp 385\u2013395\n\nAbdalkareem R, Oda V, Mujahid S, Shihab E (2019) On the impact of using trivial packages: An empirical case study on npm and pypi. https://doi.org/10.5281/zenodo.3095009\n\nAbdalkareem R, Shihab E, Rilling J (2017) On code reuse from Stack Overflow: An exploratory study on Android apps. Inf Softw Technol 88(C):148\u2013158\n\nAbdalkareem R, Shihab E, Rilling J (2017) What do developers use the crowd for? a study using Stack Overflow. IEEE Softw 34(2):53\u201360\n\nBaltes S, Diehl S (2018) Usage and attribution of Stack Overflow code snippets in gitHub projects. Empirical Software Engineering\n\nBasili VR, Briand LC, Melo WL (1996) How reuse influences productivity in object-oriented systems. Commun ACM 39(10):104\u2013116\n\nBavota G, Canfora G, Penta MD, Oliveto R, Panichella S (2013) The evolution of project inter-dependencies in a software ecosystem: The case of Apache. In: Proceedings of the 2013 IEEE International Conference on Software Maintenance, ICSM \u201913, IEEE Computer Society, pp 280\u2013289\n\nBlais M snakefood: Python Dependency Graphs. http://furius.ca/snakefood/. (accessed on 09/23/2018)\n\nBloemen R, Amrit C, Kuhlmann S, Ord\u00f3\u00f1ez Matamoros G (2014) Gentoo package dependencies over time. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR \u201914, ACM, pp 404\u2013407\n\nBogart C, Kastner C, Herbsleb J (2015) When it breaks, it breaks: How ecosystem developers reason about the stability of dependencies. In: Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering Workshop, ASEW \u201915, IEEE Computer Society, pp 86\u201389\n\nBogart C, K\u00e4stner C, Herbsleb J, Thung F (2016) How to break an API: Cost negotiation and community values in three software ecosystems. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE \u201916, ACM, pp 109\u2013120\n\nBower (2012) Bower a package manager for the web. https://bower.io/. (accessed on 08/23/2016)\n\nCastelluccio M, An L, Khomh F (2019) An empirical study of patch uplift in rapid release development pipelines. Empir Softw Eng 24(5):3008\u20133044\n\nCohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37\u201346\n\nCruz A, Duarte A (2017) npms. https://npms.io/. (accessed on 02/20/2017)\n\nde Souza CRB, Redmiles DF (2008) An empirical study of software developers\u2019 management of dependencies and changes. In: Proceedings of the 30th International Conference on Software Engineering, ICSE \u201908, ACM, pp 241\u2013250\n\nDecan A, Mens T, Constantinou E (2018a) On the impact of security vulnerabilities in the npm package dependency network. In: International Conference on Mining Software Repositories\n\nDecan A, Mens T, Grosjean P (2018b) An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empirical Software Engineering\n\nDecan A, Mens T, Grosjean P et al (2016) When github meets CRAN: an analysis of inter-repository package dependency problems. In: Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering, volume 1 of SANER \u201916, IEEE, pp 493\u2013504\n\nDi Cosmo R, Di Ruscio D, Pelliccione P, Pierantonio A, Zacchiroli S (2011) Supporting software evolution in component-based FOSS systems. Sci Comput Program 76(12):1144\u20131160\n\nDogguy M, Glondu S, Le Gall S, Zacchiroli S (2011) Enforcing type-Safe linking using inter-package relationships. Studia Informatica Universalis 9(1):129\u2013157\n\nEbert C, Cain J (2016) Cyclomatic complexity. IEEE Softw 33(6):27\u201329\n\nFleiss JL, Cohen J (1973) The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Meas 33:613\u2013619\n\nFlyvbjerg B (2006) Five misunderstandings about case-study research. Qual Inq 12(2):219\u2013245\n\nFuchs T (2016) What if we had a great standard library in JavaScript? \u2013 medium. https://medium.com/@thomasfuchs/what-if-we-had-a-great-standard-library-in-javascript-52692342ee3f.pw7d4cq8j. (accessed on 02/24/2017)\n\nGerman D, Adams B, Hassan A (2013) Programming language ecosystems: the evolution of R. In: Proceedings of the 17th European Conference on Software Maintenance and Reengineering, CSMR \u201913, IEEE, pp 243\u2013252\n\nGousios G, Vasilescu B, Serebrenik A, Zaidman A (2014) Lean ghtorrent: Github data on demand. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR \u201914, ACM, pp 384\u2013387\n\nGrissom RJ, Kim JJ (2005) Effect sizes for research: A broad practical approach. Lawrence Erlbaum Associates Publishers\nHaefliger S, Von Krogh G, Spaeth S (2008) Code reuse in open source software. Manag Sci 54(1):180\u2013193\n\nHaney D (2016) Npm & left-pad: Have we forgotten how to program? http://www.haneycodes.net/npm-left-pad-have-we-forgotten-how-to-program/. (accessed on 08/10/2016)\n\nHarris R (2015) Small modules: it\u2019s not quite that simple. https://medium.com/@Rich_Harris/small-modules-it-s-not-quite-that-simple-3ca532d65de4. (accessed on 08/24/2016)\n\nHemanth HM (2015) One-line node modules -issue#10- sindresorhus/ama. https://github.com/sindresorhus/ama/issues/10. (accessed on 08/10/2016)\n\nH\u00f6st M, Regnell B, Wohlin C (2000) Using students as subjects\u2014a comparative study of students and professionals in lead-time impact assessment. Empir Softw Eng 5(3):201\u2013214\n\nHunter JE (2001) The desperate need for replications. J Consum Res 28(1):149\u2013158\n\nInoue K, Sasaki Y, Xia P, Manabe Y (2012) Where does this code come from and where does it go? - integrated code history tracker for open source systems -. In: Proceedings of the 34th International Conference on Software Engineering, ICSE \u201912, IEEE Press, pp 331\u2013341\n\nKabbedijk J, Jansen S (2011) Steering insight: An exploration of the Ruby software ecosystem. In: Proceedings of the Second International Conference of Software Business, ICSOB \u201911, Springer, pp 44\u201355\n\nKalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining gitHub. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR \u201914, ACM, pp 92\u2013101\n\nKula RG, Roover CD, German DM, Ishio T, Inoue K (2018) A generalized model for visualizing library popularity, adoption, and diffusion within a software ecosystem. In: 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering, volume 00 of SANER \u201918, pp 288\u2013299\n\nLibraries.io. Libraries.io - the open source discovery service. https://libraries.io/. (accessed on 05/20/2018)\n\nLibraries.io (2017) Pypi. https://libraries.io/pypi. (accessed on 03/08/2017)\n\nLim WC (1994) Effects of reuse on quality, productivity, and economics. IEEE Softw 11(5):23\u201330\n\nMacdonald F (2016) A programmer almost broke the Internet last week by deleting 11 lines of code. http://www.sciencealert.com/how-a-programmer-almost-broke-the-internet-by-deleting-11-lines-of-code. (accessed on 08/24/2016)\n\nManikas K (2016) Revisiting software ecosystems research: a longitudinal literature study. J Syst Softw 117:84\u2013103\n\nMcCamant S, Ernst MD (2003) Predicting problems caused by component upgrades. In: Proceedings of the 9th European Software Engineering Conference Held Jointly with 11th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ESEC/FSE \u201903, ACM, pp 287\u2013296\n\nMirhosseini S, Parnin C (2017) Can automated pull requests encourage software developers to upgrade out-of-date dependencies? In: Proceedings of the 32Nd IEEE/ACM International Conference on Automated Software Engineering, ASE \u201917, IEEE Press, pp 84\u201394\n\nMockus A (2007) Large-scale code reuse in open source software. In: Proceedings of the First International Workshop on Emerging Trends in FLOSS Research and Development, FLOSS \u201907, IEEE Computer Society, p 7\u2013\n\nMohagheghi P, Conradi R, Killi OM, Schwarz H (2004) An empirical study of software reuse vs. defect-density and stability. In: Proceedings of the 26th International Conference on Software Engineering, ICSE \u201904, IEEE Computer Society, pp 282\u2013292\n\nnpm (2016) What is npm? \u2014 node package managment documentation. https://docs.npmjs.com/getting-started/what-is-npm. (accessed on 08/14/2016)\n\nnpm Blog T (2016) The npm blog changes to npm\u2019s unpublish policy. http://blog.npmjs.org/post/141905368000/changes-to--unpublish-policy. (accessed on 08/11/2016)\n\nOrsila H, Geldenhuys J, Ruokonen A, Hammouda I (2008) Update propagation practices in highly reusable open source components. In: Proceedings of the 4th IFIP WG 2.13 International Conference on Open Source Systems, OSS \u201908, pp 159\u2013170\n\nPatra J, Dixit PN, M. Pradel (2018) Conflictjs: Finding and understanding conflicts between javaScript libraries. In: Proceedings of the 40th International Conference on Software Engineering, ICSE \u201918, ACM, pp 741\u2013751\n\nPython Python testing tools taxonomy - python wiki. https://wiki.python.org/moin/PythonTestingToolsTaxonomy. (accessed on 05/16/2018)\n\nRahman MT, Rigby PC, Shihab E (2019) The modular and feature toggle architectures of google chrome. Empir Softw Eng 24(2):826\u2013853\n\nRay B, Posnett D, Filkov V, Devanbu P (2014) A large scale study of programming languages and code quality in gitHub. In: Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE \u201914, ACM, pp 155\u2013165\nSalman I, Misirli AT, Juristo N (2015) Are students representatives of professionals in software engineering experiments? In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 1 of ICSE \u201915, . IEEE, pp 666\u2013676\n\nSciTools Understand tool. https://scitools.com/. (accessed on 04/16/2019)\n\nSeaman CB (1999) Qualitative methods in empirical studies of software engineering. IEEE Trans Softw Eng 25(4):557\u2013572\n\nSinger J, Sim SE, Lethbridge TC (2008) Software engineering data collection for field studies. In: Guide to Advanced Empirical Software Engineering. Springer, london, pp 9\u201334\n\nSjoberg DIK, Anda B, Arisholm E, Dyba T, Jorgensen M, Karahasanovic A, Koren EF, Vokac M (2002) Conducting realistic experiments in software engineering. In: Proceedings International Symposium on Empirical Software Engineering, IEEE, pp 17\u201326\n\nSojer M, Henkel J (2010) Code reuse in open source software development Quantitative evidence, drivers, and impediments. J Assoc Inf Syst 11(12):868\u2013901\n\nTrockman A, Zhou S, K\u00e4stner C, Vasilescu B (2018) Adding sparkle to social coding: an empirical study of repository badges in the npm ecosystem. In: Proceedings of the International Conference on Software Engineering, ICSE \u201918, ACM\n\nTsay J, Dabbish L, Herbsleb J (2014) Influence of social and technical factors for evaluating contribution in gitHub. In: Proceedings of the 36th International Conference on Software Engineering, ICSE \u201914, ACM, pp 356\u2013366\n\nValiev M, Vasilescu B, Herbsleb J (2018) Ecosystem-level determinants of sustained activity in open-source projects A case study of the pyPi ecosystem. In: Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE \u201918. ACM\n\nVasilescu B, Yu Y, Wang H, Devanbu P, Filkov V (2015) Quality and productivity outcomes relating to continuous integration in gitHub. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE \u201915, ACM, pp 805\u2013816\n\nWilliams C (2016) How one developer just broke Node, Babel and thousands of projects in 11 lines of JavaScript. http://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos. (accessed on 08/24/2016)\n\nWittern E, Suter P, Rajagopalan S (2016) A look at the dynamics of the javaScript package ecosystem. In: Proceedings of the 13th International Conference on Mining Software Repositories, MSR \u201916, ACM, pp 351\u2013361\n\nWu Y, Wang S, Bezemer C-P, Inoue K (2018) How do developers utilize source code from Stack Overflow? Empirical Software Engineering\n\nZambonini D (2011) A Practical Guide to Web App Success, chapter 20. Five Simple Steps. (accessed on 02/23/2017). In: Gregory O (ed)\n\nZhu J, Zhou M, Mockus A (2014) Patterns of folder use and project popularity: A case study of gitHub repositories. In: Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM \u201914, ACM, pp 30:1\u201330:4\n\nPublisher\u2019s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.\n\nRabe Abdalkareem is a postdoctoral fellow in the Software Analysis and Intelligence Lab (SAIL) at Queen\u2019s University, Canada. He received his Ph.D. in Computer Science and Software Engineering from Concordia University, Montreal, Canada. His research investigates how the adoption of crowdsourced knowledge affects software development and maintenance. Abdalkareem received his masters in applied Computer Science from Concordia University. His work has been published at premier venues such as FSE, ICSME and Mobile-Soft, as well as in major journals such as TSE, IEEE Software, EMSE and IST. Contact him at rab_abdu@encs.concordia.ca; http://users.encs.concordia.ca/rababdu.\nVinicius Oda is a MASc. student in the Department of Computer Science and Software Engineering at Concordia University, Montreal. His research interests include Software Engineering, Software Ecosystems, and Mining Software Repositories, among others.\n\nSuhaib Mujahid is a Ph.D. student in the Department of Computer Science and Software Engineering at Concordia University. He received his masters in Software Engineering from Concordia University (Canada) in 2017. He obtained his Bachelors in Information Systems at Palestine Polytechnic University. His research interests include wearable applications, software quality assurance, mining software repositories and empirical software engineering. You can find more about him at http://users.encs.concordia.ca/smujahi.\n\nEmad Shihab is an Associate Professor and Concordia University Research Chair in the Department of Computer Science and Software Engineering at Concordia University. His research interests are in Software Engineering, Mining Software Repositories, and Software Analytics. His work has been published in some of the most prestigious SE venues, including ICSE, ESEC/FSE, MSR, ICSME, EMSE, and TSE. He serves on the steering committees of PROMISE, SANER and MSR, three of the leading conferences in the software analytics areas. His work has been done in collaboration with and adopted by some of the biggest software companies, such as Microsoft, Avaya, BlackBerry, Ericsson and National Bank. He is a senior member of the IEEE. His homepage is: http://das.encs.concordia.ca.\nAffiliations\n\nRabe Abdalkareem\\textsuperscript{1} \\cdot Vinicius Oda\\textsuperscript{1} \\cdot Suhaib Mujahid\\textsuperscript{1} \\cdot Emad Shihab\\textsuperscript{1}\n\nVinicius Oda\nv\\_oda@encs.concordia.ca\n\nSuhaib Mujahid\ns\\_mujahi@encs.concordia.ca\n\nEmad Shihab\neshihab@encs.concordia.ca\n\n\\textsuperscript{1} Data-Driven Analysis of Software (DAS) Lab, Department of Computer Science and Software Engineering, Concordia University, Montr\u00e9al, Canada", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/030_abdalkareem.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 37, "total-input-tokens": 85927, "total-output-tokens": 28394, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 2411, 1], [2411, 6085, 2], [6085, 9193, 3], [9193, 12433, 4], [12433, 15838, 5], [15838, 19418, 6], [19418, 22236, 7], [22236, 24868, 8], [24868, 27280, 9], [27280, 30239, 10], [30239, 33792, 11], [33792, 36155, 12], [36155, 39776, 13], [39776, 43587, 14], [43587, 46577, 15], [46577, 49821, 16], [49821, 53906, 17], [53906, 56694, 18], [56694, 60591, 19], [60591, 62458, 20], [62458, 66386, 21], [66386, 68260, 22], [68260, 71282, 23], [71282, 73619, 24], [73619, 75899, 25], [75899, 78688, 26], [78688, 82630, 27], [82630, 85966, 28], [85966, 89557, 29], [89557, 93502, 30], [93502, 97404, 31], [97404, 100669, 32], [100669, 105421, 33], [105421, 110133, 34], [110133, 113879, 35], [113879, 115425, 36], [115425, 115872, 37]]}}
|
|
{"id": "e739998161c907734c4fdd7848e9759a630933d8", "text": "Reuse and maintenance practices among divergent forks in three software ecosystems\n\nJohn Businge\\textsuperscript{1,2} \u00b7 Moses Openja\\textsuperscript{3} \u00b7 Sarah Nadi\\textsuperscript{4} \u00b7 Thorsten Berger\\textsuperscript{5,6}\n\nAccepted: 25 October 2021 / Published online: 4 March 2022\n\u00a9 The Author(s) 2022\n\nAbstract\nWith the rise of social coding platforms that rely on distributed version control systems, software reuse is also on the rise. Many software developers leverage this reuse by creating variants through forking, to account for different customer needs, markets, or environments. Forked variants then form a so-called software family; they share a common code base and are maintained in parallel by same or different developers. As such, software families can easily arise within software ecosystems, which are large collections of interdependent software components maintained by communities of collaborating contributors. However, little is known about the existence and characteristics of such families within ecosystems, especially about their maintenance practices. Improving our empirical understanding of such families will help build better tools for maintaining and evolving such families. We empirically explore maintenance practices in such fork-based software families within ecosystems of open-source software. Our focus is on three of the largest software ecosystems existence today: Android, .NET, and JavaScript. We identify and analyze software families that are maintained together and that exist both on the official distribution platform (Google play, nuget, and npm) as well as on GitHub, allowing us to analyze reuse practices in depth. We mine and identify 38 software families, 526 software families, and 8,837 software families from the ecosystems of Android, .NET, and JavaScript, to study their characteristics and code-propagation practices. We provide scripts for analyzing code integration within our families. Interestingly, our results show that there is little code integration across the studied software families from the three ecosystems. Our studied families also show that techniques of direct integration using git outside of GitHub is more commonly used than GitHub pull requests. Overall, we hope to raise awareness about the existence of software families within larger ecosystems of software, calling for further research and better tools support to effectively maintain and evolve them.\n\nKeywords Clone-and-own \u00b7 Change propagation \u00b7 Variant synchronisation \u00b7 Empirical study \u00b7 Variant developers \u00b7 Version control systems \u00b7 Pull requests \u00b7 Cherry-picking changes \u00b7 Rebasing changes \u00b7 Squashing changes \u00b7 Software product lines \u00b7 Variants\n\nCommunicated by: Federica Sarro\n\n\\textsuperscript{1} John Businge\njohnxu21@gmail.com\n\nExtended author information available on the last page of the article.\n1 Introduction\n\nThe increased popularity of social-coding platforms such as GitHub made forking a powerful mechanism to easily clone software repositories for creating new software. A developer may fork a mainline repository into a new forked repository, often transforming governance over the latter to a new developer, while preserving the full revision history and establishing traceability information. While forking allows isolated development and independent evolution of repositories, the traceability allows comparing the revision histories, for instance, to determine whether one repository is ahead of the other (i.e., contains changes not yet integrated in the other). It also allows easier commit propagation across the repositories.\n\nMany studies on forking exist, often focusing on the reasons and outcomes (Nyman et al. 2012; Robles and Gonz\u00e1lez-Barahona 2012; Viseur 2012; Nyman and Lindman 2013; Nyman and Mikkonen 2011; Zhou et al. 2018; Zhou et al. 2019; 2020) or on the community dynamics as influenced by forking (Gamalielsson and Lundell 2014). The community typically distinguishes between two kinds of forks (Zhou et al. 2020): social forks that are created for isolated development with the goal of contributing back to the mainline and divergent forks that are created for splitting off a new development branch, often to steer the development into another direction without intending to contribute back, while leveraging the mainline project that defines or adheres to some standards (Sung et al. 2020). Divergent forks are more relevant for supporting large-scale software reuse\u2014the focus of this paper.\n\nStudies on divergent forks usually rely on general heuristics to identify as many forks as possible, without systematically verifying that these are indeed divergent forks. Additionally, when studying code propagation techniques, existing studies do not consider the intricacies of git to identify the possible types of code propagation (e.g., offline git rebasing without using GitHub at all), but focus only on pull requests. To address the first challenge of identifying divergent forks, we use the insight that there are particular ecosystems that have a systematic way of publishing \u201cmembers\u201d of the ecosystem. For example, most Android apps are published on the Google Play store. Similarly, most Eclipse plug-ins are distributed on the Eclipse marketplace. The advantage of such ecosystems is that each member has a unique ID that identifies it. Thus, given an open-source GitHub repository and its fork, we can verify whether the fork is actually an independent version of the original mainline (which is a core criteria of a divergent fork) by checking that both the mainline and the fork are listed as separate entries in the corresponding distribution platform. To address the second challenge of considering the git intricacies, we design a technique that identifies the majority of code propagation techniques on Git and GitHub by leveraging all commit meta data. Inspired by the notion of software families (a.k.a., program families (Parnas 1976; Czarnecki 2005; Dubinsky et al. 2013; Apel et al. 2013; Krueger and Berger 2020b; Stanculescu et al. 2015; Berger et al. 2020)\u2014portfolios of managed and similar software systems in an application domain\u2014we use the term software family, or family for short, to refer to a mainline repository and its corresponding divergent forks. We refer to each family member as a variant.\n\nWe present a large-scale empirical study on reuse and maintenance practices via code propagation among software families in software ecosystems. We take the above considerations into account and study three large-scale ecosystems in different technological spaces: Android, JavaScript, and .NET. Android is one of the largest and most successful software ecosystem with substantial software reuse (Mojica et al. 2014; Li et al. 2016; Sattler et al. 2018; Berger et al. 2014). The JavaScript ecosystem distributes its packages\nthrough npm, which is by far the largest package manager with over 1.82M package distributions.\\footnote{As seen on Libraries.io by June 2021} The .NET ecosystem has a package management system, nuget, that is moderately large with over 261K packages.\\footnote{As seen on Libraries.io by June 2021} As such, our three selected ecosystems vary in their nature (apps versus packages), their programming languages (Java, JavaScript, and C#), and their sizes (in terms of their distribution platforms).\n\nOur study addresses two main research questions:\n\n**RQ1** *What are the characteristics of software families in our ecosystems?*\n\nWe investigate general characteristics of the families and their variants, including the number of variants per family and the divergence of application domains, developer ownership, and variant popularities within the families. We also determine the frequencies of variant maintenance, looking at releases numbers. This allows putting the studied maintenance and co-evolution practices into context.\n\n**RQ2** *How are software families maintained and co-evolved in our ecosystems?*\n\nTo determine management practices, we investigate how code is propagated between the mainline and its divergent forks in the family. For example, are pull requests used as the main propagation technique? Is code propagated only from the mainline to the forks, or is there propagation in the other direction, too? We study the code propagation mechanisms used as well as the kinds of changes being propagated.\n\nTo the best of our knowledge, our work is the first to provide a large-scale in-depth study of code-propagation practices in divergent forks. Understanding these code-propagation strategies exercised by developers can help in building better tool support for software customization and code reuse. We analyze pairs of mainline and fork open source projects whose package releases are available in package distribution platforms of the three ecosystems: Android comprising 38 software families, .NET comprising 526 software families, and JavaScript comprising 8,837 software families.\n\nOur results show that the majority (82\\%) of forks we study are owned by developers different than those of the within a family. Such distinction of ownership gives us confidence that we are studying real divergent forks. Interestingly though, we find little code propagation across all the mainline\u2013fork pairs in the three ecosystems we studied. The most used code propagation technique is *git merge/rebase* that is used in 33\\% of Android mainline-fork pairs, 11\\% of JavaScript pairs, and 18\\% of .NET pairs. We find that cherry picking is less frequently used, with only 9\\%, 0.9\\%, and 2.5\\% of Android, JavaScript, and .NET pairs using it, respectively. Among the three pull request integration mechanisms we studied (merge, rebase, and squash), the most used pull request integration mechanism is the merge option in the direction of fork $\\rightarrow$ mainline, where 2.4\\%, 7\\%, and 11\\% of the pairs in Android, JavaScript, and .NET use this strategy. We find that integrating commits using squashed or rebased pull requests is rare in all three ecosystems. Overall, we find that when code propagation occurs, it seems that fork developers perform this propagation directly through *git* and outside of GitHub\u2019s built-in pull request mechanism. This observation implies that simply relying on pull requests to understand code propagation practices in divergent forks is not enough.\n\nIn summary, this work makes the following contributions:\n\n- We propose leveraging the main distribution platforms of three ecosystems to precisely identify divergent forks. We devise a technique to identifying families in\nthese ecosystems by using data both from GitHub and the respective distribution platform.\n\n- In contrast to previous studies on code propagation strategies that either focused only on pull requests or on directly comparing commit IDs, we are the first to study code propagation while considering pull requests with the options of squash / rebase as well as git rebased and cherry-picked commits.\n- We analyze the prevalence of code propagation within software families as well as the types of propagation strategies used.\n- We synthesize implications of our results for code reuse tools.\n- We provide an online appendix (2020) containing our datasets, intermediate results, and the scripts to trace code propagation between any mainline-fork pair.\n\nAn earlier version of this work appeared as a conference paper (Businge et al. 2018). It focused on analyzing code propagation at the commit level within only the Android ecosystem. It also provided preliminary insights on the reasons why different app variants exist. This article extends the conference paper as follows. First, we extend our analysis with two more ecosystems of moderate to large scale. Second, we substantially improve our identification of code integration methods by not focusing solely on pull requests or direct comparison of commit IDs. Instead, we are the first to consider most types of code propagation techniques, including rebasing, squashing, and cherry-picking commits. Third, we contribute a toolchain for analyzing code propagation between any mainline\u2013fork pair. (iv) We provide more discussion of the implications of our results.\n\nParts of RQ1 for the JavaScript ecosystem have been previously presented as a workshop paper (Businge et al. 2020). In this article, our additional contributions for RQ1 for the JavaScript ecosystem are the following. First, we refine the JavaScript dataset by ensuring that the mainline-fork pairs exist both on GitHub and the npm package manager. To this end, we eliminate a total of 2,456 mainline-fork pairs where either the mainline or fork were deleted from GitHub, but their package releases still existed on the npm package manager. Second, we provide a more detailed description of how the dataset was collected and provide the full refined dataset in the replication package. Third, we create an additional dataset of new families from the .NET ecosystem. Fourth, in addition to the new characteristic variant ownership as well as more illustrative graph comparisons, we discuss the characteristics of the mainline\u2013fork pairs across all three ecosystems.\n\n2 Background on Code Propagation Strategies\n\nWe now discuss the mechanisms offered by GitHub and similar social-coding platforms to propagate code among different repositories. We describe characteristics of these mechanisms and the kind of metadata they generate, which an automated identification technique can potentially rely on.\n\nWhile a mainline and a forked repository are under no obligation to synchronize any changes, developers commonly propagate their code changes (e.g., new features or bug fixes) among repositories via commit integration (Jiang et al. 2017; Openja et al. 2020). For tracing such propagation, however, the metadata provided by GitHub is not always reliable. For instance, Kalliamvakou et al. (2014) and Kononenko et al. (2018) found a large number of pull requests appearing as not merged while they were actually merged. The authors find that it is not uncommon for destination repositories to resolve pull requests outside GitHub.\nTable 1 Changes of commit metadata during code propagation for the different kinds of code propagation with GitHub or Git facilities\n\n| Metadata changed | Pull Requests | Git Commands |\n|------------------|---------------|--------------|\n| | Merge | Squash | Rebase | Cherry-pick | Merge | Rebase |\n| Commit ID | No | Yes | Yes | Yes | No | No |\n| Author Name | No | Yes | No | No | No | No |\n| Author Date | No | Yes | No | No | No | No |\n| Committer Name | No | Yes/No | Yes/No | Yes/No | No | No |\n| Committer Date | No | Yes | Yes | Yes | No | No |\n| Commit Message | No | Yes | No | No | No | No |\n| File details | No | No | No | No | No | No |\n\nYes: metadata change \nNo: no change of metadata\n\nThis is why our work considers both commit integration through GitHub and commit integration directly using git, but outside GitHub.\n\nIn the following, we describe code propagation using GitHub and git facilities. Table 1 provides details on the relationship between commits across forked repositories based on the respective code propagation technique used. To collect the information in this table, we read the official references (Vandehey 2019)\\(^2\\)\\(^3\\) and online resources\\(^4\\) as well as created toy repositories to mimic the various integration scenarios in order to verify this information. We use these insights for creating our code propagation traceability technique described in Section 3.3.\n\n2.1 Propagation with GitHub Facilities\n\nA pull request has a head ref, which is the reference for the source repository and a branch a developer wants to pull commits from; we refer to it as the source branch. A pull request also has a base ref, which is the reference for the destination repository into which the pulled commits are integrated into; we refer to it as the destination branch for clarity. The source and destination branches may belong to the same repository or to different repositories. When studying code propagation in a software family, we are mainly interested in pull requests from one source repository in the family to another destination repository in the same family.\n\nOnce a pull request is submitted on GitHub, a developer can use its user interface to integrate the commits in the pull request into the destination branch using one of these three options: (i) merge the pull request commits, (ii) rebase the pull request commits, and (iii) squash the pull request commits.\n\n- **Merge pull request commits** is the default. When the developer chooses this option, the commit history in the destination branch will be retained exactly as it is. As can be seen from Table 1, the metadata of the integrated commits from the source branch remain\n\n\\(^2\\)https://www.atlassian.com/git/tutorials/merging-vs-rebasing \n\\(^3\\)https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-request-merges \n\\(^4\\)https://cloudfour.com/thinks/squashing-your-pull-requests/\nunchanged in the destination branch. However, a new merge commit will be created in the destination branch to \u201ctie together\u201d the histories of both branches (GitHub 2020).\n\n- **Rebase and merge pull request commits**: When the integrator selects the Rebase and merge option on a pull request on GitHub, all commits from the source branch are replayed onto the destination branch and integrated without a merge commit. From Table 1, we can see that using this integration technique, the commit metadata between source and destination preserves the author name, author date, and commit message but alters the commit ID, committer name, and committer date. The committer name becomes the name of the developer from the destination repository who rebased and merged the pull request. Note that if the developer who submitted the pull request is coincidentally the same as the developer who integrates it (e.g., because the developer works on both repositories), then the committer name will remain the same (GitHub 2020).\n\n- **Squash and merge pull request commits**: When the integrator selects the Squash and merge option on a pull request on GitHub, the pull request\u2019s commits are squashed into a single commit. Instead of seeing all of a contributor\u2019s commits from the source branch, the commits are squashed into one commit and included in the commit history of the destination branch. Apart from the file details, all other commit meta data changes. The committer name changes unless, similar to above, the original committer and the developer merging the pull request are the same (GitHub 2020).\n\n### 2.2 Propagation with Git Facilities (Cherry Pick, Merge, and Rebase Commits)\n\nA developer may also not rely on the GitHub user interface and instead choose to integrate commits from a source branch into a destination branch outside GitHub using one of the git integration commands. The integrator has to first locally fetch commits from the source branch (for example mainline) that contains the commits they wish to integrate into their branch. They then perform the integration locally using one of four options outlined below ((i) git merge, (ii) git rebase, (iii) git cherry-pick, and (iv) other Git commands that rewrite commit history) and afterwards, push the changes to their corresponding GitHub repository.5\n\n- **Git cherry-pick commits**: Cherry picking is the act of picking a commit from one branch and integrating it into another branch. Commit cherry picking can, for example, be useful if a mainline developer creates a commit to patch a pre-existing bug. If the fork developer cares only about this bug patch and not other changes in the mainline, then they can cherry pick this single commit and integrate it into their fork. As shown in Table 1, the author name, author date, commit message, and file details of the cherry picked commit remain the same in the destination branch. The commit ID, committer name, and committer date however do change. Note that the committer name may remain the same if the integrator is the same developer who performed the original commit in the source branch.\n\n- **Git merge commits**: Like in the pull request merge, git merge also preserves all the commit metadata and creates an extraneous new merge commit in the destination branch that ties together the histories of both branches.\n\n- **Git rebase commits**: Rebasing is an act of moving commits from their current location (following an older commit) to a new head (newest commit) of their branch (Chacon\n\n5[https://www.atlassian.com/git/tutorials/merging-vs-rebasing](https://www.atlassian.com/git/tutorials/merging-vs-rebasing)\nand Straub 2014b). Git rebase deviates slightly from rebasing pull requests on GitHub as it does not change the committer information. To better understand git rebase, let us explain it with an illustration based on the experiments we carried out. On the left-hand side of Fig. 1, we have a mainline repository and a fork repository where each repository made updates to the code through commits C3 and C4 in the mainline and commits F1 and F2 in the fork. The fork developer observes that the new updates in the mainline are interesting and decides to integrate them using rebasing. After rebasing, the commit history will look the right side of Fig. 1. Notice that the IDs and the order of the integrated commits C3 and C4 in the fork branch are unchanged. However, the IDs of commits F1 and F2 change to F1\u2019 and F2\u2019. In this case, Git rebase is like the fork developer saying \u201cHey, I know I started this branch last week, but other people made changes in the meantime. I don\u2019t want to deal with their changes coming after mine and maybe conflicting, so can you pretend that I made [my changes] today?\u201d (Vandehey 2019).\n\n- **Other Git commands that rewrite commit history**: Git has a number of other tools that rewrite commit history, including changing commit messages, commit order, or splitting commits (Chacon and Straub 2014a). These commands include: `git commit --amend`, `git rebase -i HEAD~N`, and `git --squash`, etc. Most of these commands significantly change the history and the meta data of commits. If the integrator uses any of these commands in the destination repository, then there is no straightforward way to match the integrated commits across the two repositories (Chacon and Straub 2014a).\n\n### 3 Methodology\n\nOur goal is to improve the empirical understanding of maintenance practices, specifically code propagation in software families. We identify and analyze software families by using data from both GitHub and the distribution platforms of the three ecosystems.\n\n#### 3.1 Identifying Software Families\n\nGiven the different nature of our studied ecosystems in terms of what information each distribution platform stores and how this information is accessed, we employ different techniques to identify Android families versus JavaScript and .NET families. Figure 2 shows an overview of this process. We extract families in the Android ecosystem from...\nGitHub and Google Play while the families in .NET and JavaScript are extracted from Libraries.io.6\n\n3.1.1 Identifying Android Families\n\nWe are interested in identifying families of real Android apps that are evidently used by end users. Taking all GitHub repositories with Android apps into account would also include toy apps or course assignments. To this end, we identify source repositories of apps that also exist on Google Play. We mainly match GitHub repositories and Google Play apps via their unique identifier\u2014the package name contained in the app manifest file (AndroidManifest.xml). Such manifest files also declare the app\u2019s components, necessary permissions, and required hardware and Android version. As such, each Android app in a software family must have a unique package name, which excludes any forked repositories where the package name was not modified. More specifically, we identify Android families using a relatively conservative filtering approach as follows.\n\n1. Using GitHub\u2019s REST API v3, we identify 79,338 mainline repositories matching the following criteria: (1) is not a fork; (2) the repository contains the word \u201cAndroid\u201d in the name/description/readme; (3) has been forked at least twice; (4) was created before 01/07/2019 (we mined on 14/12/2019, so we used the date 01/07/2019 to obtain repositories that have some history) (5) has an AndroidManifest.xml file; (6) has a description or readme.md file; and (7) has a number of forks $\\geq 2$ to reduce the chance of finding student assignments (Munaiah et al. 2017).\n\n2. To ensure that we are collecting real-world apps, we check if the identified mainline repositories exist on Google Play. From each repository\u2019s AndroidManifest.xml file, we extract the app\u2019s package name and check its existence on Google Play. In total,\n\n6https://libraries.io/\nwe find 7,423 mainline repositories representing an actual Google Play app (Businge et al. 2017).\n\n3. We filter out duplicate mainline repositories containing `AndroidManifest.xml` files with the same package name. Such duplicates easily arise when an app\u2019s source code is copied without forking. Since package names are unique on Google Play, only one of these duplicate repositories can actually correspond to the Google Play app. We manually select one repository from these duplicates by considering repository popularity (number of forks and stars on GitHub), repository and app descriptions on both GitHub and Google Play, as well as the developer name on GitHub and Google Play. In some cases, the Google Play app description conveniently linked to the GitHub repository. As a result of this step, we discard 1,232 repositories and are left with 6,191 mainline repositories.\n\n4. To ensure that we study repositories with enough development history, we filter out mainlines with fewer than six commits in their lifetime, according to the median number of commits in GitHub projects found by prior work (Kalliamvakou et al. 2014). This leaves us with 4,337 mainline repositories.\n\n5. We filter out mainline repositories without any active forks, which have no commit after the forking date and were probably abandoned. This leaves us with 1,166 mainline repositories, which have a total of 12,025 active forks altogether.\n\n6. We remove forks that have the same package name as their mainline. If no forks remain for a given mainline, we also remove this mainline. For the forks with different package names than their corresponding mainline, we check the existence of the fork\u2019s package name on Google Play in order to ensure that the fork is also a real (and different) Android app. This leaves us with 69 app families comprising of 95 forks.\n\n7. Finally, by manual inspection, we filter out forked repositories whose app package name points to a Google Play app that is not the correct app. This analysis is based on the observation that, sometimes, fork developers copy code including the `AndroidManifest.xml` from another app without changing the package name. This practice results in the forked app\u2019s package name pointing to an app that exists on Google Play, but that is not the one hosted in the GitHub repository. We inspect the `Readme.md` and unique commit messages in the GitHub repository and the respective Google Play description page. Eliminating all mismatched apps leaves a total of 38 app families comprising of 54 forked apps\u2014our final dataset to answer the research questions.\n\n3.1.2 Identifying JavaScript and .NET Families\n\nA family in the JavaScript and .NET ecosystems comprises packages of libraries of applications written in the respective language. Similar to the Android ecosystem, we only consider packages that exist as source-code repositories on GitHub and on the ecosystem\u2019s main distribution channels: `npm` and `nuget`. The metadata of a package release on the package managers of `npm` or `nuget` is similar. On both package managers, a package\u2019s metadata include: source repository of the package (GitHub, GitLab, BitBucket), number of dependent projects/packages, number of dependencies, number of package releases, and the package contributors. Fortunately, most of the data of 37 package managers for different ecosystems can be found on one central location `Libraries.io`, which is a platform that periodically collects all data from different package managers. In addition to the metadata\nfor a specific package on a given package manager, Libraries.io also extends the package metadata with more information from GitHub. For example, it stores a Forkboolean field, which indicates whether the corresponding repository of a package is a fork. Such a field Forkboolean can help us identify forked repositories that have published their packages. Note that this is different from the Android ecosystem where such explicit traceability does not exist, which is why we first mine repositories from GitHub and then filter out those that are published on Google Play. In contrast, with .NET and JavaScript, we mine the families directly from Libraries.io. We extract the families from the latest Libraries.io data dump release 1.6.0 that was released on January 12, 2020. The meta-model for the data on the Libraries.io data dump can be found online.7 We extract .NET and JavaScript families from Libraries.io with the following steps:\n\n1. Using the package\u2019s field Platform, we filter out the packages that are distributed on nuget and npm package managers.\n2. Next, we use the field Forkboolean to identify repositories that are forks, and use the field Fork Source Name with Owner to identify the fork repository name as well as the parent repository (mainline). We extract all fork repositories that map to published packages on nuget and npm.\n3. Next, we merge the sets of packages from Step 1 and Step 2 to identify only packages that make a mainline-fork pairs (i.e., where the fork repository and its corresponding mainline in the set in Step 2 have their packages present in the set in Step 1. Using the GitHub API, we then verify that indeed the mainline parent of the divergent fork and they are still existing on GitHub so as to eliminate wrong pairs and (e.g., those that have been deleted from GitHub). From the .NET ecosystem, we identify a total of 526 software families having a total of 590 mainline\u2013fork pairs. From the JavaScript ecosystem, we identify a total of 8,837 software families having a total of 10,357 mainline\u2013fork pairs. Similar to Android families, a family in .NET and JavaScript contains at least one mainline and one or more variant forks.\n\n3.2 Identifying Family Characteristics (RQ1)\n\nWe now describe how we identify characteristics of the identified families and their variants (i.e., mainlines and forks) for our three ecosystems.\n\nWe define and calculate various metrics as follows. Note that, given the different nature of these ecosystems and the type of information available for each, some metrics are specific to only some of the ecosystems. For example, FamilySize is a metric we can calculate for all variants in all the three ecosystems. On the other hand, given the difference in nature of Android variants and JavaScript/.NET packages, we need to calculate variant popularity differently across the ecosystems (downloads and reviews versus dependents and dependencies).\n\nIn the following, we discuss the goal of each metric and how we calculate it. Overall, we look at metrics that fall into general characteristics of variants, variant maintenance activity, variant ownership, and variant popularity. For repositories in the Android ecosystem, we extract the metrics from GitHub and Google Play store. For repositories in the .NET and JavaScript ecosystems, we extract the metrics from GitHub and Libraries.io.\n\nTable 3 in Section 4 summarizes all metrics (and provides their values).\n\n7https://libraries.io/data\n3.2.1 General Characteristics\n\n**Family Size** We record the number of variants (metric *FamilySize* in Table 3) for all families in the three ecosystems. Note that a family with *FamilySize* = 2 has one mainline and one fork while a family with *FamilySize* = 3 has one mainline and two forks.\n\n**Variant Package Dependencies** ecosystems provide a huge bazaar of software that can be reused through explicit package dependencies (Decan et al. 2019). Since a divergent fork inherits functionality from the mainline and may also continuously synchronize with the mainline to acquire new changes, one would expect that the number of package dependencies for a mainline and fork would be the same. However, it would be interesting to see cases where they are not the same. In this context, for example, if the fork has more dependencies, it could mean that fork is implementing new features that are not in the mainline. We extract the number of dependencies from Libraries.io. For Android, we extracted the dependencies from the apps Gradle files on GitHub.\n\n**Android variant categories** Using the variant\u2019s metadata available on Google Play, we also determine its variant category (e.g., Business, Finance, Productivity) and extract its description. We also record whether the variants are listed under the same category on Google Play, which helps us understand the nature of the variants in a family.\n\n3.2.2 Identifying Maintenance Activities (JavaScript & .NET only)\n\nA repository with many releases shows that it is being actively maintained since each release indicates either bug fixes or new features being introduced. To this end, we are interested in seeing the relationship between the mainline and the fork in terms of the number of package releases on the package distribution platforms. We collect the number of package releases for variants in the .NET and JavaScript ecosystems from Libraries.io. The metrics related to variant maintenance activity are *PackageReleasesMLV* for the mainline variants and *PackageReleasesFV* for the fork variants. Unfortunately the package manager for variants in Android ecosystem (Google Play store) does not keep history for the applications, and therefore we cannot extract variant releases from there. An alternative to collect the variant releases in the Android ecosystem is to collect them from the repositories themselves using the GitHub API. Unfortunately, we found that using the GitHub API to collect the list of releases of a repository returns zeros for most of the repositories even when a repository has releases. For example, we can see that the Android divergent fork imaeses / k-9\\(^8\\) has releases. However, when we access the fork using the GitHub API for a list of releases\\(^9\\), we can see that it returns an empty list. To this end, we decided not to collect package releases for the variants in the Android ecosystem.\n\n3.2.3 Identifying Variant Ownership Characteristics\n\nWe would like to identify whether the mainline and fork variant have common owners. This is interesting to study since we determine if whether variant fork are started by the owners of the mainlines or if they are started different developers not in the mainline. We define\n\n---\n\n\\(^8\\)https://github.com/imaeses/k-9/releases\n\n\\(^9\\)https://api.github.com/repos/imaeses/k-9/releases\nthe owner of a repository as a contributor who has access rights of integrating changes into the repository (i.e., a repository committer). As we explained in Section 2, based on the different kinds of commit integration techniques, it might be difficult to identify the original repository of a given commit (especially in cases where a mainline has many forks). To this end, we identify a repository committer (owner) as one who has merged at least one pull request, since we are certain that only contributors who have access rights to a repository can integrate changes. We consider that the mainline and a fork variant have common owners if there exists at least one common owner between them. With this criteria, both the mainline and fork variant should have at least one same developer (not a bot) who merged a pull request in both repositories. This means that our ownership criteria relies on each variant merging at least one pull request. Since we have very few variant pairs in the Android ecosystem, this would reduce further the very small dataset of variant pairs. To this end, we apply the described method only on the variants of .NET and JavaScript ecosystems, which have moderately large to very large dataset of variant pairs and use a different criteria to identify the owners of Android variants that we explain later. Since all the variants are published in Google Play, then each variant has an owner. We identify only 89 of the 590 mainline\u2013fork pairs in the .NET ecosystem where both the mainline and fork variant had any merged PR by a real developer. For the JavaScript ecosystem we identify only 89 of the 10,357 mainline\u2013fork pairs where both the mainline and fork variant had any merged PR by a real developer.\n\nFor the variant pairs in the Android ecosystem, we employ another method to identify ownership that covers all the dataset. We mine ownership from Google Play store. On Google play store, each variant has an attribute developer id or dev id, which is the name of the developer/company (owner) that uploads the variant on its updates on the marketplace.\n\n3.2.4 Identifying Variant Popularity\n\nWe want to understand the popularity of the variants we are studying in terms of whether they are widely used in their respective ecosystems. We extract the popularity metrics from the distribution platform of each of our studied ecosystems. We use a different popularity measure for variants in the Android ecosystem than those from .NET and JavaScript.\n\n- **Android variants**: For the variants in the Android ecosystem, we define two popularity metrics for the number of downloads on Google play, DownloadsMLV and DownloadsFV for the mainline and divergent fork respectively. We also define two popularity metrics for the number of reviews on Google play ReviewsMLV and ReviewsFV for the mainline and divergent fork, respectively.\n\n- **JavaScript and .NET variants**: For variants in these two ecosystems, we record the number of other packages on the JavaScript and .NET that depend on the mainline and the fork variants (DependentPackagesMLV and DependedntPackagesFV respectively). We also record the number of other projects on GitHub that depend on the mainline and variant (DependentProjectsMLV and DependentProjectsFV respectively). All the variant\u2019s dependent packages / projects are extracted from Libraries.io. The package and project dependents are a good way of measuring popularity since they give an indication of what other packages / projects are interested in the functionality provided by the variant.\n3.3 Identifying Code Propagation (RQ2)\n\nAnswering RQ2 requires determining whether and how any code was propagated among the variants of a software family. To identify code propagation, we rely on categorizing commits in the history of the mainline and the forks based on the possible types of code propagation we discussed in Section 2.\n\nFigure 3 illustrates the relationship between variants in the same family. Specifically, we demonstrate the relationship between the commits in the mainline variant of a family and any of its divergent forks. We identify two broad categories of commits: (1) common commits are those that exist in both the mainline variant and the forked variant and represent either the starting commits that existed before the forking date or propagated commits and (2) unique commits that exist only in one variant. For each (mainline variant, fork variant) pair in a family, we first identify common commits and then identify unique commits, as follows.\n\n3.3.1 Identifying Common Commits\n\nTo ensure we correctly categorize commits, we perform the following steps in this exact order. Once a commit is categorized in one step, we do not need to analyze it again in the following steps. We consider only the default repository branch master/main branch for both the mainline and forks.\n\n- **Inherited commits:** The fork date is the point in time at which the fork variant is created. At that point, all commits in the fork are the same as those in the mainline, and\n\n\nwe refer to them as *InheritedCommits*. In Fig. 3, the *InheritedCommits* are the purple commits 1, 2, and 3. To extract these commits for either variants, we collect all the commits since the first commit in the history until the fork date.\n\n**Pull-Request commits:** We first collect the merged pull requests in each repository and identify the pull requests whose source and destination branches belong to the analyzed repository pair. The GitHub API `:owner/:repo/pulls/:pull_number` provides all the information of a given pull request. One can identify the source and destination branches using the pull request objects `['head']['repo']['full_name']` and `['base']['repo']['full_name']` from the returned json response, respectively. Based on the source and destination information, we can always identify the direction of the pull request as `fork \u2192 mainline` or `mainline \u2192 fork`, as shown in Fig. 3. For each pull request, we collect the pull request commits `pr_commits` using the GitHub API `:owner/:repo/pulls/:pull_number/commits`. Regardless of how a pull request gets integrated, the commit information in the source repository is always identical to that in `pr_commits`. Thus, we can always identify the pull request commits in the source repository by comparing the IDs of the commits in `pr_commits` to those in the history of the source repository. The tricky part is identifying the integrated commits in the destination repository. Based on the information discussed in Section 2 and summarized in Table 1, we can identify the pull request commits in the destination repository as follows:\n\n- **Merged pull request commits:** Based on Table 1, the commit IDs of pull request commits integrated using the default merge option do not change. Thus, to identify these commits, we simply compare the IDs of the `pr_commits` to those in the commit history of the destination repository.\n\n- **Rebased pull request commits:** Recall from Table 1 that integrated commits from a rebased pull request have different commit IDs on the destination branch. Thus, we identify the rebased commits in the destination branch by comparing the remaining unchanged commit metadata, such as author name, author date, commit message, and file details.\n\n- **Squashed pull request commits:** As part of a squashed pull request\u2019s metadata, GitHub records the ID of the squashed commit on the destination branch in the `merge_commit_sha` attribute.\\(^\\text{10}\\) Using this ID, we can identify the exact squashed commit in the destination repository. For extra verification, we also compare the changed files of all commits in the pull request with the changed files in the identified squashed commit.\n\n**Git merged commits:** After identifying all commits related to pull requests, we now analyze any remaining unmatched commits to identify if they might have been propagated directly through Git commands. Recall from Section 2 that this includes merged, rebased, and cherry-picked commits.\n\n- **Git cherry-picked commits:** We locate cherry-picked commits in the source and destination commit histories by comparing the following commit metadata: commit ID, author name, author date, commit message and filenames and file changes. We can also identify the source and the destination branches of the cherry picked commits by looking at the com-\n\n\\(^{10}\\)https://developer.github.com/v3/pulls/\nmitter dates of the matched commits. We mark the commit with the earlier committer date to be from the source branch and that with the later date to be in the destination branch.\n\n- **Git merged and Git rebased commits:** At this point, we have already identified all integrated pull request commits as well as cherry picked commits. Thus, any remaining commits that have the same ID in the histories of both variants must have been propagated through git merge or git rebase. As shown in Table 1 and Fig. 1, any commits integrated through git rebase have exactly the same ID and meta data in both the source and destination branch. Similarly, commits integrated through git merge also have the same exact information. While we can differentiate git-merged and git-rebased commits by finding merge commits (those with two parents) and marking any commits between the merge commit and the common ancestor as commits that are integrated through git merge, this differentiation is not important for our purposes. We are only interested in marking both types of commits as propagated commits. Thus, for our purposes, we can identify commits integrated via Git rebase or Git merge, but do not differentiate between them. Similar to pull requests, both types of commits may be pulled from any of the branches to the other. However, unlike pull requests, it is not possible to identify which variant the propagated commit originated from. This is because of the nature of distributed version-control systems where commits can be in multiple repositories, but there is no central record identifying the commits\u2019 origin. Since it is common for commits to be pulled from the mainline and pushed into the fork repository as a result of the fork trying to keep in sync with the new changes in the mainline, we make an assumption that all commits that we identify as integrated through git merge or git rebase are pulled from the mainline variant and pushed into the fork variant.\n\n### 3.3.2 Identifying Unique Commits\n\nTo identify the unique commits between the mainline and fork we use the `compare` GitHub API\\(^\\text{11}\\). The `compare` GitHub API compares between the mainline branch and fork branch, as one of the items, return the diverged commits that comprise the number of commits a given branch (say mainline branch) is ahead of the other branch (fork branch) as well the number of commits the branch is behind the other. The commits that the mainline branch is ahead of the fork branch are the unique commits to the mainline, while the commits the mainline is behind the fork are the unique commits to the fork.\n\n### 3.3.3 Verifying our Commit Categorization Methods\n\nWe verify our methods of identifying common commits for the different commit propagation techniques discussed in Section 3.3.1 in two phases: we first test our scripts on six toy projects we created ourselves, where we intentionally include at least one example of each commit propagation technique and verify that the commits are correctly categorized. Second, we manually analyze some of the results of our scripts from a sample of six real mainline\u2013fork pairs that are part of our data collection from each ecosystem, and which we\n\n\\(^{11}\\text{https://docs.github.com/en/rest/reference/repos#compare-two-commits}\\)\nprovide all details for in our online appendix. From earlier version of this work in the conference paper (Businge et al. 2018), we noticed integrated pull requests between mainline and the variant forks were very rare. To this end, when testing our scripts, in addition to the variant forks which have very a limited number of integrated commits, we also use social forks that have lots of integrated commits with their mainline counterparts. In this section, we will discuss only the following 3 pairs, which we show in Table 2:\n\n- (dashevo / dash-wallet, sambarboza / dash-wallet): The repository sambarboza / dash-wallet is a social fork. The mainline dashevo / dash-wallet has a total 445 PRs. Our scripts identifies that 74 of these 445 pull requests were integrated from the fork repository sambarboza / dash-wallet into the mainline repository dashevo / dash-wallet. We show the details of these 74 PRs in Table 2. Our technique identified that 3 of the 74 PRs were integrated using the PR merge option (all together having a total of 13 commits). There were 43 of the 74 PRs that were integrated using PR squash option (having a total of 194 commits), 2 of the 74\n\n| Technique | # PRs | # Commits |\n|-----------|-------|-----------|\n| Android | | |\n| dashevo / dash-wallet (D), sambarboza / dash-wallet (S) | PR Merged | 3 | 13 |\n| | Squashed | 43 | 194 |\n| | Rebased | 2 | 6 |\n| | Unclassified | 26 | 167 |\n| Git | Merge/rebase | 405 | |\n| | Cherry-pick | 0 | |\n| | Total | 74 | 785 |\n| .NET | | |\n| flagbug / YoutubeExtractor (D), Kimmax / SYMMExtractor (S) | PR Merged | 2 | 2 |\n| | Squashed | 0 | 0 |\n| | Rebased | 0 | 0 |\n| | Unclassified | 0 | 0 |\n| Git | Merge/rebase | 3 | |\n| | Cherry-pick | 1 | |\n| | Total | 2 | 6 |\n| JavaScript | | |\n| TerriaJS / terriajs (S), bioretics / rer3d-terriajs (D) | PR Merged | 9 | 101 |\n| | Squashed | 0 | 0 |\n| | Rebased | 0 | 0 |\n| | Unclassified | 0 | 0 |\n| Git | Merge/rebase | 1,825 | |\n| | Cherry-pick | 10 | |\n| | Total | 9 | 1,936 |\n\nThe first two mainline\u2013fork pairs in the table we have S = source (fork) and D = destination (mainline). The last mainline\u2013fork pair we have S = source (mainline) and D = destination (fork).\nPRs used the PR rebase option having a total of 6 commits, and the integration option of the 26 PRs was unclassified (having a total of 167). We identified a total of 405 commits that were integrated using the *git* merge/rebase integration option and no commit was integrated using *git* cherry-pick option.\n\n- *(flagbug / YoutubeExtractor, Kimmax / SYMMExtractor):* The repository Kimmax / SYMMExtractor is a *variant fork*. The mainline flagbug / YoutubeExtractor has a total of 32 pull requests. Our scripts identifies that 2 of the 32 PRs were integrated from the fork repository Kimmax / SYMMExtractor into the mainline repository (lagbug / YoutubeExtractor (see details in Table 2). The two PRs were integrated using the merge PR option having a total of two commits that were integrated. We also identified a total of three commits that were integrated using the *git* merge/rebase integration option and 1 commit was integrated using *git* cherry-pick option.\n\n- *(TerriaJS / terriajs, bioretics / rer3d-terriajs):* The repository bioretics / rer3d-terriajs is a *variant fork*. The fork bioretics / rer3d-terriajs has a total of 10 pull requests. Our scripts identifies that 9 of the 10 pull requests were integrated from the mainline TerriaJS / terriajs into the fork bioretics / rer3d-terriajs. The 9 PRs had a total of 101 commits. There were no commits integrated using the PR squash and PR rebase options. A total of 1,825 were integrated using the option *git* merge/rebase integration option and only 10 commits integrated using *git* cherry-pick option.\n\nGiven the above results of our scripts, we select some of the identified code propagation techniques and manually verify them. For each analyzed mainline\u2013fork pair, we randomly sample a pull request from each identified pull request integration technique that were returned by our scripts. We manually analyze those sampled pull requests and their commits, including the commit metadata to verify the correctness of the identified propagation technique. For each of these sampled pull requests, we also randomly select two commits and manually analyze them to make sure they have been correctly classified. For example, in the pair [getodk / collect (D), lognaturel / collect (S)] (lognaturel / collect is a social fork), our script reveals that the commits in the pull requests numbered 3531, 3462 and 3434 were integrated using merging, squashing and rebasing, respectively. We manually verify that these pull requests have been in fact integrated using these techniques by looking at their commit metadata. Similarly, for the pair [dashevo / dash-wallet (D), sambarboza / dash-wallet (S)] (sambarboza / dash-wallet is a social fork), we verify that the commits in the pull requests number 421, 333, and 114 were integrated using merging, squashing, and rebasing, respectively. We also look at the results returned by integration outside GitHub (*git* merge/rebase and *git* cherry-pick). For example, our results indicate that the pair [FredJul/Flym (D), Etuldan/spaRSS (S)] (Etuldan/spaRSS is a variant fork), has no commits integrated using pull requests but had 34 and five commits integrated using *git* merge/rebase and *git* cherry-picking, respectively. We manually verify these five latter commits and confirm their correctness.\n\nAs the pair dashevo / dash-wallet, sambarboza / dash-wallet from Table 2 shows, there were some pull requests that our scripts were not able to classify. As part of our manual verification, we find that the GitHub API indicates that they are integrated into the destination repository since their *merge* date is not *null*. On deeper investigation, we discover that all the unclassified pull request commits were integrated.\ninto a different branch from the master branch. For example, pull requests 514 and 512 from the fork sambarboza/dash-wallet were both integrated in the branch evonet-develop on the mainline repository. We also observed that both pull requests had an integration build test failure (Travis CI). This explains why the commits are missing in the history of the master branch and why our scripts could not classify those integrated commits.\n\nOne would wonder if we have a threat to construct validity since we do not consider the commit integration into other branches other than the default (main/master). For example, the scenario we presented above of unclassified pull requests that were integrated in the development branch (\u201cstaging\u201d), but that were missing in the main branch since they failed the integration build test. If any of the 167 are integrated from the staging branch into the master branch using any of the integration techniques that do not completely rewrite the commit history (i.e., PR merge/squash/rebase, git merge/rebase/cherry-pick), then our script would always identify them as commits that were integrated between the mainline and the fork using the git merge/rebase option. As such, our script minimizes the threat to validity of the unclassified pull requests.\n\nOur manually verified data for both the toy projects and the real projects gives us confidence that our scripts can correctly identify the commits integrated through different integration mechanisms in any mainline\u2013fork pair of any repository.\n\n3.3.4 Fork Variability Percentage\n\nTo quantify how much a fork differs from its mainline, we define a metric variability percentage as follows:\n\n\\[\n\\text{VariabilityPercentage} = \\frac{\\text{uniqueFV}}{\\text{uniqueFV} + \\text{CommonCommits}} \\times 100\n\\]\n\nwhere \\(\\text{CommonCommits} = \\text{Pull Request commits} + \\text{Git commits} + \\text{InheritedCommits}\\) as shown in Fig. 3. \\(\\text{VariabilityPercentage}\\) measures the percentage of unique commits in a fork, when compared to all the commits in that fork. A lower percentage means that most of the changes in the fork are either starting commits (i.e., the fork did not make many changes after the fork date) or merged commits that are propagated from/to the mainline. Both these cases indicate that the functionality in the fork is not much different/variable from that in the mainline. On the other hand, a higher \\(\\text{VariabilityPercentage}\\) indicates more specific customizations in the fork.\n\n4 Variant Family Characteristics (RQ1)\n\nWe now present the characteristics of our identified software families within the ecosystems. Table 3 shows all the metrics we defined with values.\n\n4.1 General Variant Characteristics\n\n- **Variant Family FamilySize.** Figure 4 shows the number of variants (i.e., family size) in each of the variant families of the three ecosystems we studied.\n\n We can see that the distributions of family sizes for all three ecosystems are right-skewed with most families having two members. Specifically, 28 (73%) of 38 software families, 7,731 (87%) of 8,837 software families, and 475 (90%) of 526 software families have only two variants. The three distributions also show that larger families are\n| Metric | Mean | Min | Median | Max | Description |\n|--------------------------------|------|-----|--------|------|-------------------------------------------------|\n| **FamilySize** | | | | | |\n| Android apps | 2.4 | 2 | 2 | 7 | Number of variants in an Android family |\n| .NET apps | 2.1 | 2 | 2 | 7 | Number of variants in a .NET family |\n| JavaScript apps | 2.2 | 2 | 2 | 16 | Number of variants in a JavaScript family |\n| **App Dependencies (.NET & JavaScript)** | | | | | |\n| PackageDependenciesMLV | 40.4 | 0 | 26 | 140 | Number of mainline variant packages dependencies on Android |\n| | 2.3 | 0 | 1 | 49 | Number of mainline variant packages dependencies on .NET |\n| | 11.8 | 0 | 7 | 267 | Number of mainline variant packages dependencies on JavaScript |\n| PackageDependenciesFV | 22 | 0 | 22 | 81 | Number of fork variant packages dependencies on Android |\n| | 2.0 | 0 | 1 | 25 | Number of fork variant packages dependencies on .NET |\n| | 9.8 | 0 | 6 | 605 | Number of fork variant packages dependencies on JavaScript |\n| **App Popularity (Android)** | | | | | |\n| DownloadsMLV | 2,211K | 1 | 50K | 100M | Number of downloads of the mainline variant from Google Play |\n| DownloadsFV | 5,479K | 5 | 1K | 100K | Number of downloads of the fork variant from Google Play |\n| ReviewsMLV | 27K | 0 | 547 | 631K | Number of reviews of the mainline variant on Google Play |\n| ReviewsFV | 2.8K | 0 | 45 | 161K | Number of reviews of the fork variant on Google Play |\n| **App Popularity (.NET & JavaScript)** | | | | | |\n| DependentPackagesMLV | 106 | 0 | 0 | 27K | Number of packages that depend on the mainline app on .NET |\n| | 80 | 0 | 2 | 26K | Number of packages that depend on the mainline app on JavaScript |\n| DependedntPackagesFV | 0.4 | 0 | 0 | 19 | Number of .NET packages that depend on the fork app on .NET |\n| | 1.7 | 0 | 0 | 2K | Number of JavaScript packages that depend on the fork app on JavaScript |\n| DependentProjectsMLV | 133 | 0 | 0 | 33K | Number of .NET projects that depend on the mainline app on GitHub |\n| | 140 | 0 | 0 | 83K | Number of JavaScript projects that depend on the mainline app on GitHub |\n| DependentProjectsFV | 0.5 | 0 | 0 | 82 | Number of .NET projects that depend on the fork app on GitHub |\n| | 2 | 0 | 0 | 5K | Number of JavaScript projects that depend on the fork app on GitHub |\nTable 3 (continued)\n\n| Metric | Mean | Min | Median | Max | Description |\n|-------------------------------|------|-----|--------|------|-------------------------------------------------|\n| App Maintenance (.NET & JavaScript) | | | | | |\n| PackageReleasesMLV | 14.6 | 1 | 2 | 188 | Number of mainline variant packages dependencies on .NET |\n| | 15 | 1 | 8 | 1117 | Number of mainline variant packages dependencies on JavaScript |\n| PackageReleasesFV | 3.6 | 1 | 2 | 54 | Number of fork variant packages dependencies on .NET |\n| | 4 | 1 | 2 | 341 | Number of fork variant packages dependencies on JavaScript |\n\nMLV mainline variant\nFV forked variant\n\nFig. 4 Distribution of family sizes (number of variants in a family) of the three ecosystems. A variant family contains one mainline variant and at least one or more fork variants. The presented data corresponds to 38 software families, 8,837 software families, and 526 software families. Note that y-axes of Figs. 4b and c are presented on logarithmic scales. The axes of figures are also presented on different scales for visibility purposes.\nrather seldom in all three ecosystems, but that the largest family sizes we observe are part of the JavaScript ecosystem. When identifying variant families from the different ecosystems, we observe that although Android is considered one of the largest known ecosystems (Mojica et al. 2014; Li et al. 2016; Sattler et al. 2018), identifying its variant families is rather difficult compared to the software packaging ecosystems (JavaScript and .NET) we studied. In the Android ecosystem is not compulsory to record any source repository of an Android variant on Google Play. To this end, we went through the lengthy process described in Section 3.1.1, applying a number of heuristics on GitHub repositories to identify families.\n\n- **Variant Package Dependencies**: In Fig. 5, we present two scatter plots showing the graph of mainline dependencies versus the fork dependencies. Figures 5a to c show the scatter plots of the number fork variant package dependencies (y-axis) versus the number of mainline variant package dependencies (x-axis) for Android, .NET and JavaScript variants, respectively. A point in any of the scatter plots represents the number of package dependencies of a given fork variant (y-axis) and the number of package dependencies of the counterpart mainline variant (x-axis). In all scatter plots, it's not surprising that the number of package dependencies for a fork and its corresponding\n\n \n**Fig. 5** Scatter plots of mainline and fork variant dependencies of other packages on the ecosystems. The datasets mainline\u2013fork variants of 54 mainline\u2013fork pairs for Android, 590 mainline\u2013fork pairs for .NET and 10,357 mainline\u2013fork pairs for JavaScript. **Note**: The graphs are presented on different scales for visibility purposes.\nmainline are correlated. This confirms that fork variants inherit the original dependencies of the mainline. However, we also observe points in all the scatter plots where one variant has more dependencies than the other. This means that the variant with more packages dependencies has functionality that is not included in the counterpart variant. Although the observation is more prominent for the mainline variant since we see many points below the diagonal lines for the two graphs (the forks do not keep in sync with the mainline), it is interesting that we also have some fork variants with more dependencies. Follow-up studies could investigate what and why new functionalities related to the used dependencies are being introduced in the variants.\n\n- **Android variant categories:**\n\n Figure 6 shows the distribution of variants in the different categories on Google Play. We can see that 12 of the 54 forks (22%) are listed in a different category from the mainline, which suggests that these variants serve different purposes. However, the majority of pairs include variants in the same category.\n\n### 4.2 Variant Maintenance Activity (JavaScript & .NET)\n\nFigure 7 shows the release distributions for both the mainline and the fork variants in the JavaScript and .NET ecosystems. Each point on the x-axis represents a pair, and we sort the pairs by the number of mainline package releases. Figure 7a shows that the majority of mainline variants has multiple releases. Specifically, 5,888 of the 8,835 (67%) mainline variants have $\\geq 5$ package releases on the JavaScript package manager. The fork variants have fewer, but still multiple releases. Specifically, 2,389 of the 10,357 mainline variants (23%) have $\\geq 5$ package releases on the JavaScript package manager. Interestingly, from the plot we also observe a number of forks having more releases than their mainlines. Looking at Fig. 7b, for .NET variants, we observe a similar distribution like that of JavaScript.\n\n \n**Fig. 6** Relationship between the variant categories listed on Google Play for each variant in the Android Mainline\u2013Fork Pairs. *Same* = mainline\u2013fork pairs share the same category and *Different* = mainline\u2013fork pairs share different category.\nvariants in Fig. 7a. These results are interesting, since they indicate that developers of forked variants usually do not make a one-off package distribution. They are continuously distributing new releases of their packages, further emphasizing that these are indeed variant forks.\n\n**Observation 1\u2013RQ1:** Families in fact exist in our three software ecosystems. We collected 38,526, and 8,837 different families. While both the mainlines and forks have multiple releases, the number of releases is significantly higher than those of the forks. Still it indicates that the latter are usually not one-shot releases; with some having even more than their mainlines.\n\n### 4.3 Variant Ownership Characteristics\n\nFigure 8 shows the percentage of common owners in the mainline\u2013fork variant pairs of our three studied ecosystems. For the Android variants the analysis is based on all the data we collected (54 mainline\u2013fork variant pairs). However, for the .NET and JavaScript variants we only analysed a subset of the .NET and JavaScript mainline\u2013variant pairs, respectively, due to the criteria we set out to identify variant ownership in Section 3.2. From Fig. 8, we can see relatively the same percentages of the common (Yes) and not common (No) developers across the three ecosystems. Overall, our results imply that the majority of forked variants are started and maintained by developers different from those maintaining the mainline counterparts.\n\n**Observation 2\u2013RQ1:** The majority of the mainline\u2013fork variant pairs for the three ecosystems we investigated are owned by different developers (91% for Android variants, 95% of JavaScript variants and 92% of .NET variants). This implies that the majority of forked variants in our datasets are started and maintained by developers different from those maintaining the mainline counterparts.\nFig. 8 Variant owners for the mainline\u2013fork variant pair for the three ecosystem. Yes = mainline\u2013fork variant pair has common developers and No = mainline\u2013fork variant pair do not have common developers. The datasets of mainline\u2013fork variant pairs of 54 from Android, 985 from JavaScript, and 89 from .NET ecosystems. Note: The graphs are presented on different scales for visibility purposes.\n\n4.4 Variant Popularity Characteristics\n\nFigure 9 shows the variant popularity for the variants in the three software packaging ecosystems of Android, JavaScript, .NET.\n\n- **Android variants**: Figure 9a shows the variant downloads distribution for both the mainline and fork variants where each point on the x-axis represents a pair and we sort the pairs by the number of mainline downloads. We observe that the majority of the mainline variants are quite popular, 27 of the 38 mainline variants (71%) have $\\geq 10K$ downloads. For fork variant popularity in terms of downloads, we observe that 10 of the 54 fork variants (19%) having $\\geq 10K$ downloads. We believe it is natural that the mainline variants are more popular than their fork counterparts, since we assume they...\nFig. 9 Distributions of mainline and fork variant variants\u2019 popularity metrics for the variants in the three ecosystems of Android, JavaScript and .NET. The datasets of 54 mainline\u2013fork pairs for Android, 10,357 mainline\u2013fork pairs for JavaScript, and 590 mainline\u2013fork pairs for .NET ecosystems have been released first on Google Play. Figure 9b shows the variant reviews distribution for both the mainline and fork variants where each point on the x-axis represents\n\n12Note that Google Play does not keep release history of its variants so it is not possible to obtain the first listing date of each variant.\na pair and we sort the pairs by the number of mainline reviews. We observe a similar distribution for number of reviews like those observed in the number of downloads. This is not surprising since previous studies have found downloads and reviews to be correlated (Businge et al. 2019). Overall, the variant popularity we observe gives us confidence that our data set consists of real variants.\n\n- **JavaScript and .NET variants**: In Figs. 9c\u2013f we present the popularity graphs for the variants in the two ecosystems of .NET and JavaScript. Figure 9c shows the dependent packages distributions for both the mainline and fork variants where each point on the x-axis represents a pair and we sort the pairs by the number of mainline dependent packages. We observe that the majority of mainline variants are quite popular, 6,157 of the 10,357 mainline variants (59%) having at least two dependent packages. For fork variants, we observe that 1,624 of the 10,357 mainline variants (16%) having at least two dependent packages. Figure 9d shows the dependent projects distributions for both the mainline and fork variants for the variants in the JavaScript ecosystem. Each point on the x-axis represents a pair and we sort the pairs by the number of mainline dependent project. We also observe a similar distribution for number of dependent projects such as that observed in the number of dependent packages. The remaining two graphs, Figs. 9e and f, show the same data for the .NET ecosystem, and both show similar trends to those observed for JavaScript.\n\nComparing the popularity of all the ecosystems, we observe that the mainline variants are more popular than the fork variant counterparts. This is not surprising since the forks are clones of the mainline. However from Fig. 9, in all the three ecosystems, it is interesting to observe a few fork variants being more popular than their mainline counterparts. In a follow-up study it would be interesting to investigate possible explanations why the variants are more popular than their mainline counterparts. Comparing the popularity of the variants in the JavaScript and .NET ecosystems, we observe that on average the variants in the JavaScript ecosystem are more popular than the variants in the .NET ecosystem. We also observe that the fork variants in the .NET in the ecosystem are less popular (have fewer dependent packages/projects) than the variants in the JavaScript ecosystem. In a follow-up study it would also be interesting to investigate why variants in JavaScript families are more popular than the variants in .NET families and also why the fork variant variants in the JavaScript families are more popular than the fork variant variants in the .NET families.\n\nTables 4 and 5 present a few examples showing the variant popularity (for all the three ecosystems) and variant maintenance activities (for only .NET and JavaScript). In Table 5 columns mainline and fork we use the package names of the variants since repository names on GitHub were too long. In both tables, we present two interesting examples of variant pairs that we randomly picked: (1) **abandoned mainlines**: the first variant pair in each of the ecosystems has the fork variant more popular that the mainline. When we\n\n| mainline | fork | mainline downloads | fork downloads | mainline reviews | fork review |\n|-------------------|-----------------------|--------------------|----------------|------------------|-------------|\n| TobyRich / | TailorToys / | 10K | 100K | 106 | 1,034 |\n| app-smartplane-android | app-powerup-android | | | | |\n| opendatakit / | kobotoolbox / | 1,000K | 100K | 3,049 | 1,527 |\n| collect | collect | | | | |\nTable 5 Example of mainline\u2013fork pairs from the .NET and JavaScript ecosystems showing statistics on the popularity and maintenance activities\n\n| mainline | fork dependent packages | mainline dependent packages | fork package releases | mainline package releases | fork package releases |\n|------------|-------------------------|-----------------------------|-----------------------|--------------------------|-----------------------|\n| .NET | Flurl.Signed | Flurl.Http.Signed | 3 | 10 | 6 | 10 |\n| | Ninject | Portable.Ninject | 638 | 19 | 75 | 14 |\n| JS | selenium | selenium-server | 97 | 2,046 | 2 | 51 |\n| | gulp-istanbul | gulp-babel-istanbul | 5,867 | 11 | 24 | 14 |\n\nJS JavaScript\n\ncompared the last release dates of the variants in all the ecosystems, we observed that the mainlines seem to have been abandoned while the fork variant continued to evolve. This is the reason the fork variants are more popular. In Table 5 we can also see that the fork variants have more releases than the mainlines. (2) Co-evolution: the second pair in each of the ecosystems we present another interesting case of co-evolution of both the mainline and fork variant. are continuously being maintained and where both are popular. In this cases, it would be interesting co-evolution of the variants in both technical and social aspects. Technical: for example investigating if the variants are complementary or they are competing? Social: What can we learn about the variant communities?\n\nObservation 3\u2013RQ1: Although the mainline variants are more popular, which is not surprising, there is quite a number of fork variants that are also popular. We also observe a few of the fork variants being more popular than their mainline counterparts. This again tells us the forks we are studying are indeed variant forks being used by the community of other developers (in the cases of .NET and JavaScript variants) and for Android variants, being downloaded and installed on user phones. We have pointed out some interesting research directions that can be investigated in follow-up studies.\n\n5 Code Propagation in the Software Families (RQ2)\n\nSo far, we have analyzed the characteristics of the software families in across our three ecosystems. Our results from RQ1 give us confidence that the fork variants in our data set are indeed variant forks. In RQ2, we present the results of how variants in the same family co-evolve. Specifically, we are interested in their code propagation practices to understand if the variants evolve separately or if they propagate code between each other after the forking date. We present the results of code propagation between family variants in terms of propagated commits, while differentiating the propagation mechanisms we explained in Sections 2 and 3.3. Recall that these commit types determine the various code propagation strategies (e.g., pull requests versus direct integration through git).\nTables 6, 7, 8 and 9 show the metrics we use in this RQ to measure the types of propagated commits in the ecosystems of Android, JavaScript, and .NET. Where applicable, we specify the direction of the propagated code, i.e., mainline\u2192fork or fork\u2192mainline. Recall from Section 3.3.1 that we do not differentiate between git merge and git rebase commits and that we assume that all integrated git merge and git rebase commits are in the direction mainline\u2192fork. This is why Tables 7 and 8 show only one metric gitPullMLV-FV to represent these two commit integration types. Tables 6\u20139 show the summary of the descriptive statistics of all the metrics we use to investigate code\n\n| Metric | Mean | Min | Median | Max | Description |\n|-------------------------|------|-----|--------|------|--------------------------------------------------|\n| **Android variants** | | | | | |\n| mergedPRsMLV-FV | 0.31 | 0 | 0 | 15 | Number of merged PR from the mainline to the fork variant. |\n| mergedPRsFV-MLV | 0.09 | 0 | 0 | 4 | Number of merged PR from a given the fork to the mainline variant. |\n| prMergedCommitsMLV-FV | 8.33 | 0 | 0 | 427 | Number of merged PR commits from the mainline to the fork variant. |\n| prMergedCommitsFV-MLV | 0.57 | 0 | 0 | 28 | Number of merged PR commits from the fork to the mainline variant. |\n| prSquashedMLV-FV | 0 | 0 | 0 | 0 | Number of squashed PR from the the mainline to the fork variant. |\n| prSquashedFV-MLV | 0 | 0 | 0 | 0 | Number of squashed PR from a given the fork to the mainline variant. |\n| prRebasedMLV-FV | 0 | 0 | 0 | 0 | Number of rebased PR from the the mainline to the fork variant. |\n| prRebasedFV-MLV | 0 | 0 | 0 | 0 | Number of rebased PR from a given the fork to the mainline variant. |\n| **.NET variants** | | | | | |\n| mergedPRsMLV-FV | 0 | 0 | 0 | 3 | Number of merged PR from the mainline to the fork variant. |\n| mergedPRsFV-MLV | 0.2 | 0 | 0 | 13 | Number of merged PR from a given the fork to the mainline variant. |\n| prMergedCommitsMLV-FV | 0.2 | 0 | 0 | 30 | Number of merged PR commits from the mainline to the fork variant. |\n| prMergedCommitsFV-MLV | 1.2 | 0 | 0 | 207 | Number of merged PR commits from the fork to the mainline variant. |\n| prSquashedMLV-FV | 0 | 0 | 0 | 0 | Number of squashed PR from the the mainline to the fork variant. |\n| prSquashedFV-MLV | 0 | 0 | 0 | 5 | Number of squashed PR from a given the fork to the mainline variant. |\n| prSquashedCommitsFV-MLV | 0.1 | 0 | 0 | 14 | Number of squashed PR commits from the fork to the mainline variant. |\n| prRebasedMLV-FV | 0 | 0 | 0 | 0 | Number of rebased PR from the the mainline to the fork variant. |\n| prRebasedFV-MLV | 0 | 0 | 0 | 0 | Number of rebased PR from a given the fork to the mainline variant. |\nTable 6 (continued)\n\n| Metric | Mean | Min | Median | Max | Description |\n|-------------------------------|------|-----|--------|-----|-----------------------------------------------------------------------------|\n| JavaScript variants | | | | | |\n| mergedPRs<sub>MLV-FV</sub> | 0 | 0 | 0 | 26 | Number of merged PR from the mainline to the fork variant. |\n| mergedPRs<sub>FV-MLV</sub> | 0.4 | 0 | 0 | 4 | Number of merged PR from a given the fork to the mainline variant. |\n| prMergedCommits<sub>MLV-FV</sub> | 0.1 | 0 | 0 | 399 | Number of merged PR commits from the mainline to the fork variant. |\n| prMergedCommits<sub>FV-MLV</sub> | 0.57 | 0 | 0 | 28 | Number of merged PR commits from the fork to the mainline variant. |\n| prSquashed<sub>MLV-FV</sub> | 0 | 0 | 0 | 2 | Number of squashed PR from the mainline to the fork variant. |\n| prSquashed<sub>FV-MLV</sub> | 0 | 0 | 0 | 21 | Number of squashed PR from a given the fork to the mainline variant. |\n| prSquashedCommits<sub>MLV-FV</sub> | 0.4 | 0 | 0 | 52 | Number of squashed PR commits from the mainline to the fork variant. |\n| prSquashedCommits<sub>FV-MLV</sub> | 0 | 0 | 0 | 109 | Number of squashed PR commits from the fork to the mainline variant. |\n| prRebased<sub>MLV-FV</sub> | 0 | 0 | 0 | 2 | Number of rebased PR from the mainline to the fork variant. |\n| prRebased<sub>FV-MLV</sub> | 0 | 0 | 0 | 3 | Number of rebased PR from a given the fork to the mainline variant. |\n| prRebasedCommits<sub>MLV-FV</sub> | 0.4 | 0 | 0 | 4 | Number of rebased PR commits from the mainline to the fork variant. |\n| prRebasedCommits<sub>FV-MLV</sub> | 0 | 0 | 0 | 25 | Number of rebased PR commits from the fork to the mainline variant. |\n\npropagation at the commit level for all the three ecosystems of Android, JavaScript, and .NET.\n\n5.1 Pull Request Propagation (Commit Integration Inside GitHub)\n\nWe present the results of the pull request integration techniques: merge, rebase and squash (as well as the unclassified PRs) for the mainline\u2013fork pairs in all the three ecosystems of Android, JavaScript, and .NET. In Table 6 the results of the summary statistics and in Table 7 we present the details of the summary statistics. We also present the distributions of the integration in both directions in Fig. 10.\n\nFigure 10 shows the box plots showing the distributions of the different PR integration techniques. For example for the variants in the Android ecosystem, the distribution of the PR integration in both directions of mainlines \u2192 fork and fork \u2192 mainline are shown in Fig. 10a. There was only one pull request in each direction of integration. Both pull requests were integrated using the PR merge option. There was no PR integrated using any of the other PR integration options. We can see that in all the boxplots the majority of the mainline\u2013fork variant pairs have zero PRs integrated in either direction. This implies that most of the pairs do not integrate PRs between themselves.\nTable 7 Number of mainline\u2013fork pairs, pull requests involved in code propagation in our dataset of 54 mainline\u2013fork pairs, 10,357 mainline\u2013fork pairs, and 590 mainline\u2013fork pairs from the ecosystems of Android, JavaScript, and .NET, respectively\n\n| | Mainline\u2192 Fork | Fork\u2192 mainline |\n|------------------|----------------|----------------|\n| | Pairs | PRs | Commits | Pairs | PRs | Commits |\n| **Android variants** | | | | | | |\n| PR Merged | 1 | 1 | 5 | 1 | 2 | 427 |\n| Rebased | 0 | 0 | 0 | 0 | 0 | 0 |\n| Squashed | 0 | 0 | 0 | 0 | 0 | 0 |\n| Unclassified | 0 | 0 | 0 | 0 | 0 | 0 |\n| Git Cherry-pick | 5 | n/a | 250 | 4 | n/a | 136 |\n| gitPullMLV-FV | 18 | n/a | 13,198 | n/a | n/a | n/a |\n| **.NET variants** | | | | | | |\n| PR Merged | 9 | 13 | 96 | 67 | 139 | 721 |\n| Rebased | 0 | 0 | 0 | 0 | 0 | 0 |\n| Squashed | 0 | 0 | 0 | 13 | 21 | 72 |\n| Unclassified | 0 | 0 | 0 | 3 | 3 | 9 |\n| Git Cherry-pick | 15 | n/a | 99 | 16 | n/a | 138 |\n| gitPullMLV-FV | 106 | n/a | 5,601 | n/a | n/a | n/a |\n| **JavaScript variants** | | | | | | |\n| PR Merged | 99 | 162 | 1,862 | 724 | 1,394 | 4,523 |\n| Rebased | 1 | 1 | 4 | 11 | 13 | 67 |\n| Squashed | 5 | 6 | 72 | 132 | 250 | 1,048 |\n| Unclassified | 7 | 10 | 33 | 23 | 32 | 134 |\n| Git Cherry-pick | 95 | n/a | 275 | 91 | n/a | 251 |\n| gitPullMLV-FV | 1,180 | n/a | 40,001 | n/a | n/a | n/a |\n\nFor example, the Android apps, the first row in the direction of mainline\u2192 fork, only 1 fork variant merged 1 PR from the mainline containing 5 commits and in the direction of fork\u2192 mainline, only 1 mainline merged 2 PRs containing 427 commits.\n\nTable 7 shows the details of summary statistics in the distributions. For example, in the top section of Table 7 (Android variants) and in the first row, we observe 1 of the 54 mainline\u2013fork variant pairs that integrated 1 PR having a total of 5 commits, using the merge pull request option, in the direction of mainline\u2192 fork. In the same row, in the direction of fork\u2192 mainline, we observe 1 mainline\u2013fork pair that integrated 2 PRs, having a total of 427 commits, using the merge pull request option, in the direction of fork\u2192 mainline.\n\nWe can see that for Android variants only 1 of the 54 (1.9 %) mainline\u2013fork pairs integrated commits using the merge pull request option. We observe more or less similar trends for the mainline\u2013fork variants pairs in the other two ecosystems. For the JavaScript mainline\u2013fork variant pairs, we observe 99 of the 10,357 mainline\u2014fork variant pairs (1 %).\nTable 8 Git based (outside GitHub) code propagation practices, at commit level, for the 54 mainline\u2013fork pairs, 10,357 mainline\u2013fork pairs, and 590 mainline\u2013fork pairs in the Android, JavaScript, .NET ecosystems, respectively\n\n| Metric | Mean | Min | Median | Max | Description |\n|-------------------------|------|-----|--------|------|-----------------------------------------------------------------------------|\n| **Android variants** | | | | | |\n| gitCherrypickedMLV-FV | 4.6 | 0 | 0 | 168 | Number of git cherry-picked commits from the mainline to the fork variant. |\n| gitCherrypickedFV-MLV | 2.5 | 0 | 0 | 75 | Number of git cherry-picked commits from the fork to the mainline variant. |\n| gitPullMLV-FV | 244 | 0 | 0 | 6567 | Number of git merged/rebased commits from the mainline to the fork variant.|\n| **.NET variants** | | | | | |\n| gitCherrypickedMLV-FV | 1.5 | 0 | 0 | 42 | Number of git cherry-picked commits from the mainline to the fork variant. |\n| gitCherrypickedFV-MLV | 0.4 | 0 | 0 | 148 | Number of git cherry-picked commits from the fork to the mainline variant. |\n| gitPullMLV-FV | 9.5 | 0 | 0 | 2,317| Number of git merged/rebased commits from the mainline to the fork variant. |\n| **JavaScript variants** | | | | | |\n| gitCherrypickedMLV-FV | 4.6 | 0 | 0 | 168 | Number of git cherry-picked commits from the mainline to the fork variant. |\n| gitCherrypickedFV-MLV | 0 | 0 | 0 | 70 | Number of git cherry-picked commits from the fork to the mainline variant. |\n| gitPullMLV-FV | 3.7 | 0 | 0 | 6,035| Number of git merged/rebased commits from the mainline to the fork variant. |\n\nintegrating commits, using the merge pull request option, in the direction of mainline\u2192fork and 724 of the 10,357 mainline\u2013fork pairs (7 %) in the direction of fork\u2192mainline. We observe very few mainline\u2013fork variant pairs, in the JavaScript software packaging ecosystem, integrating commits using the pull request squash/rebase options in either integration directions. For the mainline\u2013fork variant pairs in the .NET ecosystem, we observe 9 of 590 mainline\u2013fork pairs (1.5 %) and 67 of the 590 mainline\u2013fork pairs (11.3 %) integrating commits, using the merge pull request option, in the direction of mainline\u2192fork and fork\u2192mainline, respectively. We did not observe any commits integrated using the rebased pull request option in either integration direction, while for the commits integrated using the squash pull request option, we only observed integration in the direction of fork\u2192mainline accounting for 13 of the 590 mainline\u2013fork pairs (2 %).\n\nWe observe that there are more mainline\u2013fork variant pairs integrating commits in the direction of fork\u2192mainline as opposed to mainline\u2192fork irrespective of the PR integration option used. For Android variants we observed 1 pair each in either direction (1.9 % each); for JavaScript variants we have 867 of 10,357 mainline\u2013fork pairs (8.4 %) in the direction of fork\u2192mainline to 105 of 10,357 mainline\u2013fork pairs (14 %). Regarding the pull request integration options, we can see that the merge pull request option is clearly the most frequently used in all integration directions and in all the three ecosystems. In all three software packaging ecosystems, the squash and rebase options are rarely used. However, comparing the two PR options, squash and rebase, we observe that the squash PR option is used more often.\nTable 9 Unique commits and variability percentage for the 54 mainline\u2013fork pairs, 10,357 mainline\u2013fork pairs, and 590 mainline\u2013fork pairs in the Android, JavaScript, .NET ecosystems, respectively\n\n| Metric | Mean | Min | Median | Max | Description |\n|-----------------|-------|-----|--------|--------|-----------------------------------------------------------------------------|\n| **Android variants** | | | | | |\n| unique<sub>MLV</sub> | 1,122 | 0 | 228 | 18,961 | Number of unique commits in the mainline variant in a given mainline\u2013fork pair. |\n| unique<sub>FV</sub> | 98.3 | 1 | 16 | 1,646 | Number of unique commits in the fork variant in a given mainline\u2013fork pair. |\n| InheritedCommits | 1,884 | 10 | 755 | 29,110 | Number of common commits between a given fork and the mainline variant. |\n| VariabilityPercentage | 15 | 0 | 2.7 | 93.8 | Percentage of unique commits according to (1). |\n| **.NET variants** | | | | | |\n| unique<sub>MLV</sub> | 102.2 | 0 | 3 | 10,789 | Number of unique commits in the mainline variant in a given mainline\u2013fork pair. |\n| unique<sub>FV</sub> | 16.2 | 0 | 5 | 605 | Number of unique commits in the fork variant in a given mainline\u2013fork pair. |\n| InheritedCommits | 224.5 | 0 | 42.1 | 20,538 | Number of common commits between a given fork and the mainline variant. |\n| VariabilityPercentage | 20 | 0 | 11 | 99 | Percentage of unique commits according to (1). |\n| **JavaScript variants** | | | | | |\n| unique<sub>MLV</sub> | 33.5 | 0 | 3 | 10,223 | Number of unique commits in the mainline variant in a given mainline\u2013fork pair. |\n| unique<sub>FV</sub> | 12.8 | 0 | 5 | 1,229 | Number of unique commits in the fork variant in a given mainline\u2013fork pair. |\n| InheritedCommits | 111.5 | 14 | 32 | 66,861 | Number of common commits between a given fork and the mainline variant. |\n| VariabilityPercentage | 22.3 | 0 | 14 | 99 | Percentage of unique commits according to (1). |\n\n**Observation 1\u2013RQ2**: Code propagation using PRs is rarely used in all the mainline\u2013fork variant pairs from the three ecosystems that we studied. Unsurprisingly, we have observed that PRs in the direction of fork \u2192 mainline are more than those in the direction of mainline \u2192 fork. However, although low numbers are observed, there are some PRs in the direction of mainline \u2192 fork. We have also observed that, in all the three ecosystems, the most used integration option is by far the merge PR option. The squash and rebase PR option are less frequently used in mainline\u2013fork variant pairs all the three ecosystems, although the squash PR option is more used than the rebase PR option. The low numbers could be attributed to the fact that the fork variants are created not to submit PRs but to diverge away from the mainline to solve a different problem. A follow-up study involving a user study could investigate motivation behind fork variant are creation and why there is limited collaboration between mainline and fork variants.\n5.2 Git Propagation (Commit Integration Outside GitHub)\n\nIn this section we present the results of commit integration outside GitHub relating to `git cherry-pick` and `git merge/rebase` (gitPullMLV-FV). The summary statistics of these two commit integration techniques are presented in Table 8. In Table 7, the detailed results corresponding to the summary statistics in Table 8 are presented. We first present the results of `git cherry-pick`, and we follow with the results of `git merge/rebase`.\n\n- **git cherry-pick commit integration**: Like we stated in Section 3.3 commits can be cherry-picked from mainline in two directions: mainline\u2192fork or fork\u2192mainline. The two metrics: `gitCherrypickedMLV-FV` and `gitCherrypickedFV-MLV` (in Table 8) corresponding to the two commit integration directions for the mainline\u2192fork and fork\u2192mainline, respectively, in the three ecosystems. In Fig. 11 we present boxplot distributions corresponding to the results in Table 8. We can see all the distributions only show outliers, meaning that most pairs do not have cherry-picked commits. The detailed statistics in Table 7 reveal the same results. For example, the upper part of Table 7 presenting the Android variants, we can see that there are only 5 of the 54 mainline\u2013fork pairs (9%) that integrated a total of 250 commits in the direction of...\nmainline\u2192fork. In the direction of fork\u2192mainline there were 4 of the 54 mainline\u2013fork pairs (7.4%) integrating a total of 136 commits. Like the results of pull request integration presented earlier, we can also clearly see that commit integration using `git cherry-pick` is rarely used in the mainline\u2013fork variant pairs in all the three ecosystems we have studied. Unlike pull request integration where the developer has to sync upstream or downstream the new changes, with `git cherry-pick` the developer have to search for specific commits to integrate. This requires to first look into the pool of new changes and identify the ones of interest to cherry-pick. If the mainline and fork variant have diverged solving different problems, then finding the interesting commits in the new changes might be laborious. We hypothesize that this could be one of the reasons why there are few numbers of commits observed in mainline\u2013fork variant pairs in the three ecosystems. A follow up study to confirm or refute this hypothesis would add value to this study.\n\n- **`git merge/rebase` commit integration**: In Table 8 we can see metric `gitPullMLV-FV` representing the `git merge/rebase` commit integration in the direction of mainline\u2192fork, in the three ecosystems. Again we can see that the all the medians for all the metric in all the three ecosystems are all zeros. Figure 11 shows three boxplots showing the distributions of `gitPullMLV-FV` metric for the mainline\u2013fork variant pairs in the three ecosystems. From the boxplots, we can also observe that the medians are all zeros. In Table 7 we present the detailed statistics for the metric `gitPullMLV-FV`. For Android mainline\u2013fork variant pairs, we observe 18 of the 54 mainline\u2013fork pairs (33%) with a total of 13,198 commits being integrated in the direction of mainline\u2192fork. For .NET mainline\u2013fork variant pairs, we observe 106 of the 590 mainline\u2013fork pairs (18%) with a total of 5,601 commits being integrated in the direction of mainline\u2192fork. And finally for JavaScript mainline\u2013fork variant pairs, we observe 1,180 of the 10,357 mainline\u2013fork pairs (11%) with a total of 40,001 commits being integrated in the direction of mainline\u2192fork. We can see that although `git merge/rebase` still rarely used in the mainline\u2013fork variants in all the three ecosystems, it is more used than the other two options of pull requests and `git cherry-pick`. We can conclude that `git merge/rebase` is the most used code integration mechanism between the variants.\nin variant families. Again, we speculate that the lack of integration mainline\u2013fork variant pairs could be as a result of the variants diverging to solve different problems from those being solved by their mainline counterparts.\n\n**Observation 2\u2013RQ2**: Like the integration technique using PRs, we also observe that `git merge/rebase` and `git cherry-pick` integration techniques are also less frequently used in the variants in the three ecosystems. However, we observe that integration using `git merge/rebase` is the most commonly used integration mechanism between the mainline\u2013fork variants in all the three ecosystems which occurs in the integration direction of mainline\u2192fork. In general, a follow-up study to investigate why most variants do not share code would reveal reasons for the low numbers of integration.\n\n### 5.2.1 Fork Variability Percentage\n\nThis section presents the results of variability percentage (metric `VariabilityPercentage`) for the fork variants in the three ecosystems. In Table 6, we present the summary statistics for the metrics used to calculate `VariabilityPercentage` in (1). Figure 12 presents the distributions of the metric `VariabilityPercentage` of the fork variants in the three ecosystems. We can see that the medians are 2.7%, 11%, and 14%, for variants in the three ecosystems of Android, .NET, and JavaScript, respectively. A high value of the metric `VariabilityPercentage` implies that the fork differs from its mainline counterpart. For the fork variants in the Android ecosystem, we observe quite a number of the forks, 35 of the 54 (35%), have a high `VariabilityPercentage` (\u2265 10%). The fork variants from the .NET ecosystem, we also observe the majority of the forks, 281/590 (53%), have a high `VariabilityPercentage` (< 10%). Lastly, the fork variants in the JavaScript ecosystem, we also observe quite the majority of the forks, 6,076/10,357 (58%), have a relatively high `VariabilityPercentage` (< 10%).\n\n Distribution of fork variability percentage\u2014`VariabilityPercentage` for the variants in the three ecosystems. The datasets of 54 fork variants, 10,357 fork variants, and 590 fork variants from the ecosystems of Android, JavaScript, and .NET, respectively.\nObservation 3\u2013RQ2: The majority of the fork variants in the three ecosystems of Android, JavaScript, and .NET highly differ from their mainline counterparts (i.e., they have higher numbers of unique commits). The findings of forks variants differing from their mainlines could be used to support our earlier finding relating to limited commit integration in the mainline\u2013fork variant pairs in the three ecosystems.\n\n5.3 Summary\n\nWe have presented results of code propagation practices among mainline\u2013fork variant pairs from the three ecosystems of Android, .NET, and JavaScript. Overall, in all the studied mainline\u2013fork variant pairs of the three ecosystems, we observe infrequent code propagation, regardless of the type propagation mechanism or direction. The most used code propagation technique is `git merge/rebase`, which is used in 33% of Android mainline-fork pairs, 11% of JavaScript pairs, and 18% of .NET pairs. For integration using pull requests, developers often integrate code in the direction of fork \u2192 mainline compared to those in the direction of mainline \u2192 fork, in all the mainline\u2013fork variants. The code integration in the direction of mainline \u2192 fork is often done using the `merge` pull request option or `git merge/rebase` outside GitHub. Moreover, the `squash` and `rebase` pull request options are less frequently used in mainline\u2013fork variant pairs, although the `squash` PR option is more used than the `rebase` pull request option. Finally, by comparing the fork variability percentage, we observed a high percentage difference between the fork variants and their mainline counterparts, indicated by the higher number of unique commits. These results are consistent across all the variants of the three ecosystems (i.e., Android, JavaScript, and .NET) that we studied. Our findings potentially indicate that the fork variants are being created with the intention of diverging away from the mainline to solve a different problem (i.e., with no intention to sync in any way with the original mainline). Future studies could investigate the motivation behind fork variants\u2019 creation and why there is a limited collaboration between mainline and fork variants.\n\n6 Discussion and Implications\n\nThe observations from our two research questions have several implications for future research on co-evolution of software families and for respective tool support.\n\nImplications for Identifying Variant Forks As opposed to previous studies that relied on heuristics applied to GitHub repositories to identify Variant forks, in this study, we ensure that all members of a variant family represent different variants in the marketplace (Google Play, JavaScript, and .NET). Relying on only heuristics applied to GitHub repositories to find variant forks may have false positives (i.e., fork classified as a variant fork, yet it is a social fork). The method for identifying divergent forks can be reused by other researchers interested in studying variant families in other ecosystems, including operating-system packages (e.g., Debian packages (Berger et al. 2014)) and ecosystems established for other programming languages. In fact, most of the popular programming language today, such as JavaScript, Java, PHP, .NET, Python, and many more have their own package managers.\navailable that host hundreds of thousands of packages. More details on the package managers can be found on Libraries.io which is a platform we have used to identify and extract details about variant families from the JavaScript and .NET ecosystem. Libraries.io references packages from over 37 package managers where one can obtain software families in the different ecosystems.\n\n**Implications for Forking Studies** Observation 1\u2013RQ2 and Observation 2\u2013RQ2 suggest that, in our studied divergent forks, direct integration using `git` outside of GitHub is more commonly used than GitHub pull requests. *This implies that simply relying on pull requests to understand code propagation practices in divergent forks is not enough.* Furthermore, it seems that integration using `git rebase` is common, as per Observation 2\u2013RQ2. Rebas ing complicates the git history and empirical studies that do not consider rebasing may report skewed, biased and inaccurate observations (Paix\u00e3o and Maia 2019). Thus, in addition to looking beyond pull requests when studying code propagation, *studies must also consider rebased commits.* In this paper, we contribute reusable tooling for identifying these rebased commits.\n\n**Implications for Integration Support Tools** Regardless of the integration technique used, our findings based on the variants from the three ecosystems studied suggest that code propagation rarely happens between a fork and its mainline. In our datasets, we observe 35% of 54 mainline\u2013fork pairs, 21% of 590 mainline\u2013fork pairs, and 11.5% of 10,357 mainline\u2013fork pairs that integrated commits using at least one of the commit integration techniques in the three ecosystems of Android, .NET, and JavaScript, respectively. The lack of integration may be problematic, since the fork variants may rely on the correct functionality of the existing code from the mainline. This means that any bugs that exist in the mainline will also exist in these forks, unless bug fixes are propagated from one variant to the other. However, current integration techniques (Lillack et al. 2019; Krueger and Berger 2020a; Krueger et al. 2020) do not necessarily facilitate finding such bug fixes. For example, code integration using pull requests and `git merge / rebase` may not be the best when integrating changes in variant forks since they involve syncing upstream / downstream all the changes missing in the current branch. Alternatively, cherry picking is probably more suitable for bug fixes since the developer can choose the exact commits they want to integrate. However, GitHub\u2019s current setup does not make it easy to identify commits to cherry-pick without digging through the branch\u2019s history to identify relevant changes since the last code integration. As a result of the difficulty of finding commits to cherry-pick, developers may end up fixing the same bugs, which would result in duplicated effort and wasted time. To check if a possible duplication of effort occurs in our data set, we looked at the unique commits of the variants and indeed found that developers independently update files shared by the variants. For example, in the mainline\u2013fork variant pair (`k9mail / k-9, imaeses / k-9`) the shared file `ImapStore.java`\\(^\\text{13}\\) has been touched by 15 different developers in 142 commits in the mainline variant while in the fork variant it has been touched by one developer in 9 different commits. It is possible that these developers could be fixing similar bugs existing in these shared artifacts. Moreover, the study of Jang et al. (2012) reports that during the parallel maintenance of cloned code, a bug found in one clone can exist in other clones, thus, it needs to be fixed multiple times. Furthermore, as a result of different developers changing shared files, it is possible that these developers do not integrate\n\n\\(^{13}\\text{src/com/fsck/k9/mail/store/ImapStore.java. Same path for both mainline and fork.}\\)\ncode because of \u201cfear of merge conflict.\u201d In relation to this conjecture, several studies have reported that merging diverged code between repositories is very laborious as a result of merge conflicts (Stanciulescu et al. 2015; Brun et al. 2011; de Souza et al. 2003; Perry et al. 2001; Sousa et al. 2018; Mahmood et al. 2020; Silva et al. 2020). To this end, it would be interesting for future research to interview the developers of our forks (and further forks) to determine whether the lack of support for cherry picking bug fixes or specific functionality does indeed contribute to the lack of code propagation. In that case, developing a patch recommendation tool that can inform developers of possible interesting changes as soon as they are introduced in one variant and recommend them to other variants in a family can help save developers\u2019 efforts. The recent work by Ren et al. (2018) that focused on providing the mainline with facilities to explore non-integrated changes in forks to find opportunities for reuse is one step towards this direction. Our work opens up more opportunities for applying such tools since, as mentioned above with respect to identifying divergent forks, we provide a technique for identifying such forks by combining information from GitHub and the ecosystem\u2019s main delivery platform as well as we mention various other ecosystems where a similar strategy can be adopted. Finally, the limited sharing of changes can give rise to quality issues. We did not specifically investigate the propagation of test cases, which might not be propagated as well. Developing techniques for propagating test cases within families could significantly enhance the quality of variants within families. The potential of test-case propagation has recently been pointed out in a preliminary study by Mukelabai et al. (2021).\n\n**Implications for Future Research**\n\nOur work is the first to perform a large-scale empirical study on the practices used to manage software families within software ecosystems. Our results give rise to the following open research questions that could be addressed as follow up studies to further understand the evolution of such families.\n\n1. **More than two variants in a family:** In the results of RQ1, we showed that there are quite a number of families that had a FamilySize of more than two variants (i.e., mainline with two or more fork variants). However, in this study we only concentrated on the practices used to manage mainline-fork pairs. For example, we did not look at fork-fork pairs in a given family or looking at the holistic evolution the families that have more than two variants. It would be interesting to extend the study to those families study the evolution of the family.\n\n2. **Variant dependencies:** In RQ1, we observed that in some variant pairs in all the three ecosystems, one of the mainline or fork variant in the pair has more dependencies than the other. This implies that the variant that has more dependencies implements new functionality relating to the extra dependencies that are missing in the counterpart. It would be interesting to investigate what/why the new functionality is missing in the counterpart variant. Another interesting research relating to dependencies would be to investigate if there are some variants in a family that have updated their code to depend on new releases of the common dependencies, while other variants in the same family are still dependent on the old releases of the dependencies. Updating code to implement a new release of a dependency may involve fixing incompatibilities, especially if the new release of the dependency involves a breaking change. To avoid effort duplication, a tool could be developed that could help in transplanting patches (related to the incompatibility fixes), to other variants in the family that have not yet migrated their code to the new API-breaking change of the release of the common dependency.\n\n3. **Limited sharing of changes in unique commits:** In RQ2 we have observed that there is limited sharing of the changes in the unique commits between the mainline\u2013fork variant\npairs in the three ecosystems. We hypothesized that one of the possible reasons could be the variants diverging from each other to solve different problems. We also stated that fork variants could be created to support a new technology, serve different community, target different content, to support a frozen feature in the mainline. Fork variants created for the above reasons are likely to have little to share with their mainline variants. It would be interesting to carry out a study involving mixed methods of quantitative and user studies to verify our hypothesis.\n\n4. **Impediments in co-evolving variants in software families:** Like in the study of Robles and Gonz\u00e1lez-Barahona (2012), in our dataset we also observed that some mainline\u2013fork variant pairs continue to co-exist, while others one of the variants in the pair is abandoned as the other continues to evolve. A Follow-up study can be conducted to investigate the impediments to co-evolving these variants. Inspirations can be leveraged from the studies of the co-evolution of Eclipse platform and its third-party plugins (Businge et al. 2012a; 2013; 2010; 2012b; 2015; Businge et al. 2019; Kawuma et al. 2016).\n\n7 Related Work\n\nWe discuss related work on (i) variant forking and on (ii) code propagation in forked projects, as well as we discuss (iii) general studies on forking.\n\n7.1 Variant Forking\n\nTo understand the variants in our variant families, RQ2 explored the reasons forks were created. While there are existing studies on variant forks, most of these were done in the pre-GitHub days of SourceForge, before the advent of social coding environments (Nyman et al. 2012; Robles and Gonz\u00e1lez-Barahona 2012; Viseur 2012; Nyman and Lindman 2013; Laurent 2008; Nyman and Mikkonen 2011). These studies reported controversial perceptions around variant forks in the pre-GitHub days (Chua 2017; Dixion 2009; Ernst et al. 2010; Nyman and Mikkonen 2011; Nyman 2014; Raymond 2001). However, Zhou et al. (2020) recently report that these perceptions have changed with the advent of GitHub. In the Pre-GitHub days, variant forks were frequently considered as risky to projects, since they could fragment a community and lead to confusion of developers and users. Jiang et al. (2017) state that, although forking is controversial in the traditional open source software (OSS) community, it is encouraged and is a built-in feature in GitHub. The authors further report that developers carry out social forking to submit pull requests, fix bugs, add new features, and keep copies. Zhou et al. (2020) also report that most variant forks start as social forks. Robles and Gonz\u00e1lez-Barahona (2012) comprehensively study a carefully filtered list of 220 potential forks of different projects that were referenced on Wikipedia. The authors assume that a fork is significant if a reference to it appears in the English Wikipedia. They found that technical reasons and discontinuation of the original project were the most common reasons for creating variant forks, accounting for 27.3% and 20% respectively. More recently, Zhou et al. (2020) interviewed 18 developers of variant forks on GitHub to understand reasons for forking in more modern social coding environments that explicitly support forking. The authors report that the motivations they observed align with the above prior studies.\n\nAll the above works studied forks for any type of project, not limited to a specific technological space (e.g., web applications or mobile apps). Our paper is different in that it focuses\non Android apps, triangulating data from both GitHub and Google Play to study real-world apps. Specifically, we study variant reuse practices in RQ2 and, different from both studies ((Zhou et al. 2020) and (Robles and Gonz\u00e1lez-Barahona 2012)), we investigate additional phenomena, such as code propagation with RQ3.\n\nAnother difference between the current study and the study of Zhou et al. (2020) is the heuristics the two studies employ to determine variant forks. Zhou et al. (2020) classify forks on GitHub as variant forks using the following heuristics: (i) contain the phrase \u201cfork of\u201d in the description, (ii) received at least three external pull requests, (iii) have at least 100 unique commits, (iv) have at least one year of development, and (v) have changed their name. In our work, we use the external validation of a fork being listed on Google Play under a different package name, and we use the description there to verify that this app is indeed a variant of the mainline.\n\n7.2 Code Propagation Practices\n\nThere are only a few studies that investigated code integration between a given repository and its forks. Stanciulescu et al. (2015) studied forking on GitHub using a case study of Marlin, an open source firmware for 3D printers. The authors observed that many forked variants share their changes with the mainline. However, their work does not differentiate between social and variant forks. Thus, we do not know whether this observed prevalent code propagation is simply due to the fact that these are social forks created with the main goal of contributing back to the original project (Zhou et al. 2019). In our current paper, we are interested only in variant forks. Recently, Zhou et al. (2020) observed that only 16% of their 15,306 studied variant forks ever synchronized or merged changes with their mainline repository. However, based on their discussed threats to validity, it seems that the authors relied only on common commit IDs to identify shared commits. As we explained in Section 2, there are several integration techniques that result in propagated commits having different commit IDs. Thus, relying only on the commit ID may result in missing other shared commits. To mitigate this problem, our work identifies integrated commits that preserve the commit ID as well as those that may have been integrated using techniques that change the commit ID. Another study on code propagation practices work of Kononenko et al. (2018). The authors considered three types of commit integration: GitHub merge, cherry-pick merge and commit squashing. In comparison to our study, we only do not study commit squashing but we look at other techniques the authors did not consider like: GitHub rebase and squash pull requests as well as git merge and rebase.\n\nCode propagation practices do not necessarily have to be in the context of forks. For example, German et al. (2016) investigated how Linux uses Git. The authors stated that code changes are variant to track because of the proliferation of code repositories and because developers modify (\u201crebase\u201d) and filter (\u201ccherry-pick\u201d) the history of these changes to streamline their integration into the repositories of other developers. To this end, the authors presented a method \\textit{continuousMining} that crawls all known git repositories of a project multiple times a day to record and analyze all change-sets of a project. They authors state that \\textit{continuousMining} not only yields a complete git history, but also catches phenomena that are variant to study such as rebasing and cherry-picking. While we do not continuously capture the \u201clive\u201d history of a software project, we are able to capture rebased and cherry-picked commits in the context of forked projects by relying on the commit meta data, after a thorough investigation of how this meta data changes depending on the propagation strategy.\n7.3 Other Studies About Forking\n\nGamalielsson and Lundell (2014) studied the long term sustainability of Open Source software communities in Open Source software projects involving a fork. The authors study was based on LibreOffice project, which is a fork from the OpenOffice project. They wanted to understand how Open Source software communities were affected by a forking. The authors undertook an analysis of the LibreOffice project and the related OpenOffice and Apache OpenOffice projects by reviewing documented project information and a quantitative analysis of project repository data as well as a first hand experiences from contributors in the LibreOffice community. Their results strongly suggested a long-term sustainable LibreOffice community that had no signs of stagnation in the LibreOffice project 33 months after the fork. They also reported that good practice with respect to governance of Open Source software projects is perceived by community members as a fundamental challenge for establishing sustainable communities. Nyman (Nyman 2014) interviewed developers to understand their views on forking. His findings from the interviews differentiate good forks, which are those that (i) revive abandoned programs, (ii) experiment with and customize existing programs, or (iii) minimize tyranny and resolve disputes by allowing involved parties to develop their own versions of the program, vs. bad forks, which are those that (i) create confusion among users or (ii) add extra work among developers (including both duplication of efforts and increased work if attempting to maintain compatibility).\n\n8 Threats to Validity\n\nInternal Validity We identify four issues that could threaten the internal validity of our results: (1) In Section 3.1, the heuristics used for app family data identification in Steps 2 & 6 resulted in mismatch in the mapping of some the forks on GitHub and Google Play. We mitigated the threat by carrying out a through manual analysis in Section 3.1\u2013Step 7 and discarded the mismatched apps. Some of the steps we carried out during Android variant\u2019s data collection are manual, and any errors in those could affect our results. (2) Although we did not observe any cases where the developer changed the message in cherry-picked commits, we acknowledge that our algorithm will not be able to identify such cases; instead, our algorithm will identify them as unique commits in the respective variants. (3) We also acknowledge that our tool chain may miss some commits that are integrated using more than one integration technique. For example in Section 3.3.3, we presented the unclassified merged pull requests, which were listed on the GitHub API as merged yet they were not merged in the master branch. We discovered that the pull requests were integrated in a different branch other than the mainline but had all failed the build integration tests. To this end, when integrating commits from a fork \u2192 mainline, as a \u201cbest practice\u201d, developers may wish to first integrate the commits into a different branch (say staging branch) perform and integration test and then later integrate them into the master. However, following the \u201cbest practice\u201d we have explained, if the developer first integrates into the development branch using one commit integration technique. Thereafter the developer may wish integrate the same commits into the master using a different technique that changes the original integrator\u2019s metadata (for example cherry-picking). In that case, our toolchain will miss such commits. (4) In Section 2.2, we also stated that our scripts are not able to identify the integrated commits if the integrator uses git commands that rewrite the commit history.\nHowever, like we stated in Section 3.3.3, we believe that the practice of rewriting contributions from the community is likely to be rare with experienced developers, since rewriting changes commit authorship. (5) In Step 6 of Section 3.1, we eliminated all Android mainlines that did not at least one fork having a different package name on Google Play store. This means that we eliminate fork variants that were created for different markets other than Google play. However, unlike Google play where one can use an app\u2019s package name as a unique ID on Google play, other markets, such as anzhi, apkmirror, appsapk do not implement this strategy which means we cannot easily identify the correct app for a given GitHub repository. Therefore, we intentionally focus only on Android apps that are distributed on Google play store, which limits the number of Android families we are able to identify.\n\n**Construct Validity** The calculation of variability percentage of the fork variants treats commits the same way irrespective of the number of files touched. For example, a commit that has touched 100 files is treated the same as one that has just touched on file. While this may be misleading, the measure provides some indication of unique development activity.\n\n**External Validity** We analyzed only 54 Android mainline\u2013fork variant pairs while there exists millions of android applications on Google Play and other Android markets, which means that our results might not be representative of all the Android applications. However, we also analyze mainline\u2013fork variant pairs from two other ecosystems that also show similar results and behavior.\n\n### 9 Conclusion\n\nWe presented a large-scale exploratory study on reuse and maintenance practices via code propagation between variant forks and their mainline counterparts in software ecosystems. Our subject ecosystems cover different technological spaces: Android, JavaScript, and .NET. As part of our study, we designed a systematic method to identify real variant forks as well.\n\nWe identified and analyzed families of variants that are maintained together and that exist both on the official package distribution platforms (Google play, nuget, and npm) as well as on GitHub, allowing us to analyze reuse practices in depth. For variants in a given ecosystem, we mined from both sources of information\u2014from GitHub and the package distribution site\u2014to study their characteristics, including their variations, and code-propagation practices. In the Android ecosystem we identified 38 software families with a total of 54 mainline\u2013fork pairs, in the .NET ecosystem 526 software families with 590 mainline\u2013fork pairs, and in the JavaScript ecosystem 8,837 JavaScript software families with 10,357 mainline\u2013fork pairs. We provide a toolchain for analyzing code integration between any mainline-fork variant pair. Regardless of the integration technique used, our findings suggest that code integration rarely happens between a fork and its mainline. In our study, in the Android ecosystem, we observed only 19 of the 54 (35 %) that integrated commits using at least one of the commit integration techniques we discussed. In the .NET ecosystem, we observed a total of 126 of the 590 mainline\u2013fork pairs (21 %) that that integrated commits using at least one of the commit integration techniques. In the JavaScript ecosystem, we observe a total of 1,189 of the 10,357 mainline\u2013fork pairs (11.5 %) that integrated commits using at least one of the commit integration techniques.\nOverall, we analyzed variant forks on GitHub for two main reasons: (1) many previous studies focused on social forks, (2) the few studies on variant forks are conducted in the pre-GitHub days of SourceForge. In the future, it would be interesting to investigate a middle ground between the variant forks and social forks. For example, one could investigate if the practices observed in the variant forks are different from those of social forks.\n\nAcknowledgements We thank Serge Demeyer for comments on earlier drafts of this work.\n\nJohn Businge\u2019s work is supported by the FWO-Vlaanderen and F.R.S.-FNRS via the EOS project 30446992 SECO-ASSIST. Thorsten Berger\u2019s work is supported by Swedish research council and Wallenberg Academy. Sarah Nadi\u2019s research was undertaken, in part, thanks to funding from the Canada Research Chairs Program.\n\nOpen Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article\u2019s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article\u2019s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.\n\nReferences\n\nOnline appendix (2020) https://github.com/johnxu21/emse2020\n\nGitHub I. (2020). About pull request merges. https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-request-merges\n\nApel S, Batory D, Kastner C, Saake G (2013) Feature-oriented software product lines, Springer, Berlin\n\nBerger T, Pfeiffer R, Tartler R, Dienst S, Czarnecki K, Wasowski A, She S (2014) Variability mechanisms in software ecosystems. Inf Softw Technol 56(11):1520\u20131535\n\nBerger T, Stegh\u00f6fer JP, Ziadi T, Robin J, Martinez J (2020) The state of adoption and the challenges of systematic variability management in industry. Empir Softw Eng 25:1755\u20131797\n\nBrun Y, Holmes R, Ernst MD, Notkin D (2011) Proactive detection of collaboration conflicts. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering, ESEC/FSE \u201911. Association for Computing Machinery, New York, pp 168\u2013178. https://doi.org/10.1145/2025113.2025139\n\nBusinge J, Decan A, Zerouali A, Mens T, Demeyer S (2020) An empirical investigation of forks as variants in the npm package distribution. In: Papadakis M, Cordy M (eds) Proceedings of the 19th Belgium-Netherlands software evolution workshop, BENEVOL 2020, Luxembourg, December 3-4, 2020, CEUR Workshop Proceedings, vol. 2912. CEUR-WS.org. http://ceur-ws.org/Vol-2912/./paper1.pdf\n\nBusinge J, Kawuma S, Bainomugisha E, Khomh F, Nabaasa E (2017) Code authorship and fault-proneness of open-source android applications: An empirical study. In: Proceedings of the 13th international conference on predictive models and data analytics in software engineering, PROMISE. ACM, New York, pp 33\u201342. https://doi.org/10.1145/3127005.3127009\n\nBusinge J, Kawuma S, Openja M, Bainomugisha E, Serebrenik A (2019) How stable are eclipse application framework internal interfaces? In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). pp 117\u2013127. https://doi.org/10.1109/SANER.2019.8668018\n\nBusinge J, Openja M, Kavaler D, Bainomugisha E, Khomh F, Filkov V (2019) Studying android app popularity by cross-linking github and google play store. In: SANER\n\nBusinge J, Openja M, Nadi S, Bainomugisha E, Berger T (2018) Clone-based variability management in the android ecosystem. In: 2018 IEEE international conference on software maintenance and evolution, ICSME 2018, Madrid, Spain, September 23-29, 2018, pp 625\u2013634\n\nBusinge J, Serebrenik A, van den Brand M (2012) Compatibility prediction of eclipse third-party plug-ins in new eclipse releases. In: 12th IEEE international working conference on source code analysis and manipulation, SCAM 2012, Riva del Garda, Italy, September 23-24, 2012, pp 164\u2013173\n\nBusinge J, Serebrenik A, van den Brand M (2012) Survival of eclipse third-party plug-ins. In: 28th IEEE international conference on software maintenance, ICSM 2012, Trento, Italy, September 23-28, 2012, pp 368\u2013377. https://doi.org/10.1109/ICSM.2012.6405295\nBusinge J, Serebrenik A, van den Brand M (2013) Analyzing the eclipse API usage: Putting the developer in the loop. In: 17th European conference on software maintenance and reengineering, CSMR 2013, Genova, Italy, March 5-8, 2013. pp 37\u201346\n\nBusinge J, Serebrenik A, van den Brand MGJ (2010) An empirical study of the evolution of Eclipse third-party plug-ins. In: EVOL-IWPSE\u201910. ACM, pp 63\u201372\n\nBusinge J, Serebrenik A, van den Brand MGJ (2015) Eclipse API usage: the good and the bad. Softw Qual J 23(1):107\u2013141. https://doi.org/10.1007/s11219-013-9221-3\n\nChacon S, Straub B (2014) git tools - rewriting history. https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History\n\nChacon S, Straub B (2014) Pro Git Apress\n\nChua BB (2017) A survey paper on open source forking motivation reasons and challenges. In: Alias RA, Ling PS, Bahri S, Finnegan P, Sia CL (eds) 21st Pacific Asia conference on information systems, PACIS 2017, Langkawi, Malaysia, July 16-20, 2017. p 75\n\nCzarnecki KBan\u00e2tre JP, Fradet P, Giavitto JL, Michel O (eds) (2005) Overview of generative software development. Springer, Berlin\n\nDecan A, Mens T, Grosjean P (2019) An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empir Softw Eng 24(1):381\u2013416. https://doi.org/10.1007/s10664-017-9589-y\n\nDixion J (2009) Different kinds of open source forks \u2013 salad, dinner, and fish. https://jamesdixon.wordpress.com/2009/05/13/different-kinds-of-open-source-forks-salad-dinner-and-fish/\n\nDubinsky Y, Rubin J, Berger T, Duszynski S, Becker M, Czarnecki K (2013) An exploratory study of cloning in industrial software product lines. In: CSMR\n\nErnst NA, Easterbrook SM, Mylopoulos J (2010) Code forking in open-source software: a requirements perspective. arXiv:1004.2889\n\nGamalielsson J, Lundell B (2014) Sustainability of open source software communities beyond a fork: How and why has the libreoffice project evolved? J Syst Softw 89:128\u2013145. https://doi.org/10.1016/j.jss.2013.11.1077. http://www.sciencedirect.com/science/article/pii/S0164121213002744\n\nGerman DM, Adams B, Hassan AE (2016) Continuously mining distributed version control systems: An empirical study of how linux uses git. Empir Softw Eng 21(1):260\u2013299\n\nJang J, Agrawal A, Brumley D (2012) Redebug: Finding unpatched code clones in entire OS distributions. In: IEEE symposium on security and privacy, SP 2012, 21-23 May 2012, San Francisco, California, USA. IEEE Computer Society, pp 48\u201362. https://doi.org/10.1109/SP.2012.13\n\nJiang J, Lo D, He J, Xia X, Kochhar PS, Zhang L (2017) Why and how developers fork what from whom in github. Empir Softw Eng 22(1):547\u2013578. https://doi.org/10.1007/s10664-016-9436-6\n\nKalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining github. In: MSR\n\nKawuma S, Businge J, Bainomugisha E (2016) Can we find stable alternatives for unstable eclipse interfaces? In: 2016 IEEE 24th international conference on program comprehension (ICPC), pp. 1\u201310. https://doi.org/10.1109/ICPC.2016.7503716\n\nKononenko O, Rose T, Baysal O, Godfrey M, Theisen D, de Water B (2018) Studying pull request merges: a case study of shopify\u2019s active merchant. In: Proceedings of the 40th international conference on software engineering: software engineering in practice, ICSE-SEIP \u201918. Association for Computing Machinery, New York, pp 124\u2013133. https://doi.org/10.1145/3183519.3183542\n\nKrueger J, Berger T (2020) Activities and costs of re-engineering cloned variants into an integrated platform. In: 14th international working conference on variability modelling of software-intensive systems (VaMoS)\n\nKrueger J, Berger T (2020) An empirical analysis of the costs of clone- and platform-oriented software reuse. In: 28th ACM SIGSOFT international symposium on the foundations of software engineering (FSE)\n\nKrueger J, Mahmood W, Berger T (2020) Promote-pl: A round-trip engineering process model for adopting and evolving product lines. In: 24th ACM international systems and software product line conference (SPLC)\n\nLaurent AS (2008) Understanding open source and free software licensing. O\u2019Reilly Media, Newton\n\nLi L, Martinez J, Ziadi T, Bissyand\u00e9 TF, Klein J, Traon YL (2016) Mining families of android applications for extractive spl adoption. In: SPLC\n\nLillack M, Stanciulescu S, Hedman W, Berger T, Wasowski A (2019) Intention-based integration of software variants. In: 41st international conference on software engineering (ICSE)\n\nMahmood W, Chagama M, Berger T, Hebig R (2020) Causes of merge conflicts: A case study of elasticsearch. In: 14th international working conference on variability modelling of software-intensive systems (VaMoS)\nMojica IJ, Adams B, Nagappan M, Dienst S, Berger T, Hassan AE (2014) A large scale empirical study on software reuse in mobile apps. IEEE Softw 31(2):78\u201386\n\nMukelabai M, Berger T, Borba P (2021) Semi-automated test-case propagation in fork ecosystems. In: 43rd international conference on software engineering, new ideas and emerging results track (ICSE/NIER)\n\nMunaiah N, Kroh S, Cabrey C, Nagappan M (2017) Curating GitHub for engineered software projects. Empir Softw Eng 22(6):3219\u20133253\n\nNyman L (2014) Hackers on forking. In: Proceedings of The international symposium on open collaboration. pp 1\u201310\n\nNyman L, Lindman J (2013) Code forking, governance, and sustainability in open source software. Technol Innov Manag Rev 3:7\u201312\n\nNyman L, Mikkonen T (2011) To fork or not to fork: Fork motivations in sourceforge projects. In: Open source systems: grounding research. pp 259\u2013268\n\nNyman L, Mikkonen T, Lindman J, Foug\u00e8re M (2012) Perspectives on code forking and sustainability in open source software. In: Open source systems: long-term sustainability. pp 274\u2013279\n\nOpenja M, Adams B, Khomh F (2020) Analysis of modern release engineering topics: \u2013 a large-scale study using stackoverflow \u2013. In: 2020 IEEE international conference on software maintenance and evolution (ICSME). pp 104\u2013114. https://doi.org/10.1109/ICSME46990.2020.00020\n\nPaix\u00e3o M, Maia P (2019) Rebasing in code review considered harmful: A large-scale empirical investigation. In: 2019 19th international working conference on source code analysis and manipulation (SCAM). pp 45\u201355\n\nParnas DL (1976) On the design and development of program families. IEEE Trans Softw Eng 2(1):1\u20139. https://doi.org/10.1109/TSE.1976.233797\n\nPerry DE, Siy HP, Votta LG (2001) Parallel changes in large-scale software development: An observational case study. ACM Trans Softw Eng Methodol 10(3):308\u2013337. https://doi.org/10.1145/383876.383878\n\nRaymond ES (2001) The Cathedral & the Bazaar: Musings on linux and open source by an accidental revolutionary. Newton, O\u2019Reilly Media Inc\n\nRen L, Zhou S, K\u00e4stner C (2018) Poster: forks insight: providing an overview of github forks. In: 2018 IEEE/ACM 40th international conference on software engineering: companion (ICSE-Companion). pp 179\u2013180\n\nRobles G, Gonz\u00e1lez-Barahona JM (2012) A comprehensive study of software forks: dates, reasons and outcomes. In: Open source systems: long-term sustainability. pp 1\u201314\n\nSattler F, von Rhein A, Berger T, Johansson NS, Hard\u00f8 MM, Apel S (2018) Lifting inter-app data-flow analysis to large app sets. Autom Softw Eng 25:315\u2013346\n\nSilva LD, Borba P, Mahmood W, Berger T, Moisakis J (2020) Detecting semantic conflicts via automated behavior change detection. In: 36th IEEE international conference on software maintenance and evolution (ICSME)\n\nSousa M, Dillig I, Lahiri SK (2018) Verified three-way program merge. Proc ACM Program Lang 2(OOPSLA). https://doi.org/10.1145/3276535\n\nde Souza CRB, Redmiles D, Dourish P. (2003) Breaking the code, moving between private and public work in collaborative software development. In: Proceedings of the 2003 international ACM SIGGROUP conference on supporting group work, GROUP \u201903. Association for Computing Machinery, New York, pp 105\u2013114. https://doi.org/10.1145/958160.958177\n\nStanciulescu S, Schulze S, Wasowski A (2015) Forked and integrated variants in an open-source firmware project. In: IEEE international conference on software maintenance and evolution (ICSME), ICSME \u201915\n\nSung C, Lahiri SK, Kaufman M, Choudhury P, Wang C (2020) Towards understanding and fixing upstream merge induced conflicts in divergent forks: An industrial case study. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering: software engineering in practice, ICSE-SEIP \u201920. Association for Computing Machinery, New York, pp 172\u2013181. https://doi.org/10.1145/3377813.3381362\n\nVandehey S (2019) Rebase and merge. https://cloudfour.com/thinks/squashing-your-pull-requests/\n\nViseur R (2012) Forks impacts and motivations in free and open source projects. Int J Adv Comput Sci Appl - IJACSA 3(2)\n\nZhou S, St\u0103nciulescu C, Le\u00dfenich O, Xiong Y, Wasowski A, K\u00e4stner C (2018) Identifying features in forks. In: Proceedings of the 40th international conference on software engineering. pp 105\u2013116\n\nZhou S, Vasilescu B, K\u00e4stner C (2019) What the fork: A study of inefficient and efficient forking practices in social coding. In: Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. pp 350\u2013361\n\nZhou S, Vasilescu B, K\u00e4stner C (2020) How has forking changed in the last 20 years? a study of hard forks on github. In: Proceedings of the 42nd international conference on software engineering. Accepted\nPublisher\u2019s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.\n\nJohn Businge is a Postdoctoral fellow in the LORE lab at the University of Antwerp, Belgium. He received his PhD from Eindhoven University of Technology, the Netherlands in 2013. After receiving his PhD, he was a lecturer at Mbarara University of Science and Technology, Uganda. For six months in 2016 he was a Fulbright research scholar at the University of California, Davis in the U.S.A. His research focuses on mining software repositories, clone detection, program analysis, variability management, and empirical software engineering.\n\nMoses Openja is a PhD student and a member of SWAT Lab Polytechnique Montreal, Canada. He received his bachelor\u2019s degree in 2017 from Mbarara University of Science and Technology, Uganda and his masters degree in 2021 from Polytechnique Montreal, Canada. His research area includes software quality of machine learning applications, empirical software Engineering, Software maintenance and evolution in the software ecosystem, and release engineering.\n\nSarah Nadi is an Assistant Professor in the Department of Computing Science at the University of Alberta, and a Tier II Canada Research Chair in Software Reuse. She obtained her Master\u2019s (2010) and PhD (2014) degrees from the University of Waterloo in Canada. Before joining the University of Alberta in 2016, she spent approximately two years as a post-doctoral researcher at the Technische Universit\u00e4t Darmstadt in Germany. Sarah\u2019s research focuses on providing intelligent support for software maintenance and reuse, including creating recommender systems to guide developers through correctly and securely reusing individual functionality from external libraries.\nThorsten Berger is a Professor in Computer Science at Ruhr University Bochum in Germany. After receiving the PhD degree from the University of Leipzig in Germany in 2013, he was a Postdoctoral Fellow at the University of Waterloo in Canada and the IT University of Copenhagen in Denmark, and then an Associate Professor jointly at Chalmers University of Technology and the University of Gothenburg in Sweden. He received competitive grants from the Swedish Research Council, the Wallenberg Autonomous Systems Program, Vinnova Sweden (EU ITEA), and the European Union. He is a fellow of the Wallenberg Academy\u2014one of the highest recognitions for researchers in Sweden. He received two best-paper awards and one most influential paper award. His service was recognized with distinguished reviewer awards at the tier-one conferences ASE 2018 and ICSE 2020. His research focuses on model-driven software engineering, program analysis, and empirical software engineering.\n\nAffiliations\n\nJohn Businge\\textsuperscript{1,2} \u00b7 Moses Openja\\textsuperscript{3} \u00b7 Sarah Nadi\\textsuperscript{4} \u00b7 Thorsten Berger\\textsuperscript{5,6}\n\nMoses Openja\nopenjamosesopm@gmail.com\n\nSarah Nadi\nnadi@ualberta.ca\n\nThorsten Berger\nthorsten.berger@rub.de\n\n\\textsuperscript{1} Mbarara University of Science and Technology, Mbarara, Uganda\n\\textsuperscript{2} University of Antwerp, Antwerp, Belgium\n\\textsuperscript{3} SWAT Lab., \u00c9cole Polytechnique de Montr\u00e9al, Montr\u00e9al, Canada\n\\textsuperscript{4} University of Alberta, Edmonton, Canada\n\\textsuperscript{5} Ruhr University Bochum, Bochum, Germany\n\\textsuperscript{6} Chalmers | University of Gothenburg, Gothenburg, Sweden", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/031_businger.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 47, "total-input-tokens": 125231, "total-output-tokens": 35600, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 2852, 1], [2852, 6848, 2], [6848, 10573, 3], [10573, 14119, 4], [14119, 17260, 5], [17260, 20902, 6], [20902, 23286, 7], [23286, 25124, 8], [25124, 28665, 9], [28665, 32136, 10], [32136, 35469, 11], [35469, 39027, 12], [39027, 40663, 13], [40663, 44055, 14], [44055, 47342, 15], [47342, 49748, 16], [49748, 53488, 17], [53488, 56716, 18], [56716, 60040, 19], [60040, 61385, 20], [61385, 63163, 21], [63163, 65423, 22], [65423, 67267, 23], [67267, 68444, 24], [68444, 69055, 25], [69055, 73017, 26], [73017, 76380, 27], [76380, 79612, 28], [79612, 83001, 29], [83001, 86031, 30], [86031, 89922, 31], [89922, 93460, 32], [93460, 94802, 33], [94802, 97313, 34], [97313, 99555, 35], [99555, 102849, 36], [102849, 106793, 37], [106793, 110928, 38], [110928, 114471, 39], [114471, 118370, 40], [118370, 122086, 41], [122086, 125614, 42], [125614, 130299, 43], [130299, 134979, 44], [134979, 139749, 45], [139749, 141549, 46], [141549, 143197, 47]]}}
|
|
{"id": "baa3cad94707bd2fb8b7fe0b731c92b851fa8d8e", "text": "Software Reuse in Open Source: A Case Study\n\nAndrea Capiluppi, Brunel University, UK\nKlaas-Jan Stol, Lero (The Irish Software Engineering Research Centre), University of Limerick, Ireland\nCornelia Boldyreff, University of East London, UK\n\nABSTRACT\n\nA promising way to support software reuse is based on Component-Based Software Development (CBSD). Open Source Software (OSS) products are increasingly available that can be freely used in product development. However, OSS communities still face several challenges before taking full advantage of the \u201creuse mechanism\u201d: many OSS projects duplicate effort, for instance when many projects implement a similar system in the same application domain and in the same topic. One successful counter-example is the FFmpeg multimedia project; several of its components are widely and consistently reused in other OSS projects. Documented is the evolutionary history of the various libraries of components within the FFmpeg project, which presently are reused in more than 140 OSS projects. Most use them as black-box components; although a number of OSS projects keep a localized copy in their repositories, eventually modifying them as needed (white-box reuse). In both cases, the authors argue that FFmpeg is a successful project that provides an excellent exemplar of a reusable library of OSS components.\n\nKeywords: Case Study, Component-Based Software Development, Empirical Study, Open Source Software, Quantitative Study, Software Evolution, Software Reuse\n\nINTRODUCTION\n\nReuse of software components is one of the most promising practices of software engineering (Basili & Rombach, 1991). Enhanced productivity (as less code needs to be written), increased quality (since assets proven in one project can be carried through to the next) and improved business performance (lower costs, shorter time-to-market) are often pinpointed as the main benefits of developing software from a stock of reusable components (Sametinger, 1997; Sommerville, 2004).\n\nAlthough much research has focused on the reuse of Off-The-Shelf (OTS) components, both Commercial OTS (COTS) and Open Source Software (OSS), in corporate software production (Li et al., 2009; Torchiano & Morisio, 2004), the reusability of OSS projects in other OSS projects has only recently started to draw the attention of researchers and developers in OSS communities (Lang et al., 2005; Mockus, 2007; Capiluppi & Boldyreff, 2008). A vast amount of code is created daily, modified and stored in OSS repositories, and the inherent\nphilosophy around OSS is indeed promoting reuse. Yet, software reuse in OSS projects is hindered by various factors, psychological and technical. For instance, the project to be reused could be written in a programming language that the hosting project dislikes or is incompatible with; the hosting project might not agree with the design decisions made by the project to be reused; finally, individuals in the hosting project may dislike individuals involved in the project to be reused (Senyard & Michlmayr, 2004). A search for the \u201cemail client\u201d topic in the SourceForge repository (http://www.sourcforge.net) produces 128 different projects (SourceForge, 2011): this may suggest that similar features in the same domain are implemented by different projects, and that code and features duplication play a significant role in the production of OSS code.\n\nThe interest of practitioners and researchers in the topic of software reuse has focused on two predominant questions: (1) from the perspective of OSS integrators (Hauge et al., 2007), how to select an OSS component to be reused in another (potentially commercial) software system, and (2) from the perspective of end-users, how to provide a level of objective \u201ctrust\u201d in available OSS components. This interest is based on a sound reasoning; given the increasing amount of source code and documentation created and modified daily, it starts to be a (commercially) viable solution to browse for components in existing code and select existing, working resources to reuse as building blocks of new software systems, rather than building them from scratch.\n\nAmong the reported cases of successful reuse within OSS systems, components with clearly defined requirements, and hardly affecting the overall design (i.e., the \u201cS\u201d and \u201cP\u201d types of systems following the original S-P-E classification by Lehman (1980)) have often proven to be the typically reused resources by OSS projects. Reported examples include the \u201cinternationalization\u201d (often referred to as I18N) component (which produces different output text depending on the language of the system), or the \u201cinstall\u201d module for Perl subsystems (involved in compiling the code, test and install it in the appropriate locations) (Mockus, 2007). To our best knowledge, there is no academic literature about the successful reuse of OSS, and an understanding of internal characteristics of what makes a component reusable in the OSS context is lacking.\n\nThe main focus of this paper is to report on the FFmpeg project (http://ffmpeg.org/), and its build-level components, and to show how some of these components are currently reused in other projects. This project is a cornerstone in the multimedia domain; several dozens of OSS projects reuse parts of FFmpeg, one of the most widely reused being the libavcodec component. In the domain of OSS multimedia applications, libavcodec is the most widely adopted and reused audio/video codec (coding and decoding) resource. Its reuse by other OSS projects is so widespread since it represents a crosscutting resource for a wide range of systems, from single-user video and audio players to converters and multimedia frameworks. As such, FFmpeg represents a unique case (Yin, 2003, p.40), which is why we selected the project for this study.\n\nIn particular, the study is an attempt to evaluate whether the reusability principle of \u201chigh cohesion and loose coupling\u201d (Fenton, 1991; Macro & Buxton, 1987; Troy & Zweben, 1981) has an impact on the evolutionary history of the FFmpeg components.\n\nThis paper makes two contributions:\n\n1. It studies how the size of FFmpeg components evolve: the empirical findings show that the libavcodec component (contained in FFmpeg) is an \u201cevolving and reusable\u201d component (an \u201cE-type\u201d of system) (Lehman, 1980), and as such it poses several interesting challenges when other projects integrate it; and\n\n2. It studies how the architecture of FFmpeg components evolve, and how these components evolve when separated from FFmpeg: the empirical findings show two emerging scenarios in the reuse of this resource. On the one hand, the majority of projects that\nreuse the FFmpeg components do so with a \u201cblack-box\u201d strategy (Szyperski, 2002), as such incurring synchronization issues due to the independent co-evolution of the project and the component. On the other hand, a number of OSS projects apply a \u201cwhite-box\u201d reuse strategy, by maintaining a private copy of the FFmpeg components. The latter scenario is further empirically analyzed in order to obtain a better understanding of how the component is not only reused, but also integrated into a host system.\n\nThe remainder of this paper is structured following the guidelines for reporting case study research proposed by Runeson and H\u00f6st (2009). The next section provides relevant background information and an overview of related work on software components and OSS systems. This is followed by a presentation of the research design of our study. After this, the results of the empirical study are presented. Followed by threats to validity of this study. The last section concludes with the key findings and provides directions for future work.\n\nBACKGROUND AND RELATED WORK\n\nThis section presents background and related work that is relevant for the remainder of the paper. The first subsection briefly discusses research on OSS reuse. This is followed by a discussion of Component-Based Software Development (CBSD) and the terminology used in this paper. This is followed by a brief overview of a useful and relevant categorization of components. Since this work considers the evolution of software components, a brief summary of Lehman\u2019s classification of software programs is provided. This section concludes with a brief discussion of related work regarding software decay and architectural recovery.\n\nComponent-Based Software Development and Terminology\n\nAs mentioned, Component-Based Software Development (CBSD) has been proposed as a promising approach to large-scale software reuse. It is important, however, first to define clearly what is meant by the term \u201ccomponent.\u201d The word \u201ccomponent\u201d is often used in the context of CBSD as a reusable piece of software, either Commercial Off-The-Shelf (COTS) or Open Source. For instance, Torchiano and Morisio (2004) have derived the following definition: \u201cA COTS product is a commercially available or open source piece of software that other software projects can reuse and integrate into their own products.\u201d This definition considers a COTS or Open Source software product as an independent unit that can be reused. However, a number of authors have provided more specific definitions; a commonly cited definition can be found in Szyperski (2002, p. 41): \u201cA software component is a unit of composition with contractually specified interfaces and explicit context dependencies only. A software component can be deployed independently and is subject to composition by third parties.\u201d\n\nAs De Jonge (2005) points out, \u201cComponent-Based Software Engineering (CBSE) is mostly concerned with execution-level components (such as COM, CCB, or EJB components).\u201d Szyperski (2002, p. 3) also speaks of software components as being \u201cexecutable units of independent production, acquisition, and deployment that can be composed into a functioning system.\u201d\n\nIn this paper, following De Jonge (2005) we use the term \u201cbuild-level component.\u201d De Jonge speaks of build-level components as \u201cdirectory hierarchies containing ingredients of an application\u2019s build process, such as source files, build and configuration files, libraries, and so on.\u201d In an earlier paper, De Jonge (2002) uses the term \u201csource code component.\u201d In this context, we interpret the meaning of \u201cbuild-\nlevel\u201d component to be equivalent to the term \u201cmodule,\u201d as used by Clements et al. (2010, p. 29). They indicate that a module refers to a unit of implementation, and as such, can be source code or other implementation artifacts. Eick et al. (2001) also interpret a module to be a directory in the source code file system, which contains several files, though they note that this terminology is not standard. Tran et al. (1999, 2000) considered individual source files as modules. Clements et al. define a \u201ccomponent\u201d to be a runtime entity, which is consistent with the definition by Szyperski. Although important issues are already known when incorporating and reusing whole systems into larger, overarching projects (as in the case of Linux distributions German & Hassan, 2009), in the remainder of this paper, we use the term \u201ccomponent\u201d to refer to build-level component.\n\nComponents can be reused in different ways, as briefly mentioned: black-box reuse and white-box reuse (Szyperski, 2002). Black-box reuse refers to the reuse of a component as-is without any alterations. The component can only be viewed in terms of its input and output. This is typically the case when proprietary (COTS) components are used, as the source code is usually not available for proprietary software. On the other hand, when the component\u2019s source code is available, the integrator can perform white-box reuse. The integrator may make changes to a component to fit his or her intended purpose. Obviously, the availability of the source code makes OSS components particularly suitable for white-box reuse.\n\nThe two scenarios are summarized in Figure 1. As an example, the MPlayer project keeps a copy of the library in its repository (and it eventually modifies, or \u201cforks,\u201d it for its own purposes, in a white-box reuse scenario), while the VLC project, at compilation time, requires the user to provide the location of an up-to-date version of the FFmpeg project (black-box reuse).\n\nResearch on Open Source Software Reuse\n\nThere is a growing body of empirical research the use of OSS components in CBSD (Ayala et al., 2007; Hauge et al., 2009; Capiluppi & Knowles, 2009; Li et al., 2009; Ven & Mannaert, 2008). There is an increasing number of OSS products available, many of which have become viable alternatives to commercial products (Fitzgerald, 2006), and adopting OSS components to build products is a common scenario (Hauge et al., 2010).\n\nResearch on OSS reuse can be classified along two dimensions. The first dimension considers the question who reuses the software. This can either be an Independent Software Vendor (ISV), or other OSS communities. The second dimension considers the software that is reused, in particular the granularity of components. Haefliger et al. (2008) identified different granularities of code reuse: algorithms and methods, single lines of code, and components. Components themselves may be of a coarse granularity, i.e., complete software systems. A common example of this is the so-\ncalled \u201cLAMP stack,\u201d (Wikipedia, n.d.) which is an \u201censemble\u201d of Linux, Apache, MySQL, and a scripting language such as Python, Perl, PHP or Ruby. Much of the literature on OSS reuse focuses on such coarse-grained components by ISVs, though it is noteworthy that granularity cannot be measured on a discrete scale but rather a continuous one. German et al. (2007) discuss dependencies between packages (which they define as an installable unit of software), such as found in Linux distributions. They define a model to represent and analyze such dependencies. Other work led by German investigated the issue of licenses when reusing different OSS components (German & Hassan, 2009; German & Gonz\u00e1lez-Barahona, 2009).\n\nOn the other hand, reuse can be done with components of a finer granularity. There are few studies of this, all of which focus on the reuse by other OSS projects. The study presented in this paper also considers components of relatively small granularity, which is why we discuss this related work in more detail. Table 1 provides an overview of the study objectives as well as research methods and samples.\n\nOne of the first studies that quantifies the reuse in Open Source Software is by Mockus (2007). That study focuses on reuse by identifying directories of source code files that share a number (defined by a threshold) of file names; therefore, the study only considers white-box reuse. Mockus studied reuse on a large sample of 38,700 unique projects with 5.3 million unique file name paths. Mockus found that approximately half of the files are used in more than one project, which indicates significant reuse among OSS projects.\n\nHaefliger et al. (2008) conducted a study of 15 OSS projects, six of which were studied in-depth. The goal of this study was an investigation of the influence of several factors identified in the literature on the support of code reuse in OSS development. Factors included standards and tools, quality ratings and certificates, and incentives as found in commercial software development firms. The study shows that all studied projects reuse software, and that black-box reuse was the predominant form.\n\nSojer and Henkel (2010) conducted a survey to investigate quantitatively the relationship between developer and project characteristics on the one hand and the degree of software reuse in OSS projects on the other hand. The survey among 686 OSS developers identified a number of factors, such as developers\u2019 experience in OSS projects that affect software reuse in OSS projects. Unlike other studies, such as the one by Mockus and Haefliger et al. mentioned, this study does not investigate actual reuse within OSS projects, but rather developers\u2019 behavior and opinions on the topic.\n\nHeinemann et al. (2011) studied reuse in a sample of 20 OSS projects written in the Java programming language, using clone detection.\n\n### Table 1. Overview of previous studies of reuse in OSS\n\n| Authors | Study objective | Method and sample |\n|--------------------|---------------------------------------------------------------------------------|--------------------------------------------------------|\n| Mockus et al. (2007) | To identify and quantify large-scale reuse in OSS. | Survey of 38,700 projects, 13.2 MLOC |\n| Haefliger et al. (2008) | Is code reuse supported in OSS? | Multiple case study, 15 projects, in-depth analysis of 6 projects, 6MLOC |\n| Sojer and Henkel (2010) | How important is code reuse in OSS projects? What are perceived benefits, issues and impediments of code reuse? How is code reuse affected by characteristics of developers and project? | Web-based survey, 686 responses |\n| Heinemann et al. (2011) | Do OSS projects reuse software? How much black-box/white-box? | Empirical study, 20 OSS Java projects, 3.3 MLOC |\ntechniques complemented with manual inspection. Their study investigated whether OSS projects reuse software, and to what extent such reuse happens as white-box and black-box. They found that reuse is common in the OSS Java projects studied, in particular black-box reuse, as previously found by Haefliger et al. (2008). It must be noted that their measurements also counted reuse of the Java standard libraries.\n\nComponent Characterization\n\nComponents, as defined, can be characterized in different categories depending on their relationships to other components. Lungu et al. (2006) distinguish between four types of (Java) packages. These are:\n\n1. **Silent package**: no dependency relations between the package and other packages.\n2. **Consumer package**: a dependency relation from the package to other packages (that is, the package depends on, or consumes, functionality from other packages);\n3. **Provider package**: there is a dependency from other packages to the package (that is, the package provides functionality to other packages);\n4. **Hybrid package**: the package is both a consumer and provider at the same time (that is, it both consumes and provides functionality to and from other packages, respectively).\n\nThough Lungu et al. refer to Java packages, which, they argue, are the main mechanism for the decomposition and modularization of a software system written in Java, we argue that the same four types listed can be used to characterize components as directories containing source code files (as defined in the previous subsection). That is, a provider is a component that provides services to other components (which therefore become dependent upon the provider). Likewise, a consumer relies on functionality provided in other components (and is therefore dependent upon those). Incidentally, Java packages are in fact represented as directories in a source code file system.\n\nSoftware Evolution and Program Classification\n\nThere is a continuous pressure on software systems to evolve in order to prevent becoming obsolete (Lehman, 1978). Lehman (1980) stated a number of \u201claws of software evolution\u201d. He presents a classification of programs into three classes: S, P and E, which relates to how programs evolve. The three program types are briefly summarized below.\n\n**S-Programs**\n\nLehman (1980) described S-Programs as: \u201cprograms whose function is formally defined by and derivable from a specification.\u201d These are programs that solve a specific problem, which is completely defined. The specification of the problem \u201cdirects and controls the programmer in his creation of the program that defines the desired solution\u201d (Lehman, 1980). Changes may of course be made to the program, for instance, to improve resource usage or improve its maintainability. However, such changes must not change the mapping between the input and output. If changes are made due to a changed specification, it is a different program that solves a new problem. Typical examples of S-type programs are library routines that implement mathematical operations, for instance the sine and cosine functions.\n\n**P-Programs**\n\nP-Programs are programs that implement a solution to a problem that is well-defined but whose implementation must be limited to an approximation to achieve practicality. The problem statement of P-Programs \u201cis a model of an abstraction of a real-world situation, containing uncertainties, unknown, arbitrary criteria and continuous variables\u201d (Lehman, 1980). Whereas the correctness of an S-Program depends on its specification, the value and validity of P-Programs is dependent on the solution\nacquired in a real-world environment. As the environment or world in which the program is used is changing, P-Programs themselves must also change. Examples, as suggested by Lehman, are a software program implementing the game of chess, as well as weather prediction software.\n\n**E-Programs**\n\nThe defining characteristic of the third class of programs, E-Programs, is that the installation of a program itself changes the nature of the problem that it is solving. As Lehman (1980) stated: \u201cOnce the program is completed and begins to be used, questions of correctness, appropriateness and satisfaction arise [\u2026] and inevitably lead to additional pressure for change.\u201d In other words, the environment (or world) in which the program was originally conceived is changing due to the introduction of the program itself. Or, stated in more abstract terms, the introduction of a solution (the software program) to a problem changes the nature of the problem itself. This leads to the need for continuous change to E-type programs. Lehman mentions as examples of such types of programs operating systems and air-traffic control software (Lehman, 1980).\n\n**Software Architecture, Decay and Architectural Recovery**\n\nThe empirical analysis of the FFmpeg components reported below revealed several changes in the components and in their connections to the core of the system: these changes revealed (in at least one case) a decay in how some of the components are internally structured, and externally connected to other components. Therefore this work is also related to the study of software architectures, as it relates to components, and their mutual relationships (Bass et al., 2003).\n\nIt is now widely accepted that a system\u2019s software architecture has different views (IEEE, 2000); well known is the 4+1 view model of architecture (Kruchten, 1995), which defines the logical, development, process, physical views, plus a use-case view. As outlined, our study considers components as directories containing source code files, which would be presented in the development view. One related aspect that was also considered for the present study is about how such structural characteristics decay over time, how components become less cohesive and how the connections between them infringe the original design constraints.\n\nOne important aspect of software architectures and components is modularity (Parnas, 1972): the division of a system into modules (or components) helps in the separation of the functionality and responsibilities of the various modules. Reusability is a quality attribute that is directly related to a component\u2019s (or system\u2019s); examining the inter-component couplings (Bass et al., 2003) may provide valuable insights that help to assess the reusability of a component (or system). The analysis of coupling and cohesion of object-oriented systems has also shown that a good degree of modularity is achieved by observing the \u201cloose coupling and high cohesion\u201d principle for components (Fenton, 1991; Macro & Buxton, 1987; Troy & Zweben, 1981).\n\nAs software systems evolve over time, the software engineering literature has firmly established that software architectures and the associated code suffer from software decay (Eick et al., 2001). Perry and Wolf (1992) speak of architectural erosion and architectural drift. The former occurs as a result of violating the (conceptual) software architecture. The latter is due to an insensitivity of stakeholders about the architecture, which may lead to an obscuration of the architecture, which in turn may cause violation of the architecture. As a result, software systems have the progressive tendency to lose their original structure, which makes it difficult to understand and further maintain them (Schmerl et al., 2006). Among the most common discrepancies between the original and the degraded structures, the phenomenon of highly coupled, and lowly cohesive, modules has already been known since 1972 (Parnas, 1972) and it is an established topic of research.\n\nArchitectural recovery is one of the recognized counter-measures to this decay (Due\u00f1as et al., 1998). Several earlier works have focused\non the architectural recovery of proprietary software (Due\u00f1as et al., 1998), closed academic software (Abi-Antoun et al., 2007), COTS-based systems (Avgeriou & Guelfi, 2005) and OSS (Bowman et al., 1999; Godfrey & Lee, 2000; Tran et al., 2000). In all of these studies, systems were selected in a specific state of evolution, and their internal structures analyzed for discrepancies between the conceptual and concrete architectures (Tran et al., 2000). Researchers have proposed various approaches to address this issue by proposing frameworks (e.g., Sartipi et al., 2000), methodologies (e.g., Krikhaar et al., 1999) or guidelines and concrete advice to developers (e.g., Tran et al., 2000).\n\nArchitectural recovery provides insights into the concrete architecture, which in turn may be of help to developers and integrators. For instance, certain architectural styles (Clements et al., 2010) may be identified, which can provide valuable insights into a system\u2019s quality attributes (Bass et al., 2003; Harrison & Avgeriou, 2011). Recovery is very important as well to ensure the maintainability of a software product; if the conceptual architecture is not respected, the resulting concrete architecture may become a spaghetti architecture, which can be an obstacle to making necessary changes to the system. In the context of software reuse, and this research in particular, components (as defined) may be identified that can be reused in other systems (i.e., OSS projects).\n\nRESEARCH DESIGN\n\nThe study presented in this paper is a quantitative, descriptive case study (Yin, 2003). As Easterbrook et al. (2008) pointed out, there exists some confusion in the software engineering literature over what constitutes a case study, distinguishing between a case study as a \u201cworked example\u201d and case study as an \u201cempirical method\u201d. Case studies can also be conducted in different contexts, for instance in industry (\u201cin vivo\u201d) or in a research/laboratory setting (\u201cin vitro\u201d). This study is an empirical, \u201cin vitro\u201d case study of one OSS project, namely FFmpeg. As such, this study presents the description and analysis of a system, and following the classification by Glass et al. (2002) the research approach can therefore be classified as \u201cdescriptive.\u201d\n\nThe remainder of this section proceeds as follows. First, we provide further information on the FFmpeg project. Second, we introduce the research questions that guided the research. Third, we present the definitions to operationalize this research. The section concludes with a discussion of data collection and analysis procedures.\n\nSelection and Description of the FFmpeg System\n\nThis paper presents a case study of reuse of build-level components in the FFmpeg project. We selected this project as an example of software reuse for several reasons:\n\n1. It has a long history of evolution as a multimedia player that has grown and refined several build-level components throughout its life cycle. Some of these components appear like \u201cE\u201d type systems, instead of traditional \u201cS\u201d or \u201cP\u201d types, with lower propensity for software evolution.\n\n2. Several of its core developers have been collaborating also in the MPlayer (http://www.mplayerhq.hu) project, one of the most commonly used multimedia players across OSS communities. Eventually, the libavcodec component has been incorporated (among others from FFmpeg) into the main development trunk of MPlayer, increasing FFmpeg\u2019s visibility and widespread usage.\n\n3. Its components are currently reused on different platforms and architectures, both in static linking and in dynamic linking. Static linking involves the inclusion of source code files or pre-compiled libraries at compile-time, while dynamic linking involves the inclusion of a (shared) binary library at runtime.\n4. Finally, the static-linking reuse of the FFmpeg components presents two opposite scenarios: either a black-box reuse strategy, with \u201cupdate propagation\u201d issues reported when the latest version of a project has to be compiled against a particular version of the FFmpeg components (Orsila et al., 2008); or a white-box reuse strategy.\n\nAs mentioned, the FFmpeg system has successfully become a highly visible OSS project partly due to its components, libavcodec in particular, which have been integrated into a large number of OSS projects in the multimedia domain.\n\nIn terms of a global system\u2019s design, the FFmpeg project does not yet provide a clear description of either its internal design, or how the architecture is decoupled into components and connectors. Nonetheless, by visualizing its source tree composition (de Jonge, 2002), the folders containing the source code files appear to be semantically rich, in line with the definitions of build-level components (de Jonge, 2005), and source tree composition (de Jonge, 2002). The first column of Table 2 summarizes which folders currently contain source code and subfolders within FFmpeg.\n\nAs shown, some components act as containers for other subfolders, apart from source files, as shown in columns two and three, respectively. Typically these subfolders have the role of specifying/restricting the functionalities of the main folder in particular areas (e.g., the libavutil folder which is further divided into the various supported architectures, such as Intel x86, ARM, PPC, etc.; as mentioned, Lungu et al. (2006) refer to this structural \u201cpattern\u201d as an Archipelago). The fourth column describes the main functionalities of the component. It can be observed that each directory provides the build and configuration files for itself and the subfolders contained, following the definition of build-level components (de Jonge, 2005). The fifth column of Table 2 lists the month in which the component was first detected in the repository. Apart from the miscellaneous tools component, each of these are currently reused as OSS components in other multimedia projects as development libraries, for example, the libavutil component is currently redistributed as the libavutil-dev package.\n\nTable 2 shows that the main components of this system have originated at different dates, and that the older ones (e.g., libavcodec) are typically more articulated into several directories and multiple files. The libavcodec component was created relatively early in the history of this system (08/2001), and it has now grown to some 220,000 source lines of code (SLOC) alone.\n\nAs is visible in the time-line in Figure 2, other components have coalesced since then; each component appears modularized around a specific \u201cfunction,\u201d according to the \u201cDe-\n\n| Component name | Folder count | File count | Description | First detected |\n|----------------|--------------|------------|-------------|----------------|\n| libavcodec | 12 | 625 | Extensive audio/video codec library | 08/2001 |\n| libpostproc | 1 | 5 | Library containing video postprocessing routines | 10/2001 |\n| libavformat | 1 | 205 | Audio/video container mux and demux library | 12/2002 |\n| libavutil | 8 | 70 | Shared routines and helper library | 08/2005 |\n| libswscale | 6 | 20 | Video scaling library | 08/2006 |\n| tools | 1 | 4 | Miscellaneous utilities | 07/2007 |\n| libavdevice | 1 | 16 | Device handling library | 12/2007 |\n| libavfilter | 1 | 11 | Video filtering library | 02/2008 |\nscription\u201d column in Table 2, and as such have become more identifiable and hence reusable in other systems (and are in fact repackaged as distinct OSS projects, http://www.libav.org).\n\nResearch Questions\n\nThis research has been guided by three research questions:\n\nRQ1: How does the size of FFmpeg components evolve?\nRationale: at first, we were interested in how the components of FFmpeg behave in terms of their size, when they become available, and if there is a limit to growth in such components affecting their ability to be reused properly.\n\nRQ2: How does the architecture of FFmpeg components evolve?\nRationale: we were interested in understanding how the various FFmpeg components relate to one another in terms of coupling and cohesion. We consider these measures to be a representation of the software architecture.\n\nRQ3: How do FFmpeg components evolve when separated from FFmpeg (e.g., in white-box reuse)?\nRationale: as mentioned, the FFmpeg components have been reused so far in a black-box or a white-box scenario. OSS components are particularly suitable for white-box reuse due to the availability of the source code. A number of FFmpeg components have in fact been reused using a white-box reuse approach. Since in such a scenario a copy of the component is made and maintained by a new hosting project, the component is likely to evolve separately from its original host project (i.e., FFmpeg). Therefore, it is interesting to study how FFmpeg components evolve when they are reused as white-box components.\n\nDefinitions and Operationalization\n\nThis section introduces a number of definitions that are relevant to the research presented in this paper. In this paper we use terminology and definitions provided in related and previous studies.\n\nThe previous section already discussed our interpretation of the term component. To summarize, we consider a directory in the source code file system, containing several source code files, to be a build-level component (de Jonge, 2005), which are subsequently used as units of composition. Others have used the word \u201cmodule\u201d for this (e.g., Clements et al., 2010).\n\nIn order to measure the evolution of components and their architectural evolution, we use a number of measurements that have been well established in software engineering measurement literature, namely coupling and cohesion. Coupling is further divided into outbound coupling (fan-out) and inbound coupling (fan-in). Furthermore, we have considered the concept of \u201cconnection\u201d which states whether two components are related or not.\n\u2022 **Coupling**: Coupling is a measure of the degree of interdependence between modules (Fenton, 1991). There are several types of coupling, such as common coupling where modules reference a global data area, control coupling where control data is passed between modules, etc. An extensive classification of types of coupling is presented by Lethbridge and Lagani\u00e8re (2001, p. 323). In this study, we define coupling as the union of \u201croutine call\u201d coupling and \u201cinclusion/import\u201d coupling. Routine call coupling refers to function calls from a component A to a component B. Inclusion/import coupling refers to dependencies expressed using the #include directive of the C preprocessor. We used the Doxygen tool (http://www.doxygen.org/) to extract this information. Since the empirical study is based on the definition of build-level components, two further conversions have been made:\n\n1. The file-to-file and the functions-to-functions couplings have been \u201clifted\u201d (Krikhaar, 1999, p. 38, p. 85) into folder-to-folder couplings, as also done by Tran and Holt (1999); this is graphically illustrated in Figure 3. A stronger coupling link between folder A and B will be found when many elements within A call elements of folder B.\n\n2. Since the behavior of build-level components is studied here, the couplings to subfolders of a component have also been redirected to the component alone; hence a coupling $A \\rightarrow B/C$ (with C being a subfolder of B) is reduced to $A \\rightarrow B$. This is graphically illustrated in Figure 4.\n\n\u2022 **Outbound coupling** (fan-out): for each component, the percentage of couplings directed from any of its elements to elements of other components, as in requests of services. A component with a large fan-out, or \u201ccontrolling\u201d many components provides an indication of poor design, since the component is probably performing more than one function.\n\n\u2022 **Inbound coupling** (fan-in): for each component, the percentage of couplings directed to it from all the other components, as in \u201cprovision of services.\u201d A component with high fan-in is likely to perform often-needed tasks, invoked by many components, which is regarded as an acceptable design behavior.\n\n\u2022 **Cohesion**: for each component, the sum of all couplings, in percentage, between its own elements (files and functions).\n\n\u2022 **Connection**: distilling the couplings as defined, one could say, in a Boolean manner, whether two folders are linked by a connection or not, disregarding the strength of the link itself. The overall number of\nthese connections for the FFmpeg project is recorded monthly in Figure 5; the connections of a folder to itself are not counted (for the encapsulation principle), while the two-way connection and is counted just once (since we are only interested in which folders are involved in a connection).\n\nData Collection and Analysis\n\nThe source code repository (SVN) of FFmpeg was parsed monthly, resulting in some 100 temporal points, after which the tree structures were extracted for each of these points. The monthly extraction of the raw data was achieved by downloading the repository on the first day of each month. As an example, for retrieving the snapshot for 02/2008, the following command was issued:\n\n```\nsvn -r {2008-02-01} checkout svn://svn.ffmpeg.org/ffmpeg/trunk\n```\n\nOn the one hand, the number of source folders (but not yet build-level components) of the corresponding tree is recorded in Figure 5. On the other hand, in order to produce an accurate description of the tree structure as suggested by Tran et al. (2000), each month\u2019s data has been further parsed using Doxygen, with the aim of extracting the common coupling among the elements (i.e., source files and headers, and source functions) of the systems. Doxygen generates so-called .dot files in the process. Each of these .dot files represents a file (or a class), or a cluster of files, and its couplings towards other in the system. In order to generate the .dot files (and keep them available after the process), the Doxygen configuration file (http://mastodon.uel.ac.uk/IJOSSP2012/Doxygen_base.txt) contains these two commands:\n\n```\n\"HAVE_DOT = YES\"\n\"DOT_CLEANUP = NO\"\n```\n\nVarious scripts are then applied to obtain the summary of function calls (http://mastodon.uel.ac.uk/IJOSSP2012/ffmpeg-2008-02-01-summary_ALL_FUNCTION_CALLS.txt), dependencies and include relationships. The information in the summary files is at the atomic level of functions or files: in order to define inter-relationships between components, these relations are lifted (Krikhaar, 1999) to the level of the build-level components (i.e., folders) that contain them, as was mentioned.\n\nThe analysis of size growth has been performed using the SLOCCount tool (Wheeler, n.d.).\n\nFor each build-level component summarized in Table 2, a study of its relative change in terms of the contained SLOC along its lifecycle has been undertaken. In addition, a study of the\narchitectural connections has been performed, by analyzing temporally:\n\n1. The number of couplings that were actually involved with elements of the same component (as per the definition of cohesion);\n2. The number of couplings that consisted of links to or from other components (as per the definition of inbound and outbound couplings, respectively).\n\nPrevious studies that present recovered architectures have used \u201cbox-and-line\u201d (or box and arrow) diagrams (e.g., Bowman et al., 1999). We use UML package diagrams (rather than component diagrams) to graphically visualize (build-level) components, as defined in the previous section.\n\nRESULTS AND DISCUSSION\n\nThis section provides the results of the empirical investigation, addressing the three research questions identified in the previous section. First, the size growth of the FFmpeg components is presented (Table 2). This is followed by a presentation of an analysis of the architectural evolution of the components. This section concludes with a discussion of the deployment of libavcodec in other OSS projects.\n\nSize Growth of FFmpeg Components\n\nAs a general result, two different evolutionary patterns can be observed, which have been clustered in the two graphs of Figure 6 and Figure 7; the measures are all relative to the highest values recorded, and they are presented as percentages on the Y-axis. In the top graph, three components (libavcodec, libavutil and libavformat in blue, yellow and red, respectively) show a linear growth as a general trend (relative to the maximum size achieved by each). In the following, these components are referred to as E-type components. On the other hand, the other components in FFmpeg (Table 2) show a more traditional evolution that is typical for library packages, and are referred to as either \u201cS-type\u201d or \u201cP-type\u201d systems (as presented in the background section).\n\nSize Growth in E-Type Components\n\nConsidering the top diagram in Figure 6, the libavcodec component started out as a medium-sized component (18 KSLOCs), but currently its\nsize has reached over 220 KSLOCs, which is an increase of over 1,100%. Also, the libavformat component has moved through a comparable pattern of growth (250% increase), but with a smaller size overall (from 14 to 50 KSLOC). Although reusable resources are often regarded as \u201cS-type\u201d or \u201cP-type\u201d systems, since their evolutionary patterns manifest a reluctance to growth (as in the typical behavior of software libraries), these two components achieve an \u201cE-type\u201d evolutionary pattern even when heavily reused by several other projects. The studied cases appear to be driven mostly by adaptive maintenance (Swanson, 1976), since new audio and video formats are constantly added and refined among the functions of these components.\n\nUsing a metaphor from botany, these software components appear and grow as \u201cfruits\u201d from the main \u201cplant\u201d (\u201ctrunk\u201d in the version control system). Furthermore, these components behave as \u201cclimacteric\u201d fruits (such as bananas), meaning that they ripen off the parent plant (and in some cases, they must be picked in order to ripen; that is, a component needs to\nbe separated from the parent project in order to allow it to mature and evolve). These FFmpeg components have achieved an evolution even when separated from the project they belonged to (i.e., FFmpeg), similarly to climacteric fruits.\n\n**Size Growth in S- and P-Type Components**\n\nThe bottom diagram in Figure 7 details the relative growth of the remaining components. The Figures 6 and 7 show that these remaining components show a more traditional library-style type of evolution. Maintenance activities in these components are more likely to be of a corrective or perfective nature (Swanson, 1976). The components libpostproc and libswscale appear to be hardly changing at all, even though they have been formed for several years in the main project (Figure 2). Libavdevice, when created, was already at 80% of its current state; libavfilter, in contrast, although achieving a larger growth, does so since it was created at a very small stage (600 SLOC), which has now doubled (1,400 SLOCs). These resources are effectively library-type of systems, and their reuse is simplified by the relative stability of their characteristics, meaning the type of problem they solve. Using the same metaphor as shown, the components (\u201cfruits\u201d) following this behavior are unlikely to ripen any further once they have been picked. Outside the main trunk of development, these components remain unchanged, even when incorporated into other OSS projects.\n\n**Architectural Evolution of FFmpeg Components**\n\nThe observations related to the growth in size have been used to cluster the components based on their coupling patterns. As mentioned, each of the 100 monthly checkouts of the FFmpeg system were analyzed in order to extract the common couplings of each element (functions or files), and these common couplings were then converted (lifted) into connections between components.\n\nAs observed also with the growth in size, the E-type components present a steadily increasing growth of couplings compared to the more stable S-type and P-type components. In the following section, we will study whether the former also display a more modularized growth pattern, resulting in a more stable and defined behavior.\n\n**Coupling Patterns in E-Type Components**\n\nFigures 8 through 10 present the visualization of the three E-type components identified. For each component, four trends are displayed:\n\n1. The overall amount of its common couplings;\n2. The amount of couplings directed towards its elements (cohesion);\n3. The amount of its outbound couplings (fan-out);\n4. The amount of its inbound couplings (fan-in).\n\nAs seen, these trends are also measured relative to the highest values recorded in each trend, and they present the results in percentages on the Y-axis.\n\nEach component has a continuous growth trend regarding the number of couplings affecting it. The libavutil component has one sudden discontinuity in this growth, which will be later explained. As a common trend, it is also visible that both the libavcodec and libavformat components have a strong cohesion factor, which maintains over the 75% threshold throughout their evolution. In other words, in these two components, more than 75% of the total number of couplings are consistently between internal elements. The cohesion of libavutil, on the other hand, degrades until it becomes very low, revealing a very high fan-in. After the restructuring at around one fifth of its lifecycle (June 2006), this component becomes a provider (Lungu et al., 2006), fully providing services to other components (more than 90% of the overall amount of its couplings \u2013 around\n3,500 \u2013 are either towards its own elements or serving calls from other components).\n\nWhen observing the three components as part of a common, larger system, the changes in one component become relevant to the other components as well. For example, the general trend of libavcodec is intertwined to the other two components (i.e., libavutil and libavformat) in the following ways:\n\n1. The overall cohesion decreases during a time interval when no overall couplings (i.e., the blue trend) were added, therefore this attribute has decayed.\n\n2. In parallel with the cohesion decay, the fan-out of libavcodec (top of Figure 5) abruptly increases, topping some 17% at the latest studied point: at a closer inspection, this larger fan-out (e.g., requests of services) is increasingly directed towards the libavutil component, which around the same period (middle of Figure 5) experiences a sudden increase of its fan-in (i.e., provision of services).\n\n3. Also, the fan-in of libavcodec decreases: in the first part of its evolution, libavcodec served numerous requests from the libavformat component; throughout the evolution, these links are converted into connections to libavutil instead, decreasing the fan-in of libavcodec.\n\n4. Performing a similar analysis for libavformat, it becomes clear that its fan-out degrades, becoming gradually larger, the reason being an increasingly stronger link to the elements of both libavcodec and libavutil. This form of inter-component\n\nFigure 8. Coupling patterns of E-type components. Libavavcodec.\n\nFigure 9. Coupling patterns of E-type components. Libavutil.\ndependencies is a form of architectural decay (Eick et al., 2001). This has been reproduced for the latest available data point in Figure 11: both libavformat and libavcodec depend heavily on libavutil (1,093 and 1,748 overall couplings, respectively); furthermore, the same two components are also intertwined by 523 calls by libavformat that are served by libavcodec.\n\nFigure 11 shows that most of the couplings of these displayed components are amongst themselves; for instance, 68% of the couplings of libavformat (4,051 couplings) are couplings to itself (i.e., its cohesion); 18% (1,093) is to libavutil, and 9% is to libavcodec. Ninety-five per cent of libavformat\u2019s couplings are found within these three components; the remaining 5% are couplings to other components. When comparing these results with the plots in Figures 8 through 10 (especially the one representing the libavcodec component), it becomes clear how its architecture has decayed. In the earliest points, libavcodec represented an excellent component, with a cohesion made of 90% of all its couplings, and a fan-in of 10% of all its couplings. No fan-out was recorded, so essentially libavcodec had no need for services by other components. The latest available point, instead (Figure 11), shows a component that has decayed, that needs more from libavutil (16% of all its couplings), and for which the fan-out has increased to some 18% of its overall couplings.\n\nThe graph in Figure 11 shows another result, representing in fact the typical trade-offs between encapsulation and decomposition: several of the common files accessed by both libavformat and libavcodec have been \u201crelocated\u201d (Tran & Holt, 1999) recently to a third location (libavutil), that acts as a provider (Lungu et al., 2006) to both. This in turns has a negative effect on reusability; when trying to reuse (some of) the functionality of libavcodec, it will be necessary to include also (some of) the contents of libavutil, since a large amount of calls are issued by libavformat towards libavutil. Even worse, when trying to reuse (some of) the functionality of libavformat, it will be necessary to include also (some of the functionality of) libavutil and libavcodec, since the three components are heavily intertwined.\n\n**Coupling Patterns in S- and P-Type Components**\n\nThe characteristics of the E-type components as described can be summarized as follows:\n\n- High cohesion;\n- Fan-out under a certain threshold; and\nClear, defined behavior as a component (e.g., a \u201cprovider\u201d as achieved by the libavutil component).\n\nThe second cluster of components identified (the \u201cS-\u201d and \u201cP-type\u201d) revealed several discrepancies from the results observed previously. A list of key results is summarized here:\n\n1. As also observed for the growth of components, the number of couplings affecting this second cluster of components reveals a difference of one (libswscale, libavdevice and libavfilter) and even two (libpostproc) orders of magnitude with respect to the E-type components.\n\n2. Slowly growing trends in the number of couplings were observed in libavdevice and libavfilter, but their cohesion remains stable. On the other hand, a high fan-out was consistently observed in both, with values of 0.7 and 0.5, respectively. Observing more closely, these dependencies are directed towards the three E-type components defined. This suggests that these components are not yet properly designed; this may also be due to their relatively young age. Their potential reuse is subsumed to the inclusion of other FFmpeg libraries as well.\n\nTo summarize, this second type of components can be classified as slowly growing, less cohesive and more connected with other components in the same system. They can be acceptable reusable candidates, but resolving the inter-connections with other components from the same project could prove difficult.\n\nDeployment of libavcodec in other OSS Projects\n\nAlthough identified as \u201cE-type\u201d components, the three components libavcodec, libavformat and libavutil have been shown as highly reusable, based on coupling patterns and size growth attributes. This is interesting, as it seems to contradict the expectation that E-type software is less reusable, due to the need to continuously evolve. In order to observe how these components are actually reused and deployed in other hosting systems, this section summarizes the study of the deployment of the libavcodec component in four OSS projects: avifile (http://avifile.sourceforge.net/), avidemux (http://fixounet.free.fr/avidemux/), MPlayer and xine (Freitas, Roitzsch, Melanson, Mattern, Langauf, Petteno et al., 2002).\n\nThe selection of these projects for the deployment study is based on their current reuse of these components. Each project hosts a copy...\nof the libavcodec component in their code repositories, therefore implementing a white-box reuse strategy of this resource. In other words, these projects maintain their own copy of the libavcodec component. The issue to investigate is whether these hosting projects maintain the internal characteristics of the original libavcodec, hosted in the FFmpeg project. In order to do so, the coupling attributes of this folder have been extracted from each OSS project, and the number of connected folders has been counted, together with the total number of couplings. The results are shown in Figure 12.\n\nEach diagram in Figure 12 represents a hosting project: the libavcodec copy presents some degree of cohesion (the re-entrant arrow), and its specific fan-in and fan-out (inwards and outwards arrows, respectively). The number of connections (i.e., distinct source folders) responsible for the fan-in and fan-out are displayed by the number in the (multi-) module diagram in the upper-left and upper-right corners. The following observations can be made:\n\n- The total amount of couplings in each copy is always lower than the original FFmpeg copy; this means that not the whole FFmpeg project is reused, but only some specific resources.\n- In each copy, the ratio $\\text{fan-in/fan-out}$ is approximately 2:1. In the xine copy, this is reversed: this is due to the fact that, apparently, xine does not host a copy of the libavformat component.\n- For each graph, the connections between libavcodec and libavutil, and between libavcodec and libavformat have been specifically detailed: the fan-in from libavformat alone has typically the same order of magnitude than all the remaining fan-in.\n- The fan-out towards libavutil typically accounts for a much larger ratio. This is a confirmation of the presence of a consistent dependency between libavcodec and libavutil, which therefore must be reused together. The avidemux project moved the necessary dependencies to libavutil within the libavcodec component; therefore no build-level component for libavutil is detectable.\n\n**THREATS TO VALIDITY**\n\nWe are aware of a few limitations of this study, which are discussed below. Threats may occur with respect to construct validity, reliability and external validity. Since we do not seek to establish any causal relationships, we do not discuss threats to internal validity.\n\n**Construct Validity**\n\nConstruct validity is concerned with establishing correct operational measures for the concepts that are being studied (Yin, 2003). We used coupling and cohesion measures to represent inter-software component connections. These measures are widely used within the software engineering literature in relation to software module inter-connectivity. We interpreted the term \u201ccomponent\u201d as \u201cbuild-level\u201d component, as previously done in other studies (e.g., de Jonge, 2005).\n\nFurthermore, the build-level components presented in Table 2 (though probably accurate) are automatically assigned, but they could be only subcomponents of a larger component (e.g., composed of both libavutil and libavcodec).\n\n**Reliability**\n\nReliability is the level to which the operational aspects of the study, such as data collection and analysis procedures, are repeatable with the same results (Yin, 2003, p. 34). At the time of our study, FFmpeg was hosted in a Subversion repository, which was parsed monthly, as discussed in the research design section. Guba (1981) states that an inquiry can be affected by \u201cinstrumental drift or decay,\u201d which may produce effects of instability. In order to guard against this, we have established an audit trail of the data extraction process, which is a recommended practice to establish reliability (Guba, 1981). A snapshot (of the example given in the research design section) is made\npublicly available (http://mastodon.uel.ac.uk/IJOSSP2012/ffmpeg-2008-02-01.tar.gz). The generated .dot files (which represent individual files, classes or clusters of files, and contain its couplings to other modules in the system) are also publicly available (http://mastodon.uel.ac.uk/IJOSSP2012/ffmpeg-2008-02-01-dots.tar).\n\n**External Validity**\n\nExternal validity is concerned with the extent to which the results of a study can be generalized. In our study, we have focused on one case study (FFmpeg), which is written mostly in the C programming language. Performing a similar study on a system written in, for instance, an object-oriented language (e.g., C++ or Java), the results could be quite different. However, as outlined in the introduction section, it is not our goal to present generalizations based on our results. Rather, the aim of this paper is to document a successful case of OSS reuse by other OSS projects.\nCONCLUSION AND FUTURE WORK\n\nThis section presents the conclusion of this study followed by directions for future work.\n\nConclusion\n\nEmpirical studies of reusability of OSS resources should proceed in two strands: first, they should provide mechanisms to select the best candidate component to act as a building block in a new system; second, they should document successful cases of reuse, where an OSS component(s) has been deployed in other OSS projects. This paper contributes to the second strand by empirically analyzing the FFmpeg project, whose components are currently widely reused in several multimedia OSS applications. The empirical study was performed on project data for the last eight years of its development, studied at monthly intervals, to determine and extract the characteristics of its size, the evolutionary growth and its coupling patterns, in order to identify and understand the attributes that made its components a successful case of OSS reusable resources. After having studied these characteristics, four OSS projects were selected among the ones implementing a white-box reuse of the FFmpeg components; the deployment and the reuse of these components was studied from the perspective of their interaction with their hosting systems.\n\nIn our case study of FFmpeg, a number of findings were obtained. First, it was found that several of its build-level components make for a good start in the selection of reusable components. They coalesce, grow and become available at various points in the life cycle of this project, and all of them are currently available as building blocks for other OSS projects to use. Second, it was possible to classify (using Lehman\u2019s S-P-E program type categories) at least two types of components: one set presents the characteristics of evolutionary (E-type) systems, with a sustained growth throughout. The other set, albeit with a more recent formation, is mostly unchanged, therefore manifesting the typical attributes of software libraries.\n\nThe two clusters were compared again in the study of the connections between components. The first set showed components with either a clearly defined behavior, or an excellent cohesion of its elements. It was also found that these three components become increasingly mutually connected, which results in the formation of one single super-component. The second set appeared less stable, with accounts of a large fan-out, which suggests a poor design or immaturity of the components.\n\nOne of the reusable resources found within FFmpeg (i.e., libavcodec) was analyzed when deployed into four OSS systems that have reused it using a white-box approach. Its cohesion pattern appeared similar to the original copy of libavcodec, while it emerged with more clarity that currently its reuse is facilitated when the libavformat and libavutil components are reused, too. Given that most of the projects reusing the libavcodec library are \u201cdynamically\u201d linking (i.e., black box reuse) it to their code, any change made to the libavcodec library have a propagation issue (Orsila et al., 2008): this means that the linking projects need to adapt their code as long as a new version of libavcodec is released; on the other hand, the projects hosting their own copy of the same library (i.e., white box reuse) will face less of the propagation issue, since the changes pushed onto the original version libavcodec will not affect their copies.\n\nFuture Work\n\nThis work has several open strands to follow: at first, it would be interesting to replicate this study to other systems that are currently widely reused. In particular, it is necessary to start defining and distinguishing the reuse of whole systems \u201cas libraries\u201d (such as the project zlib), from the reuse of components within larger projects (such as the component libavcodec within the FFmpeg project). In the first case, the whole project is reused as-is, and\nit seems likely that only a subset of functions will be reused. In the latter, the implications are more interesting; researchers and practitioners should try to extract automatically libraries that comply with reusability principles, and avoid reusing whole systems.\n\nThe second research direction that needs to be addressed is about the evolution of reusable resources. It needs to address the following questions:\n\n- Do libraries need to remain mostly unchanged to be reusable?\n- What are the main issues of forking reusable libraries to avoid the effects of \u201ccascade updates\u201d?\n\nIn this respect, OSS developers and interested parties have to produce a strategy for the upgrade of their resources when such resources rely heavily on external libraries.\n\nThirdly, the example of the components being available at different times in FFmpeg shows that other evolving projects might be able to produce a similar response to the OSS communities, by signaling the presence of reusable libraries that could benefit other projects apart from their own.\n\nFinally, the presence of so many available OSS projects implementing similar applications (e.g., our example of over 100 projects implementing an \u201cemail client\u201d) should be analyzed further to detect how much code duplication, code cloning or components reuse is visible in these projects.\n\nACKNOWLEDGMENTS\n\nThe authors would like to thank Dr Daniel German for the clarification on the potential conflicts of licenses in the FFmpeg project, Thomas Knowles for the insightful discussions, and Nicola Sabbi for the insider knowledge of the MPlayer system. We thank the anonymous reviewers for their constructive feedback, which has improved this paper. This work was, in part, supported by Science Foundation Ireland grant 10/CE/I1855 to Lero\u2014The Irish Software Engineering Research Centre (www.lero.ie). This paper is a revised version of: Capiluppi, A., Boldyreff, C. & Stol, K. (2011) Successful Reuse of Software Components: A Report from the Open Source Perspective, in: Hissam, S. A., Russo, B., de Mendon\u00e7a Neto, M. G. & Kon, F. (Eds.) Open Source Systems: Grounding Research, Springer, Advances in Information and Communication Technology (AICT) vol. 365, pp. 159-176.\n\nREFERENCES\n\nAbi-Antoun, M., Aldrich, J., & Coelho, W. (2007). A case study in re-engineering to enforce architectural control flow and data sharing. *Journal of Systems and Software*, 80(2), 240\u2013264. doi:10.1016/j.jss.2006.10.036\n\nAvgeriou, P., & Guelfi, N. (2005). Resolving architectural mismatches of COTS through architectural reconciliation. In X. Franch & D. Port (Eds.), *Proceedings of the 4th International Conference on COTS-Based Software Systems* (LNCS 3412, pp. 248-257).\n\nAyala, C., S\u00f8rensen, C., Conradi, R., Franch, X., & Li, J. (2007). Open source collaboration for fostering off-the-shelf components selection. In Feller, J., Fitzgerald, B., Scacchi, W., & Sillitti, A. (Eds.), *Open source development, adoption, and innovation*. New York, NY: Springer. doi:10.1007/978-0-387-72486-7_2\n\nBasili, V. R., & Rombach, H. D. (1991). Support for comprehensive reuse. *IEEE Software Engineering Journal*, 6(5), 303\u2013316.\n\nBass, L., Clements, P., & Kazman, R. (2003). *Software architecture in practice* (2nd ed.). Reading, MA: Addison-Wesley.\n\nBowman, I. T., Holt, R. C., & Brewster, N. V. (1999). Linux as a case study: Its extracted software architecture. In *Proceedings of the 21st International Conference on Software Engineering* (pp. 555-563).\n\nCapiluppi, A., & Boldyreff, C. (2008). Identifying and improving reusability based on coupling patterns. In H. Mei (Ed.), *Proceedings of the 10th International Conference on Software Reuse: High Confidence Software Reuse in Large Systems* (LNCS 5030, pp. 282-293).\nCapiluppi, A., & Knowles, T. (2009). Software engineering in practice: Design and architectures of FLOSS systems. In Proceedings of the 5th IFIP WG 2.13 International Conference on Advances in Information and Communication Technology (Vol. 299, pp. 34-46).\n\nClements, P., Bachmann, F., Bass, L., Garlan, D., Ivers, J., & Little, R. \u2026Stafford, J. (2010). Documenting software architectures: Views and beyond (2nd ed.). Reading, MA: Addison-Wesley.\n\nde Jonge, M. (2002). Source tree composition. In C. Gacek (Ed.), Proceedings of the 7th International Conference on Software Reuse: Methods, Techniques, and Tools (LNCS 2319, pp.17-32).\n\nde Jonge, M. (2005). Build-level components. IEEE Transactions on Software Engineering, 31(7), 588\u2013600. doi:10.1109/TSE.2005.77\n\nDue\u00f1as, J. C., de Oliveira, W. L., & de la Puente, J. A. (1998). Architecture recovery for software evolution. In Proceedings of the 2nd Euromicro Conference on Software Maintenance and Reengineering (pp. 113-119).\n\nEasterbrook, S., Singer, J., Storey, M.-A., & Damian, D. (2008). Selecting empirical methods for software engineering research. In Shull, F., Singer, J., & Sj\u00f8berg, D. I. K. (Eds.), Guide to advanced empirical software engineering (pp. 285\u2013311). New York, NY: Springer. doi:10.1007/978-1-84800-044-5_11\n\nEick, S. G., Graves, T. L., Karr, A. F., Marron, J. S., & Mockus, A. (2001). Does code decay? Assessing the evidence from change management data. IEEE Transactions on Software Engineering, 27(1), 1\u201312. doi:10.1109/32.895984\n\nFenton, N. E. (1991). Software metrics: A rigorous approach. London, UK: Chapman & Hall.\n\nFitzgerald, B. (2006). The transformation of open source software. Management Information Systems Quarterly, 30(3), 587\u2013598.\n\nFreitas, M., Roitzsch, M., Melanson, M., Mattern, T., Langauf, S., & Petteno, D. \u2026Lee, A. (2002). Xine multimedia engine. Retrieved from http://www.xine-project.org/home\n\nGerman, D. M., & Gonz\u00e1lez-Barahona, J. M. (2009). An empirical study of the reuse of software licensed under the GNU general public license. In Proceedings of the 5th IFIP WG 2.13 International Conference on Open Source EcoSystems: Diverse Communities Interacting (pp. 185-198).\n\nGerman, D. M., Gonzalez-Barahona, J. M., & Robles, G. (2007). A model to understand the building and running inter-dependencies of software. In Proceedings of the 14th Working Conference on Reverse Engineering (pp. 140-149).\n\nGerman, D. M., & Hassan, A. E. (2009). License integration patterns: Addressing license mismatches in component-based development. In Proceedings of the 31st IEEE International Conference on Software Engineering (pp. 188-198).\n\nGlass, R. L., Vessey, I., & Ramesh, V. (2002). Research in software engineering: An analysis of the literature. Information and Software Technology, 44(8), 491\u2013506. doi:10.1016/S0950-5849(02)00049-6\n\nGodfrey, M. W., & Lee, E. H. S. (2000). Secrets from the monster: Extracting Mozilla\u2019s software architecture. In Proceedings of the 2nd Symposium on Constructing Software Engineering Tools (pp. 15-23).\n\nGuba, E. (1981). Criteria for assessing the trustworthiness of naturalistic inquiries. Educational Communication and Technology, 29, 75\u201392.\n\nHaefliger, S., von Krogh, G., & Spaeth, S. (2008). Code reuse in open source software. Management Science, 54(1), 180\u2013193. doi:10.1287/mnsc.1070.0748\n\nHarrison, N. B., & Avgeriou, P. (2011). Pattern-based architecture reviews. IEEE Software, 28(6), 66\u201371. doi:10.1109/MS.2010.156\n\nHauge, \u00d8., Ayala, C., & Conradi, R. (2010). Adoption of open source software in software-intensive organizations - A systematic literature review. Information and Software Technology, 52(11), 1133\u20131154. doi:10.1016/j.infsof.2010.05.008\n\nHauge, \u00d8., \u00d8sterlie, T., S\u00f8rensen, C.-F., & Gerea, M. (2009, May 18). An empirical study on selection of open source software - Preliminary results. In Proceedings of the 2nd ICSE Workshop on Emerging Trends in Free/Libre/Open Source Software Research and Development, Vancouver, BC, Canada (pp. 42-47).\n\nHauge, \u00d8., S\u00f8rensen, C.-F., & R\u00f8sdal, A. (2007). Surveying industrial roles in open source software development. In Feller, J., Fitzgerald, B., Scacchi, W., & Sillitti, A. (Eds.), Open source development, adoption and innovation (pp. 259\u2013264). New York, NY: Springer. doi:10.1007/978-0-387-72486-7_25\nHeinemann, L., Deissenboeck, F., Gleirscher, M., Hummel, B., & Irbeck, M. (2011). On the extent and nature of software reuse in open source Java projects. In K. Schmid (Ed.), *Proceedings of the 12th International Conference on Software Reuse: Top Productivity through Software Reuse* (LNCS 6727, pp. 207-222).\n\nIEEE. (2000). *IEEE Std 1471-2000: IEEE recommended practice for architectural description of software-intensive systems*. Piscataway, NJ: IEEE.\n\nKrikhaar, R. (1999). *Software architecture reconstruction* (Unpublished doctoral dissertation). University of Amsterdam, Amsterdam, The Netherlands.\n\nKrikhaar, R., Postma, A., Sellink, A., Stroucken, M., & Verhoef, C. (1999). A two-phase process for software architecture improvement. In *Proceedings of the IEEE International Conference on Software Maintenance* (pp. 371-380).\n\nKruchten, P. B. (1995). The 4+1 view model of architecture. *IEEE Software*, 12(5), 42\u201350. doi:10.1109/52.469759\n\nLang, B., Abramatic, J.-F., Gonz\u00e1lez-Barahona, J. M., G\u00f3mez, F. P., & Pedersen, M. K. (2005). Free and proprietary software in COTS-based software development. In X. Franch & D. Port (Eds.), *Proceedings of the 4th International Conference on Composition-Based Software Systems* (LNCS 3412, p. 2).\n\nLehman, M. M. (1978). Programs, cities, students, limits to growth? *Programming Methodology*, 42-62.\n\nLehman, M. M. (1980). Programs, life cycles, and laws of software evolution. *Proceedings of the IEEE*, 68(9), 1060\u20131076. doi:10.1109/PROC.1980.11805\n\nLethbridge, T. C., & Lagani\u00e8re, R. (2001). *Object-oriented software engineering: Practical software development using UML and Java* (2nd ed.). London, UK: McGraw-Hill.\n\nLi, J., Conradi, R., Bunse, C., Torchiano, M., Slyngstad, O. P. N., & Morisio, M. (2009). Development with off-the-shelf components: 10 facts. *IEEE Software*, 26(2), 80\u201387. doi:10.1109/MS.2009.33\n\nLungu, M., Lanza, M., & G\u00eerba, T. (2006). Package patterns for visual architecture recovery. In *Proceedings of the 10th European Conference on Software Maintenance and Reengineering*.\n\nMacro, A., & Buxton, J. (1987). *The craft of software engineering*. Reading, MA: Addison-Wesley.\n\nMockus, A. (2007). Large-scale code reuse in open source software. In *Proceedings of the First International Workshop on Emerging Trends in FLOSS Research and Development*.\n\nOrsila, H., Geldenhuys, J., Ruokonen, A., & Hamouda, I. (2008). Update propagation practices in highly reusable open source components. In *Proceedings of the IFIP 20th World Computer Congress on Open Source Software* (Vol. 275, pp. 159-170).\n\nParnas, D. L. (1972). On the criteria to be used in decomposing systems into modules. *Communications of the ACM*, 15(12), 1053\u20131058. doi:10.1145/361598.361623\n\nPerry, D. E., & Wolf, A. L. (1992). Foundations for the study of software architectures. *ACM SIGSOFT Software Engineering Notes*, 17(4), Runeson, P., & H\u00f6st, M. (2009). Guidelines for conducting and reporting case study research in software engineering. *Empirical Software Engineering*, 14(2), 131\u2013164.\n\nSametinger, J. (1997). *Software engineering with reusable components*. Berlin, Germany: Springer-Verlag.\n\nSartipi, K., Kontogiannis, K., & Mavaddat, F. (2000). A pattern matching framework for software architecture recovery and restructuring. In *Proceedings of the 8th International Workshop on Program Comprehension* (pp. 37-47).\n\nSchmerl, B., Aldrich, J., Garlan, D., Kazman, R., & Yan, H. (2006). Discovering architectures from running systems. *IEEE Transactions on Software Engineering*, 32(7), 454\u2013466. doi:10.1109/TSE.2006.66\n\nSenyard, A., & Michlmayr, M. (2004). How to have a successful free software project. In *Proceedings of the 11th Asia-Pacific Software Engineering Conference* (pp. 84-91).\n\nSojer, M., & Henkel, J. (2010). Code reuse in open source software development: Quantitative evidence, drivers, and impediments. *Journal of the Association for Information Systems*, 11(12), 868\u2013901.\n\nSommerville, I. (2004). *Software engineering (International Computer Science Series)* (7th ed.). Reading, MA: Addison-Wesley.\n\nSourceForge. (2011). *Email client*. Retrieved from http://sourceforge.net/directory/?q=email%20client\n\nSwanson, E. B. (1976). The dimensions of maintenance. In *Proceedings of the 2nd International Conference on Software Engineering* (pp. 492-497).\nSzyperski, C. (2002). *Component software: Beyond object-oriented programming* (2nd ed.). Reading, MA: Addison-Wesley.\n\nTorchiano, M., & Morisio, M. (2004). Overlooked aspects of COTS-based development. *IEEE Software*, 21(2), 88\u201393. doi:10.1109/MS.2004.1270770\n\nTran, J. B., Godfrey, M. W., Lee, E. H. S., & Holt, R. C. (2000). Architectural repair of open source software. In *Proceedings of the 8th International Workshop on Program Comprehension* (pp. 48-59).\n\nTran, J. B., & Holt, R. C. (1999). Forward and reverse repair of software architecture. In *Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research*.\n\nTroy, D. A., & Zweben, S. H. (1981). Measuring the quality of structured designs. *Journal of Systems and Software*, 2(2), 113\u2013120. doi:10.1016/0164-1212(81)90031-5\n\nVen, K., & Mannaert, H. (2008). Challenges and strategies in the use of open source software by independent software vendors. *Information and Software Technology*, 50(9-10), 991\u20131002. doi:10.1016/j.infsof.2007.09.001\n\nWheeler, D. A. (n.d.). *SLOCCount*. Retrieved from http://www.dwheeler.com/sloccount/\n\nWikipedia. (n.d.). *Lamp (software bundle)*. Retrieved from http://en.wikipedia.org/wiki/LAMP_(software_bundle)\n\nYin, R. K. (2003). *Case study research: Design and methods* (3rd ed.). Thousand Oaks, CA: Sage.\n\n**ENDNOTES**\n\n1 Of course, a full structural evaluation of these 128 projects should be performed before arguing that no features are reused among these projects.\n\n2 A list of OSS and commercial projects integrating the libavcodec is given and maintained under http://ffmpeg.org/projects.html\n\n3 The term \u201cconnection\u201d is not intended to cover the term \u201cdependency\u201d between packages in a distribution, since this paper only analyses the internal architecture of components.\n\nAndrea Capiluppi is a Lecturer in Software Engineering at University Brunel since May 2012. Before that, he was a Senior Lecturer at the University of East London, from February 2009 to April 2012, and a Senior Lecturer at University of Lincoln, UK, for three years, from January 2006 to February 2009. He has gained a PhD from Politecnico di Torino, Italy, in May 2005, and has held a Researcher position and a Consultant position at the Open University in UK. In November 2003 he was a Visiting Researcher in the GSyC group at the University of Rey Juan Carlos de Madrid, Spain, one of the partners of the project proposal. His publications include some 50 papers, published in leading international conferences and journals, mostly devoted to the Open Source Software topic. He has been a consultant to several industrial companies and has published works where results on FLOSS research have been disseminated in commercial sites. He has taken part in one of the packages of the CALIBRE project, a \u20ac1.5 million pan-European EU research project focused on the use of FLOSS in industry.\n\nKlaas-Jan Stol is a researcher at Lero, the Irish Software Engineering Research Centre, where he has worked since 2008. He holds a PhD in Software Engineering from the University of Limerick, Ireland, and a MSc in Software Engineering from the University of Groningen, the Netherlands. His research interests are in Open Source Software (OSS), software development methods (including OSS development practices), software architecture, component-based software development, software reuse and empirical software engineering.\nCornelia Boldyreff is the Associate Dean (Research and Enterprise) at the School of Architecture, Computing and Engineering at the University of East London. She gained her PhD in Software Engineering from the University of Durham. In 2004 she moved to the University of Lincoln to become the first Professor of Software Engineering at the university, where she co-founded and directed the Centre for Research in Open Source Software. She has over 25 years experience in software engineering research and has published extensively on her research in the field. She is a Fellow of the British Computer Society and a founding committee member of the BCSWomen Specialist Group. She has been actively campaigning for more women in SET throughout her career.", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/032_capiluppi.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 26, "total-input-tokens": 72868, "total-output-tokens": 19482, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 2532, 1], [2532, 6672, 2], [6672, 10280, 3], [10280, 13293, 4], [13293, 17345, 5], [17345, 20967, 6], [20967, 25135, 7], [25135, 28917, 8], [28917, 32608, 9], [32608, 35172, 10], [35172, 37709, 11], [37709, 40121, 12], [40121, 42167, 13], [42167, 43259, 14], [43259, 46876, 15], [46876, 48474, 16], [48474, 50940, 17], [50940, 53255, 18], [53255, 57056, 19], [57056, 57988, 20], [57988, 61904, 21], [61904, 65655, 22], [65655, 69953, 23], [69953, 74285, 24], [74285, 77707, 25], [77707, 78460, 26]]}}
|
|
{"id": "76d4c40f1a1bc61678d1c01bf1604237d4977b22", "text": "Variant Forks \u2013 Motivations and Impediments\n\nJohn Businge,\u2217 Ahmed Zerouali,\u2021 Alexandre Decan,\u2020 Tom Mens,\u2020 Serge Demeyer,\u2217 and Coen De Roover,\u2021\n\n\u2217University of Antwerp, Antwerp, Belgium\n\u2020University of Mons, Mons, Belgium\n\u2021Vrije Universiteit Brussels, Brussels, Belgium\n\n{ john.businge | serge.demeyer }@uantwerpen.be\n{ alexandre.decan | tom.mens }@umons.ac.be\n{ ahmed.zerouali | coen.de.roover }@vub.be\n\nAbstract\u2014Social coding platforms centred around git provide explicit facilities to share code between projects: forks, pull requests, cherry-picking to name but a few. Variant forks are an interesting phenomenon in that respect, as they permit for different projects to peacefully co-exist, yet explicitly acknowledge the common ancestry. Several researchers analysed forking practices on open source platforms and observed that variant forks get created frequently. However, little is known on the motivations for launching such a variant fork. Is it mainly technical (e.g., diverging features), governance (e.g., diverging interests), legal (e.g., diverging licences), or do other factors come into play? We report the results of an exploratory qualitative analysis on the motivations behind creating and maintaining variant forks. We surveyed 105 maintainers of different active open source variant projects hosted on GitHub. Our study extends previous findings, identifying a number of fine-grained common motivations for launching a variant fork and listing concrete impediments for maintaining the co-existing projects.\n\nIndex Terms\u2014Mainlines, Variants, GitHub, Software ecosystems, Maintenance, Variability\n\nI. INTRODUCTION\n\nThe collaborative nature of open source software (OSS) development has led to the advent of social coding platforms centred around the git version control system, such as GitHub, BitBucket, and GitLab. These platforms bring the collaborative nature and code reuse of OSS development to another level, via facilities like forking, pull requests and cherry-picking. Developers may fork a mainline repository into a new forked repository and take governance over the latter while preserving the full revision history of the former. Before the advent of social coding platforms, forking was rare and was typically intended to compete with the original project [1]\u2013[6].\n\nWith the rise of pull-based development [7], forking has become more common and the community typically characterises forks by their purpose [8]. Social forks are created for isolated development with the goal of contributing back to the mainline. In contract, variant forks are created by splitting off a new development branch to steer development into a new direction, while leveraging the code of the mainline project [9].\n\nSeveral studies have investigated the motivations behind variant forks in the context of OSS projects [1]\u2013[6]. However, most have been conducted before the rise of social coding platforms and it is known that GitHub has significantly changed the perception and practices of forking [8]. In this social coding era, variant projects often evolve out of social forks rather than being planned deliberately [8]. To this end, social coding platforms often enable mainlines and variants to peacefully co-exist rather than compete. Little is known on the motivations for creating variants in the social coding era, making it worthwhile to revisit the motivation for creating variant forks (why?).\n\nSocial coding platforms offer many facilities for code sharing (e.g., pull requests and cherry-picking). So if projects co-exist, one would expect variant forks to take advantage of this common ancestry, and frequently exchange interesting updates (e.g., patches) on the common artefacts. Despite advanced code-sharing facilities, Businge et al. observed very limited code integration, using the git and GitHub facilities, between the mainline and its variant projects [10]. This suggests that code sharing facilities in themselves are not enough for graceful co-evolution, making it worthwhile to investigate impediments for co-evolution (how?).\n\nWe therefore explore two research questions:\n\nRQ1: Why do developers create and maintain variants on GitHub? The literature pre-dating git and social coding platforms identified four categories of motivations for creating variant forks: technical (e.g., diverging features), governance (e.g., diverging interests), legal (e.g., diverging licences), and personal (e.g., diverging principles). RQ1 aims to investigate whether those motivations for variant forks are still the same, or whether new factors have come into play.\n\nRQ2: How do variant projects evolve with respect to the mainline? If, despite advanced code sharing facilities, there is limited code integration between the mainline and the variant projects, a possible cause could be related to how the teams working on the variants and the mainline are structured. Therefore, RQ2 investigates the overlap between the teams maintaining the mainline and variant forks, and how these teams interact. As such we hope to identify impediments for co-evolution.\n\nThe investigations are based on an online survey conducted with 105 maintainers involved in different active variant forks hosted on GitHub.\n\nOur contributions are manifold: we identify new reasons for creating and maintaining variant forks; we identify and categorize different code reuse and change propagation practices between a variant and its mainline; we confirm that little code integration occurs between a variant and its mainline, and uncover concrete reasons for this phenomenon. We discuss\nthe implications of these findings and how tools can help to achieve an efficient code integration and collaboration between mainlines and diverging variant forks. Our replication package can be found 1.\n\nII. RELATED WORK\n\nPrevious research has focused on (A) motivations for creating or maintaining variant forks; and (B) interaction between variant forks and their mainline.\n\nA. Motivations for creating or maintaining variant forks\n\nSeveral studies have investigated motivations for creating and maintaining variant forks. However, most of these studies were carried out on SourceForge, pre-dating the advent of social coding platforms like GitHub [1]\u2013[5], [11]. Several of those early studies report perceived controversy around variant forks [5], [12]\u2013[17]. Jiang et al. [18] state that, although forking may have been controversial in the OSS community, it is now encouraged as a built-in feature on GitHub. They further report that developers create social forks of repositories to submit pull requests, fix bugs, and add new features. Zhou et al. [8] conclude that most variant forks started as social forks and that perceptions of forks have changed with the advent of GitHub. Robles and Gonz\u00e1lez-Barahona [2] carried out a comprehensive pre-GitHub study on a carefully filtered list of 220 potential forks referenced on Wikipedia. They report motivations and outcomes for forking on these 220 projects.\n\nThe literature has uncovered a number of motivations for creating variants. Below, we present those where both the mainline and variant co-evolve together. The motivation of reviving an abandoned project is not considered in this study since it does involve co-evolution if the variants.\n\n\u25cb Technical (addition of functionality). Sometimes developers want to include new functionality into the project, but the main developer(s) do not accept the contribution. An example is Poppler, a fork of xpdf relying on the poppler library [2].\n\n\u25cb Governance disputes. Some contributors from the community create a variant project because they feel that their feedback is not heard, or because the maintainers of the mainline are unresponsive or too slow at accepting their patches. A well-known example is a fork of GNU Emacs (originally Lucid) which was created as a result of the significant delays in bringing out a new version to support the Energize C++ IDE [19].\n\n\u25cb Legal issues. This includes disagreements on the license and trademarks, and changes to conform to rules and regulations. An example is X.Org, which originated from XFree86 [2], [19]. XFree86 was originally MIT/X open source license that is GPL-compatible and then was changed to one that was not GPL-compatible. This caused many practical problems and a serious uproar in the community, resulting in the project fork X.Org.\n\n\u25cb Personal reasons. In some situations, the developer team disagrees on fundamental issues (beyond mere technical matters) related to the software development process and the project. An example is the OpenBSD fork from NetBSD. One of the developers of NetBSD had a disagreement with the rest of the core developers and decided fork and focus his efforts on OpenBSD [20].\n\nFocusing on variant forks in the Android ecosystem, Businge et al. [21] found that re-branding, simple customizations, feature extension, and implementation of different but related features are the main motivations to create forks of Android apps. Zhou et al. [8] interviewed 18 developers of hard forks on GitHub to understand reasons for forking in social coding environments that explicitly support forking. The motivations they observed align with the findings of the aforementioned studies.\n\nSung et al. [9] investigated variant forks in an industrial case study to uncover the implications of frequent merges from the mainline and the resulting merge conflicts in the variant forks. They implemented a tool that can automatically resolve up to 40% of 8 types of mainline-induced build breaks.\n\nWhile the pre-GitHub studies reported perceived controversy around variant forks, Zhou et al. [8] report that this controversy has reduced with the advent of GitHub. Jiang et al. [18] report that, while forking is considered controversial in traditional OSS communities, it is actually embraced as a built-in feature in GitHub. Our study builds on these previous studies to identify whether the motivations for variant forks are still the same or whether new factors have come into play.\n\nB. Interaction between variant forks and their mainline\n\nWe have only encountered two studies that investigated the interaction between variant forks and mainlines [8], [10]. Zhou et al. [8] conducted 18 semi-structured developer interviews. Many respondents indicated being interested in coordination across repositories, either for eventually merging changes back into the mainline, or to monitor activity in the mainline repository and select and integrate interesting updates into their variant project. Businge et al. [10] also investigated the interaction between mainline and variants. The authors quantitatively investigated code propagation among variants and their mainline in three software ecosystems. They found that only about 11% of the 10,979 mainline\u2013variant pairs had integrated code between them. Since the mainlines and variants share a common code base, and with the collaborative maintenance facilities of git and the pull-based development model, one would expect more interactions between the mainline and its variants. We hypothesise that there are some impediments to enable such interactions. Since the two aforementioned studies do not report any such impediments, we decided to carry an exploratory qualitative survey with variant maintainers to identify possible impediments.\n\nIII. STUDY DESIGN\n\nTo understand the motivations behind the creation and maintenance of variant forks we conducted an online survey with maintainers of variant forks. In this section, we explain\nhow we (i) designed the survey protocol; (ii) collected mainline-variant pairs and extracted the maintainers of the variant forks; and (iii) recruited the survey participants.\n\nA. Survey Protocol Design\n\nWe designed a 12-question survey that would last at most 15 minutes. Since we aimed to learn from a large number of projects, we used an online survey as this data collection approach is known to scale well [22]. The survey can be found here\\(^2\\). The questions were designed to cover our two main research questions. 8 of the 12 questions were close-ended and respondents could answer them either via multiple choice or Likert scales. An optional free-text form was provided for 3 of the 8 close-ended questions to allow respondents to share additional thoughts and feedback. The 4 remaining questions are open-ended. All questions were carefully formulated so as not to bias respondents towards a specific answer. We validated them by subjecting them to the critical eye of 7 colleagues and by conducting trial runs of the survey with the same 7 participants.\n\nB. Identifying variant projects and participants\n\nGiven the scope of the survey, we target respondents involved in the creation and maintenance of variant projects. Therefore, we first needed to identify such variants. To this end, we relied on two data sources: Libraries.io and GitHub.\n\nLibraries.io contains metadata about projects distributed through various package registries. We collected the metadata for all projects of some of the largest package registries (npm, Go, Maven, PyPI and Packagist). We relied on this metadata to identify those projects that are variants of another one, following the variant identification method proposed by Businge et al. [10], [23]. We only considered variants that are actively maintained in parallel with their mainline counterparts. We extracted variants for which the mainline\u2013variant pair was created before 2019-04-01 and updated at least once after 2020-04-01 (i.e., active projects). This process yielded 227 mainline\u2013variant project pairs.\n\nWe collected additional mainline-variant pairs from GitHub directly. To do so, we searched for mainline projects using the GitHub search endpoint. We looked for popular (> 50 stars and forks), long-lived (created before 2018) and active (still updated in 2020) repositories. We focused on software development repositories whose main language is among the top 17 of most popular languages used in GitHub (e.g., JavaScript, Java, Go, Python, Ruby, C, etc). For all the mainline projects we found, we tried to identify and collect variant forks. This process is subject to a known threat to validity since previous studies revealed that the majority of forks on GitHub are inactive [24], [25] or are social forks [21]. To reduce this threat, we filtered forks based on the following heuristics: \\( \\geq 10 \\) stars, \\( \\geq 10 \\) commits ahead of the mainline, \\( \\geq 5 \\) closed pull requests, diverging README files. We manually verified these remaining forks to ensure they corresponded to variants of the corresponding mainline. This process yielded 264 additional mainline-variant project pairs, leading to a total of 491 collected mainline\u2013variant pairs.\n\nC. Participant Recruitment\n\nBased on this collection of mainline-variant pairs, we identified contributors that had integrated at least one pull request into the variant. We retrieved their public-facing emails (if available) using the GitHub API, while ensuring to respect the GitHub Privacy Statement.\\(^3\\) We individually contacted a total of 762 variant maintainers from the 491 variant projects, and received a total of 105 responses (response rate 14%), representing a total of 105 variant forks (21%). All participants were required to read and accept an informed consent form before taking part in the survey.\n\n\n\nD. Analysis\n\nWe used open card sorting [26], on the 3 open-ended questions to identify common responses reported by the participants. In the analysis, we grouped similar responses from the open-ended questions into themes. We did not start with any pre-defined themes in mind, but instead derived the themes from the open-ended answers, iterating as many times as needed until reaching a saturation point. The first iteration of coding themes was performed by the first author of the paper, and any responses the first author was unsure of were decided by discussion with the second author. Once the first two authors agreed on the themes, a virtual meeting was set with all six authors to discuss the resulting themes and come to a negotiated agreement [27]. This allowed us to remove duplicates and, in some cases, to generalize or specialize themes.\n\n\\(^2\\)10.5281/zenodo.5855808\n\n\\(^3\\)https://docs.github.com/en/github/site-policy/github-privacy-statement\nIV. RQ1: Why do developers create and maintain variants on GitHub?\n\nRQ1 aims to investigate whether new motivations for creating variant forks have changed since the advent of social coding platforms. To do so, we asked the survey participants the following questions:\n\n- **SQ1a**: Was the motivation for creating the variant an individual or a community decision?\n- **SQ1b**: What was the motivation for creating the variant of the mainline project?\n- **SQ1c**: What are the motivation details relating to the motivation in SQ1b?\n\nFor **SQ1c**, we presented a multiple choice question. **SQ1a** presented Likert-scale answer options, while **SQ1b** was an optional open-ended question. For the latter, we coded the responses into themes and categorised common themes. When quoting the survey respondents, we refer to them using [R N] notation, where N is the respondent\u2019s ID. The respondents\u2019 answers that include the selection on the multiple choice answers as well as the themes resulting from coding open-ended answers are underlined. The open-ended responses are presented in *italics*. Where applicable, we integrate and compare our findings with related research findings.\n\nA. Results\n\nFig. 2 summarises the responses for **SQ1a** and **SQ1b**. Fig. 2(a) shows that the majority of the participants responded that the decision was individual. Fig. 2(b) shows that the majority ranked highly the technical motivation for creating variants. We also see quite a number of highly ranked motivations of governance and others.\n\nWhile previous studies have investigated the motivations for creating variants, no study has investigated the details of those motivations (**SQ1c**). To identify these details, two optional open-ended questions allowed respondents to provide details on their Likert-scale answer to **SQ1b**. The two questions were (1) *Kindly provide details for your selected answer(s) on the motivation*; and (2) *If there are any links that are documented relating to your choice of answers on motivation detail, kindly point us there*.\n\n100 of the 105 survey respondents answered the optional open-ended question **SQ1c**. Luckily, during the coding process (cf. Section III-D), we were able to identify possible answers of the 5 respondents that did not answer **SQ1c** by comparing the information on the *readme.md* files of the variant and mainlines. 30 of the 105 respondents provided links to documents (pull requests, issues, and blogs) relating to their choice of answers on motivation detail.\n\nFig 3 presents a Sankey diagram summarising the details of the respondents\u2019 choice of motivation based on the coded themes. The figure presents the distribution of the responses to all questions relating to **RQ1** and how these responses relate to each other. The thickness of the edge represents the frequency of respondents between two entities.\n\nFocusing on the axes of decision and motivation, we can confirm the observations from Fig. 2(b) that the majority of respondents had an individual and technical motivation. The majority of respondents that answered the question original developers? selected none implying that the majority of the variants were started by different developers. Since the answers to **SQ1b** were presented on a Likert scale, participants were asked to rank the appropriate motivation(s) to why they created the variant. While coding the motivations details, we identified respondents who ranked highly more than one motivation category and also provided a response in the open-ended question to support each highly ranked motivation category. In this scenario, each highly ranked motivation category would have a motivation detail for the same respondent. At the end we found that 105 of the survey participants chose 145 motivation categories, of which 84 technical, 34 governance, 3 legal and 24 others. Below we present the common motivation themes and some specific responses we found very interesting.\n\n**Technical.** Maintenance is the most frequently mentioned reason for the technical motivation. 19 of the 84 survey participants who selected technical, mentioned phrases related to performing bug/security fixes.\n\n- [R59] ranked highly both technical and governance and mentioned \u201cThe PR to merge the fork\u2019s new capabilities into the mainline code was too large, [...] and my attempts to incorporate feedback into the PR [...] ended upsetting the primary maintainer who has been studiously ignoring the pull request for three years\u201d. The respondent also provided a GitHub link to his pull request to the mainline. Indeed, we found that the PR was made in February 2018 and was accompanied by a discussion of 218 comments between the mainline maintainer and the respondent. On October 2021, the PR was still open.\n\u2022 \u201cI forked the original project in order to fix a bug. However, the way the original was architected made this very challenging, so I ended up rewriting it instead of submitting a patch to the original.\u201d [R79]\n\nThe next prominent technical motivation detail was different goals. 17 respondents who selected technical, mentioned phrases related to variants present different goals / content / communities / directions:\n\n\u2022 \u201c[We] list websites that accept Bitcoin Cash cryptocurrency, as opposed to the mainline that lists websites with 2 factor authentication.\u201d [R1]\n\n\u2022 \u201cThe original goal of the mainline is completely different from the fork variant.\u201d [R4]\n\n\u2022 \u201cWe wanted to take the project in a different direction\u201d [R100].\n\nAn equally prominent technical motivation detail was new features. 17 respondents who selected technical, mentioned phrases related to introduction of new features not in the mainline:\n\n\u2022 \u201c[...] to add support for a feature I knew would not get merged into the main project.\u201d [R53]\n\n\u2022 \u201cMainline developer only does bugfixes and eventual underlying runtime/SDK upgrades to stay current. He did not add new features due to lack of interest [...]\u201d [R67]\n\n\u2022 \u201cOur variant introduces new experimental functionality that is not yet ready for use in the mainline.\u201d [R80]\n\nAnother technical motivation was customization. 8 respondents who selected technical, mentioned phrases related to variant customizes the mainline features:\n\n\u2022 \u201cThe \u201cbones\u201d were good, but I wanted to add some aesthetics [...] so, I forked it to make it pretty and my own.\u201d [R10]\n\n\u2022 \u201cThe new version is a vectorized, accelerated version of the original.\u201d [R37]\n\n\u2022 \u201c[We] added some syntactic sugar and some improvements by itself [...]\u201d [R42]\n\nThe next technical motivation was unmaintained feature. 8 respondents who selected technical, mentioned phrases related to one of the mainline feature used by the variant is no longer maintained.\n\n\u2022 \u201cThe \u2018shiny\u2019 component of mainline was declared to be no longer maintained around the time I created our fork. [...] I did not like many of the architectural decisions of the original project, I opted to create a fork instead of volunteer to maintain the original.\u201d [R65]. The respondent provided an extra link. An issue about \u2018shiny\u2019 component was opened up in July 2015 and closed in July 2017. The issue contained 93 comments from 35 participants. When closing the issue the maintainer stated that \u201c[...] If somebody or bodies from the community wants to fork the source code and run with it, they have my blessing [...]\u201d. The variant was created on August 2017.\n\n\u2022 \u201cThe mainline project had made a radical shift from providing one set of features to a different, disjoint set of features. The maintainer had thought about it very well, but some users (including myself) had built their workflows around one of the old features. For this reason, I lifted that particular feature into a separate project that was also published under a different name to the package index.\u201d [R23]. The respondent also provided us a GitHub issue link, discussing the details. The issue was opened by the variant maintainer on July 2015 and was eventually closed on April 2018. The issue had 33 comments involving 17 participants.\n\n\u2022 \u201cMainline dropped support for a small subset of the code and asked for community support to create a fork to support that subset\u201d [R66].\n\nA final technical motivation was technology. 7 respondents who selected technical, mentioned phrases related to variant created to depend on a different technology.\n\n\u2022 \u201cAdded support for Open Street Maps as an available map provider [...] mainline was not willing to accept this kind of contribution.\u201d [R8]. This was also ranked as a governance.\n\u2022 \u201cThe mainline wasn\u2019t updated to use .NET Core which I was using in my project, so I updated it\u201d [R29]\n\u2022 \u201c[...] to keep the source code compatible with the language/compiler version that we use (Swift / Xcode). [...] if the maintainer of the mainline is supporting a different one, then we could not compile our dependency anymore.\u201d [R54]\n\n**Governance.** After technical, governance is the secondmost popular motivation, with responsiveness being the most prominent governance category. 18 of the 34 respondents who selected governance mentioned phrases related to mainline was unresponsive to pull requests or issues for a long time. Most of the respondents that ranked governance highly as their motivation, also ranked other options of motivations highly. Only 4 of the 34 ranked only governance.\n\n\u2022 \u201c[They] had a series of commits that fixed functionality for newer PHP versions, but never made into a release. After waiting for more than a year for a release, a fork was done just to push a newer release into Composer/Packagist.\u201d [R21]\n\u2022 \u201cWe submitted some bug fixes [...], but didn\u2019t hear back from the maintainer for a while and needed to progress to meet our own goals so we forked. I followed up over email with the maintainer and he merged the patches about a month later, at which point we closed down and archived our fork and returned to using the mainline.\u201d [R15]. Merging back to the original corresponds to one the outcomes of variant forking reported in [2].\n\u2022 \u201c[...] due to lack of response from mainline maintainer (more than months) and need of release. This lead to release of a new variant. [...] there is no intention to submit changes to mainline anymore (even when the first PR was merged into mainline after more than year).\u201d [R56]\n\nThe next governance motivation was feature acceptance. 15 respondents who selected governance, mentioned phrases related to mainline hesitant to or not willing to accept feature.\n\n\u2022 \u201cTECHNICAL: Added support for Open Street Maps as an available map provider. GOVERNANCE: not exactly governance, but mainline was not willing to accept this kind of contribution\u201d [R8]. This was coded as technology in technical. The respondent also provided a GitHub PR link containing extra information. The PR included 45 conversations and 15 participants between June 2018 until March 2021 when it was closed.\n\u2022 \u201cMainline was not ready to accept those changes in part because the maintainers were not responsive. Since that time all of the issues have been dealt with and my variant is no longer needed, though the infrastructure for creating a new release of the variant remains in place in the event that it might be needed in the future.\u201d [R44]\n\u2022 \u201c[...] even main repo maintainer was saying he is busy and please use your fork for thing X and Y. We don\u2019t know the exact reason why he stopped maintaining it and also did not allow us to maintain his repo\u201d [R89]. In one of the multiple choice answers, the respondent indicated that the variant was created through a community decision. The respondent also provided an extra link, revealing that three contributors from the community were interested in a couple of new features that were missing in mainline, but the mainline maintainer seemed busy. At the end, two members of the community took over the fork maintenance and introduced the missing features and advertised the additions in the readme.md file of the fork as well as in the issue.\n\n**Others.** The most prominent motivation for others is supporting personal projects. 8 of the 24 respondents who selected others mentioned phrases related to variant was created to support personal projects.\n\n\u2022 \u201c[The] maintainer was not interested in a PR that added functionality needed by a project I\u2019m developing. [It] was considerably easier to add the logic into the [new] library than bolt it on.\u201d [R18]. This was ranked as technical, governance, and others. As we can see in the participant response we have phrases like \u201cadding logic\u201d (new features, technical), \u201cwas not interested in a PR\u201d (feature acceptance, governance), and \u201cfunctionality needed by a project I\u2019m developing\u201d (supporting personal projects, others).\n\u2022 \u201cIn Oct 2017 [...] has changed its API and these changes broke the mainline project. I used this project daily and needed to fix it ASAP. After quick fix I started to add my own features. [...] the mainline project has been fixed and refactored, but my other projects were already depending on my own fork.\u201d [R56]\n\u2022 \u201c[...] to make sure that no matter what happen to the mainline repository, we can maintain source access to this library, which is an essential dependency of our project. [...]\u201d [R54]. This response is in line with Nyman et al. [1] who reported that forking provides a mechanism for safeguarding against despotic decisions by the project lead, who is thus guided in their actions to consider the best interest of the community.\n\nThe next motivation for others was supporting mainline, which was mentioned by 7 respondents who selected others:\n\n\u2022 \u201cWe have a fork that is the \u201cmain fork\u201d, which is [...], and the \u201cdevelopment fork\u201d is [FORKNAME]. In this case, our modeling tool [...] is only maintained as the fork [...] we synchronize everything between both forks while the [FORKNAME] one is mainly used to develop new features, which are then pushed as PRs to the main fork.\u201d [R61]\n\u2022 \u201cPreparation of mainline pull requests. mainline repo should not be spammed by WIP PRs by students. Supervisors do coaching and try to improve the quality by the initial mainline pull request. [...] Keeping the PR open on the fork, reduces the number of PRs.\u201d [R73]\n\u2022 \u201cWe needed a repository for tracking our ideas to keep the number of issues of the main repository low.\u201d [R83]. The extra link that was provided revealed that the mainline and variant are owned by the same developer: \u201cthis repository is used by [X] to make his ideas transparent. He collects the issues here to avoid flooding the \u201cofficial\u201d issue tracker. - Refined issues will be migrated to the official issue tracker\u201d.\n\nThe next motivation detail for others was code quality. 3 respondents who selected others, mentioned phrases related to mainline low code quality.\n\u201cThe mainline [...] was clearly written by someone who isn\u2019t a professional software engineer.\u201d [R63]\n\n\u201cThe way the original was architected made this very challenging, so I ended up rewriting it instead of submitting a patch to the original.\u201d [R79]\n\n**Legal.** The motivation of legal was least popular, corresponding to only 3 of the 105 respondents that indicated phrases related to closed source. Below we present their corresponding responses.\n\n- \u201c[The] main reason is creating [an] open source and commercial product which has much more features\u201d [R7].\n This motivation detail was also categorised as: (new features, technical) and (supporting personal projects, others).\n\n- \u201c5 years ago the permissions model for GitHub and Travis is not what it is today. I wanted to use Travis but if I granted Travis access to my primary github account, it would have read access to all the github repos [...], which would expose private customer code. I forked the repo [but] the permissions model has evolved [and I] deleted the fork\u201d [R24].\n\n- \u201cThe founders of the mainline had been absent from the project for several years, but came back and booted the maintainers off and [...] shifted the project to a closed source.\u201d [R36]. The respondent provided a link with extra information showing that three of the maintainers that were booted from the original project and a fourth one from the community joined forces and are now maintaining the variant. The variant currently has over 739 stars, is used by 35 developers, has 101 pull requests and 195 issues.\n\n**B. Discussion and Implications**\n\n**RQ1** mainly focused on determining the motivations for creating and maintaining variants, especially those that are actively being maintained in parallel with their mainline counterparts. We identified that the decision to create the variants is mostly initiated by individuals and less by the community. Our observations thereby confirm the findings in the literature. Our study also extends the state-of-the-art by providing fine-grained reasons for creating and maintaining variants relating to the reported motivations. Furthermore, our study revealed new reasons that have not been reported in literature (categorised as others in our survey) which include: 1) supporting the mainline, 2) variant supporting other personal projects, 3) localization purposes and 4) variant developers not trusting the code quality of the mainline. The reported findings are very useful to guide follow-up studies in investigating the co-evolution of mainline and variant projects.\n\nFig. 3 presented an overview of how the detailed motivations relate to who is involved in creating and maintaining the variants. The motivations majorly related to developers outside the core contributors of the mainlines (82%). We also observed quite a significant number of respondents (24%) reporting that the decision to create the variant was initiated by the community. We observed from the open-ended responses that, before the transition from social to variant fork, some variant maintainers engage with the mainline maintainers through discussions in issues and pull requests. This is inline with the Zhou et al. who reported that many variant forks start as social forks [8].\n\nBesides the motivations for creating and maintaining variants, the respondents reported some interesting software reuse practices by the variants, like those categorized in the themes of: different goals, new features, customization, technology, supporting personal projects, supporting upstream, localization. A specific example of [R70] categorized in the different goals theme, stated that in the cryptocurrency world, all applications inherit code from the mother project bitcoin/bitcoin. Downstream applications also monitor their immediate upstream and other in the hierarchy for important updates like bug and security fixes as well as other specific updates. These cryptocurrency applications can be considered as a software family [21] or software ecosystem [28]. Variants are also likely to occur in other dedicated software ecosystems like Eclipse, Atom, Emacs, software library distributions for Java, C, C++, Python, Go, Ruby, and OS distributions for macOS, Linux, Windows, and iOS. To this end, our study opens up different research directions that can aim at deeply investigating different reuse practices in software families and software variants. A deeper understanding of these reuse practices can aid in developing tools that can support more effective software reuse.\n\n**Summary \u2013 RQ1:** Many variant forks start as social forks. The decision to create/maintain the forks is either community-driven (contributing up to 24%) or individual (76%). The majority of the developers (82%) creating the forks are not maintainers of the mainlines. We identified 18 variant creation/maintenance motivation details categorized in the motivations of technical (accounting 58% of the responses), governance (24%), others (16%) and legal (2%). The detailed motivations in the others category are newly introduced since the social coding era.\n\n**V. RQ2: How do variant projects evolve with respect to the mainline?**\n\n**RQ2** aims to identify the impediments for co-evolution between the mainline and variant projects. This question lead to two specific focuses reflecting the who and the how, respectively. The who focus aimed at identifying who are the developers involved in maintaining variants. The how aimed to understand how variant forks evolve w.r.t. the mainline. As for RQ1 we refer to the responses using underlined, italics and [R.N].\n\n**A. Results for the \u201cwho?\u201d focus**\n\nTo understand who is creating and maintaining variant forks, we asked two multiple-choice questions:\n\n- **SQ**\\(_b^2\\): How many of the original developers of the mainline maintained the variant in its first 6 months?\n- **SQ**\\(_b^1\\): Do the variant and mainline have common active maintainers?\n\nFig. 4(a) and Fig. 4(b) summarise the answers to **SQ**\\(_b^2\\) and **SQ**\\(_b^1\\), respectively. The majority of the respondents chose the options of none for **SQ**\\(_b^2\\) (none of the creators of the variant were part of the mainline) and no for **SQ**\\(_b^1\\) (they do not have common active maintainers). This implies that\nmost developers involved in the creation and maintenance of variants are not core maintainers of the mainline from where the variant was forked. Fig. 3 reveals the difference in the numbers of participants who selected none for $SQ^2_a$ and no for $SQ^2_b$. Focusing at how responses of $SQ^2_a$\u2014original developers? and $SQ^2_b$\u2014common active maintainers? are associated, one can observe that most respondents that selected option none in $SQ^2_a$ went ahead to select option no in $SQ^2_b$. Other associations between responses of $SQ^2_a$ and $SQ^2_b$ can be observed as well.\n\nAnecdotally, [R36] responded to $SQ^2_a$ that 6\u201310 developers from the mainline were involved in the creation of the variant, and responded to $SQ^2_b$ with the option yes & no\u2014\u201cThey used to have common maintainers in the early stages of the variant, but now the projects have technically diverged away from each other, there are no more common maintainers\u201d. Respondents [R51] and [R57] selected for $SQ^2_a$ the options 6\u201310 and 2\u20135 respectively, while selecting the option no for $SQ^2_b$. This implies that at least two maintainers involved in fork creation are not (or no longer) contributing to the mainline.\n\nSummarising our observations for $SQ^2_a$ and $SQ^2_b$, we conclude that **variant forks are created and maintained by developers different from those in the mainline counterparts.** This observation concurs with the earlier findings of Businge et al. [10].\n\n### B. Results for the \u201chow?\u201d focus\n\nTo understand how variant forks evolve w.r.t. the mainline, we asked two additional questions:\n\n**$SQ^2_c$: Do the variant forks and the upstream still discuss the main directions of the project?**\n\n**$SQ^2_d$: Do the variant developers integrate changes to and from the upstream repository?**\n\nFor $SQ^2_c$ we presented four multiple choice answer options, corresponding to the first four answers reported in Fig. 5, gathering the highest number of responses. We allowed respondents to provide an open-ended answer if they felt that their choice was not among the four proposed options. The open-ended answers were coded into themes (listed in Fig. 5 from variant follows mainline\u2192to variant is a mirror of the mainline). Fig. 5 shows that more than half of the respondents chose the option of never (corresponding to: no, there has never been any discussion since the creation of the variant). Even if there was some discussion, 10.7% of the respondents signal that they technically diverged (corresponding to: \u201cThey used to discuss but not anymore since the projects have technically diverged from each other\u201d). The open-ended answers also revealed variant responses that do not discuss the directions of the project, like mainline hostile to variant, not very active, in contact but rarely discuss and only once.\n\nAn explanation for the high number of variant developers that do not discuss with the mainline developers about the project direction can be derived from the findings of $SQ^2_a$ and $SQ^2_b$. The majority of the variants are created and maintained by developers that are not core developers of the mainline. Also, most of the motivation details in RQ1 could explain the high numbers of never. For example we observed that the majority of the variants in the motivation details category of different goals, unmaintained features in the mainline, those having issues with the mainline responsiveness, those whose features will not be accepted by the mainline (feature acceptance), selected never in $SQ^2_c$. We conclude that the reasons for the majority of variant forks not to discuss the project directions with the mainline could be attributed to a diverging range of motivations for creating the variant as well as to the variant creators not being part of the mainline\u2019s core development team.\n\nAnecdotally, 5 respondents indicated phrases related to variant follows mainline. Respondent [R77] indicated that \u201cin the crypto world, the mainline inherits changes from BITCOIN, for example, security commits, and the variant merges those changes in. So the variant is very interested in every change in the Mainline. However, the variant must maintain the specific new features that we added separately, and the Mainline is not interested in helping the Variant do this.\u201d We also observed two interesting cases where the variants merged back to the\nmainline. This is in line with Robles and Gonz\u00e1lez-Barahona [2] who reported that one of the outcomes of forking is the fork merging back.\n\nFor $SQ_2$ we asked respondents two closed-ended questions: (1) How often do the maintainers of the variant integrate the following types of changes from the mainline?; and (2) How often do the maintainers of the variant integrate the following types of changes into the mainline?. We provided Likert-scale options for the two questions. We presented optional follow-up questions with open-ended answers, for each of the two questions, allowing respondents to provide extra information.\n\nFig. 6(a) presents the answers from the respondents on what they value most when integrating changes back from the mainline. The highly scored changes are bug fixes and security fixes. One can observe that most respondents were leaning towards the negative side of the Likert scale, implying that most variants are not interested in integrating changes from the mainline. Fig. 6(b) focuses on integrations from variants towards the mainline. We observe a similar trend to Fig. 6(a), with an even more pronounced negative inclination.\n\nFig. 6(c) and Fig. 6(d) present the coded themes of the extra information gathered from the open-ended answers corresponding to the results in Fig. 6(a) and Fig. 6(b), respectively. Fig. 6(c) summarises the results of 28 respondents who provided the extra information, while Fig. 6(d) summarises the results of only 17 respondents, most likely because most variants do not submit changes to the mainline. The most prominent response in Fig. 6(c) was related to being kept in sync, signaling the desire of variants to keep in sync with the changes made in the mainline. The next prominent response was related to occasionally pull from mainline implying that variants from time to time pull changes made in the mainline. Some respondents mentioned phrases related to specific changes are pulled; for example, [R63] indicated that \u201cIt\u2019s mostly changes that make the library for specific iRobot Roomba models (new ones for example)\u201d. Other respondents mentioned phrases related to everything except specific changes; for example, [R48] mentioned that \u201cAll non-compiler specific changes are pulled\u201d. In Fig. 6(d) there were two prominent answers: PRs are suggested, for example, \u201cMade PRs with changes but those have just been ignored. They\u2019re still \u201copen\u201d with 0 comments from the mainline dev\u201d [R67]. The other prominent answer is changes are out of scope, for example, \u201cWe use this as a dependency in another project [. . . ] which is often diverging from the language version of the mainline, so there is little reason for us to push this to mainline\u201d [R54].\n\nC. Discussion and Implications\n\nThe results of $RQ_2$ revealed that variants are created and maintained by developers that are not core developers of the mainline. We also observed limited interaction between the mainline and its variant(s). Although we found there is little code integration, the integration from mainline to variant is more frequent than from variant to mainline. Our study confirms and extends the findings of Businge et al. [10]: we provide concrete reasons relating to little integration between the mainline and variants that include:\n\n1) technical divergency: variants and mainlines are offering different goals, implementing different technologies, variant is maintaining a part of the mainline that is frozen;\n2) governance disputes: mainlines are unresponsive to pull requests and issues from the variants and mainlines not willing or hesitant to accept some features from the variants. One respondent also reported that mainline is actively very hostile to variants as a result of mainline\u2019s license changing to proprietary;\n3) distinct developers: Another reason for the lack of code integration is because most of the variants are maintained by developers that are not part of the core team of the mainline. Furthermore, we observed that a few mainline\u2013variant pairs that do interchange code are mostly interested in patch sets (security fixes and bug fixes).\n\nAlthough maintenance and collaboration have improved through dedicated tooling, especially through distributed ver-\nsion control systems like Git [29] and transparency mechanisms on social coding platforms like GitHub [30], these tools are only ideal for social forks which aim to sync all the changes between repositories. For example, code integration using pull requests and git tools like merge/rebase may not be the best when integrating changes in between mainline and variant forks since they involve syncing upstream/downstream with all changes missing in the current branch.\n\nThis study reveals that some variant maintainers are only interested in integrating commits with specific changes. A suitable integration mechanism would be commit cherry picking since the developers can choose the exact commits they want to integrate. However, GitHub\u2019s current setup does not make it easy to identify commits to cherry-pick without digging through the branch\u2019s history to identify relevant changes since the last code integration. Additionally, even though the variants have diverged from their mainlines, we do believe that since they share common code, some of the common code may go through maintenance to perform some bug and security fixing. Since these mainline\u2013variant repository pairs are being maintained by uncommon developers, chances are that these fixes could be missed or they could be fixed at different times by different developers, resulting in duplicated effort.\n\nOur findings are very relevant to code integration tool builders between mainline and variants to prioritise certain categories of mainline\u2013variant pairs by targeting specific changes. Ideally, a tooling would help identify possibly important fixes in commits and recommend these commits to mainline or variant developers to support a more efficient reuse. Some promising studies in this direction have focused on providing the mainline with facilities to explore non-integrated changes in forks to find opportunities for reuse [31] and cross-fork change migration [32]. More experimental ideas have focused on virtual product-line platforms for unified development of multiple variants of a project [33]\u2013[37].\n\n**Summary\u2013RQ2**: Variant forks do not usually interact with the mainline during their co-evolution. The lack of interaction could be attributed to a variety of reasons including: (i) technical divergence, where variants and mainlines are offering different features or implementing different technologies having nothing to share; (ii) governance disputes, where mainlines are unresponsive to the requests from community and also uninterested in some features suggested by the community; (iii) distinct development teams that no longer interact; (iv) diverging licenses, where the mainline variant has changed the license and integration is no longer possible. As a result of these divergences, it is likely that important security or patch updates could be missed or are duplicated.\n\n**VI. Threats to Validity**\n\n**Construct validity**. The response categories for the closed questions in the survey originated from a thorough literature review. The questions were carefully phrased to avoid biasing the respondent towards a specific answer. We validated the questions by consulting seven colleagues from three different universities and through trial runs of the survey with seven participants. Social desirability bias may also have influenced the answers [38]. To mitigate this issue, we informed participants that the responses would be anonymous and evaluated in a statistical form.\n\n**Internal validity**. We used an open coding process to classify the participants responses received from open-ended questions. The coding process is known to lead to increased processing and categorization capacity at the loss of accuracy of the original response. To alleviate this issue lack of accuracy, we allowed more than one code to be assigned to the same answer.\n\n**Generalizability**. Our study is limited to variants of mainline repositories that are hosted on GitHub. We do not claim that our findings generalize to other social coding platforms. In addition, the set of participants we interviewed corresponds to those who decided to make their e-mail public and who accepted to take part in our study. As such, they are not de facto representative of all maintainers of variant forks.\n\n**VII. Conclusions**\n\nThanks to social coding platforms like GitHub, software reuse through forking to create variant projects is on the rise. We carried out an exploratory study with 105 maintainers of variants, focusing on answering two key research questions:\n\n1) **Why do developers create and maintain variants on GitHub?** We observed that the motivations reported by studies carried out in the pre-GitHub era, still hold. We identified 18 motivation details for variant creation and maintenance, categorized in the motivations of technical (58% of the responses), governance (24%), others (16%) and legal (2%). Some of these motivations are newly introduced in the social coding era.\n\n2) **How do variants projects evolve with respect to the mainlines?** We have found that there is little interaction between the variants and their mainlines during the co-evolution and reported possible impediments to the lack of interaction. These include: (i) technical (i.e., diverging features), where variants and mainlines are offering different goals or implementing different technologies having nothing to share; (ii) governance (i.e., diverging interests), where mainlines are unresponsive to the requests from community and also uninterested in some features suggested by the community; (iii) legal (e.g., diverging licenses), where the mainline variant has changed the license and integration is no longer possible.\n\nOur findings are very useful to guide follow-up studies in investigating the co-evolution and reuse practices between mainline and variants. A deeper understanding of these practices can aid code integration tool builders in developing tools to support more effective software reuse between mainline projects and their variant forks.\n\n**Acknowledgment**\n\nThis work is supported by the joint FWO-Vlaanderen and F.R.S.-FNRS Excellence of Science project SECO-ASSIST under Grant number O.0157.18F- RG43.\nREFERENCES\n\n[1] L. Nyman, T. Mikkonen, J. Lindman, and M. Foug\u00e8re, \u201cPerspectives on code Forking and Sustainability in open source software,\u201d in Open Source Systems: Long-Term Sustainability, 2012, pp. 274\u2013279.\n\n[2] G. Robles and J. M. Gonz\u00e1lez-Barahona, \u201cA comprehensive study of software forks: Dates, reasons and outcomes,\u201d in Open Source Systems: Long-Term Sustainability, 2012, pp. 1\u201314.\n\n[3] R. Viseur, \u201cForks impacts and motivations in free and open source projects,\u201d International Journal of Advanced Computer Science and Applications, vol. 3, no. 2, February 2012.\n\n[4] L. Nyman and J. Lindman, \u201cCode forking, governance, and sustainability in open source software,\u201d Technology Innovation Management Review, vol. 3, pp. 7\u201312, January 2013.\n\n[5] L. Nyman and T. Mikkonen, \u201cTo fork or not to fork: Fork motivations in SourceForge projects,\u201d in Open Source Systems: Grounding Research, 2011, pp. 259\u2013268.\n\n[6] J. Gamalielsson and B. Lundell, \u201cSustainability of open source software communities beyond a fork: How and why has the libreoffice project evolved?\u201d Journal of Systems and Software, vol. 89, pp. 128 \u2013 145, 2014.\n\n[7] G. Gousios, M. Pinzger, and A. van Deursen, \u201cAn exploratory study of the pull-based software development model,\u201d in International Conference on Software Engineering, 2014, pp. 345\u2013355.\n\n[8] S. Zhou, B. Vasilescu, and C. K\u00e4stner, \u201cHow has forking changed in the last 20 years? a study of hard forks on GitHub,\u201d in International Conference on Software Engineering. ACM, 2020, pp. 268\u2013269.\n\n[9] C. Sung, S. K. Lahiri, M. Kaufman, P. Choudhury, and C. Wang, \u201cTowards understanding and fixing upstream merge induced conflicts in divergent forks: An industrial case study,\u201d in International Conference on Software Engineering. ACM, 2020, pp. 172\u2013181.\n\n[10] J. Businge, M. Openja, S. Nadi, and T. Berger, \u201cReuse and maintenance practices among divergent forks in three software ecosystems,\u201d Journal of Empirical Software Engineering, 2021.\n\n[11] A. S. Laurent, Understanding Open Source and Free Software Licensing, O\u2019Reilly Media, 2008.\n\n[12] B. B. Chua, \u201cA survey paper on open source forking motivation reasons and challenges,\u201d in Pacific Asia Conference on Information Systems, 2017.\n\n[13] J. Dixion, \u201cDifferent kinds of open source forks: Salad, dinner, and fish,\u201d https://jamesdixon.wordpress.com/2009/05/13/different-kinds-of-open-source-forks-salad-dinner-and-fish/, 2009.\n\n[14] N. A. Ernst, S. M. Easterbrook, and J. Mylopoulos, \u201cCode forking in open-source software: a requirements perspective,\u201d ArXiv, vol. abs/1004.2889, 2010.\n\n[15] L. Nyman, \u201cHackers on Forking,\u201d in The International Symposium on Open Collaboration, 2014, pp. 1\u201310.\n\n[16] E. S. Raymond, The Cathedral & the Bazaar: Musings on linux and open source by an accidental revolutionary. O\u2019Reilly Media, Inc., 2001.\n\n[17] P. Bratach, \u201cWhy Do Open Source Projects Fork?\u201d https://thenewstack.io/open-source-projects-fork/, 2017.\n\n[18] J. Jiang, D. Lo, J. He, X. Xia, P. S. Kochhar, and L. Zhang, \u201cWhy and how developers fork what from whom in GitHub,\u201d Empirical Softw. Engg., vol. 22, no. 1, pp. 547\u2013578, Feb. 2017.\n\n[19] D. A. Wheeler, \u201cforking,\u201d https://dwheeler.com/oss_fs_why.html#forking, 2009, revised as of July 18, 2015.\n\n[20] T. de Raadt, \u201cTheo de Raadt\u2019s dispute w/ NetBSD,\u201d https://zeus.theos.com/deraadt/coremail.html, 2006, retrieved October 2021.\n\n[21] J. Businge, M. Openja, S. Nadi, E. Bainomugisha, and T. Berger, \u201cClone-based variability management in the Android ecosystem,\u201d in International Conference on Software Maintenance and Evolution. IEEE, 2018, pp. 625\u2013634.\n\n[22] F. Uwe, An Introduction to Qualitative Research. London: Sage Publications, 2014.\n\n[23] J. Businge, A. Decan, A. Zerouali, T. Mens, and S. Demeyer, \u201cAn empirical investigation of forks as variants in the npm package distribution,\u201d in The Belgium-Netherlands Software Evolution Workshop, ser. CEUR Workshop Proceedings, vol. 2912. CEUR-WS.org, 2020.\n\n[24] J. Businge, M. Openja, D. Kasvales, E. Bainomugisha, F. Khomh, and V. Filkov, \u201cStudying Android app popularity by cross-linking GitHub and Google Play store,\u201d in International Conference on Software Analysis, Evolution and Reengineering, 2019, pp. 287\u2013297.\n\n[25] J. Businge, S. Kawuma, E. Bainomugisha, F. Khomh, and E. Nabaasa, \u201cCode authorship and fault-proneness of open-source Android applications: An empirical study,\u201d in PROMISE, 2017.\n\n[26] T. Zimmermann, \u201cCard-sorting: From text to themes,\u201d in Perspectives on Data Science for Software Engineering. Elsevier, 2016, pp. 137\u2013141.\n\n[27] D. Garrison, M. Cleveland-Innes, M. Koole, and J. Kappelman, \u201cRevisiting methodological issues in transcript analysis: Negotiated coding and reliability,\u201d The Internet and Higher Education, vol. 9, pp. 1\u20138, 03 2006.\n\n[28] A. Decan, T. Mens, and P. Grosjean, \u201cAn Empirical Comparison of Dependency Network Evolution in Seven Software Packaging Ecosystems,\u201d Empirical Softw. Engg., vol. 24, no. 1, pp. 381\u2013416, Feb. 2019.\n\n[29] C. Rodr\u00edguez-Bustos and J. Aponte, \u201cHow distributed version control systems impact open source software projects,\u201d in Working Conference on Mining Software Repositories. IEEE, 2012, pp. 36\u201339.\n\n[30] L. Dabbish, C. Sturt, J. Tsay, and J. Herbsleb, \u201cSocial Coding in GitHub: Transparency and Collaboration in an Open Software Repository,\u201d in Conference on Computer Supported Cooperative Work, 2012, pp. 1277\u20131286.\n\n[31] L. Ren, S. Zhou, and C. K\u00e4stner, \u201cPoster: Forks insight: Providing an overview of GitHub forks,\u201d in The International Conference on Software Engineering: Companion (ICSE-Companion), 2018, pp. 179\u2013180.\n\n[32] L. Ren, \u201cAutomated patch porting across forked projects,\u201d in Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 1199\u20131201.\n\n[33] M. Antkiewicz, W. Ji, T. Berger, K. Czarnecki, T. Schmorleiz, R. L\u00e4mmel, u. St \u02d8anciulescu, A. W \u02db asowski, and I. Schaefer, \u201cFlexible Product Line Engineering with a Virtual Platform,\u201d in Companion of the International Conference on Software Engineering, 2014, pp. 532\u2013535.\n\n[34] S. Fischer, L. Linsbauer, R. E. Lopez-Herrejon, and A. Egyed, \u201cEnhancing clone-and-own with systematic reuse for developing software variants,\u201d in International Conference on Software Maintenance and Evolution, 2014, pp. 391\u2013400.\n\n[35] L. Montalvillo and O. D\u00edaz, \u201cTuning GitHub for SPL development: Branching models & repository operations for product engineers,\u201d in International Conference on Software Product Lines, 2015, pp. 111\u2013120.\n\n[36] J. Rubin and M. Chechik, \u201cA framework for managing cloned product variants,\u201d in International Conference on Software Engineering. IEEE, 2013, pp. 1233\u20131236.\n\n[37] S. Stanculescu, T. Berger, E. Walkingshaw, and A. Wasowski, \u201cConcepts, operations, and feasibility of a projection-based variation control system,\u201d in International Conference on Software Maintenance and Evolution (ICSME), 2016, pp. 323\u2013333.\n\n[38] A. Furnham, \u201cResponse bias, social desirability and dissimulation,\u201d Personality and Individual Differences, vol. 7, no. 3, pp. 385\u2013400, 1986.", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/033_businger.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 11, "total-input-tokens": 39452, "total-output-tokens": 13849, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 5588, 1], [5588, 11556, 2], [11556, 16565, 3], [16565, 21341, 4], [21341, 25073, 5], [25073, 31288, 6], [31288, 37558, 7], [37558, 41919, 8], [41919, 46149, 9], [46149, 52345, 10], [52345, 59404, 11]]}}
|
|
{"id": "ef7bc2527c08dab159ba96939b946b32aa15f785", "text": "Who, What, Why and How? Towards the Monetary Incentive in Crowd Collaboration: A Case Study of Github\u2019s Sponsor Mechanism\n\nXunhui Zhang, Tao Wang*, Yue Yu*, Qiubing Zeng, Zhixing Li, Huaimin Wang\n{zhangxunhui,taowang2005,yuyue,lizhixing15}@nudt.edu.cn,qiubingzeng@gmail.com,whm_w@163.com\nNational University of Defense Technology\nChangsha, Hunan, China\n\nABSTRACT\nWhile many forms of financial support are currently available, there are still many complaints about inadequate financing from software maintainers. In May 2019, GitHub, the world\u2019s most active social coding platform, launched the Sponsor mechanism as a step toward more deeply integrating open source development and financial support. This paper collects data on 8,028 maintainers, 13,555 sponsors, and 22,515 sponsorships and conducts a comprehensive analysis. We explore the relationship between the Sponsor mechanism and developers along four dimensions using a combination of qualitative and quantitative analysis, examining why developers participate, how the mechanism affects developer activity, who obtains more sponsorships, and what mechanism flaws developers have encountered in the process of using it. We find a long-tail effect in the act of sponsorship, with most maintainers\u2019 expectations remaining unmet, and sponsorship has only a short-term, slightly positive impact on development activity but is not sustainable. While sponsors participate in this mechanism mainly as a means of thanking the developers of OSS that they use, in practice, the social status of developers is the primary influence on the number of sponsorships. We find that both the Sponsor mechanism and open source donations have certain shortcomings and need further improvements to attract more participants.\n\nCCS CONCEPTS\n\u2022 Computer systems organization \u2192 Embedded systems; Redundancy; Robotics; \u2022 Networks \u2192 Network reliability.\n\nKEYWORDS\nsponsor, donation, GitHub, open source, financial support\n\n1 INTRODUCTION\nOpen source development has brought prosperity to software ecosystems. Its characteristics of distributed coordination, free participation, and convenient sharing have led to the emergence of myriad open source projects, large-scale participation of developers, and continuous development of high-quality projects. However, the expansion of project scales has also brought challenges for software maintenance, such as continuously and rapidly increasing feature requests and bug fix reports [37] and an increasing pull request review workload [69]. Although there are many continuous integration (CI) tools and continuous deployment (CD) tools to help reduce the workload of project managers, the complicated and high-pressure maintenance work still subjects them to stress [66]. Past studies have shown that most current open source work is still spontaneously performed by volunteers [22]. They engage in open source work as a hobby, to improve their personal reputations or to learn new technologies. These intrinsic benefits motivate volunteers to make open source contributions [21]. However, many core managers and software maintainers would like to secure funding from others for their open source work because of the aforementioned challenges, thereby alleviating the related mental pressure and financial burdens [5, 57, 67].\n\nAt present, there are many ways in which the open source sphere obtains financial support, such as crowdfunding on Kickstarter, project donations on OpenCollective, and issue rewards on BountySource and IssueHunt [49]. However, these are mainly web portals serving open source contributors active in other social coding communities. The separation of development activities and financial support brings problems. First, it is difficult for sponsors to find active developers and open source projects in the open source community. Second, open source contributors need to spend considerable effort on maintaining the financial support platform. In May 2019, GitHub, the world\u2019s most popular software hosting platform, launched the Sponsor mechanism, characterized by deep integration of financial support and the social coding platform. While the Sponsor mechanism supports sponsorship of organizations and projects, it targets mainly individual contributors in the GitHub community. Therefore, unlike past related studies [52, 53], we can explore donation mechanism in the open source sphere from the perspective of individual developers. In this context, this paper aims to explore donation in the open source sphere using the Sponsor mechanism as an example. We conducted an empirical study based on mixed methods and answered the following research questions.\n\nRQ1 Why do individuals participate or not in the Sponsor mechanism?\nFrom the feedback of GitHub developers, we summarized eight reasons for participation among sponsored developers, six reasons for participation among sponsors, and six reasons for not participating in the mechanism among other individuals. The main reason that participants used the Sponsor mechanism was its relationship with open source software (OSS) usage. The main reason for not participating was that developers did not need sponsorship or that they were driven to participate in open source development because of its nonmonetary character. Our findings can help optimize\nthe Sponsor mechanism and attract more participants by satisfying the different motivations of contributors.\n\n**RQ2 How effective is sponsorship in motivating developer OSS activity?**\n\nWe find through quantitative analysis that the sponsor mechanism has provided only a short-term, subtle boost to contributors\u2019 activities. According to the results of the qualitative analysis, most developers agree that sponsorship can provide them with motivation but are not satisfied with the available amounts. In contrast, most sponsors are satisfied with the current mechanism. Our findings shed light on the application of the Sponsor mechanism in the open source sphere and the problems surrounding it. This work helps to rationalize the mechanism to promote greater participation in open source contributions among developers.\n\n**RQ3 Who is likely to receive more sponsorship?**\n\nThe questionnaire results show that making useful OSS contributions and being active are the most critical factors for obtaining more sponsorship. However, according to the quantitative data analysis results, the factor that most affects sponsorship is the developer\u2019s social status in the community. Our findings can provide actionable suggestions for developers seeking more sponsorships, while the conflicting results also illuminate the problems with OSS donations.\n\n**RQ4 What are the shortcomings of the Sponsor mechanism?**\n\nThe research reveals that problems with the mechanism include usage deficiencies, object orientation with supported functions, and personalization. Many developers complain that the donations do not apply to open source ecosystems. A more relevant mechanism is needed to promote the healthy and sustainable development of the ecosystem.\n\nThe contributions of this paper are as follows:\n\n- To the best of our knowledge, this is the first in-depth study that comprehensively analyzes the GitHub Sponsor mechanism.\n- We quantitatively and qualitatively analyze the Sponsor mechanism along four dimensions, including developers\u2019 motivation to participate (why), the mechanism\u2019s effectiveness (how), the characteristics of developers who obtain more sponsorships (who), and the mechanism\u2019s shortcomings (what).\n- We provide actionable suggestions to help developers participating in the Sponsor mechanism obtain more sponsorship and feasible advice for improving the mechanism\u2019s effectiveness.\n\nThe remainder of this paper is organized as follows. Section 2 presents the related work, and Section 3 describes the background of the GitHub Sponsor mechanism. Section 4 presents the study design of this paper. In Section 5, we describe the results for each research question. Then, we discuss the findings in Section 6, and describe the threats in Section 7. Finally, in Section 8, we conclude the paper and describe future work.\n\n## 2 RELATED WORK\n\nOpen Innovation in Science (OIS) is a concept, which unifies the two domains of open and collaborative practices in science, i.e., open science (OS) and open innovation (OI) [6]. For OS, the three pillars are accessibility, transparency, and inclusivity, among which the inclusivity (e.g., citizen science) is directly related to the knowledge production process. For OI, various forms of collaborative practice exist, including crowdsourcing, OSS development, etc. Regarding these open initiatives, the motivation and incentives of participation has always been the focus of continuous research [4, 70]. Although there are different views on the relationship between citizen science, crowdsourcing, and OSS development, we follow the relationships described above and present the related work on participation motivation and monetary incentives of the three parts separately.\n\n### 2.1 Citizen science\n\nFor traditional citizen science, the motivation of participants varies greatly depending on the age [2], gender [48], educational background [46], and level of involvement [63]. In many cases, both monetary and non-monetary incentives have a positive effect on participation [9]. However, Wiseman et al. found that non-monetary incentives alone were better for online HCI projects to promote high-quality data from participants [71]. Knowles [38] also confirmed that although monetary incentives enhanced participation, they undermined sustained participation in volunteering initiatives. While for some specific projects (e.g., the conservation of species), monetary incentives even have the opposite effect [55].\n\nBecause participants act as sensors to collect data or volunteer their idling computer or brainpower to classify large data sets in the citizen science projects [71], their motivation to participate is primarily intrinsic [15, 43]. However, as motivation to participate varies for different projects, the imposition of monetary incentives can have different effects. Unlike traditional citizen science, OSS development is an open innovation activity requiring deep involvement and a great deal of experience, so the motivation and incentives for participation may vary considerably.\n\n### 2.2 Crowdsourcing\n\nActing as a type of online activity, participants will receive the satisfaction of a given kind of need, be it economic, social recognition, self-esteem, or the development of individual skills [16]. Hossain [34] classified the motivators into extrinsic and intrinsic motivators, where extrinsic motivators include financial motivators (e.g., cash), social motivators (e.g., peer recognition), and organizational motivators (e.g., career development). Intrinsic motivators are directly related to participants\u2019 satisfaction with the task (e.g., enjoyment, fun). Considering the related incentives, Liang et al. [45] highlighted that both intrinsic and extrinsic incentives could increase the effort of participation; however, extrinsic incentives weaken the impact of intrinsic motivation. By comparing paid and unpaid tasks, Mao et al. [47] concluded that monetary incentives make the task processing speed faster, but the quality is reduced. Based on this, Feyisetan et al. [18] improved the paid microtasks more engaging by including sociality features or other game elements. MTurk is a typical and popular crowdsourcing platform based on financial incentives and gamification, where participants are recruited, paid, and rated for their participation in microtasks, which ensure speed and quality at the same time [10]. Unlike MTurk, the contribution to Wikipedia is not incentivized by monetary rewards. Content\ncontribution is more driven by reciprocity, self-development, while community participation relies on altruism, self-belonging, etc [73].\n\nAs can be seen from the related works above, there are many situations of crowdsourcing and different forms of motivation and incentive. However, unlike OSS development, traditional crowdsourcing tasks are mostly micro-tasks, which are relatively simple and require less time. Moreover, there is a clear distinction between the roles, i.e., core developers and external contributors for OSS contributors. Contribution types include code contribution, code review, repository maintenance, management, etc.\n\n2.3 Open source software development\n\nSuccessful OSS initiatives can effectively change the method of software development [30, 39], improve software development efficiency [31, 60], and ensure software quality through effective management [1, 58]. Many projects have emerged along with the increasing number of users participating in the development of the OSS community [28]. In this context, many companies are involved in contributing to open source projects [32]. However, they have limited control and influence in day-to-day OSS work and decision processes [35], and OSS still relies on the voluntary participation of crowd labor [17].\n\nMany studies have focused on analyzing individuals\u2019 motivations and the incentives for participating in OSS projects [14, 20, 33, 42, 59, 72]. Von Krogh et al. [68] classified contributors\u2019 motivations into three categories, namely, intrinsic motivation (e.g., ideology and fun), internalized extrinsic motivation (e.g., reputation and own use), and extrinsic motivation (e.g., career and pay). Among developers who volunteer to contribute to open source projects, their motivation is mainly intrinsic or internalized extrinsic motivation [68]. They have full-time jobs and spend some spare time making open source contributions [21]. However, Hars et al. [3] found that being paid can promote continuous contribution from developers with all types of motivation.\n\nCurrently, there are many ways to obtain financial support for open source initiatives, e.g., through donations or bounties. Many studies have focused on the characteristics, impact, or effectiveness of each form of financial support. For example, regarding bounties, Zhou et al. [77] studied the relation between issue resolution and bounty usage and found that adding bounties would increase the likelihood of issue resolution. Acting as a way for recruiting developers, setting bounties attracts those developers who want to make money through open source contributions, which facilitate the completion of complex tasks. However, unlike bounty, the donation is a way of passively obtaining financial support. Regarding open source donation, Krishnamurthy et al. [40] studied the donation to the OSS platform and found the relation between donation level and platform association length and relational commitment. For the donation to OSS, Nakasai et al. [50, 51] analyzed the incentives of individual donors and found that the benefits for donors and software release could promote donations. In contrast, bugs in software will negatively affect the number of donations. However, they only focused on eclipse projects. Overney et al. [53] studied the impact of donations from a broader perspective of open source projects on GitHub, which corresponds to NPM packages and explicitly mentions the way of donation in the README.md files. They found that only a small fraction (mainly active projects) asked for donations, and the number of received donations was mainly associated with project age. Most donations are requested and eventually used for engineering activities. However, there was a slight influence of donation on project activities. Although Overney et al. did a thorough analysis of project-level donation, there lacks analysis of donation towards open source developers. Also, we think adding the qualitative analysis from the users\u2019 perspective can confirm the quantitative findings and help understand the pros and cons of system design and use.\n\n3 BACKGROUND\n\n3.1 Terminology\n\nTo help the reader understand the rest of the article, we introduce key terms related to the Sponsor mechanism.\n\n- **Sponsor**: an entity who provides donations to others.\n- **Maintainer**: an entity who can be sponsored (developers who set up a Sponsor profile).\n- **Nonmaintainer**: an entity who has not set up the Sponsors.\n- **Sponsorship**: the donation relationship between a sponsor and a maintainer.\n- **AccountSetUpTime**: the time when maintainers set up the Sponsor profile for their accounts.\n- **FirstSponsorTime**: the time when maintainers receive their first sponsorship.\n\n3.2 Introduction of the Sponsor mechanism\n\nCurrently, in GitHub, the workflow and key elements of sponsorship are shown in Figure 1, where the sponsorship is constructed on the maintainer\u2019s sponsor page by clicking the \u201cselect\u201d button of specific amount. The sponsor page is preset by the maintainer when setting up a Sponsor profile in the related GitHub account, which mainly consists of the following elements.\n\u2022 Personal description: maintainers are free to add text and modify it at any time. The main content can cover basic personal information, project information, why they need to be sponsored, other ways of donation, etc.\n\u2022 Preset goal: maintainers are allowed to set the number of sponsors or sponsorships that they want to get from the Sponsor mechanism and add related descriptions about the goal.\n\u2022 Featured projects: this part lists the related projects that the maintainer currently works on or with the most popularity.\n\u2022 Preset tiers & description: this part contains the tiers set by the maintainer. Sponsors can choose which tier to pay according to the amount and the related description.\n\u2022 Payment choices: sponsors can choose to monthly or one-time customized payment.\n\nAfter choosing the way to construct the sponsorship, sponsors can get the sponsor badge and receive updates from the sponsored maintainer in the future.\n\n3.3 Preliminary analysis\n\nWe conduct a statistical analysis of the use trends of the Sponsor mechanism (Figure 2 shows the number of developers who set up the Sponsor account and how the number of sponsorships changes over time). We can see that the number of developers who set up an account increased sharply around October 2019 (new things inspire people\u2019s interest). At other times, the growth rate shows a downward trend. Meanwhile, the absolute number of participants in this mechanism increased steadily, although the growth rate shows a slight upward trend. Compared to GitHub itself, which has shown a strong increase in its user base [74], the Sponsor mechanism has not attracted as much attention. In this context, we formulate RQ1: Why do individuals participate (or not) in the Sponsor mechanism?\n\nAccording to our manual observation of GitHub developers\u2019 sponsorship pages, we find that developers can spend more time on their open source work if sponsored by others (with examples of this trend being Tim Condon [64] and Super Diana [61]). In short, we consider how the Sponsor mechanism may affect developers\u2019 open source activities. In this context, we ask RQ2: How effective is sponsorship in motivating developer OSS activity?\n\nThere are some very successful cases of individuals receiving support under the GitHub Sponsor mechanism (e.g., Caleb Porzio, who was sponsored by 1,314 sponsors as of 7 August 2021). However, most Sponsor participants have not been successful, and many have not received any sponsorships at all. According to Figure 3, only 14.1% of maintainers are sponsored at least once. Most people do not receive any sponsorships, despite setting up a Sponsor account. Among sponsors, most (76.3%) sponsor others just one time. Based on the statistical analysis results, we consider which developer characteristics lead to more sponsorships. In this vein, we ask RQ3: Who is likely to receive more sponsorships?\n\nCurrently, there are many ways to obtain financial support for open source initiatives, e.g., through donations or bounties. The different types of financial support each have advantages and disadvantages [49]. It falls to participants (especially those who have participated in multiple financial support mechanisms) to judge the reasonableness and effectiveness of each. To better understand users\u2019 perceptions of the Sponsor mechanism and thus enrich and improve it, we propose RQ4: What are the shortcomings of the Sponsor mechanism?\n\n4 STUDY OVERVIEW\n\n4.1 Overall research methodology\n\nThe overall framework of this paper is shown in Figure 4, with the research methodology consisting of two main parts: data collection and research methods.\n\n4.1.1 Data collection. The data is collected using GitHub API. The goal was to find different kinds of GitHub users (maintainers, sponsors, and nonmaintainers) and gather their related basic information and activities. Here, we focus on how to distinguish different kinds of users. The acquisition of relevant basic information and details on activities is described in the subsequent section (see Section 4.2) when we introduce each research method in detail. We acquired different types of users through the following steps.\n\n(1) We used the RESTful API [27] to obtain all users. After that, we queried maintainers using the field hasSponsorsListing\n\n\n\n\nof the GraphQL API [26]. We obtained 60,732,250 users who had not deleted their accounts, among which 7,992 users were individual maintainers.\n\n(2) We used the field `sponsorshipsAsMaintainer` of the GraphQL API [26] to look up all the sponsorships that maintainers had received and the corresponding sponsors.\n\n(3) Using the list of sponsors queried in step (2), we used the field `sponsorshipsAsSponsor` of the GraphQL API [26] to query all the related maintainers. This step was to supplement the information on the maintainers who had set up the Sponsor profiles identified during the query process in step (1).\n\n(4) We repeated steps (2) and (3) until no new maintainers or sponsors appeared.\n\nThrough the above steps, we obtained 20,579 users, among which 8,028 are maintainers, 13,555 are sponsors (1,004 users are maintainers while sponsoring others at the same time). We also get 22,315 times of sponsorships. All users except maintainers were marked as nonmaintainers.\n\n4.1.2 Research methods. To answer the research questions, we used a combination of quantitative and qualitative analysis. Regarding our why (RQ1) and what (RQ4) questions, since it was difficult to capture everyone\u2019s reasons for participation or nonparticipation and summarize the shortcomings of the mechanism based on just the platform information, we asked relevant people to complete a questionnaire. For the how (RQ2) and who (RQ3) questions, we collected maintainer-related data, quantitatively analyzed the impact of sponsorship behavior on maintainer open source activity, and explored the correlation between factors and the amount of sponsorship. On this basis, we again conducted a qualitative analysis using a questionnaire. This combination of quantitative and qualitative analysis led to our conclusions. Next, we describe each research method in detail.\n\n4.2 Detailed introduction of research methods\n\n4.2.1 Questionnaire. Since there are three types of interaction between the user and the Sponsor mechanism, namely, interactions with a sponsor, a maintainer, or a nonmaintainer (see Section 3.1), we designed three different online surveys [75]. The surveys for both sponsors and maintainers relate to their expectations for and satisfaction with the Sponsor mechanism. The survey for nonmaintainers relates to their reason for not setting up the Sponsor feature for their account. All the surveys start with an introduction to the research background and purpose. There are two types of questions in each survey.\n\n- Demographic questions designed to obtain participants\u2019 information, including their role in and experience with OSS development (the predefined answers were inspired by prior research [44]).\n- Main questions, designed to gather users\u2019 views on the Sponsor mechanism.\n\nAmong the main questions, there are three kinds.\n\n- Open-ended questions aimed at gathering answers.\n- Rating scale questions soliciting users\u2019 satisfaction and agreement levels.\n- Multiple-choice questions with \u201cOther\u201d text field options aimed at gathering large-scale user feedback while providing additional answers.\n\nWe provide a final, open-ended question to allow participants to talk freely about the Sponsor mechanism. We discussed the questions with software engineering researchers to ensure that the items were well designed for our study and clear enough for participants to answer. Finally, we used SurveyMonkey [62] to deploy our online surveys.\n\nThere were two rounds of each survey: 1) the pilot stage, aimed at gathering answers to the open-ended questions from a limited number of participants, and 2) the full-scale stage, aimed at gathering the votes for each answer from a larger population. The statistics on the two stages can be seen in Table 1.\n\nParticipant recruitment. To recruit participants for the two rounds of three different surveys, we took the following steps:\nTable 1: Statistics on the two-stage survey\n\n| Stage | Statistic items | Maintainers | Sponsors | Nonmaintainers |\n|-------|----------------|-------------|----------|----------------|\n| Pilot | #selected participants | 400 | 400 | 400 |\n| | #successful invitations | 394 | 388 | 390 |\n| | #response (%) | 45 (11.4%) | 24 (6.2%) | 9 (2.3%) |\n| | Date for collection | June 8, 2021 - June 15, 2021 |\n| Full-scale | #selected participants | 6,104 | 6,359 | 7,500 |\n| | #successful invitations | 5,951 | 6,224 | 7,343 |\n| | #response (%) | 467 (7.8%) | 396 (6.4%) | 202 (2.8%) |\n| | Date for collection | June 29, 2021 - July 13, 2021 |\n\n# means the number, e.g., #response implies the number of responses\n\n(1) For all three types of users (maintainers, sponsors, nonmaintainers), we filtered out those whose email or name information could not be openly accessed, as these users might not want to receive questionnaires.\n\n(2) For all three types of users, we filtered out those who had not been active in the last month (since May 3, 2021), as they might not have focused on open source work on GitHub in recent days. In this step, we used the GitHub API to obtain users\u2019 recent activity, including the top repositories to which they had contributed in the last month and their last update time (field \u201cupdatedAt\u201d) on GitHub [26].\n\n(3) For nonmaintainers, we selected only users who may be eligible to set up a Sponsor profile based on their location information and the list of countries or regions included under the GitHub Sponsor mechanism [25].\n\n(4) After completing the above three steps, we randomly selected 400 unique individuals of each type without overlap as participants in the pilot stage.\n\n(5) For the full-scale stage, we selected all other maintainers (6,104) and sponsors (6,359) as participants. For nonmaintainers, due to the low response rate in the pilot stage, we filtered users according to the total number of stars of projects owned by developers (collected on 23 June 2021). We selected those with at least ten stars (we assumed that developers with popular projects are more likely to be interested in the Sponsor mechanism and use GitHub very often). After that, we randomly selected 7,500 participants.\n\nResponse and analysis. After selecting the participants, we published the questionnaire online and sent the web address to participants via email. The email invitation contained the basic information of the questionnaire publisher, the reason for the release, the number of questions, and the estimated time required to fill out the questionnaire.\n\nBased on the participants\u2019 feedback of the pilot stage, we designed the questionnaires for the full-scale stage. We removed 1 question for maintainers, 1 question for sponsors, and 2 questions for nonmaintainers due to answers with repetitive content in relation to the answers to other questions. We extracted the essential information from all responses and turned some open questions into multiple-choice questions (3 for maintainers, 3 for sponsors, and 1 for nonmaintainers) through open coding of card sorting method [78] by the first, second and the fifth authors together. To avoid disturbing the participants, we extended the time to collect the responses in this stage relative to that in the pilot stage but did not send a second email reminder. At the same time, because different types of participants dedicate different amounts of attention to the Sponsor mechanism, the response rate varies greatly. Nonmaintainers, who do not participate in the Sponsor mechanism, may not care about it and not want to reply to the email.\n\nWhen analyzing the multiple-choice questions, we first calculated the voting rate for each preset option. After that, we manually included the textual response for the \u201cOther\u201d option into the preset taxonomy, if possible, via the closed coding method [78]. If a new topic emerged, we integrated it into the existing taxonomy. When analyzing the last open question (\u201cDo you have anything else to tell us about the Sponsor mechanism?\u201d), we extracted the essential information from the textual response for qualitative analysis. To facilitate analysis, we use [MCx], [SCx], and [OCx] to represent the textual response in the questionnaire for maintainers, sponsors, and nonmaintainers, respectively, where x indicates the serial number of the comment.\n\nThrough the first two questions of each questionnaire, we collected participants\u2019 demographic information, including their status and experience with open source development. For the full-scale stage, the results are shown in Table 2. More than 70% of participants in each category have more than three years of OSS development experience. More than 10% of sponsors have no OSS development experience, which indicates that many sponsors sponsor others solely to support OSS development or maintenance.\n\nTable 2: Demographic information of participants in the full-scale stage\n\n| Questions | Answers | M (%) | S (%) | NM (%) |\n|-----------|---------|-------|-------|--------|\n| Q1: How would you best describe yourself? | Developer working in industry | 62.3 | 80.0 | 65.5 |\n| | Full time independent developer | 16.6 | 10.0 | 8.0 |\n| | Student | 11.6 | 6.9 | 6.5 |\n| | Academic researcher | 3.7 | 3.6 | 16.0 |\n| Q2: How many years of OSS development experience do you have? | Never | 1.1 | 10.2 | 3.0 |\n| | <1 year | 2.2 | 4.6 | 6.5 |\n| | 1-3 years | 10.1 | 14.5 | 12.6 |\n| | 3-5 years | 21.9 | 22.6 | 23.1 |\n| | 5-10 years | 33.6 | 26.9 | 27.1 |\n| | >10 years | 31.2 | 21.3 | 27.6 |\n\nM: maintainer; S: sponsor; NM: nonmaintainer\n\n4.2.2 ITS analysis. The aim of this analysis was to determine when to treat sponsorship as an intervention and how it influences the potential trends in maintainers\u2019 activities (development and discussion activities) from a long-term perspective. Therefore, following the guidelines of previous studies [53, 65, 76], we used the ITS method. The settings of the ITS analysis are shown below.\n\nInterventions: We set both accountSetUpTime and firstSponsorTime (see Section 3.1) as separate interventions. We assumed that maintainers may increase their activity after accountSetUpTime to attract others\u2019 attention for future sponsorship or be motivated to increase their open source contributions after firstSponsorTime.\n\nResponses: We set the number of commits (development activity) and the number of discussions (discussion activity) as responses, as they indicate different kinds of activities on GitHub.\n\nUnstable period: Similar to previous studies [53, 65, 76], we set 15 days before and after interventions as the unstable period.\n\nBefore & after intervention periods: To retain enough analyzable data, we selected maintainers with at least six months of activity.\nbefore and after interventions in addition to the unstable period. Therefore, each maintainer has at least $15 \\times 2 + 6 \\times 2 + 30 = 390$ days of activity on GitHub.\n\n**Time window:** Each month in before & after intervention periods is a time window, and the unstable period is also a time window. Therefore, there are $6 \\times 12 + 1 = 13$ time windows in all.\n\nThe independent variables are as follows.\n\n**Basic items.**\n- **intervention:** Binary variable indicating an intervention\n- **time:** Continuous variable indicating the time by month from the start of an observation to each time window, with a value range of $[0, 12]$\n- **time after intervention:** Continuous variable indicating how many months have passed after an intervention (if there is no intervention, $time\\ after\\ intervention=0$; otherwise, $time\\ after\\ intervention=time-6$).\n\n**Developer characteristics.**\n- **number of stars before:** Continuous variable, measured as the total number of stars of maintainer-owned repositories before the start of each time window\n- **in company:** Binary variable indicating whether company information exists at data collection time\n- **has goal:** Binary variable indicating whether a maintainer sets a goal for sponsorship at data collection time\n- **has another way:** Binary variable indicating whether a maintainer sets other methods for receiving donations at data collection time\n- **is hireable:** Binary variable indicating whether a maintainer declares a hireable status at data collection time\n\n**Developer activities.**\n- **number of commits before:** Continuous variable measured as the number of commits before the start of each time window\n- **number of discussions before:** Continuous variable measured as the number of discussions before the start of each time window\n\nWe built a mixed effect linear regression model for ITS analysis with a maintainer identifier as the random effect and all the measured factors as fixed effects. A major advantage of the mixed effect model is that it can eliminate the correlated observations within a subject [19]. Here, the time windows for the same maintainer tend to have a similar trend. We used the lmer function of the lmerTest package in R [41] to fit models for the maintainer\u2019s commit and discussion activities. For better model performance, we transformed the continuous variables to make them approximately normal and on a comparable scale through log-transformation (plus 0.5) and standardization (mean 0, standard deviation 1) [56]. To reduce the multicollinearity problem, we excluded factors with variance inflation factor (VIF) values $\\geq 5$ using the vif function of the car package in R [11]. We report the coefficients and the related $p$ values obtained in this way. We also report the explained variance of the factor, which can be interpreted as the effect size relative to the total variance explained by all the factors. For the fitness of models, we report both marginal ($R^2_m$) and conditional ($R^2_c$) R-squared values using the r.squaredGLMM function of the MuMIn package in R [7].\n\nTogether with ITS analysis, we visually present how responses change over time to show the activity change more intuitively (statistical analysis). Since there is an unstable period in the ITS analysis, we analyze this period separately using the Wilcoxon paired test method, which is presented in the following section.\n\n**4.2.3 Wilcoxon paired test.** For the ITS analysis, the unstable period is ignored. However, the Sponsor mechanism involves a small amount of money, which may influence maintainer behavior in the short term only. We assume that maintainers may have great fluctuations in OSS activity during the unstable period. We used a paired, nonparametric test method called the Wilcoxon paired test [8]. Through two-sided tests (both alternative=greater and alternative=less) [12], we can see whether the intervention increases or decreases a maintainer\u2019s activity. We considered three kinds of interventions, including accountSetUpTime, firstSponsorTime, and before and after each sponsorship. We used Cliff\u2019s delta ($\\delta$) to measure the effect size [29], with $|\\delta| < 0.147$ indicating a negligible effect size, $0.147 \\leq |\\delta| < 0.33$ indicating a small effect size, $0.33 \\leq |\\delta| < 0.474$ indicating a medium effect size, and $|\\delta| \\geq 0.474$ indicating a large effect size.\n\n**4.2.4 Hurdle regression analysis.** The critical idea of hurdle regression is to create a dataset with maintainer characteristics and the amount of sponsorship established. Therefore, we collected different characteristics of each maintainer heuristically, including basic information, social characteristics, Sponsor mechanism characteristics, developer activities, and project characteristics. For the amount of sponsorship, we used the number of times that a maintainer is sponsored. Next, we present detailed descriptions of the collected variables.\n\n**Developer basic information.**\n- **user age:** Continuous variable measured as the time interval by month since the creation of the user account in the GitHub community until the data collection time\n- **in company:** Binary variable indicating whether a maintainer introduces the personal work situation in detail\n- **has email:** Binary variable indicating whether a maintainer publicly provides the contact information\n- **has location:** Binary variable indicating whether the maintainer discloses the geographical location information\n- **is hireable:** Binary variable indicating whether a maintainer indicates availability for hire\n\n**Social characteristics.**\n- **followers:** Continuous variable measured as the number of followers\n- **followings:** Continuous variable indicating how many users the maintainer follows\n\n**Sponsor mechanism characteristics.**\n- **min tier:** Continuous variable measured as the minimum number of dollars set by the maintainer for donations\n- **max tier:** Continuous variable indicating the maximum donation\n- **has goal:** Binary variable indicating whether a maintainer sets a goal for sponsorship\n\u2022 **has another way**: Binary variable indicating whether a maintainer introduces other modes for receiving donations. Here, we identified other donation modes by finding links to other funding platforms in the description on the sponsorship page. Other platforms are shown in Table 9, which was compiled according to the collection by Overney et al. [53] and the supported external links of GitHub [24].\n\n\u2022 **introduction richness**: Continuous variable measured as the length of the introduction on the personal sponsorship page.\n\n\u2022 **user age after sponsor account**: Continuous variable indicating the time interval by month (to see how time influences the amount of sponsorship).\n\n**Activity characteristics.**\n\n\u2022 **number of commits**: Continuous variable measured as the total number of commits in GitHub from `accountSetUpTime` until the data collection time.\n\n\u2022 **number of discussions**: Continuous variable measured as the number of comments, including issue comments, pull request comments, and commit comments from `accountSetUpTime` until the data collection time.\n\n**Project characteristics.**\n\n\u2022 **sum star number**: Continuous variable measured as the total number of stars of repositories created by a maintainer.\n\n\u2022 **sum fork number**: Continuous variable indicating the number of forks.\n\n\u2022 **sum watch number**: Continuous variable indicating the number of watchers.\n\n\u2022 **sum top repository star number**: Continuous variable measured as the total number of stars of top repositories that a maintainer contributed in the four months before data collection [23].\n\n\u2022 **number of dependents**: Continuous variable measured as the number of repositories that rely on the project with the most watchers among all projects owned by the maintainer.\n\nWhen building the hurdle regression models, we removed maintainers with less than 3 months of activity after `accountSetUpTime` to reduce the impact of time on sponsorship. We reasoned that sponsors need time to find maintainers to donate to. To reduce the zero-inflation in the response variance, we used hurdle regression [36] by splitting the sample into two parts:\n\n1. **maintainers** who have not received any donations from others, to examine which factors influence whether a maintainer receives donations.\n\n2. **maintainers** with at least 1 sponsorship, to examine how the amount of received donations is influenced by the aforementioned characteristics.\n\nFor the reduction of the multicollinearity problem and the report of results, we use the same methods (see Section 4.2.2).\n\n## 5 RESULTS\n\n### 5.1 RQ1: Why do individuals participate or not in the Sponsor mechanism?\n\nFor this research question, the questionnaire had a dedicated item for each of the three types of participants, i.e., Q3 for maintainers, sponsors, and nonmaintainers. Table A shows the motivations or reasons elaborated by different types of developers in the full-scale stage and the percentage of votes for each option.\n\n#### 5.1.1 Related motivations.\n\nFrom the results, we find that some of the motivations of maintainers and sponsors are related.\n\n**Project use relationship.** For RM1 and RS1, they all indicate that the usage of related projects leads to sponsorship. Some 64.9% of maintainers and 85.8% of sponsors cite this factor as one motivation for participating in the Sponsor mechanism; this consensus puts it in first place on both groups\u2019 motivation lists. People think that users should give back to contributors in various ways, among which the Sponsor mechanism serves as a \u201cnice way to say thanks\u201d [MC23] and \u201callow people to easily fund their projects.\u201d [MC20]. From the perspective of sponsors, developers are grateful for the OSS that they use and hope to express their gratitude and, e.g., \u201cshow support for OSS, which I heavily rely on in my daily work. Without OSS, I could not have built a career in data science\u201d [SC3].\n\n**Promotion of continuous OSS contributions.** RM2 and RS2 reflect participants\u2019 uniform motivation to engage in further OSS contributions. Some 63.1% and 78.4% of maintainers and sponsors, respectively, cite this factor as a motivation; this factor thus ranks 2nd among all the enumerated reasons for participation. For open source developers, if they want to devote themselves to open source projects, they need to solve the problem of daily costs and open source maintenance costs (e.g., \u201cI believe in the open source and good-for-humanity idea. I need to get paid only to live a decent life\u201d [MC37]). Therefore, the emergence of the Sponsor mechanism may help them solve the above problems to a certain extent and then invest more time in open source projects (e.g., \u201cI was really hoping to get sponsorship so I could spend more time focusing on developing open source projects\u201d [MC11]). For sponsors, they also hope to inspire contributors to continue to make outstanding contributions (e.g., \u201cmotivate them to do the awesome work\u201d [SC5]).\n\n**Recognition of OSS work.** For RM4 and RS3, they all indicate sponsors\u2019 recognition of maintainers. A total of 39.9% of maintainers and 49% of sponsors cite this factor as a motivation for participation; this motivation ranks 4th and 3rd for these two groups, respectively. For some people, sponsorship is a manifestation of greater recognition by sponsors than income.\n\n**Support for specific features.** For RM7 and RS5, 18.8% of maintainers and 9.4% of sponsors hope that the Sponsor mechanism can help set the agenda for issue resolution priorities, although many people think that OSS should not be related to money (e.g., \u201cIf there was money given by others involved, I would feel pressed to implement whatever they want (like in industry projects). I want FLOSS to be completely independent from corporate requests\u201d [OC5]).\n\n#### 5.1.2 Motivation across different user types.\n\nIn addition to the motivations mentioned above related to the sponsor and maintainer relationship, there are other motivations or reasons related to the kinds of users.\n\n**Maintainers:** More than 60% of these participants chose RM3, but only 13% chose RM8. In the Other option, 4 participants mentioned that they hope sponsorship can cover some of their infrastructure costs. Moreover, 28.9% of participants even chose RM5 **Just for fun.** This indicates that different people have different\nTable 3: Reasons for participating or not participating in the Sponsor mechanism\n\n| Reason_maintainers | Votes (%) | Reason_sponsors | Votes (%) | Reason_non-maintainers | Votes (%) |\n|--------------------|-----------|-----------------|-----------|------------------------|-----------|\n| RM1 It allows users of my projects to express thanks/appreciation | 64.9 | RS1 Because I benefit from the developer\u2019s projects | 85.8 | RO1 No need to be sponsored | 39.3 |\n| RM2 Sponsorship can motivate my future | 63.1 | RS2 To encourage the developer to continue the contribution | 78.4 | RO2 I contribute to OSS not for money | 38.3 |\n| RM3 Side income for OSS contribution | 60.6 | RS3 To show my recognition of the developer\u2019s work | 69.5 | RO3 My work is not worth being sponsored | 28.4 |\n| RM4 It can reflect community recognition for my work | 39.9 | RS4 Because I\u2019m interested in the developer\u2019s projects | 49.0 | RO4 Never heard of it | 26.4 |\n| RM5 Just for fun | 28.9 | RS5 To motivate the developer to work harder on a specific feature | 9.4 | RO5 It\u2019s cumbersome | 8.5 |\n| RM6 I deserve to be rewarded for my past 21.8 | 18.8 | RS6 Because I know the developer | 8.9 | RO6 Not available in my region | 2.0 |\n| OSS contribution | | Other | | Other | 10.4 |\n| RM7 I am able to prioritize the requirements of sponsors (e.g., fixing bugs) | 13.1 | | | |\n| RM8 It\u2019s a way for me to make a living | 1.9 | | | |\n\nThe main reason cited for participation is to obtain or express appreciation for the use of open source projects or to recognize the maintainer\u2019s OSS contribution. In turn, such support may promote better contributions. Maintainers seeking to make money tend to obtain extra income rather than a full livelihood through sponsorship. For nonmaintainers, in addition to personal reasons, the mixing of open source projects and money is another critical consideration preventing them from participating.\n\n5.2 RQ2: How effective is sponsorship in motivating developer OSS activity?\n\nWe used the following methods for this research question: statistical analysis (visualization), ITS analysis, unstable period analysis based on the Wilcoxon paired test method, and qualitative analysis based on a questionnaire survey. We also explored the two kinds of interventions, namely, accountSetUpTime and firstSponsorTime.\n\n5.2.1 Visualization. Figures 5-8 present the change in activities over time. We can see from the figures that both commit and discussion activities remain stable before and after the intervention. However, during the unstable period, developers tend to be more active than usual. In response to this phenomenon, we analyzed the persistent and transient effects of the interventions using the ITS method and Wilcoxon paired test method, respectively.\n\n5.2.2 ITS analysis. Table 4 shows the results of the ITS analysis. The results show that the factor with the strongest correlation to OSS activity is the associated historical activity (i.e., number of commits before for Commit Model, number of discussions before for Discussion Model). For all four models, the associated historical activity explains more than 80% of the total variance. For the impact of other funding sources, we find that the variance explained by this factor does not exceed 1.1% in all four models. Therefore, it is somewhat clear that the existence of funding sources other than the Sponsor mechanism does not influence our exploration of the association of this mechanism with open source activity.\n\nFor the number of commits, we find that for both accountSetUpTime and firstSponsorTime, there is a slight growth trend before the intervention. After the intervention, both show a negative growth trend ($\\beta(t) + \\beta(t \\text{ after intervention}) < 0$). Additionally, we find that the intervention itself is negatively correlated with the number of commits ($\\beta(\\text{intervention}) < 0$).\n\nFor the number of discussions, we find results similar to those for the commit activity. The intervention of the Sponsor mechanism changes the original slowly increasing dynamics and reduces the discussion activity. Specifically, the intervention has no effect at accountSetUpTime but a slightly negative effect at firstSponsorTime.\n\nIn regard to the above results, it is surprising that the setup of the Sponsor mechanism or the first sponsorship does not contribute\nto the maintainer\u2019s commit activity or discussion activity growth. In contrast, there is a slight inhibitory effect. To illuminate this situation, we followed up with a questionnaire to explore the maintainers\u2019 subjective satisfaction with the Sponsorship mechanism and its motivating effect (see Section 5.2.4).\n\n5.2.3 Wilcoxon paired test analysis. Table 5 shows the results of the Wilcoxon paired test and Cliff\u2019s delta.\n\nFor the number of commits, when the maintainer sets up the Sponsor account, is sponsored for the first time, or receives a new sponsorship, the number of commits after the intervention is significantly higher. For the number of discussions, we find no significant changes around the three kinds of interventions.\n\nThis result indicates that sponsor behavior leads to a short-term increase in commit activity. For discussion, however, the sponsorship does not lead to short-term changes. In contrast to the ITS analysis, the Wilcoxon paired test analyzes changes in activity during the unstable period, further demonstrating that the Sponsorship mechanism can give a short-term boost to development activity.\n\n5.2.4 Questionnaire survey. To further explore the effectiveness of the Sponsorship mechanism, we conducted independent research with maintainers and sponsors to uncover their subjective judgments about the efficacy of the mechanism. In response to this goal, we asked maintainers (Q4 \u201cHow satisfied are you with the income from sponsors?\u201d) and sponsors (Q4 \u201cAs a sponsor, to what extent does your sponsorship meet your expectations?\u201d). Meanwhile, we asked the maintainers directly about their internal perceptions of the effectiveness of sponsorship incentives (Q5 \u201cTo what extent can sponsorship motivate you?\u201d). The results are shown in Figure 9.\n\nFor sponsors, we find that 53.7% think that sponsorship meets their expectations fully or a great deal and only 14.1% report that their expectations are hardly met or not met at all. For maintainers, we find that 50.4% consider that sponsorship motivates them fully or a great deal but 22.5% think that it does not bring any motivating effect. However, in terms of the amount of sponsorship, we find that only 20.7% of maintainers are either satisfied or very satisfied.\n\nTable 4: Results of ITS analysis\n\n| Commit Model | Dependent variable: scale(log(number of commits + 0.5)) | Discussion Model | Dependent variable: scale(log(number of discussions + 0.5)) |\n|--------------|----------------------------------------------------------|------------------|----------------------------------------------------------|\n| | accountSetUpTime | firstSponsorTime | accountSetUpTime | firstSponsorTime |\n| Coeffs (Err.) | Chisq | Coeffs (Err.) | Chisq | Coeffs (Err.) | Chisq |\n| Intercept | -0.10** (0.01) | 0.01** (0.01) | 0.01** (0.01) | 0.01** (0.01) |\n| scale(log(number of commits before + 0.5)) | 0.59** (0.01) | 5190.72** | 0.58** (0.02) | 1185.38** |\n| scale(log(number of discussions before + 0.5)) | -0.02 (0.01) | 3.45* | -0.03 (0.02) | 2.29 |\n| scale(log(number of stars before + 0.5)) | -0.06** (0.01) | 55.23** | -0.07** (0.01) | 22.71** |\n| has goal (TRUE) | 0.06** (0.01) | 17.43** | 0.07* (0.03) | 5.97* |\n| has other way (TRUE) | 0.16** (0.05) | 8.22** | 0.14 (0.09) | 2.36 |\n| in company (TRUE) | 0.89** (0.01) | 38.56** | 0.11** (0.03) | 15.60** |\n| is hireable (TRUE) | 0.00 (0.01) | 0.02 | 0.01 (0.03) | 0.22 |\n| time | 0.02** (0.00) | 96.11** | 0.03** (0.00) | 61.22** |\n| intervention (TRUE) | -0.02* (0.01) | 5.66 | -0.09** (0.02) | 25.54** |\n| time after intervention | -0.04** (0.00) | 245.92** | -0.05** (0.00) | 97.38** |\n\nNumber of Observations 75,516 20,148 75,516 20,148\n\n\\[ R^2 \\]\n\n\\[ R^2 \\]\n\n\\[ 0.64 \\]\n\n\\[ 0.64 \\]\n\n\\[ 0.66 \\]\n\n\\[ 0.65 \\]\n\n\\[ * p < 0.001, ** p < 0.01, * p < 0.05, < 0.1 \\]\n\nFigure 9: Results of 5-point Likert scale questions\nwith their income from sponsorship and 30.1% are dissatisfied or very dissatisfied with the amount.\n\nWe think that the main reason for this difference is that sponsors\u2019 main motivation to participate is to display their gratitude, inspire others, etc., by giving funds. Therefore, most sponsors are satisfied with their own behavior. For maintainers, although more than half think that sponsorship can be stimulating, we find that only approximately 20% are satisfied with the amount of sponsorship received. This shows that open source sponsorship has a positive effect on some developers, but in fact, the amount of monetary rewards that can be received through sponsorship is relatively small and unlikely to meet the expectations of maintainers.\n\nIn terms of short-term effects, the Sponsor mechanism makes a slightly positive contribution to the development activity but has no significant impact on discussion activity. However, this impact is not sustained. One possible reason is that the actual amount of support does not meet maintainers\u2019 expectations, which makes it difficult for maintainers to rely on sponsorship income to keep investing in open source contributions.\n\n5.3 RQ3: Who is likely to receive more sponsorships?\n\nFor this research question, we tried to identify the important factors influencing the amount of sponsorship and provide further advice on maintainers. We again analyzed and verified the results through a combination of quantitative and qualitative analysis. For the qualitative analysis, we analyzed both maintainers and sponsors and explored the consistency of their perceptions of sponsorship.\n\n5.3.1 Hurdle regression. From an overall perspective (see Table 6), the hurdle regression models fit well, with $R^2 = 34\\%$ and $R^2 = 39\\%$, respectively. Even though 7,465 maintainers have more than 3 months of activity after setting up their Sponsor profile, only 2,750 (36.8%) of them receive at least one sponsorship. Moreover, only 6% receive sponsorships more than 10 times, and only 25 maintainers receive more than 100 sponsorships. Therefore, although many people want to obtain sponsorship, only a small number of people succeed.\n\nWhen we consider whether the maintainer receives any sponsorships (columns 2 and 3 of Table 6), the followers factor, representing social status, has the most substantial positive effect, explaining 45.8% of the total variance. However, the factor followings is negatively correlated with the likelihood of receiving sponsorship (effect size: 3.1%). It is likely that compared to followings, followers better represents the centrality of maintainers in the community, while maintainers with large followings tend to learn more from others in the community. Discussion activity is positively correlated with the likelihood of sponsorship (number of discussions, effect size: 22.7%), while relatively speaking, commit activity explains only 0.3% of the variance. A possible explanation is that sponsored developers tend to focus more on issues or pull requests submitted by sponsors to give back or attract the attention of others. Commit activity is common among GitHub developers, where many developers may just focus on their own issues. For sponsor tiers, the min tier is negatively correlated with the likelihood of sponsorship acquisition (effect size: 12.3%). However, max tier is positively correlated and explains 5% of the variance. Both of the tiers have sizable effects but opposite directions of influence. It is likely that many sponsors tend to donate only a little money and that setting a high min tier may cause them to abstain from sponsorship. However, if maintainers want to obtain sponsorships, they cannot undervalue themselves. Trying to increase the max tier can increase the possibility of being sponsored. Another thing for maintainers to note is the importance of the introduction text when setting up their Sponsor account. If maintainers introduce themselves at greater length, they are more likely to become sponsored (effect size: 5.1%). Other factors have negligible effects, with explained variances of less than 5%.\n\nWhen we consider the amount of sponsorship received by maintainers (columns 4 and 5 of Table 6), the social status of maintainers is also positively correlated with the response (followers, effect size: 65.3%). At the same time, followings oppositely correlates with the response (effect size: 10.7%). The factor number of discussions explains 9.6% of the total variance. The min tier variable becomes nonsignificant, unlike in the receive sponsorship model. A possible explanation for this result is that the setting of the min tier is not a long-term solution for securing more sponsorship. Developers need to be more focused on their status and daily activities in the community. Other factors have negligible effects.\n\n5.3.2 Questionnaire. We asked questions related to maintainers (Q6 \u201cIn which way do you think you can obtain more sponsorships?\u201d) and sponsors (Q5 \u201cWhat kind of developer do you prefer to sponsor?\u201d) separately. Table 7 presents the results.\n\nFor maintainers. The results reveal that from the maintainers\u2019 perspective, producing useful projects and tools (WM1, WM4) is seen as more likely to draw sponsorships than just participating in projects (WM5, WM6, WM7, WM8, WM9). One possible reason for this is that the Sponsor mechanism is to credit funds to individual accounts, and the sponsorship button on the project homepage also needs to be configured by the owner. Some sponsors who want to donate to a project through the Sponsor mechanism (e.g., those reporting that \u201cI prefer to sponsor projects, not a specific developer\u201d [SC167]) may end up sponsoring only the project\u2019s owner.\n\nSome 54.5% of maintainers think that by working hard, they can obtain more sponsorships (WM2). However, some maintainers...\nsaid sponsorship is simply a matter of popularity (e.g., \u201cPurely popularity basically... OSS Creators from YouTube earn a ton of money\u201d [MC292]; \u201cI think it is mostly a function of being a celebrity so it operates on the same rules\u201d [MC262]). This is probably why 54.1% of the maintainers chose WM3.\n\nMore than 1 option was chosen by 85.6% of the sponsored participants. Moreover, 20.5% chose at least 5 options, which shows that in fact, the options that we offered are feasible for promoting sponsorships among maintainers. Some relevant participants indicated that \u201cDonations just don\u2019t work\u201d [MC284] or \u201cIt doesn\u2019t matter; people take when it\u2019s free\u201d [MC281]. These responses suggest that the reasons that prevent most people from obtaining more sponsorships that would meet their expectations are not limited to individual participation characteristics and platform mechanism design; rather, the act of sponsorship itself may not be suitable for the open source sphere. Indeed, 10 participants who selected WM11 indicated that there was no way to obtain more sponsorship.\n\nFor sponsors, the vast majority (85.1%) chose WS1, which suggests that most sponsors support developers involved in the open source projects that sponsors use. This corresponds to the top-ranked way of obtaining sponsorship (WM1) selected by the maintainers, suggesting that the best way to obtain more sponsorship, in the opinion of both maintainers and sponsors, is to create projects that more people use. Similarly, more than half of the participants wanted to sponsor projects of personal interest (WS2) and developers who had made significant contributions (WS3). We find that 31.1% of the sponsors chose to sponsor independent developers (WS5). However, some sponsors said that just being an independent developer is not enough and that the development and maintenance of good open source projects or tools are needed (e.g., \u201cIndependent developers with nice tools\u201d [SC30]).\n\nMost sponsors do not consider the act of sponsorship as a form of charity\u2014few people reported doing so simply because the person being rewarded was in hardship (WS7) or had not received many rewards (WS6). Likewise, sponsors do not want to reward another developer simply because they know one another (only 15.4% chose WS8, e.g., \u201cIt is usually a library I am using in my own project and I know the developer in person\u201d [SC168]).\n\n### Table 6: Result for factors influencing sponsorship\n\n| Dependent variable: receive sponsorship | Coefs (Err.) | Chiq |\n|----------------------------------------|-------------|------|\n| (Intercept) | \u22120.53\u2217\u2217\u2217 (0.09) | 1.80\u2217 (0.07) |\n| scale(log(user age + 0.5)) | \u22120.10\u2217 (0.03) | 8.62\u2217\u2217 |\n| in company (TRUE) | \u22120.26\u2217\u2217\u2217 (0.06) | 18.08\u2217\u2217\u2217 |\n| has email (TRUE) | \u22120.03 (0.06) | 0.31 |\n| has location (TRUE) | \u22120.11 (0.09) | 1.41 |\n| is hireable (TRUE) | \u22120.19\u2217\u2217 (0.06) | 9.70\u2217\u2217 |\n| scale(log(followings + 0.5)) | 0.96\u2217\u2217\u2217 (0.04) | 545.36\u2217\u2217\u2217 |\n| scale(log(min tier + 0.5)) | \u22120.19\u2217\u2217\u2217 (0.03) | 37.39\u2217\u2217\u2217 |\n| scale(log(max tier + 0.5)) | \u22120.42\u2217\u2217\u2217 (0.04) | 146.89\u2217\u2217\u2217 |\n| has goal (TRUE) | 0.23\u2217\u2217\u2217 (0.03) | 59.82\u2217\u2217\u2217 |\n| has other way (TRUE) | 0.18\u2217 (0.06) | 8.32\u2217 |\n| scale(log(user age after sponsor account + 0.5)) | 0.28 (0.22) | 1.54 |\n| scale(log(number of commits + 0.5)) | 0.02 (0.03) | 0.40 |\n| scale(log(number of discussions + 0.5)) | 0.08 (0.04) | 3.42 |\n| scale(log(sum star number + 0.5)) | 0.73\u2217\u2217\u2217 (0.05) | 270.29\u2217\u2217\u2217 |\n| scale(log(sum top repository star number + 0.5)) | \u22120.10\u2217\u2217 (0.04) | 7.48\u2217\u2217 |\n| scale(log(introduction richness + 0.5)) | \u22120.13\u2217\u2217 (0.04) | 9.55\u2217\u2217 |\n| scale(log(number of dependents + 0.5)) | 0.25\u2217\u2217\u2217 (0.03) | 60.84\u2217\u2217\u2217 |\n| Number of Observations | 7,465 | 2,790 |\n| delta R\u00b2 | 0.34 | 0.39 |\n\n### Table 7: Ways of obtaining more sponsorship\n\n| Way_maintainers | Votes (%) | Who_sponsors | Votes (%) |\n|-----------------|-----------|--------------|-----------|\n| WM1 Producing useful projects | 62.6 | WS1 Developers whose projects I benefit from | 85.1 |\n| WM2 Staying active and contributing more in the community | 54.5 | WS2 Developers whose projects I\u2019m interested in | 60.3 |\n| WM3 Advertising myself or my work to the community | 54.1 | WS3 Developers who make important contributions | 50.9 |\n| WM4 Producing valuable code | 38.5 | WS4 Developers who are active in community | 42.0 |\n| WM5 Getting involved in popular projects | 29.1 | WS5 Independent developers | 31.1 |\n| WM6 Getting involved in projects adopted by companies | 25.5 | WS6 Developers who haven\u2019t received much sponsorship | 24.1 |\n| WM7 Getting involved in long-term projects | 21.6 | WS7 Developers who are in hardship | 18.7 |\n| WM8 Getting involved in less maintained yet important projects | 19.1 | WS8 Developers who I know | 15.4 |\n| WM9 Getting involved in projects led by companies | 8.8 | WS9 Other | 1.0 |\n| WM10 Providing localized content | 7.4 | | |\n| WM11 Other | 3.6 | | |\nMost maintainers and sponsors think that sponsorship builds on relationships forged through using OSS. Active and meaningful participation in open source contributions can also help maintainers gain more attention. However, the quantitative analysis reveals that the social popularity of the maintainer in the community is the decisive factor in obtaining more sponsorships.\n\n5.4 RQ4: What are the shortcomings of the Sponsor mechanism?\n\nFor this research question, we investigated the mechanism shortcomings found by participants while using the Sponsor mechanism. We asked the question \u201cWhat are the shortcomings of the Sponsor mechanism?\u201d of both maintainers (Q7) and sponsors (Q6) separately. Table 8 presents the results.\n\nAmong maintainers, 13.1% thought that the Sponsor mechanism was perfect (SM6) and could meet their personal needs well, while among sponsors, 33.1% thought that the mechanism was perfect (SS2). This indicates that the satisfaction of different types of mechanism participants, especially maintainers, varies greatly. The current Sponsor mechanism does not meet maintainers\u2019 needs well. The shortcomings include the following main aspects (some of these were resolved by GitHub during the research process).\n\nDiscoverability of maintainers. The results reveal that 51.3% of maintainers found it difficult to be discovered by sponsors (SM1); however, based on feedback from sponsors, only 19.6% found it difficult to determine whom they should sponsor (SS3). A larger share (40.1%) found it difficult to assess who urgently needed sponsorship (SS1).\n\nInteractivity of participants. From the results, we find that among maintainers, 29.4% thought that the current Sponsor mechanism cannot support good direct communication with sponsors (SM2), while among sponsors, 11.8% wanted communication support (SS5). Some thought that they should not burden developers by interrupting their normal development process (\u201cI don\u2019t want to burden the developers [by asking them] to communicate with sponsors. The sponsor should be string-free\u201d [SC195]).\n\nPayments. Many people, including maintainers and sponsors, highlighted existing payment problems with the Sponsor mechanism, including limited payment options (25.1% of maintainers \u2013 SM3)), limited sponsorship tiers, inconvenient tax payments (19.3% of the maintainers \u2013 SM5)), and limited payment providers. Some of these shortcomings, e.g., the limited payment options, may have been resolved by GitHub during the research process.\n\nUser distinction. A total of 20.7% (SM4) of maintainers and 10.5% (SS6) of sponsors mentioned the distinction between sponsors and others in project development activities.\n\nGeographical restrictions. From SM7 and SS4, we see that 11% of maintainers and 13.2% of sponsors thought that support for regions limits the popularity of participation. As of 27 July 2021, only 37 regions were supported, leaving many people unable to participate in the mechanism (RO6) and sponsors unable to sponsor as many people as they want (e.g., \u201cNot all organizations I want to support joined GitHub sponsors\u201d [SC192]).\n\nLack of contribution indicators. Five participants noted that there was a lack of valid OSS contribution indicators. OSS contributions are not limited to commits and pull requests. If not involved in the current project, the sponsor hardly knows who has played a significant role in the project development (e.g., \u201cIt is not easy to measure my OSS contribution. Sometimes it is just filing issues; other times, it is documentation PRs\u201d [MC350]). Moreover, contributions of small patches to large projects are difficult for others to find and thus are unlikely to gain sponsorships (e.g., \u201cIn my case, you will be hard-pressed to get anything for your work when you are making just a little addition to a massive piece of software\u201d [MC379]). Among sponsors, some want to sponsor a project, not individual maintainers (e.g., \u201cI prefer to sponsor projects, not a specific developer\u201d [SC167]).\n\nOSS donations. The Sponsor mechanism itself is an act of donation. On GitHub, sponsorship is primarily for users or organizations that have created a GitHub account. We find from the results that 16 participants thought that the donation mechanism itself was not suitable for the current open source sphere. Many reasons were cited for this evaluation: People take open source projects for granted, and no one wants to pay for them (e.g., \u201cPeople still do not like to pay for software\u201d [MC355]). Companies that use open source initiatives to gain revenue do not want to give back to the open source project (e.g., \u201cMost companies don\u2019t fund any of their open source dependencies\u201d [MC354]). Donations are passive income, and without a regular income, developers have little motivation to work full-time on open source projects (e.g., \u201cDonation makes far less revenue than charging for things\u201d [OC78]).\n\nTo solve the problems mentioned above, we offer the following actionable suggestions after taking into account the participant feedback.\n\nDiscoverability of maintainers.\n\n- Add \u201cSponsor\u201d buttons for the relevant project or people on the release webpage (\u201cRecognition of sponsors in release of the repository would be something I can think of\u201d [SC217]).\n- Add support for integrated development environments (IDEs), allowing developers to discover package dependencies and quickly jump to sponsor pages while developing with IDEs (\u201cBetter discoverability and integration with other developer tooling\u201d [SC65]).\n- Provide a more straightforward way to show personal OSS contributions (e.g., \u201cPromote efforts like a dashboard\u201d [MC126]).\n\nInteractivity among participants.\n\n- Allow maintainers to configure themselves whether they wish to communicate directly with sponsors. The interaction can be set up in different groups for different sponsors, similar to Patreon\u2019s integration solution with Discord [54] (e.g., \u201cLack of integration with the payment tiers like the Discord integration with Patreon\u201d [MC337]).\n- Allow maintainers to configure their own thank-you emails that can be sent automatically when they receive a sponsorship (e.g., \u201cSome kind of thank-you setup where I can send notes, etc.\u201d [MC109]).\n- Allow sponsors to upload statements to disclose expenses related to sponsorship proceeds (\u201cDistribution of the money,\nespecially in FOSS [free and open source software] projects\u201d [MC88].\n\nPayments.\n\n- Provide clear income and expense statements to the sponsor and maintainer automatically.\n- Integrate as many payment providers as possible on the basis of meeting tax requirements.\n\nUser distinctions.\n\n- Let maintainers decide, through in a configurable form in their personal settings, whether they want to treat sponsors differently from nonsponsors.\n- In addition to an option to show distinctions, add configuration options such as what development activities to show and whether to distinguish between sponsors with different sponsorship amounts (e.g., \u201cDevelopers should be allowed to set permission levels based on sponsorship. E.g., you can only comment or make requests if you\u2019re a sponsor (or if the developer directly opts you in, or if you\u2019ve made contributions to the project, things like that). This would really positively change the culture of GitHub collaboration\u201d [SC212]).\n\nGeographical restrictions. Provide support for more regions.\n\nLack of contribution indicators. Set up a multidimensional indicator of contributions, and ensure rational allocation of project sponsorship funds.\n\nOSS donations. Future research should synthesize feedback from all types of open source participants and reconsider how to improve the sponsorship mechanism or design a more appropriate form of open source financial support.\n\n### Table 8: Shortcomings of the Sponsor mechanism\n\n| Shortcoming_maintainers | Votes (%) | Shortcoming_sponsors | Votes (%) |\n|-------------------------|-----------|----------------------|-----------|\n| S M1 It\u2019s hard for others to discover me for sponsorship | 51.3 | S S11 I cannot assess how urgently a developer needs to be sponsored | 40.1 |\n| S M2 I can\u2019t interact with my sponsors on GitHub (e.g., for expressing appreciation) | 29.4 | S S2 None. It\u2019s perfect | 33.1 |\n| S M3 Lack of a wide range of payment options (e.g., one-time/yearly/quarterly payment) | 25.1 | S S3 It\u2019s hard for me to find the developer I should sponsor | 19.6 |\n| S M4 GitHub does not distinctly mark my sponsors (e.g., I cannot easily tell whether an issue submitter is my sponsor) | 20.7 | S S4 It is not supported in many regions | 13.2 |\n| S M5 I have to pay taxes | 19.3 | S S5 I can\u2019t interact with the developer I sponsored on GitHub | 11.8 |\n| S M6 None. It\u2019s perfect to me | 13.1 | S S6 I\u2019m not distinctly marked in the projects whose maintainers have been sponsored by me (e.g., when I submit an issue) | 10.5 |\n| S M7 It is not supported in many regions | 11.0 | S S7 Other | 8.1 |\n| S M8 I can\u2019t declare how I dealt with the received money | 10.1 | | |\n| S M9 Other | 9.4 | | |\n\nDuring the research process, GitHub fixed some shortcomings, e.g., the one-time payment method.\n\nThe shortcomings of the Sponsor mechanism relate to three main aspects. **Usage deficiencies**: difficulty of participants in finding each other, lack of good interaction support, lack of promotion, lack of adequate payment and billing support, etc. **Object orientation with supported functions**: despite support for organizations and projects, main targeting of individuals. For sponsors, a need for better support for corporate sponsorship; for maintainers, a need for better support for multicontributor projects. **Personalization**: a need for configurability of the Sponsor mechanism to reflect variation in participant types and motivations.\n\n### 6 DISCUSSION\n\nThrough this study of the integrated sponsorship mechanism on the world\u2019s most popular open source platform (GitHub), we found that participation in the mechanism has not shown the same rapid growth as participation in open source projects. Meanwhile, there is a long-tail effect regarding the number of sponsorships obtained by maintainers; i.e., most maintainers do not obtain many sponsorships or even any at all. Compared to the work of Overney et al. [53], this research brings us one step closer to understanding the incentive effect of sponsorship on individual developers by collecting feedback from participants in open source donations, taking the GitHub Sponsor as an example.\n\nAlthough this article considers only the Sponsor mechanism, it lacks overall consideration and comparative analysis of all open source sponsorship platforms. However, we think that the article still provides some guidance in helping improve the mechanism itself and exploring the essence of open source donation.\n\nThis paper explored four aspects of the Sponsor mechanism: its who, what, why, and how. The main findings and insights are as follows.\n\n**Why do individuals participate or not in the Sponsor mechanism?**\n\nNot all open source contributors endorse open source donation. There were more nonparticipants than participants. Like the motivations for participation in traditional citizen science [15, 43] and information-sharing crowdsourcing systems like Wikipedia [73], developers are primarily intrinsically motivated to participate in open source contributions [21]. However, because open source development activities are more complex and require significant\nmaintenance, many contributors are looking for financial support [5, 57, 67]. Among the groups that support and use it, it is generally relationships built through the use of specific software that serve as the backbone of the sponsorship behavior. In fact, many users want to reflect the difference between sponsors and nonsponsors in development activities and, in this way, change the method of open source collaboration and participation in open source donation. Such a change might not be very pleasant and could lead to the open source sphere becoming money driven. We think that making the format personalized and configurable may meet the needs of more people without changing the nature of the open source sphere.\n\nIt is necessary for system designers to consider regional support and then make the Sponsor mechanism accessible and better for more people who want to participate by improving the user experience (e.g., better access to bill for tax).\n\n**How effective is sponsorship in motivating developer OSS activity?**\n\nIn a study of donations to projects, Overney et al. [53] found that donation did not improve engineering activity. And in our study, we also found that sponsorship has only a short-term positive stimulating effect on maintainers\u2019 development activity. However, the impact does not last, and there is even a slight negative effect in the long term. A possible reason for this result is that most maintainers do not receive sufficient sponsorship through the Sponsor mechanism to be motivated to contribute continuously. This may reflect the characteristics of open source donations. The maintainer passively receives sponsorship from the sponsor, and there is no compulsion for the act of sponsorship to occur. Thus, situations may arise that are similar to that of one of our questionnaire participants, who created heavily used tools but received no sponsorships. When compared horizontally with the results of other maintainers, such an outcome may have the negative effect of dealing a blow to maintainers and reducing their enthusiasm for making open source contributions.\n\nFor system designers, it is important to consider how to design conjunctive mechanisms, such as adding a ranking list according to the number of received or given sponsorships in the annual report or other locations. Therefore, the sponsorship mechanism can become a more continuous driving force, enhancing the impact of the sponsorship on developer activities.\n\n**Who is more likely to receive sponsorships?** Participants\u2019 subjective perceptions conflict with the actual phenomenon. Participants believe that creating useful open source projects should lead to more sponsorships. However, we find that the most significant factor influencing the amount of sponsorship is social status. This inconsistent finding illustrates that participants want to express their gratitude or receive appreciation from others through the software usage relationship. However, it is not the case that those who develop sufficiently useful tools receive substantive sponsorship. Given the feedback from participants in our questionnaire, this situation is likely to cause maintainers to complain about a lack of publicity for themselves or about the fact that their work leads to no more sponsorships. At the same time, developers who make minor contributions to popular projects or outstanding contributions to niche projects may be ignored under this mechanism. Comparing to project-oriented donation, e.g., open collective, patreon [53]. Although the Sponsor mechanism is targeted at developers, which allows external contributors who do not own but are actively involved in popular projects to get donations. However, it is found through the results that sponsors prefer project-oriented donation, i.e., the core developers or owners of popular or used projects are more likely to receive sponsorship. Since some of the money donated to projects is spent on travel/food [53], we think it is needed to consider the percentage of contributors\u2019 contributions to achieve greater equity.\n\nAs for now, we think that for open source developers who want to get more sponsorship, it is essential to increase one\u2019s community visibility through advertising and help oneself get more attention by building open source projects that more people use.\n\n**What are the shortcomings of the Sponsor mechanism?** The defects of the Sponsor mechanism are manifested in three main aspects: usage defects, object-oriented and support mechanisms, and personalization setting problems. At the same time, many developers believe that sponsorship behavior is not suitable for the open source ecosystem. The free nature of OSS leads to an unwillingness to pay. This finding shows that in addition to the problems with the mechanism itself, donations are not perfectly adapted to the open source ecosystem. The passivity, uncertainty, and instability inherent to donations make it difficult for maintainers to rely on them and continue to make open source contributions for a long time. At the same time, the lack of reasonable evaluations of contributions and funding allocation makes it difficult for sponsors to determine whom to sponsor and by how much. So the bounty approach of \u201cgetting paid to do more\u201d is recognized by some people than the donation approach, through which they can get paid immediately for the work and have more precise goals [77]. But how to balance the advantages of bounty and avoid regarding money as the guide of open source development may be the goal of future monetary incentive system design. For more specific system design recommendations, see Section 5.4.\n\nOverall, the Sponsor mechanism is a good attempt and an essential step toward achieving reasonable and effective open source financial support. As of now, the mechanism still needs further improvement to meet the needs of more developers.\n\n### 7 THREATS TO VALIDITY\n\nFor the questionnaire, we did not do the detection of carelessly invalid responses [13]. First of all, the number of questions is small, the time required to answer is short, and there is no overlap between questions, so it is not feasible to judge the validity of the responses simply by the results. Secondly, we did not set attention check items to shorten user participation time. However, since users need to click on our questionnaire and jump to the SurveyMonkey site to respond after receiving the email, we think this has ensured the validity of the responses we received to some extent. When conducting the second round of the questionnaire survey, to avoid disturbing participants excessively, we sent it only once. We did not send second or third reminder emails. At the same time, people who have not set up a Sponsor account may not care about the mechanism. As a result, the response rate was low.\nFor ITS analysis, data should be collected for the different factors for each time window. However, due to the lack of availability of timestamps in the GitHub API, some factors were measured only at their values at the time of data collection (e.g., in company), as they do not change frequently.\n\nFor hurdle regression, the factors included in the models were several aspects related to the sponsorship of developers. However, other factors may influence whether a developer can obtain sponsorship or how much funding is received. Moreover, the number of sponsorships does not accurately indicate the amount of money that a developer receives from donations, as there exist different tiers and sponsors can withdraw their monthly sponsorship at any time. However, we do not have access to data on the actual donations received by each developer. Developers may obtain donations from other platforms to maintain related projects. We did not consider all this funding in total or the activities of developers on other platforms.\n\nThis paper explored only the effectiveness of the Sponsor mechanism for individual users, but the Sponsor mechanism itself can also be used for organizational accounts. To avoid our analysis being confounded by the impact of such users, we processed our data accordingly. Therefore, the results do not apply to GitHub\u2019s organizational accounts. According to statistics, 92% of users who set up sponsors are individual users.\n\n8 CONCLUSION AND FUTURE WORK\n\nThis paper took GitHub\u2019s Sponsor mechanism as a case study and used a mixed qualitative and quantitative analysis method to investigate four dimensions of the mechanism. Regarding why developers participate in the Sponsor mechanism, we found that it is mainly related to the use of OSS. Regarding the mechanism\u2019s effectiveness, we found that the Sponsor system has only a short-term effect on development activities but that in the long term, there is a slight decrease. We studied who obtains more sponsorships and found that the social status of the maintainer in the community correlates most strongly with this outcome (the more followers, the more sponsorships a developer acquires). Regarding the drawbacks of the mechanism, we found that in addition to the shortcomings in its use, participants felt that the Sponsor mechanism should better attract and support corporate sponsors. Some people thought that the open source donation method needed to be improved to attract more developers to participate. Overall, we have explored the correlation between donation behavior and developers in open source communities using the GitHub Sponsor mechanism. In future work, we will further explore the following aspects: 1) the advantages and disadvantages of different open source donation platforms and the effectiveness of incentives for open source activities and 2) different types of open source financial support and the reasonableness and effectiveness of each mode.\n\nACKNOWLEDGMENTS\n\nThis work is supported by China National Grand R&D Plan (Grant No.2020AAA0103504). Thanks to all GitHub users who response to the questionnaire.\n\nREFERENCES\n\n[1] Mark Aberdour. 2007. Achieving quality in open-source software. IEEE software 24, 1 (2007), 58\u201364.\n[2] Bethany Alender. 2016. Understanding volunteer motivations to participate in citizen science projects: a deeper look at water quality monitoring. Journal of Science Communication 15, 3 (2016), A04.\n[3] Shaosong Ou Alexander Hars. 2002. Working for free? Motivations for participating in open-source projects. International journal of electronic commerce 6, 3 (2002), 25\u201359.\n[4] Maria J Antikainen and Heli K Vaataja. 2010. Rewarding in open innovation communities\u2014how to motivate members. International Journal of Entrepreneurship and Innovation Management 11, 4 (2010), 440\u2013456.\n[5] Dryden Ash. 2013. The ethics of unpaid labor and the OSS community. https://www.ashedryden.com/blog/the-ethics-of-unpaid-labor-and-the-oss-community. [Online; accessed June 8, 2021].\n[6] Susanne Beck, Carsten Bergenholtz, Marcel Bogers, Tiare-Maria Brasseur, Marie Louise Conradsen, D\u00e1il\u00e9tt Di Marco, Andreas P Distel, Leonard Dobusch, Daniel D\u00f6rler, Agnes Effert, et al. 2020. The Open Innovation in Science research field: a collaborative conceptualisation approach. Industry and Innovation (2020), 1\u201350.\n[7] Kenneth P. Burnham and David R. Anderson. 2002. Model Selection and Multimodel Inference: a Practical Information-Theoretic Approach (2nd ed.). Springer.\n[8] G. Canfora, L. Cerulo, M. Cimitile, and MD Penta. 2014. How changes affect software entropy: an empirical study. Empirical Software Engineering 19, 1 (2014), 1\u201338.\n[9] Francesco Cappa, Jeffrey Laut, Maurizio Porfiri, and Luca Giustiniano. 2018. Bring them aboard: rewarding participation in technology-mediated citizen science projects. Computers in Human Behavior 89 (2018), 246\u2013257.\n[10] Krista Casler, Lydia Buckel, and Elizabeth Hackett. 2013. Separate but equal? A comparison of participants and data gathered via Amazon\u2019s MTurk, social media, and face-to-face behavioral testing. Computers in human behavior 29, 6 (2013), 2156\u20132160.\n[11] Jacob Cohen, Patricia Cohen, Stephen G West, and Leona S Aiken. 2013. Applied multiple regression/correlation analysis for the behavioral sciences. Routledge.\n[12] The SciPy community. 2008. API Reference of scipy.stats.wilcoxon. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html. [Online; accessed July 31, 2021].\n[13] Paul G Curran. 2016. Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology 66 (2016), 4\u201319.\n[14] Paul A David and Joseph S Shapiro. 2008. Community-based production of open-source software: What do we know about the developers who participate? Information Economics and Policy 20, 4 (2008), 364\u2013398.\n[15] Margret C Domroese and Elizabeth A Johnson. 2017. Why watch bees? Motivations of citizen science volunteers in the Great Pollinator Project. Biological Conservation 208 (2017), 40\u201347.\n[16] Enrique Estell\u00e9s-Arolas and Fernando Gonz\u00e1lez-Ladr\u00f3n-de Guervara. 2012. Towards an integrated crowdsourcing definition. Journal of Information science 38, 2 (2012), 189\u2013200.\n[17] Yulin Fang and Derrick Neufeld. 2009. Understanding sustained participation in open source software projects. Journal of Management Information Systems 25, 4 (2009), 9\u201350.\n[18] Oluwaseyi Feyisetan, Elena Simperl, Max Van Kleek, and Nigel Shadbolt. 2015. Improving paid microtasks through gamification and adaptive furtherance incentives. In Proceedings of the 24th international conference on world wide web. 333\u2013343.\n[19] Andrzej Ga\u0142ecki and Tomasz Burzykowski. 2013. Linear mixed-effects model. In Linear Mixed-Effects Models Using R. Springer, 245\u2013273.\n[20] Rishab Aiyer Ghosh. 2005. Understanding free software developers: Findings from the FLOSS study. Perspectives on free and open source software 28 (2005), 23\u201347.\n[21] GitHub. 2016. Getting Paid for Open Source Work. https://opensourceguide.getting-paid/. [Online; accessed June 8, 2021].\n[22] GitHub. 2017. Open Source Survey. https://opensourceurvey.org/2017/. [Online; accessed June 8, 2021].\n[23] GitHub. 2021. About your personal dashboard. https://docs.github.com/en/github/setting-up-and-managing-your-github-user-account/managing-user-account-settings/about-your-personal-dashboard#finding-your-top-repositories-and-teams. [Online; accessed May 24, 2021].\n[24] GitHub. 2021. Displaying a sponsor button in your repository. https://docs.github.com/en/github/administering-a-repository/managing-repository-settings/displaying-a-sponsor-button-in-your-repository. [Online; accessed May 22, 2021].\n[25] GitHub. 2021. Invest in the software that powers your world. https://github.com/sponsors. [Online; accessed July 30, 2021].\n[26] GitHub. 2021. Reference of GraphQL User API. https://docs.github.com/en/graphql/reference/objects#user. [Online; accessed July 30, 2021].\n[27] GitHub. 2021. Reference of RESTful List users API. https://docs.github.com/en/rest/reference/users#list-users. [Online; accessed August 1, 2021].\n\n[28] GitHub. 2021. The 2020 State of the OCTOVERSE. https://octoverse.github.com/. [Online; accessed February 4, 2021].\n\n[29] R. J. Grissom and J. J. Kim. 2007. Effect Sizes for Research: A Broad Practical Approach. Effect sizes for research: a broad practical approach.\n\n[30] Carl Gutwin, Reagan Penner, and Kevin Schneider. 2004. Group awareness in distributed software development. In Proceedings of the 2004 ACM conference on Computer supported cooperative work. ACM, Chicago, Illinois, USA, 72\u201381.\n\n[31] Stefan Haefliger, Georg Von Krogh, and Sebastian Spaeth. 2008. Code reuse in open source software. Management science 54, 1 (2008), 180\u2013193.\n\n[32] Cynthia Harvey. 2017. 35 Top Open Source Companies. https://www.datamation.com/open-source/35-top-open-source-companies. [Online; accessed February 5, 2021].\n\n[33] Andrea Hemetsberger. 2002. Fostering cooperation on the Internet: Social exchange processes in innovative virtual consumer communities. ACR North American Advances 29 (2002), 354\u2013356.\n\n[34] Mokter Hossain. 2012. Users\u2019 motivation to participate in online crowdsourcing platforms. In 2012 International Conference on Innovation Management and Technology Research. IEEE, 310\u2013315.\n\n[35] Javier Luis C\u00e1novas Izquierdo and Jordi Cabot. 2018. The role of foundations in open source projects. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Society. ACM, Gothenburg, Sweden, 3\u201312.\n\n[36] S. Jackman, C. Kleiber, and A. Zeileis. 2008. Regression Models for Count Data in R. Journal of Statistical Software 27, 8 (2008), 1\u201325.\n\n[37] Jayanta Kanwal and Pratibha Mahgul. 2012. Bug Prioritization to Facilitate Bug Report Triage. Journal of Computer Science and Technology 27 (2012), 397\u2013412.\n\n[38] Bran Knowles. 2013. Cyber-sustainability: towards a sustainable digital future. Lancaster University (United Kingdom).\n\n[39] Bruce Kogut and Anca Meitus. 2001. Open-source software development and distributed innovation. Oxford review of economic policy 17, 2 (2001), 248\u2013264.\n\n[40] Sandeep Krishnamurthy and Arvind K Tripathi. 2009. Monetary donations to an open source software platform. Research Policy 38, 2 (2009), 404\u2013414.\n\n[41] Alexandra Kuznetsova, Per B. Brockhoff, and Bune H. B. Christensen. 2017. InterTest Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software 82, 13 (2017), 1\u201326. https://doi.org/10.18637/jss.v082.i13\n\n[42] Karim Lakhani and Robert W. 2005. Why Hackers Do What They Do: Understanding Motivation and Effort in Free/Open Source Software Projects. MIT Press, Cambridge.\n\n[43] Lincoln R Larson, Caren B Cooper, Sara Futch, Devyani Singh, Nathan J Shipley, Kathy Dale, Geoffrey S LeBaron, and John Y Takekawa. 2020. The diverse motivations of citizen scientists: Does conservation emphasis grow as volunteer participation progresses? Biological Conservation 242 (2020), 108428.\n\n[44] Huigang Li, Yue Yu, Tao Wang, Gang Yin, Shanhan Li, and Huaimin Wang. 2021. Are You Still Working on This An Empirical Study on Pull Request Abandonment. IEEE Transactions on Software Engineering (2021), 1\u20131. https://doi.org/10.1109/TSE.2021.3053403\n\n[45] Debra J Mesch, Patrick M Rooney, Kathryn S Steinberg, and Brian Denton. 2006. The effects of race, gender, and marital status on giving and volunteering in Indiana. Nonprofit and Voluntary Sector Quarterly 35, 4 (2006), 565\u2013587.\n\n[46] Nadia. 2015. A handy guide to financial support for open source. https://github.com/nayalia/lemonade-stand/blob/master/README.md. [Online; accessed June 8, 2021].\n\n[47] Keitaro Nakasai, Hideaki Hata, and Kenichi Matsumoto. 2018. Are donation badges appealing? A case study of developer responses to eclipse bug reports. IEEE Software 36, 3 (2018), 22\u201327.\n\n[48] Keitaro Nakasai, Hideaki Hata, Saya Onoue, and Kenichi Matsumoto. 2017. Analysis of donations in the eclipse project. In 8th International Workshop on Empirical Software Engineering in Practice (IWESEP). IEEE, Tokyo, Japan, 18\u201322.\n\n[49] Cassandra Overney. 2020. Hanging by the Thread: An Empirical Study of Donations in Open Source. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings (Seoul, South Korea) (ICSE \u201920). Association for Computing Machinery, New York, NY, USA, 131\u2013133. https://doi.org/10.1145/3377812.3382170\n\n[50] Cassandra Overney, Jens Meinicke, Christian K\u00e4stner, and Bogdan Vasilescu. 2020. How to Not Get Rich: An Empirical Study of Donations in Open Source. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE \u201920). Association for Computing Machinery, New York, NY, USA, 1209\u20131221. https://doi.org/10.1145/3377811.3380410\n\n[51] Patr\u00edcia Tiago, Maria Jo\u00e3o Gouveia, C\u00e9sar Capinha, Margarida Santos-Reis, and Henrique M Pereira. 2017. The influence of motivational factors on the frequency of participation in citizen science activities. Nature Conservation 18 (2017), 61.\n\n[52] Cassandra Overney. 2020. Become a sponsor to Super Diana. https://github.com/sponsors/alphacentauri2. [Online; accessed May 26, 2021].\n\n[53] SurveyMonkey. 1999. https://www.surveymonkey.com/. [Online; accessed May 26, 2021].\n\n[54] Andrew Schofield and Grahame S. Cooper. 2006. Participation in Free and Open Source Communities: An Empirical Study of Community Members\u2019 Perceptions. In Open Source Systems, Ernesto Damiani, Brian Fitzgerald, Wai-Chi Scacchi, Marco Scotto, and Giancarlo Succi (Eds.). Springer US, Boston, MA, 221\u2013231.\n\n[55] Manuel Sojer and Joachim Henkel. 2010. Code reuse in open source software development: Quantitative evidence, drivers, and impediments. Journal of the Association for Information Systems 11, 12 (2010), 2.\n\n[56] Diana Super. 2020. Become a sponsor to Super Diana. https://github.com/sponsors/0xTim. [Online; accessed May 26, 2021].\n\n[57] Asher Trockman, Shurui Zhou, Christian K\u00e4stner, and Bogdan Vasilescu. 2018. Adding Sparkle to Social Coding: An Empirical Study of Repository Badges in the Npm Ecosystem. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE \u201918). Association for Computing Machinery, New York, NY, USA, 511\u2013522. https://doi.org/10.1145/3180155.3180209\n\n[58] Lian Tung. 2020. Redis database creator Sanfilippo: Why I\u2019m stepping down from the open-source project. https://www.zdnet.com/article/redis-database-creator-sanfilippo-why-im-stepping-down-from-the-open-source-project/. [Online; accessed June 8, 2021].\n\n[59] Steven J. Vaughan-Nichols. 2021. Hard work and poor pay stresses out open-source maintainers. https://www.zdnet.com/article/hard-work-and-poor-pay-stresses-out-open-source-maintainers/. [Online; accessed Jun 8, 2021].\n\n[60] Georg Von Krogh, Stefan Haefliger, Sebastian Spaeth, and Martin W. Wallin. 2012. Carrots and Rammbocks: Motivation and Social Practice in Open Source Software Development. MIS Q. 36, 2 (Jun 2012), 649\u2013676.\n\n[61] Jing Wang, Patrick C. Shih, and John M. Carroll. 2015. Revisiting Linus\u2019s law: Benefits and challenges of open source software peer review. International Journal of Human-Computer Studies 77 (2015), 52\u201365. https://doi.org/10.1016/j.ijhcs.2015.01.005\n\n[62] John Willinsky. 2005. The unacknowledged convergence of open source, open access, and open science. First Monday 10, 8 (Aug. 2005). https://doi.org/10.5210/fm.v10i8.1265\n\n[63] Sarah Wiseman, Anna L Cox, Sandy JJ Gould, and Duncan P Brumby. 2017. Exploring the effects of non-monetary reimbursement for participants in HCI research. Human Computation (2017).\n\n[64] Bo Xu, Donald R. Jones, and Bingxia Shao. 2009. Volunteers\u2019 involvement in online community based software development. Information & Management 46, 3 (2009), 151\u2013158. https://doi.org/10.1016/j.im.2008.12.005\n\n[65] Bo Xu and Dahui Li. 2015. An empirical study of the motivations for content contribution and community participation in Wikipedia. Information & management 52, 3 (2015), 275\u2013286.\n\n[66] Yue Yu, Gang Yin, Huaimin Wang, and Tao Wang. 2014. Exploring the Patterns of Social Behavior in GitHub. In Proceedings of the 1st International Workshop on Crowd-Based Software Development Methods and Technologies (Hong Kong, China) (CrowdSoft 2014). Association for Computing Machinery, New York, NY, USA, 31\u201336. https://doi.org/10.1145/2666539.2666571\n\n[67] Xunzhao Zhang, Tao Wang, Yue Yu, Quheng Zeng, Zhiying Li, and Huaimin Wang. 2012. Questionnaire design for GitHub Sponsor mechanism. (2022). https://doi.org/10.5281/ZENODO.5715824\n\n[68] Yangyang Zhao, Alexander Serebrenik, Yuming Zhou, Vladimir Filkov, and Bogdan Vasilescu. 2017. The impact of continuous integration on other software development practices: A large-scale empirical study. In 2017 32nd IEEE/ACM.\nA OTHER PLATFORMS BESIDES THE SPONSOR MECHANISM\n\nTable 9: Other platforms for obtaining OSS financial support\n\n| Name | URL |\n|-----------------------------|------------------------------------------|\n| Bountysource | https://www.bountysource.com |\n| Flattr | https://flattr.com |\n| IssueHunt | https://issuehunt.io |\n| Kickstarter | https://www.kickstarter.com |\n| Liberapay | https://liberapay.com |\n| Gittip | https://gratipay.com |\n| Gratipay | https://gratipay.com |\n| OpenCollective | https://opencollective.com |\n| Otechie | https://otechie.com |\n| Patreon | https://www.patreon.com |\n| PayPal | https://www.paypal.com |\n| Tidelift | https://tidelift.com |\n| Tip4Commit | https://tip4commit.com |\n| LFX Mentorship (formerly CommunityBridge) | https://lfx.linuxfoundation.org/tools/mentorship |\n| Ko-fi | https://ko-fi.com |", "source": "olmocr", "added": "2025-06-23", "created": "2025-06-23", "metadata": {"Source-File": "/home/nws8519/git/adaptation-slr/studies_pdfs/034_zhang.pdf", "olmocr-version": "0.1.76", "pdf-total-pages": 18, "total-input-tokens": 67259, "total-output-tokens": 24330, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 5318, 1], [5318, 11812, 2], [11812, 16967, 3], [16967, 21371, 4], [21371, 25249, 5], [25249, 32054, 6], [32054, 38167, 7], [38167, 44488, 8], [44488, 48851, 9], [48851, 52722, 10], [52722, 58583, 11], [58583, 63861, 12], [63861, 70188, 13], [70188, 75302, 14], [75302, 82149, 15], [82149, 90105, 16], [90105, 98974, 17], [98974, 100386, 18]]}}
|