adaptation-slr/models/bertopic_docs.txt

Depended Broke Empirical Study Manifesting Breaking Changes Client Packages DANIEL VENTURINI Federal University Technology UTFPR Brazil FILIPE ROSEIRO COGO Huawei Technologies Canada IVANILTON POLATO Federal University Technology UTFPR Brazil MARCO GEROSA Northern Arizona University NAU United States IGOR SCALIANTE WIESE Federal University Technology UTFPR Brazil Complex systems network dependencies Developers often configure package managers eg npm automatically update dependencies publication new releases containing bug fixes new features dependency release introduces backwardincompatible changes commonly known breaking changes dependent packages may build anymore may indirectly impact downstream packages impact breaking changes dependent packages recover breaking changes remain unclear close gap investigated manifestation breaking changes npm ecosystem focusing cases packages’ builds impacted breaking changes dependencies measured extent breaking changes affect dependent packages analyses show around 12 dependent packages 14 releases impacted breaking change updates nonmajor releases dependencies observed manifesting breaking changes 44 introduced minor patch releases principle backward compatible Clients recovered breaking changes half cases frequently upgrading downgrading provider’s version without changing versioning configuration package manager expect results help developers understand potential impact changes recover CCS Concepts • engineering → evolution Additional Key Words Phrases Breaking changes Semantic Version npm dependency management change impact ACM Reference format Daniel Venturini Filipe Roseiro Cogo Ivanilton Polato Marco Gerosa Igor Scaliante Wiese 2023 Depended Broke Empirical Study Manifesting Breaking Changes Client Packages ACM Trans Softw Eng Methodol 32 4 Article 94 May 2023 26 pages httpsdoiorg1011453576037 work partially supported National Science Foundation Grant Number IIS1815503 CNPqMCTIFNDCT grant 40881220214 MCTICCGIFAPESP grant 2021066621 Authors’ addresses Venturini Polato Wiese Federal University Technology UTFPR Campo Mourão Paraná Brazil emails danielventurinialunosutfpredubr ipolatoigorutfpredubr F R Cogo Huawei Technologies Kingston Canada email filipecogogmailcom Gerosa Northern Arizona University NAU Flagstaff AZ email MarcoGerosanauedu Permission make digital hard copies part work personal classroom use granted without fee provided copies made distributed profit commercial advantage copies bear notice full citation first page Copyrights components work owned others authors must honored Abstracting credit permitted copy otherwise republish post servers redistribute lists requires prior specific permission andor fee Request permissions permissionsacmorg © 2023 Copyright held ownerauthors Publication rights licensed ACM 1049331X202305ART94 1500 httpsdoiorg1011453576037 1 INTRODUCTION Complex systems commonly built upon dependency relationships client package reuses functionalities provider packages turn depend packages automate process installing upgrading configuring removing dependencies package managers npm Maven pip Cargo widely adopted Despite many benefits brought reuse provider packages one main risks client packages face breaking changes 21 Breaking changes backwardincompatible changes performed provider package renders client package build defective eg change provider’s API client packages configure package managers automatically accept updates range provider package versions breaking change serious consequence catching clients guard example npm packages follow Semantic Versioning specification 23 clients adopt configurations automatically update minor patch releases providers principle release types contain breaking changes semantic version posits major updates contain breaking changes However minor patch releases occasionally introduce breaking changes generate unexpected errors client packages breaking changes manifest clients Due transitive nature dependencies package managers unexpected breaking changes potentially impact large proportion dependency network preventing several packages performing successful build Research shown providers occasionally incorrectly use Semantic Versioning specification 15 npm ecosystem prior research shown provider packages indeed publish releases containing breaking changes 14 15 18 19 However studies provide limited information regarding prevalence breaking changes focusing API breaking changes without clarifying client packages solve problems cause article fill gap conducting empirical study npm projects hosted GitHub verifying frequency types breaking changes manifest defects client packages clients recover npm main package manager JavaScript programming language 1 million packages estimated 97 web applications come npm 1 making extensive dependency network 9 employed mixed methods identify analyze types manifesting breaking changes—changes provider release render client’s build defective—and client packages deal projects article study cases breaking change manifest projects research answers following questions RQ1 extent breaking changes manifest client packages analyzed 384 packages selected using random sampling approach 95 confidence level ±5 confidence interval select client packages least one provider found manifesting breaking changes impacted 117 client packages regardless releases 139 releases addition 26 providers introduced manifesting breaking changes RQ2 changes provider packages manifest breaking change main causes manifesting breaking changes feature modifications change propagation among dependencies data type modifications also verified equal proportion manifesting breaking changes introduced minor patch releases approximately 44 release type Providers fixed manifesting breaking change cases introduced minor patch releases 464 615 respectively Finally manifesting breaking changes documented issue reports pull requests changelogs 781 cases RQ3 client packages recover manifesting breaking changes Client packages recovered manifesting breaking changes 391 cases recovery took 134 days providers fix break clients recovered first providers released fix manifesting breaking change took median 7 days Upgrading provider frequent way client packages recover manifesting breaking change article contributes literature providing quantitative qualitative empirical evidence phenomenon manifesting breaking changes npm ecosystem qualitative study may help developers understand types changes manifest defects client packages strategies used recover breaking changes also provide several suggestions clients providers enhance quality release processes additional contribution created pull requests real manifesting breaking change cases yet resolved half merged
::::
2 DEFINITIONS SCOPE MOTIVATING EXAMPLES section defines terms used article describes motivating examples research 21 Glossary Definitions following describe terms definitions use article based related work 7 11 17 Provider package release package release provides features resources use package releases Figure 1 package express provider embercli bodyparser provider express refer provider package P transitive provider want emphasize P provider packages instance Figure 1 bodyparser provider express bodyparser also bytes provider scenario consider bodyparser transitive provider Client package release package release uses features resources exposed provider package releases Figure 1 express client bodyparser bodyparser client bytes Direct provider release one directly used client package client explicitly declares dependency Figure 1 express direct provider embercli bytes direct provider bodyparser Indirect provider release package release least one providers uses words provider least one direct client’s providers Figure 1 bodyparser bytes indirect providers embercli bytes indirect provider express Transitive provider release package release one introduced breaking change client example breaking change introduced bytes Figure 1 affects client embercli packages express bodyparser transitive providers breaking change transited packages bodyparser express arrive client embercli transitive providers also impacted breaking change • Version statement client specify provider’s versions packagejson metadata file used npm specify providers versions among purposes version statement contains accepted version provider example version statement following metadata dependencies express 4106 defines client requires express version 4106 • Version range version statement client specify range versionsreleases accepted provider three types ranges Using range client specifies new provider releases supportedaccepted downloadable even ones breaking changes Caret range client specifies new provider releases contain new features bug fixes supportedaccepted downloadable breaking changes must avoided default range used npm dependency installed Tilde range range specifies new provider releases contain bug fixes supportedaccepted downloadable breaking changes new features must avoided Steady range range always resolves specific version also known specific range versioning statement range rather specific version npm allows installation steady range using command line option saveexact • Implicit explicit update implicit update happens client receives new provider version due range version packagejson version statement defined range versions example 4106 implicit update happens npm installs version 4109 matches range explicit update takes place client manually updates versioning statement directly packagejson • Manifesting breaking changes provider changes manifest fault client package ultimately breaking client’s build adopted definition breaking change prior literature 3–6 8 15 19 21 includes cases considered breaking changes eg change API effectively used client package Conversely manifesting breaking changes include cases covered prior definitions breaking change eg provider package used way intended provider developer semanticversioncompliant change introduced new release provider causes expected error client package
::::
22 Motivating Examples found following two examples manifesting breaking changes manual analysis following Listings red lines removed source code whereas blue lines inserted source code manual analysis Section 321 consists executing client tests suite releases analyzing executions run error client assetgraphbuilder700 provider assetgraph600 provider terser400 due range versions npm installed terser4610 Release 430 terser introduces change default enables wrapping functions parsing shown Listing 1 javascript terser421 without default wrapping behavior foofunction terser430 default wrapping behavior foofunction Listing 1 Diff terser421 terser430 default behavior 1httpsgithubcomtersertersercomparev421v430 change breaks assetgraphbuilder700’s tests feature turned default behavior client assetgraphbuilder800 adopts test make compatible terser’s behavior shown Listing 2 javascript expect javaScriptAssets0text match SockJSsSdefinemainfunction SockJSsSdefinemainfunction Listing 2 Diff assetgraph800 client’s tests adjusting breaking change Sometimes provider changes break client long introduction occurred client package emberclichartjs211 Figure 2 release 104 embercliqunit lefttree introduced change lead breaking change However almost 3 years later embercliqunit used together release 131 provider broccoliplugin middletree breaking change manifested November 2015 provider embercliqunit104 fixed error code changing returned object type function lintTree shown Listing 3 Despite type change break client released fix retained releases embercliqunit javascript lintTree functiontype tree Skip useLintTree false thisoptionsembercliqunit return tree Fakes empty broccoli tree return inputTree tree rebuild function return Listing 3 embercliqunit104 object type change Almost 3 years later August 2018 provider broccoliplugin131 released middletree Figure 2 fix bug Listing 4 javascript function isPossibleNodenode return typeof node string node null typeof node object var type typeof node 2 httpsgithubcomterserterserissues496 3 httpsgithubcomassetgraphassetgraphbuildercommite4140416e7feaa3d088cf3ad0229fd677ff36dbc 4 httpsgithubcomembercliembercliqunitcommit6fdfe7d 5 httpsgithubcombroccolijsbroccoliplugincommit3f9a42b Release 131 broccoliplugin package experienced manifesting breaking change due fix provider embercliqunit104 released almost 3 years prior manifesting breaking change occurred emberclichartjs’ dependency tree evolved time due range versions shown Figure 2 causing break package emberclichartjs211 installed April 2020 date analysis installation failed due integration broccoliplugin131 changes embercliqunit Fifteen days later embercliqunit143 fixed issue embercliqunit’s object type changed 15day period manifesting breaking change remained unresolved broccoliplugin received 384k downloads npm scenario shows even popular mature projects affected breaking changes Although recognize download count necessarily reflect popularity package use metric illustrative example many client packages might impacted provider package
::::
3 STUDY DESIGN section describes collected data Section 31 motivation approach RQ Section 32 31 Data Collection 311 Obtaining Metadata npm Packages first part Figure 3 shows approach sampling database initially gathered metadata files ie packagejson files published packages npm registry December 20 2010 April 01 2020 accounting 1233944 packages range refers oldest checkpoint could retrieve recent one started study ignored packages providers packagejson since cannot considered client packages therefore suffer breaking changes filtering packages without provider dataset comprises 987595 packagejson metadata files release package recorded timestamp release name providers respective versioning statements parsed versioning statements determined resolved provider version time client release Prior works adopted similar approaches studying dependency management 7 29 provider client release retrieved recent provider version satisfied range specified client release ie resolved version Using resolved version determined whether provider changed version two client releases words reproduced adopted versions providers resolving provider version release time client refine sample analyzed two criteria associated packagejson snapshot latest version client packages dataset 6httpsgithubcombroccolijsbroccolimergetreesissues65 7httpsgithubcomembercliembercliqunitcommit59ca6ad 1 packagejson snapshot nonempty entry “script test” field entry differ default Error test specified specified criterion order run automated tests part method detect manifesting breaking changes total 488805 packages remained applying criterion 2 packagejson snapshot entry containing package’s repository URL wanted retrieve information package codebase applying criterion 410433 packages remained dataset
::::
312 Running Clients’ Tests Given size dataset 410000 client packages ran tests random sample 95 confidence level ±5 confidence interval randomly selected 384 packages sample median 55 releases 9 direct providers per package chose study random sample since manual analysis slow run large dataset Section 313 spent month executing method sample ignore packages based number releases providers metric performed manual check selected packages fewer four releases 130 384 checking repositories aiming remove packages real projects lack tests lack code example projects forth removed one package sampled another one following two criteria described second part Figure 3 depicts approach running test scripts release 384 clients client package cloned repository—all client repositories hosted GitHub—and restored work tree releases using respective release tags eg “v100” releases tagged used provided timestamp packagejson metadata restore work tree ie matched release timestamp closest existing commit master branch conducted analysis verified tags timestamps point commit 94 releases tags thus checkout based timestamps reliable untagged releases restoring work tree client release updated versioning statements associated packagejson entry specific resolved provider version see Section 311 excluded file called packagelockjson locks providers’ indirect providers’ versions also executed associated tests release client package whenever provider package changed release potentially introduce manifesting breaking change provider change 1 provider added packagejson 2 resolved version provider changed previous current release client package sought reproduce build environment existed provider changed Therefore executing tests client packages performed besteffort procedure identify Nodejs adopted client package time provider changed every 6 months new major version Nodejs released8 wanted reproduce test results respect time client package published release changed Nodejs version executing client package tests selected Nodejs version using two different approaches preferred approach select Nodejs version one specified engines → node field packagejson file9 field allows developers manually specify Nodejs version runs associated code build specific release field set selected latest Nodejs version available10 time client package release Therefore changed Nodejs version executed install script released tests using npm install npm test commands respectively install test commands failed due incompatibilities selected Nodejs version took 10 minutes changed previous major release Nodejs install test commands succeeded used Node Version Manager NVM tool exchange Nodejs versions Additionally also changed npm version according Nodejs version npm package manager Nodejs packages executes install test scripts performed procedure select npm version use installation test runs Finally executed installtest scripts saved results success error client release executing installtest scripts 384 client packages sample discarded 33 packages errors allow execution installtest script releases 15 clients one required files 11 invalid test scripts eg test test 4 listed required files gitignore file specifies untracked files git ignore11 2 required specific database configurations could done 1 package required key access server randomly replaced 33 packages following aforementioned criteria Table 1 shows results execution installtest scripts 384 client packages 3230 releases Since associated providers’ version 2727 releases change tests’ releases executed Finally consider possible manifesting breaking changes cases client packages releases failed installtest scripts replication package including client packages’ sample instruments scripts identified manifesting breaking changes available download httpsdoiorg105281zenodo5558085 8 httpsgithubcomnodejsnodereleasetypes 9 httpsdocsnpmjscomfilespackagejsonengines 10 httpsnodejsorgendownloadreleases 11 httpsgitscmcomdocsgitignore 313 Manual Check Failure Cases Detecting Manifesting Breaking Changes failure cases 203 clients 1276 releases execution installtest scripts manually analyzed ones true cases manifesting breaking changes identify breaking changes manifest client package leveraged output logs logs generated npm executing install test scripts generated result executing method described Section 312 see second part Figure 3 failed test result obtained error description associated stack trace differentiated failed test results caused related issue client package eg introduced bug client caused change provider package eg change return type provider’s function obtained stack traces determined whether function provider package called manually investigated positive cases manual investigation sought confirm test failure caused manifesting breaking change introduced provider package first author responsible running tests identifying manifesting breaking changes related releases commits first author also manually analyzed manifesting breaking changes recorded following information number affected versions client whether documentation mentions manifesting breaking change responsible package addressing breaking change provider client client version impacted manifesting breaking change provider version introduced breaking change textual description causes breaking change manifestation eg “The provider function renamed mistake” “The provider normalizeurl100 introduced new function client assetgraph used client forgot update provider version packagejson” “The provider inserts ‘in null body request’” process several rounds discussions performed among authors refine analysis using continuous comparison 22 negotiated agreement 13 negotiated agreement process researchers discussed rationale used categorize code reaching consensus 13 specifically leveraged recorded information manifesting breaking change derive consistent categorization introduced breaking changes RQ2 RQ3 guide new iterations manual analysis specifically following set actions performed manual investigation Analyze execution flow determine whether associated function test failure occurred provider client code leveraged stack traces identify function called test failed particular instrumented code provider client packages output necessary information analyze execution flow analyzed variable contents adding call consolelog consoletrace functions part code client package calls function provider example suppose following error appeared “TypeError myObjectcallback function” discover variable content use command consolelogmyObject check whether myObject variable changed null received values Analyze status Continuous Integration CI pipeline compared status CI pipeline originally built release status CI pipeline time manual investigation Since source code client package remains original release installed version analysis use difference status CI pipeline additional evidence test failure caused provider version change clients CI pipelines helpful • Search client fixing commits manually searched recovering commits history commits installed previous releases client package Whenever recovery commit identified reading commit message determined whether error due client provider code example observed cases client updated provider release failed tests also observed following commits provider downgraded commit message “downgrade provider” “fix breaking change” cases considered test failure caused manifesting breaking change • Search related issue reports pull requests hypothesized manifesting breaking change would affect different clients turn would either issue bug report perform fix followed pull request codebase provider package Therefore searched issue reports pull requests error message obtained stack trace collected detailed information error confirm whether due manifesting breaking change introduced provider package • Previous subsequent provider versions test error caused manifesting breaking change downgrading previous provider version upgrading subsequent provider version might fix error provider already fixed Subsequent provider versions means provider versions fit versioning statement greater provider version introduced manifesting breaking change ie adopted provider version test failed case uninstalled current version installed previous subsequent versions executed test scripts example client specified provider p p 102 brought breaking change version example 104 installed p102 p103 p105 verify whether error persisted versions
::::
32 Research Questions Motivation Approach section contains motivation approach research questions
::::
321 RQ1 Extent Manifesting Breaking Changes Manifest Client Packages Motivation default npm sets caret range default versioning statement automatically updates minor patch releases Hence manifesting breaking changes introduced minor patch releases inadvertently cause downtime packages downloaded hundreds thousands times per day affecting large body developers Understanding prevalence manifesting breaking changes popular ecosystems npm important help developers assess risks accepting automatic minor patch updates Although prior studies focused frequency API breaking changes 3 breaking changes occur different reasons Determining prevalence broader range breaking change types remains open research problem Approach cases resulted error installtest script determined type error client provider discovered calculated 384 packages 3230 releases percentage cases confirmed manifesting breaking change Considering providers client’s latest releases calculated percentage providers introduced manifesting breaking changes addition calculated many times number releases provider introduced least one manifesting breaking change 322 RQ2 Problems Provider Package Cause Manifesting Breaking Change Motivation Prior studies breaking changes npm ecosystem restricted APIs’ breaking changes 14 However issues provider packages introduce minor patch releases manifest breaking change support developers reason manifesting breaking changes important understand root causes Approach RQ analyzed type changes introduced provider packages bring manifesting breaking change name version provider packages manually analyzed provider’s repository find exact change caused break used following approaches find specific changes introduced providers Using diff tools used diff tools analyze introduced change two releases provider example suppose manifesting breaking change introduced release provider125 case retrieved source code previous versions eg provider124 performed diff versions manually inspect changed code Analyzing provider’s commits used provider’s commits analyze changes releases manifesting breaking change provider p verified repository manually analyzed commits ahead behind release tag commit introduced manifesting breaking change Analyzing changelogs Changelogs contain information relevant changes history package used changelogs understand introduced changes release client package verify whether manifesting breaking change fix described also looked issue reports pull requests explanations causes manifesting breaking changes discovering provider changes introduced breaking changes analyzed categorized grouped common issues example related issues changing object types grouped category called Object type changed Furthermore analyzed Semantic Version level introduced fixedrecovered manifesting breaking changes provider client packages verify relationship manifesting breaking changes nonmajor releases analyzed version numbering releases fixed manifesting breaking change manifesting breaking changes documented changelogs issue reports etc Furthermore analyzed depth dependency tree provider introduced manifesting breaking change since 25 npm packages least 95 transitive dependencies 2016 10 323 RQ3 Client Packages Recover Manifesting Breaking Change Motivation breaking change may impact client package implicit explicit update client recovery identified update code waiting new provider’s release performing downgradeupgrade provider’s version Breaking changes may caused either direct indirect provider since client packages depend direct providers many indirect ones 11 breaking change may cascade transitive dependencies remains unfixed Even client packages recover breaking change upgrading newer version provider package client packages manually resolve incompatibilities might exist 12 Understanding breaking changes manifest client packages help developers understand recover Approach retrieved information RQ clients’ repositories searched information error client packages recovered manifesting breaking change following information analyzed Commits manually checked subsequent commits client packages pushed repositories provider release introduced respective manifesting breaking change particular searched commits touched packagejson file file history checked provider downgraded upgraded replaced removed Changelogs analyzed client changelogs release notes looking mentions provider updatesdowngrades 48 clients maintained changelog release notes repositories Pull requestsissue reports searched pull requests issue reports client repository contained information manifesting breaking changes example found pull requests issue reports “Update provider” “Fix provider error” title manifesting breaking change case recovered provider’s dependency tree example second motivating example Section 2 recovered dependency tree client package introduced manifesting breaking change resulted broccoliassetrev → broccolifilter → broccoliplugin Figure 2 investigated many breaking change cases introduced direct indirect providers manifesting breaking change introduced fixedrecovered package fixedrecovered fixedrecovered also verified client packages changed provider’s versions associated documentation manifesting breaking changes related time fix
::::
33 Scope Limitations definition manifesting breaking changes includes cases included prior definitions breaking changes see Section 21 article intend provide direct comparison two phenomena result stated research questions indicate proportion manifest breaking changes fact breaking changes defined prior literature eg API change provider addition since provider packages rarely accompanied formal specification intended behavior impossible scale study differentiate errors manifest client package due breaking changes manifest due idiosyncratic usage provider client package Therefore results stated RQs cannot used assess whether client package could fix build simply updating newer version provider
::::
4 RESULTS section presents associated findings RQ
::::
41 RQ1 Often Manifesting Breaking Changes Occur Client Package Finding 1 117 client packages regardless releases 139 client releases impacted manifesting breaking change 384 client packages 45 117 suffered failing test manifesting breaking change least one release 3230 client releases tests executed 1276 failed errors manually analyzed 450 139 releases error raised provider packages characterizing manifesting breaking change 86 27 releases could identify package raised error Table 2 Results Releases’ Analyses Results Releases Success 1954 605 Fail Client’s errors 479 148 manifesting breaking changes 450 139 Breaking due external changes 261 81 Errors identified 86 27 Total 3230 100 detected 261 81 releases suffered particular error type call breaking due external change releases used provider relied dataresources external APIservice eg Twitter longer available impacting clients’ releases provider cannot fix error resource cases imply detecting manifest breaking changes running clients’ tests introduce false positives simply ignored manual analyses also considered cases provider package removed npm breaking due external change Table 2 shows results analyses releases Finding 2 922 providers introduced single manifesting breaking change sample 47 providers 922 51 introduced single release manifesting breaking change 4 providers introduced two releases manifesting breaking changes detected 55 unique manifesting breaking change cases introduced providers impacted multiple clients example breaking change exhibited Incompatible Providers’ Versions classification Finding 3 impacted six clients Therefore 64 manifesting breaking change cases manifested client packages Finally 1909 providers clients’ latest versions percentage providers introduced manifesting breaking change 26 51 1909 117 clients 139 releases suffered manifesting breaking changes detected failing tests due 2 providers changes 90 introduced manifesting breaking changes single release manifesting breaking change 42 RQ2 Issues Provider Package Caused Breaking Change Manifest Finding 3 found eight categories issues grouped manifesting breaking change eight categories depending root cause issue Table 3 presents category number occurrences number impacted client releases following describe category present example found manual analysis Feature change Manifesting breaking changes category related modifications provider features eg default value variables example happens request2170—this version removed npm introduced change remained package—when developers introduced new decision rule codetext12 shown Listing 5 12httpsgithubcomrequestrequestcommitd05b6ba Table 3 Identified Categories Manifesting Breaking Changes Category Cases Releases Feature change 25 391 101 224 Incompatible providers’ versions 15 234 64 142 Object type changed 9 141 213 473 Undefined object 5 78 28 62 Semantically wrong code 5 78 14 31 Failed provider update 2 31 24 53 Renamed function 2 31 2 04 File found 1 16 4 09 Total 64 450 Listing 5 Example manifesting breaking change categorized feature change javascript debugemitting complete selfurihref ifresponsebody undefined selfjson responsebody selfemitcomplete response responsebody Listing 5 provider request assigns empty string responsebody variable instead preserving responsebody default undefined value Incompatible providers’ versions category client breaks change indirect provider example happens packages babeleslint escope escope indirect provider babeleslint javascript visitClass key visitClass value function visitClassnode Listing 6 Incompatible providers’ versions example release escope34 introduced presented change Listing 6 change impacted package babeleslint even though escope direct provider babeleslint manifesting breaking change remained unresolved single day babeleslint received 80k downloads npm Object type changed detected nine 1406 cases provider changed type object resulting breaking change client packages javascript thissetup thissockets thissockets thisnsps thisconnect Buffer var socket nspaddthis function selfsocketspushsocket selfsocketssocketid socket selfnspsnspname socket Listing 7 Object type changed example 13 httpsgithubcombabelbabeleslintissues243 14 httpsgithubcomestoolsescopeissues99issuecomment178151491 Listing 7 provider socketio140 turned array objecttextsuperscript15 simple change broke many socketio’s clients even package karmatextsuperscript16 browser test runner forced update codetextsuperscript17 publish karma01319 single day manifesting breaking change remained unresolved karma downloaded 146k times npm Undefined object category undefined object causes runtime exception breaks provider throws exception client package javascript appoptions appoptions appoptionsbabel appoptionsbabel appoptionsbabelplugins appoptionsbabelplugins Listing 8 Undefined object code example error happened provider emberclihtmlbarsinlineprecompile013 solved shown Listing 8textsuperscript18 Failed provider update category provider updates provider B provider update code work new provider B detected two cases category addition explicit update one provider category specified provider B acceptall range geq time provider B published major release introduced manifesting breaking change Despite provider specifying accept range consider implicit update provider B client suffered error Semantically wrong code Manifesting breaking changes category happen provider writes semantically wrong code generating error runtime processtextsuperscript19 affecting client errors could caught compiletime compiled language JavaScript errors happen runtime occurred provider frontmatter020 four cases javascript const separators yaml const pattern pattern const pattern yaml Listing 9 Semantically wrong code example Listing 9 provider repeated variable name pattern declaration generated semantic error Although error easily detected fixed provider didtextsuperscript20 Listing 9 provider took almost 1 year fix frontmatter022 Meanwhile frontmatter received 366 downloads period Renamed function manifesting breaking changes category occur functions renamed analysis revealed two cases functions renamed renaming case first motivating example Section 2 describe second one javascript RedisClientprototypesendcommand function command args callback var argscopy arg prefixkeys RedisClientprototypeinternalsendcommand function command args callback var arg prefixkeys Listing 10 Renamed function code example textsuperscript15httpsgithubcomsocketiosocketiocommitb73d9be textsuperscript16httpsgithubcomsocketiosocketioissues2368 textsuperscript17httpsgithubcomkarmarunnerkarmacommit3ab78d6 textsuperscript18httpsgithubcomembercliemberclihtmlbarsinlineprecompilepull5commitsb3faf95 textsuperscript19httpshacksmozillaorg201702acrashcourseinjustintimejitcompilers textsuperscript20httpsgithubcomjxsonfrontmattercommitf16fc01 Table 4 Manifesting Breaking Changes Semantic Version Level Levels Major 3 47 Minor 28 4375 Patch 28 4375 Prerelease 5 78 Total 64 100 provider redis2601 renamed function Listing 1021 However function used client package fakeredis 22 broke change Client package fakeredis103 recovered error downgrading redis2600 23 5day period within manifesting breaking change fixed fakeredis received 23k downloads npm File found cases category provider removes file adds version control ignore list gitignore client tries access unique case category sample provider referenced file added ignore list Finding 4 Manifesting breaking changes often introduced patch releases shown Table 4 64 cases manifesting breaking changes analyzed 3 cases introduced major releases 26 minor releases 28 patch releases 5 prereleases Although analyzed manifesting breaking changes minor patch releases three cases manifesting breaking changes introduced major levels indirect provider transitively affected client packages—as jsdom16 case see Section 2 Prereleases precede stable release considered unstable anything may change stable version released24 detected breaking changes prereleases providers introduced unstable changes prereleases propagated changes stable versions example prerelease redis2601 described Section 322 whose rename function propagated stable version caused failure client packages Finding 5 Manifesting breaking change fixesrecoveries introduced clients andor providers searched identify package fixedrecovered manifesting breaking changes—client provider—and level fixedrecovered release published depicted Figure 4 Figure 4 shows client packages recover nearly half manifesting breaking changes introduced minor updates turn 769 manifesting breaking changes introduced providers minor release fixed patch release Providers fix majority manifesting breaking changes introduced patch releases 464 time typically patch release 615 Finding 6 219 manifesting breaking changes documented Although clients providers often document occurrence repair manifesting breaking change issue reports pull requests changelogs onefifth manifesting breaking changes undocumented 21httpsgithubcomNodeRedisnoderediscommit861749f 22httpsgithubcomNodeRedisnoderedisissues1030issuecomment205379483 23httpsgithubcomhdachevfakerediscommit01d1e99 24httpssemverorgspecitem9 Table 5 shows client provider packages documented manifesting breaking changes 781 manifesting breaking changes cases documentation 70 one type documentation example provider received issue report fixed manifesting breaking change documented changelog Documenting manifesting breaking changes fixes supports client recovery Section 323 Finding 7 578 manifesting breaking changes introduced indirect provider Indirect providers might also introduce manifesting breaking changes propagate client Table 6 shows depth level dependency tree provider introduced manifesting breaking change 422 manifesting breaking changes introduced direct provider client’s packagejson providers ones client directly installs perform function calls code first depth level dependency tree Manifesting breaking changes introduced indirect providers depth level greater 1 represent 578 cases Six cases third depth level single one fourth depth level Clients install providers directly rather come direct provider cases manifesting breaking change may totally unclear client packages since typically unaware providers direct control installation Table 7 Packages FixingRecovering Error Fixed byRecovered Provider 32 50 Client 13 203 Transitive provider 12 188 Client Transitive provider 25 391 fixedrecovered 7 109 Total 64 100 frequent issues provider packages introduced manifesting breaking changes feature changes incompatible providers object type changes Provider packages introduced manifesting breaking changes similar rates minor patch releases fixed manifesting breaking changes providers fixed patch releases Manifesting breaking changes documented 781 cases mainly issue reports Indirect providers introduced manifesting breaking changes cases 43 RQ3 Client Packages Recover Manifesting Breaking Change Finding 8 Clients transitive providers recover breaking changes 391 cases dependency tree transitive provider located provider introduced manifesting breaking change client manifested see Section 21 Table 7 shows package fixedrecovered manifesting breaking change case provider packages fixed majority manifesting breaking changes Since introduced breaking change theoretically expected behavior Client packages recovered manifesting breaking change 203 cases transitive providers recovered manifesting breaking changes 188 cases provider introduced manifesting breaking change fix transitive provider may fix solve client’s issue Since transitive providers also clients providers introduced manifesting breaking change clients clients transitive providers recovered breaking changes 391 cases observation suggests client packages occasionally work patch manifesting breaking change introduced since 391 cases clients transitive providers need take actions recover manifesting breaking change Finding 9 Transitive providers fix manifesting breaking changes faster packages manifesting breaking change introduced fixed either provider introduced transitive provider cases client package also recover Table 8 shows time package takes fix breaking change general manifesting breaking changes fixed 7 days provider packages Even relatively short period time many direct indirect clients affected Transitive providers fix manifesting breaking changes faster clients even providers Since manifesting breaking change exists raised client packages transitive providers break first need quick fix transitive providers usually spent 4 days fix break Meanwhile providers introduced manifesting breaking change take median 7 days introduce fix cases provider neglected introduce fix took longer client client packages took comparably lengthy 134 days mean 286 SD 429 recover manifesting breaking change According Table 7 direct providers transitive providers fixed manifesting breaking changes 788 clients slow recover However transitive providers also clients analyze time clients transitive providers spend fixrecover manifesting breaking change Clients transitive providers recovered manifesting breaking change around 82 days Finding 10 Upgrading frequent way recover manifesting breaking change Table 9 describes clients recovered breaking changes 48 cases provider version changed cases 714 client packages upgraded providers’ version analyzed cases clients transitive providers recovered manifesting breaking change changing provider’s version provider fixed error observed upgrade 12 522 cases 23 Thus half cases client transitive providers fixedrecovered manifesting breaking change provider package newer versions client using followup releases provider packages number downgrades transitive provider may explain recover manifesting breaking change faster client packages Since transitive providers also providers fix manifesting breaking change soon possible avoiding propagation error caused manifesting breaking change Consequently downgrade stable release provider frequent way transitive providers recover manifesting breaking change Finally provider replaced removed small proportion breaking change raised—about 72 cases combined Finding 11 recover manifesting breaking changes clients often change adopted provider version without changing range automatically accepted versions breaking change manifests clients often update provider’s version Figure 5 shows clients transitive providers updated providers’ versions verified transitive providers never set steady version provider breaking change manifests transitive providers use range provider’s version However single transitive provider changed range caret range steady one eg ˆ121 → 121 recover manifesting breaking change Nevertheless clients used caret range breaking change manifested 385 cases downgraded provider steady version majority manifesting breaking changes introduced clients transitive providers used caret range ast default range statement npm inserts packagejson provider added dependency client package half cases clients changed provider’s version another caret range accept ranges geq less commonly used less common updating Clients transitive provider 605 cases retained range type updated range type caret tilde steady kept provider updateddowngraded example client package specifies provider past120 receives breaking change p132 Whenever provider fixes code client package update example past140 change another range type tilde steady range Client packages recovered manifesting breaking changes 391 cases including clients transitive providers Providers fixed manifesting breaking changes faster client packages recovered manifesting breaking changes updating provider clients preferred update rather downgrade providers provider’s range updated downgraded breaking change around 60 cases change range type
::::
5 DISCUSSION section discusses implications findings dependency management practices Section 51 best practices clients providers follow mitigate impact caused manifesting breaking changes Section 52 also discuss manifestation breaking changes aspects Semantic Versioning npm ecosystem Section 53 51 Dependency Management managing dependencies client packages use dependency bots GitHub Snyk Dependabot receive automatic pull requests new provider’s release 27 bots continuously check new versions providers’ bugsvulnerabilities fixes open pull requests client’s repository updating packagejson including changelogs information provider’s new version Mirhosseini Parnin 16 show packages using bots update dependencies 16x faster manual verification Additionally tools JSFIX 20 helpful upgrading provider releases especially include manifesting breaking changes major releases JSFIX tool designed adapt client code new provider release offering safe way upgrade providers verified small percentage clients recovered manifesting breaking changes removing replacing provider cf Finding 10 may difficult several features resources provider package used client 2 Instead client packages tend temporarily downgrade stable provider version ease process upgradedowngrade providers avoid surprises clients search provider changelogs significant changes verified Finding 6 manifesting breaking changes documented changelogs issue reports pull requests Dependency bots also could analyze content changelogs issue reports create red flags like notifications documentation cites manifesting breaking change Finally client packages may use packagelockjson file better manage dependencies observed Finding 7 indirect providers—the ones depth 2 3 dependency tree—are responsible 578 manifesting breaking changes affect client package Using packagelockjson file client packages stay aware providers’ versions latest successful build provider upgraded due range versions new release manifests breaking change client side client still install providers’ versions successfully built client
::::
52 Best Practices Several issues found manual classification manifesting breaking changes Section 322 could avoided use static analysis tools Errors classified Semantically Wrong Code Rename function typically captured tools client provider developers use tools dynamic language JavaScript tools help avoid issues 26 Options JavaScript include jshint jslint standard Tómasdóttir et al 26 Tómasdóttir et al 25 show developers use linters mainly prevent errors bugs mistakes Due dynamic nature JavaScript however static analysis tools cannot verify inherited objects’ properties capture errors classified Change one rule Object type change Undefined object well Rename Function functions objects’ properties Thus developers concerned creating test cases run code along functionality providers client developers find breaking changes affect code Many available frameworks mocha chai ava support tasks tests also executed integrated environments every time developer commits pushes new changes case several tools available Travis Jenkins Drone CI Codefresh Using linters continuous integration systems developers catch errors releasing new version Finally good practice npm packages keep changelog document breaking changes fixes issue reports pull requests practice continue widely adopted since currently around fifth providers cf Finding 6 would also help development automated tools eg bots dealing breaking changes Providers could create issue reports pull request templates allow clients specify consistent descriptions issues found
::::
53 Breaking Changes Manifestation Semantic Versioning Breaking changes often occur npm ecosystem impact client packages cf Finding 1 manifesting cases come indirect providers providers second level deeper dependency tree Findings Decan et al 10 show 2016 half client packages npm least 22 transitive dependencies indirect providers quarter least 95 transitive dependencies context clients may face challenges diagnosing manifesting breaking changes came manifesting breaking change introduced indirect provider client may know provider results show provider packages introduce manifesting breaking changes minor patch levels principle contain backwardcompatible updates according Semantic Versioning specification Semantic Versioning recommendation providers choose use 4 8 providers comply Semantic Versioning several errors might introduced observed Finding 4 manifesting breaking changes prereleases propagated stable releases cf Finding 4 One hypothesis providers might unaware correct use Semantic Versioning rules may explain propagated unstable changes stable releases Finally npm could provide badges provider packages would able explicitly show aware adhere Semantic Versioning Trockman 24 claims developers use visible signals specifically GitHub like badges indicate quality way clients could make better choice providers prefer aware Semantic Versioning
::::
6 RELATED WORK section describes related work regarding breaking changes npm ecosystems Breaking changes npm Bogart et al 5 present survey stability dependencies npm CRAN ecosystem authors interviewed seven package maintainers changes paper interviewees highlighted importance adhering Semantic Versioning avoid issues dependency updates recently authors investigated policies practices 18 ecosystems finding ecosystems share values stability compatibility differ values 4 Kraaijeveld 14 studied API breaking changes three provider packages author uses 3k client packages parsing providers’ clients’ files detect API breaking changes impact clients work identified 98 258 client releases impacted API breaking changes Mezzetti et al 15 present technique called type regression testing verifies type returned object API compares returned type another provider release authors chose 12 popular provider packages major releases applying technique patchminor releases belonging first major update verified type regression 94 minor patch releases research focused kind manifesting breaking changes analyzed client provider packages 139 releases impacted manifesting breaking changes Mujahid et al 19 focus detecting breakinducing versions thirdparty dependencies authors analyzed 290k npm packages flagged downgrade provider version possible breaking change provider versions tested using client tests authors identified 41 fails update resulted downgrade Similar authors resolved client’s providers release ran tests whenever least one provider version changed Møller et al 17 present tool uses breaking change patterns described providers fixes client code analyzed dataset 10 used npm packages searched breaking changes described changelogs compare classification Finding 3 found 153 cases breaking changes introduced major releases claim breaking changes 85 related specific package API points modules properties function changes Considering classification Finding 3 feature changes object type changed undefined object renamed function also classified changes package API claim 6406 manifesting breaking changes package API related Breaking changes ecosystems Brito et al 6 studied 400 providers Maven repository 116 days provider packages chosen popularity GitHub authors looked commits introduced API breaking change period Developers asked reasons breaking changes occurred article presents similar results authors claim New Feature frequent way breaking change introduced claim Feature Change main breaking change type Finding 3 Also authors similarly detected breaking changes frequently documented changelogs Finding 6 Foo et al 12 present study API breaking changes Maven PyPI RubyGems ecosystems study focuses detecting breaking changes computing diff code two releases found APIbreaking changes 26 provider packages approach suggests automatic upgrades 10 packages approach goes beyond API breaking changes found 117 client packages impacted manifesting breaking changes
::::
7 THREATS VALIDITY Internal validity breaking change detected verified type change provider package introduced collectively grouped changes categories However cases might fall one category example provider package changes type object changeimprove behavior case might fall Feature change Object type changed categorized case category represents error case since object changed feature change appropriate category would Feature change error cases categorized breaking due external change ones clients providers use—or depend on—external dataresources sites APIs changed time see Finding 1 cases represent 81 client’s releases cases could search manifesting breaking changes could execute release tests dataresource needed test longer available 8 client releases might impacted breaking changes could analyze Construct validity approach detecting breaking changes performed analysis client tests failed client used provider version breaking change client call function causes breaking change tests exercise code could detect breaking change call cases manifesting breaking changes Therefore might detected API breaking changes able detect API name changes API removal Parameter changes may detected JavaScript allows making call API number parameters25 restored working tree index respective commit tagged developer release listed tags repository used checkout respective tag However untagged releases performed checkout timestamp referenced packagejson trusted timestamp verified tags timestamp point commit 94 cases tagged repositories 25httpseloquentJavaScriptnet03functionshtmlpkzCivbonMM Lastly mention file npmshrinkwrapjson study file intended work like file packagelockjson controlling transitive dependency updates may published along package However npm strongly recommend avoiding use Also existence npmshrinkwrapjson files play major role study affect results based adopted research method include study External validity randomly selected client packages varied release numbers clients providers size However since analyzed npm packages hosted GitHub projects findings cannot directly generalized settings also important state representativeness also limited npm increases number packages releases daily Future work replicate study platforms ecosystems Finally since number projects sample small enough statistical power perform hypothesis tests around results involve packagelevel comparisons Conclusion validity Conclusion validity relates inability draw statistically significant conclusions due lack large enough data sample However research used qualitative approach mitigate potential conclusion threat conducting sanity check repositories client packages fewer four releases guarantees packages intended use production Section 312 Finally manifesting breaking changes claim work manually analyzed ensure legitimate breaking changes impact clients real world Section 313
::::
8 CONCLUSIONS reuse widely adopted practice package ecosystems npm support reusing packages However breaking changes negative side effect reuse Breaking changes impacts studied literature several ecosystems 3 6 18 28 papers examine breaking changes npm ecosystem client packages perspective ie executing client tests verify impact breaking changes 5 15 19 work analyzed manifesting breaking changes npm ecosystem client provider perspectives providing empirical analysis regarding breaking changes minor patch levels client’s perspective analyzed impact manifesting breaking changes found 117 clients impacted changes offer advice help clients automated tool developers discover avoid recover manifesting breaking changes Clients use dependency bots accelerate process upgrading providers clients look changelog files nondesired updating breaking changes provider’s perspective analyzed frequent causes manifesting breaking changes found common causes providers changed rulesbehaviors features stable last releases object type changed unintentionally undefined objects runtime Maintainers pay attention code review phases regarding issues Future research look correlation among package characteristics metrics breaking change occurrence REFERENCES 1 2018 year JavaScript 2018 review npm’s predictions 2019 Dec 2018 httpsblognpmjsorgpost180868064080thisyearinjavascript2018inreviewandnpmshtml 2 Hussein Alrubaye Mohamed Wiem Mkaouer 2018 Automating detection thirdparty java library migration function level Proceedings 28th Annual International Conference Computer Science Engineering CASCON’18 60–71 3 Christopher Bogart Christian Kästner James Herbsleb Ferdian Thung 2016 break API Cost negotiation community values three ecosystems Proceedings 2016 24th ACM SIGSOFT International Symposium Foundations Engineering FSE’16 109–120 httpsdoiorg10114529502902950325 4 Chris Bogart Christian Kästner James Herbsleb Ferdian Thung 2021 make breaking changes Policies practices 18 open source ecosystems ACM Trans Softw Eng Methodol 30 4 Article 42 July 2021 56 pages httpsdoiorg1011453447245 5 C Bogart C Kästner J Herbsleb 2015 breaks breaks ecosystem developers reason stability dependencies 2015 30th IEEEACM International Conference Automated Engineering Workshop ASEW’15 86–89 httpsdoiorg101109ASEW201521 6 Brito L Xavier Hora Valente 2018 Java developers break APIs 2018 IEEE 25th International Conference Analysis Evolution Reengineering SANER’18 Campobasso Mulise Italy 255–265 7 F R Cogo G Oliva E Hassan 2019 empirical study dependency downgrades npm ecosystem IEEE Transactions Engineering Nov 2019 1–13 8 Decan Mens 2019 package dependencies tell us semantic versioning IEEE Transactions Engineering May 2019 1226–1240 9 Alexandre Decan Tom Mens Maelick Claes 2016 topology package dependency networks comparison three programming language ecosystems Proceedings 10th European Conference Architecture Workshops ECSAW’16 Article 21 4 pages httpsdoiorg10114529934123003382 10 Decan Mens Claes 2017 empirical comparison dependency issues OSS packaging ecosystems 2017 IEEE 24th International Conference Analysis Evolution Reengineering SANER’17 2–12 11 Alexandre Decan Tom Mens Philippe Grosjean 2019 empirical comparison dependency network evolution seven packaging ecosystems Empirical Engineer 24 1 Feb 2019 381–416 httpsdoiorg101007s106640179589y 12 Darius Foo Hendy Chua Jason Yeo Ming Yi Ang Asankhaya Sharma 2018 Efficient static checking library updates Proceedings 2018 26th ACM Joint Meeting European Engineering Conference Symposium Foundations Engineering 791–796 httpsdoiorg10114532360243275535 13 Garrison Martha ClevelandInnes Marguerite Koole James Kappelman 2006 Revisiting methodological issues transcript analysis Negotiated coding reliability Internet Higher Education 9 1 2006 1–8 14 Michel Kraaijeveld 2017 Detecting Breaking Changes JavaScript APIs Master’s thesis Dept Soft Tech Delft University Technology Delft Netherlands httpresolvertudelftnluuid56e646dcd5c7482b832690e0de4ea419 15 Gianluca Mezzetti Anders Møller Martin Toldam Torp 2018 Type regression testing detect breaking changes Nodejs libraries Proceedings 32nd European Conference ObjectOriented Programming ECOOP’18 Leibniz International Proceedings Informatics LIPIcs 71–724 16 Mirhosseini C Parnin 2017 automated pull requests encourage developers upgrade outofdate dependencies 2017 32nd IEEEACM International Conference Automated Engineering ASE’17 84–94 17 Anders Møller Benjamin Barslev Nielsen Martin Toldam Torp 2020 Detecting locations JavaScript programs affected breaking library changes Proc ACM Program Lang 4 OOPSLA Article 187 Nov 2020 25 pages httpsdoiorg1011453428255 18 Anders Møller Martin Torp 2019 Modelbased testing breaking changes Nodejs libraries Proceedings 2019 27th ACM Joint Meeting European Engineering Conference Symposium Foundations Engineering 409–419 httpsdoiorg10114533389063338940 19 Suhail Mujahid Rabe Abdalkareem Emad Shihab Shane McIntosh 2020 Using others’ tests identify breaking updates International Conference Mining Repositories httpsdoiorg10114533795973387476 20 Benjamin Barslev Nielsen Martin Toldam Torp Anders Møller 2021 Semantic patches adaptation JavaScript programs evolving libraries Proc 43rd International Conference Engineering ICSE’21 21 Raemaekers van Deursen J Visser 2014 Semantic versioning versus breaking changes study maven repository 2014 IEEE 14th International Working Conference Source Code Analysis Manipulation 215–224 httpsdoiorg101109SCAM201430 22 Anselm Strauss Juliet Corbin 1998 Basics Qualitative Research Techniques Thousand Oaks CA Sage Publications 23 Jacob Stringer Amjed Tahir Kelly Blincoe Jens Dietrich 2020 Technical lag dependencies major package managers Proceedings 27th AsiaPacific Engineering Conference APSEC’20 228–237 httpsdoiorg101109APSEC51365202000031 24 Asher Trockman 2018 Adding sparkle social coding empirical study repository badges npm ecosystem 2018 IEEEACM 40th International Conference Engineering Companion ICSECompanion’18 524–526 25 K F Tómasdóttir Maurício Aniche Arie Deursen 2018 adoption JavaScript linters practice case study ESLint IEEE Transactions Engineering PP Sept 2018 26 httpsdoiorg101109TSE20182871058 26 K F Tómasdóttir Aniche van Deursen 2017 JavaScript Developers Use Linters Master’s thesis Dept Soft Tech Delft University Technology Delft Netherlands 27 Mairieli Wessel Bruno Mendes De Souza Igor Steinmacher Igor Wiese Ivanilton Polato Ana Paula Chaves Marco Gerosa 2018 power bots Characterizing understanding bots OSS projects Proceedings ACM HumanComputer Interaction 2 CSCW 2018 1–19 28 Jooyong Yi Dawei Qi Shin Hwei Tan Abhik Roychoudhury 2013 Expressing checking intended changes via change contracts Proceedings 2013 International Symposium Testing Analysis ISSTA’13 1–11 httpsdoiorg10114524837602483772 29 Ahmed Zerouali Eleni Constantinou Tom Mens Gregorio Robles Jesus GonzalezBarahona 2018 empirical analysis technical lag npm package dependencies httpsdoiorg10100797833199042146 Received 19 November 2021 revised 27 October 2022 accepted 8 November 2022
::::
CorePeriphery Communication Success FreeLibre Open Source Projects Kevin Crowstontextsuperscript1✉ Ivan Shamshurintextsuperscript2 textsuperscript1 Syracuse University School Information Studies 348 Hinds Hall Syracuse NY 13244–4100 USA crowstonsyredu textsuperscript2 Syracuse University School Information Studies 337 Hinds Hall Syracuse NY 13244–4100 USA ishamshusyredu Abstract examine relationship communications core peripheral members FreeLibre Open Source success study uses data 74 projects Apache Foundation Incubator conceptualize success terms success building community assessed graduation Incubator compare successful unsuccessful projects volume communication core committer peripheral community members use inclusive pronouns indication efforts create intimacy among team members innovation paper use inclusive pronouns measured using natural language processing techniques find core peripheral members differ volume contribution use inclusive pronouns volume communication related success
::::
1 Introduction Communitybased FreeLibre Open Source FLOSS projects developed maintained teams individuals collaborating globallydistributed environments 8 health developer community critical performance projects 7 challenging sustain voluntary members long term 4 11 Socialrelational issues seen key component achieving design effectiveness 3 enhancing online group involvement collaboration 15 paper explore community interactions related community health success Specifically examine contributions made members different roles Members different levels participation FLOSS development taken different roles 5 widely accepted models roles communitybased FLOSS teams coreperiphery structure 1 3 12 example Crowston Howison 7 see communitybased FLOSS teams onionlike coreperiphery structure core category includes core developers periphery includes codevelopers active users Rullani Haeffiger 17 described periphery “cloud” members orbits around core members open source development teams Generally speaking access core roles based technical skills demonstrated development tasks developer performs 13 Core developers usually contribute code oversee design evolution requires high level technical skills 7 Peripheral members hand submit patches bug fixes codevelopers provides opportunity demonstrate skills interest provide use cases bug reports test new releases without contributing codes directly active users requires less technical skill 7 Despite difference contributions core peripheral members important success evident making direct contributions developed core members vital development hand even though contribute sporadically peripheral members provide bug reports suggestions critical expertise fundamental innovation 17 addition periphery source new core members 10 20 maintaining strong periphery important longterm success Amrit van Hillegersberg 1 examined coreperiphery movement open source projects concluded steady movement toward core beneficial shift away core communication among core periphery predicts success yet investigated systematically gap paper addresses
::::
2 Theory Hypotheses develop hypotheses study discuss turn dependent independent variables study dependent variable study success success FLOSS projects measured many different ways ranging code quality member satisfaction market share 6 communitybased FLOSS projects examine success building developer community critical issue chose building developer community measure success identify independent variables predict success ie success building developer community examine communication among community members starting hypothesis communication predictive success H1 Successful projects higher volume communication unsuccessful projects specifically interested members different roles contribute projects noted projects rely contributions core peripheral members therefore extend H1 consider roles Specifically hypothesize H2a Successful projects higher volume communication core members unsuccessful projects H2b Successful projects higher volume communication peripheral members unsuccessful projects Prior research coreperiphery structure FLOSS development found inequality participation core peripheral members example Luthiger Stoll 14 found core members make greater time commitment peripheral members core participants spend average 12 h per week leaders averaging 14 h bugfixers otherwise active users around 5 h per week Similarly using social network analysis Toral et al 19 found core members post majority messages act middlemen brokers among peripheral members therefore hypothesize H3 Core members contribute communication peripheral members Prior research distinction coreperiphery mostly focused codingrelated behaviour roles defined coding activities performed 3 However developers coding 3 core peripheral members need engage socialrelational behaviour addition taskoriented behaviour coding Consideration nontask activities important effective interpersonal communication plays vital role development online social interaction 16 Scialdone et al 18 Wei et al 21 analyzed group maintenance behaviours used members build maintain reciprocal trust cooperation everyday interaction messages eg emotional expressions politeness strategies paper examine one factor identified investigating core peripheral members use language create “intimacy among team members” thus “building solidarity teams” Specifically Scialdone et al 18 found core members two teams used inclusive pronouns ie pronouns referring team peripheral members interpreted finding meaning “peripheral members general feel comfortable expressing sense belonging within groups” therefore hypothesize H4 Core members use inclusive pronouns communication peripheral members Scialdone et al 18 noted one team studied ceased production exhibited greater gap core periphery usage inclusive pronouns situation could indicate peripheral members group feel ownership negative implications future potential core members Scialdone et al 18 noted use inclusive pronouns “consistent Bagozzi Dholakia 2’s argument importance weintention Linux user groups ie individuals think ‘us’ ‘we’ attempt act joint way” similar argument made importance core member use inclusive pronouns therefore hypothesize H5a Successful projects higher usage inclusive pronouns core members unsuccessful projects H5b Successful projects higher usage inclusive pronouns peripheral members unsuccessful projects 3 Methods 31 Setting Scialdone et al 18 Wei et al 21 studied projects noted problem making comparison across projects quite diverse address concern paper studied larger number projects 74 total operated within common framework similar stage development Specifically studied projects Apache Foundation ASF Incubator ASF umbrella organization including 60 freelibre open source FLOSS development projects ASF’s apparent success managing FLOSS projects made frequently mentioned model efforts though often without deep understanding factors behind success ASF Incubator’s purpose mentor new projects point able successfully join ASF Projects invited join Incubator based application support sponsor member ASF Accepted projects known Podlings receive support one mentors help guide Podlings steps necessary become fullfledged ASF incubation process several goals including fulfillment legal infrastructural requirements development relationships ASF projects main goal develop effective development communities Podlings must demonstrate order graduate Incubator Apache Incubator specifically promotes diverse participation development projects improve longterm viability community ensure requisite diversity intellectual resources time projects spend incubation varies widely little two months nearly five years indicating significant diversity efforts required Podlings become viable projects primary reason projects retired Incubator rather graduated lack community development stalls progress 32 Data Collection Processing FLOSS settings collaborative work primarily takes place means asynchronous computermediated communication email lists discussion fora 5 ASF community norms strongly support transparency broad participation accomplished via electronic communications even collocated participants expected document conversations online record ie email discussion lists therefore drew data messages developers’ mailing list Perl script used collect messages html format site httpmarkmailorg discarded messages sent Podling either graduated retired ASF Incubator many projects apparently used email list even graduation dataset collected relevant data extracted html files representing message thread sources 321 Dependent Variable Success dependent variable success building community determined whether graduated success retired success based list projects maintained Apache Incubator available Apache website dataset includes email messages 24 retired 50 graduated Podlings data set also included messages projects still incubation unknown status used analysis check measure successful community development examined number developers active community successful community developers considered active members projects sent email developer mailing list incubation 322 Core Vs Periphery Crowston et al 9 suggested three methods identify core peripheral members FLOSS teams relying projectreported formal roles analysis distribution contributions based Bradford’s Law Scatter coreandperiphery analysis social network analysis showed relying projectreported roles accurate Therefore study identified message sender core member sender’s name list committers website find match sender labeled noncommitter peripheral member developed matching algorithm take account variety ways names appear email message 323 Inclusive Pronouns noted examined use inclusive pronouns one way team members build sense belong group Inclusive pronouns defined reference team using inclusive pronoun see “we” “us” “our” refers group Inclusive Reference “we” “us” “our” refer another group speaker member sentences judged two criteria 1 whether language cues inclusive reference pronoun specified definition 2 cues refer current group rather another group judge second criteria may require reviewing sentence context whole conversation usage one many indicators studied Scialdone et al 18 Wei et al 21 interesting tractable analysis handle large volume messages drawn many projects applied NLP techniques suggested implemented previous research Specifically used machinelearning ML approach algorithm learns classify sentences corpus already coded data Sentences chosen unit coding instead thematic units typically used human coding sentences easily identified machine learning Training data obtained SOCQA Sociocomputational Qualitative Analysis Syracuse University httpsocqaorg 22 23 training data consists 10841 sentences drawn two Apache projects SpamAssassin Avalon Trained annotators manually coded sentence whether included inclusive pronoun per definition distribution classes training data shown Table 1 “yes” means sentence inclusive pronoun Note sample unbalanced “yes” 1395 129 “no” 9446 871 Total 10841 features ML used bag words experimenting unigrams bigrams trigrams Naïve Bayes MNB k Nearest Neighbors KNN Support Vector Machines SVM algorithms Python LibSVM implementation trained applied predict class sentences ie whether sentence inclusive pronoun expected NLP would problem handling first part definition second whether pronoun refers group would pose challenges 10fold crossvalidation used evaluate classifier’s performance training data Results shown Table 2 results show though three approaches gave reasonable performance SVM outperformed methods Linear SVM model therefore selected use experimented tuning SVM parameters minimal term frequency etc find settings affected accuracy used default settings Unigram Bigram Trigram MNB 086 081 075 KNN 089 089 088 SVM LinearSVC 097 097 097 random guess baseline binary classification task would give accuracy 05 majority vote rule baseline classify examples majority class provides accuracy 087 trained SVM model significantly outperforms evaluate model performance applied new data results checked trained annotator one annotators training data set Specifically used model code 200 sentences 10 sentences randomly selected 5 projects “graduated” “in incubator” “retired” “unknown” classes projects annotator coded sentences compared results Cohen kappa agreement corrected chance agreement human vs machine coding 886 higher frequently applied threshold 80 agreement words ML model performed least well second human coder would expected Examining results somewhat surprisingly found cases predicted “inclusive reference” refers another group suggesting ML managed learn second criterion Two sentences model misclassified illustrative limitations approach looks like requires work “our patterns” libpathpm looked pathpm wwwapacheorg clue actual class “no” classifier marked “yes” inclusive pronoun “our” included sentence though quotes Could also clarify download URLs thirdparty dependencies can’t ship actual class “yes” model marked sentence “no” due error spelling space “we” human annotator ignored error enough examples errors ML learn Despite limitations benefit able handle large volumes email makes possible slight loss reliability coding especially considering human coders also perfectly reliable
::::
4 Findings section discuss turn findings study first validating measure success examining support hypothesis 41 Membership check measure success graduation Incubator compared number developers graduated retired projects active developers participated mailing list results shown Table 3 table shows graduated projects twice many developers active mailing list retired projects differences large statistical test significance seems superfluous doubters KruskalWallis test chosen data normally distributed shows statistically significant difference number developers graduated retired projects p 0001 result provides evidence validity graduation measure community health status Core Peripheral Graduated 316 194 822 1024 Retired 139 93 254 183 N 74 Standard deviations parentheses Hypothesis 1 successful projects would communication shown Table 4 hypothesis strongly supported graduated projects many times messages sent retired projects incubation process p 00001 Table 4 Mean number messages status developer role Core Peripheral Graduated 8265 8878 7306 8908 Retired 1791 1805 1652 2058 N 74 Standard deviations parentheses Hypotheses 2a 2b core peripheral members respectively would communicate successful projects unsuccessful projects differences Tables 4 5 show hypotheses supported p 00001 core p 00001 peripheral members overall message count graduated vs retired projects p 00011 p 00399 messages per developer Table 5 Mean number messages sent per developer status developer role Core Peripheral Graduated 239 191 109 119 Retired 107 200 47 92 N 74 Standard deviations parentheses Hypothesis 3 core members would communicate peripheral members Table 4 see fact total core peripheral members send volume messages graduated retired projects However fewer core members average sends many messages average shown Table 5 p 00001 Table 6 Mean number messages including inclusive pronoun sent per developer status developer role Core Periphery Graduated 22 18 6 5 Retired 12 8 4 5 N 74 Standard deviations parentheses Hypothesis 4 core members would use inclusive pronouns peripheral members Table 6 shows number messages sent developers included inclusive pronoun table shows core developers send messages inclusive pronouns graduated retired projects p 00001 Table 7 Mean percentage messages include inclusive pronoun per developer status developer role Core Periphery Graduated 76 34 55 22 Retired 93 5 53 32 N 74 Standard deviations parentheses control fact core developers send messages general computed percentage messages include inclusive pronoun shown Table 7 table see mean percentage messages sent core developers include inclusive pronoun higher peripheral members p 0001 Hypotheses 5a b would use inclusive pronouns core peripheral members respectively successful projects Table 6 hypothesis seems supported core members least note successful projects communication overall Examining Table 7 suggests fact slightly proportional use inclusive pronouns core members unsuccessful projects difference use peripheral members However neither difference significant using KW test meaning Hypothesis 5 supported Finally assess factors examined predictive projects success applied stepwise logistic regression predicting graduation various measures communication developed eg total number message developer role mean number percentage message inclusive pronouns first regression identified one factor predictive number core members result expected argued number core members also viewed measure community health regression without counts members identified total number mean number messages sent core members predictive mean negative coefficient R2 regression 33 combination factors provide much insight essentially proxy developer count greatest lot messages many messages per developer ie developers
::::
5 Discussion general data suggest successful projects ie successfully built community graduated incubation members correspondingly large volume communication suggesting active community expected core members contribute overall message volume seems almost evenly split core peripheral members suggesting roles play important part projects results demonstrate importance interaction shared responsibilities core peripheral members expected core members display somewhat greater ownership expressed use inclusive pronouns counter expectations use inclusive pronouns distinguish successful unsuccessful projects possible explanation result limitation data processing determined developer status core periphery based committer lists website collected time analysis process take account movement developers periphery core less frequently core periphery could successful projects active peripheral members ie using inclusive pronouns invited join core thus suppressing average peripheral members
::::
6 Conclusions work presented extended many ways future work First noted developers may change status results would accurate took account history developers became committers correctly assign status time Obtaining historical data challenging impossible Second ML NLP might improved richer feature set 24 though noted performance already good would expected additional human coder Third would interesting examine first months early signs predictive eventual outcome Fourth might similarly possible predict peripheral members become core members individual actions Fifth consider effects additional group maintenance behaviours Wei et al 21 Syracuse SOCQA success applying ML NLP techniques codes suggesting analysis feasible Sixth necessary consider limits hypothesized impacts example hypothesized communication reflects developed community could much communication creates information overload negative impact Finally paper considered communication behaviours complete model success would take account measure development activities code commits topic data available online Despite limitations research offers several advances prior work First examines much large sample projects Second uses objective measure success namely graduation ASF Incubator measure community development Finally shows viability application NLP ML techniques processing large volumes email messages incorporating analysis content messages counts network structure Acknowledgements thank SOCQA Nancy McCracken PI access coded sentences training Feifei Zhang checking coding results SOCQA partially supported grant US National Science Foundation Sociocomputational Systems SOCS program award 11–11107 References Amrit C van Hillegersberg J Exploring impact sociotechnical coreperiphery structures open source development J Inf Technol 252 216–229 2010 Bagozzi RP Dholakia UM Open source user communities study participation Linux user groups Manage Sci 527 1099–1115 2006 Barcellini F Détienne F Burkhardt JM situated approach roles participation open source communities HumComput Interact 293 205–255 2014 Bonaccorsi Rossi C FOSS succeed Res Policy 32 1243–1258 2003 Crowston K Wei K Howison J Wiggins FreeLibre open source development know know ACM Comput Surv 442 Article 7 2012 Crowston K Howison J Annabi H Information systems success free open source development theory measures Softw Process Improv Pract 112 123–148 2006 Crowston K Howison J Assessing health open source communities IEEE Comput 395 89–91 2006 Crowston K Li Q Wei K Eseryel UY Howison J Selforganization teams FreeLibre open source development Inf Softw Technol 496 564–575 2007 Crowston K Wei K Li Q Howison J Core periphery FreeLibre open source team communications Proceedings Hawai‘i International Conference System System HICSS39 2006 Dahlander L O’Mahony Progressing center coordinating work Organ Sci 224 961–979 2011 Fang Neufeld Understanding sustained participation open source projects J Manage Inf Syst 254 9–50 2009 Jensen C Scacchi W Role migration advancement processes OSSD projects comparative case study Proceedings 29th International Conference Engineering ICSE pp 364–374 2007 Jergensen C Sarma Wagstrom P onion patch migration open source ecosystems Proceedings 19th ACM SIGSOFT Symposium 13th European Conference Foundations Engineering pp 70–80 2011 Luthiger Stoll B Fun development Proceedings First International Conference Open Source Systems Genova Italy 11–15 July 2005 Park JR Interpersonal affective communication synchronous online discourse Libr Q 772 133–155 2007 Park JR Linguistic politeness facework computer mediated communication part 2 application theoretical framework J Soc Inf Sci Technol 5914 2199–2209 2008 Rullani F Haefliger periphery stage intraorganizational dynamics online communities creation Res Policy 424 941–953 2013 Scialdone MJ Heckman R Crowston K Group maintenance behaviours core peripheral members FreeLibre open source teams Proceedings IFIP WG 213 Working Conference Open Source Systems Skövde Sweden 3–6 June 2009 Toral SL MartínezTorres MR Barrero Federico Analysis virtual communities supporting OSS projects using social network analysis Inf Softw Technol 523 296–303 2010 von Krogh G Spaeth Lakhani KR Community joining specialization open source innovation case study Res Policy 327 1217–1241 2003 Wei K Crowston K Li NL Heckman R Understanding group maintenance behaviour FreeLibre opensource projects case fire gaim Inf Manage 513 297–309 2014 Yan JLS McCracken N Crowston K Design active learning system human correction content analysis Paper Presented Workshop Interactive Language Learning Visualization Interfaces 52nd Annual Meeting Association Computational Linguistics Baltimore MD June 2014 httpnlpstanfordedueventsillvi2014papersmccrackenillvi2014pdf Yan JLS McCracken N Crowston K Semiautomatic content analysis qualitative data Proceedings iConference Berlin Germany 4–7 Mar 2014 Yan JLS McCracken N Zhou Crowston K Optimizing features active machine learning complex qualitative content analysis Paper Presented Workshop Language Technologies Computational Social Science 52nd Annual Meeting Association Computational Linguistics Baltimore MD June 2014
::::
impacts lockdown open source contributions COVID19 pandemic Jin Hutextsuperscripta b Daning Hutextsuperscriptb Xuan Yangtextsuperscriptc Michael Chautextsuperscripta textsuperscripta Faculty Business Economics University Hong Kong Hong Kong textsuperscriptb Business School Southern University Science Technology Shenzhen Guangdong 518055 China textsuperscriptc Department Informatics University Zurich 8006 Zurich Switzerland textbfARTICLE INFO textbfKeywords COVID19 Lockdown Work productivity Open source Facetoface interactions textbfABSTRACT COVID19 pandemic instigated widespread lockdowns compelling millions transition workfromhome WFH arrangements rely heavily computermediated communications CMC collaboration study examines impacts lockdown innovationdriven work productivity focusing contributions open source OSS projects GitHub worlds largest OSS platform leveraging two lockdowns China natural experiments discover developers 2021 Xian lockdown increased OSS contributions 90 2020 Wuhan lockdown reduced contributions 105 subsequent survey study elucidates divergence uncovering adaptation effect wherein Xian developers became accustomed new norm WFH time capitalizing flexibility opportunities remote work Moreover findings across lockdowns reveal lack facetoface F2F interactions significantly impeded OSS contributions whereas increased available time home positively influenced finding especially noteworthy challenges assumption CMC effortlessly substitute F2F interactions without negatively affecting productivity examine impacts stayathome orders United States US OSS contributions find significant effects Collectively research offers valuable insights multifaceted impacts lockdown productivity shedding light individuals adapt remote work norms protracted disruptions like pandemic insights provide various stakeholders including individuals organizations policymakers vital knowledge prepare future disruptions foster sustainable resilience adeptly navigate evolving landscape remote work postpandemic world textbf1 Introduction COVID19 pandemic catalyzed global transition workfromhome WFH arrangements nations implemented lockdown measures limit human mobility curb spread virus Fang et al 2020 Sheridan et al 2020 Wang 2022 unprecedented shift remote work facilitated myriad computermediated communications CMC technologies instigated profound lasting impacts work productivity area garnered significant attention recent scholarly investigations Barber et al 2021 Cui et al 2022 Understanding impacts work productivity crucial guiding policy decisionmaking multiple levels help reshape individual approaches worklife balance redefine organizational strategies WFH arrangements inform governmental policies legislation aimed supporting remote work Moreover significant disruptions brought pandemic highlight imperative adaptability resilience levels Studying effects lockdown work productivity provide valuable insights enabling stakeholders better navigate future upheavals cultivate enduring resilience However impact lockdown work productivity especially within innovationdriven domains open source OSS development remains largely unexplored address research gap study leverages lockdowns implemented two worlds largest economies – United States US China – various stages pandemic lockdowns serve natural experiments enabling us study impacts OSS developers contributions GitHub worlds largest OSS platform GitHub 2022b Chinas ZeroCOVID strategy marked uniform strict lockdown measures across various cities different times provides ideal setting study OSS contributors’ responses lockdowns importantly allows us understand adaptation new normal WFH throughout various pandemic stages Meanwhile US prominent role OSS community extensive data availability serves optimal environment extend validate findings derived Chinese lockdowns thereby enhancing generalizability insights beyond specific context China Taken together natural experiments enable us delve deeper different approaches managing pandemic influence OSS contributors’ productivity main differenceindifferences analysis focused two lockdowns China initial lockdown Wuhan 2020 another one occurred Xi’an 2021 Interestingly results revealed significant positive impact 2021 lockdown OSS contributions Xi’an developers contrast negative impact observed among Wuhan developers 2020 lockdown Moreover lockdowns results indicated developers made online comments local peers experienced pronounced decline contributions delve deeper underlying mechanisms driving outcomes conducted targeted survey among developers affected two lockdowns survey findings reveal Xi’an developers reported significantly fewer interruptions marked increase flexibility making OSS contributions later lockdown 2021 Factors fear related COVID19 increased housework responsibilities significantly reduced Wuhan developers’ contributions initial 2020 lockdown became insignificant developers 2021 Xi’an lockdown findings point notable adaptation effect developers became accustomed new norms WFH imposed COVID19 pandemic time survey also found Wuhan Xi’an developers increase available time positively influenced OSS contributions Moreover survey study unveiled Wuhan Xi’an developers lack facetoface F2F interactions significantly found significantly reduce contribution levels finding corroborated another survey discovery identified strong positive correlation developers’ tendency comment GitHub propensity F2F interactions prior lockdown Coupled aforementioned analysis demonstrated pronounced negative impact contributions Wuhan Xi’an developers engaged online commenting activities local collaborators evidence leads inference developers frequently engaged F2F interactions adversely affected lockdowns terms contributions finding underscores importance F2F interactions collaborative work environments challenges assumption CMC seamlessly replace F2F interactions without adverse impact productivity Furthermore use analysis examine impact stayathome lockdown orders US developers’ OSS contributions empirical approach guided three key considerations First assessing generalizability findings Chinese lockdowns contexts vital impacts strict lockdown measures like China may differ effects milder restrictions adopted elsewhere Second prominence OSS development US coupled extensive data available GitHub makes apt context analysis Third heterogeneity policies regarding lockdowns across different US states offers unique opportunity comparative analysis allows nuanced understanding diverse approaches pandemic management influence OSS contributions addition comparing effects observed China US aim provide valuable insights broader implications lockdown measures OSS contributions global scale Interestingly analysis revealed significant impact US lockdowns developers’ OSS contributions posit may attributable less strict nature stayathome orders US compared lockdown measures enforced China relatively lenient restrictions US permitted essential activities work may led significant disruptions potential F2F interactions provided additional available time developers Consequently factors may exerted minimal effects OSS contributions contributions threefold First examining impact lockdowns OSS contributions study provides novel insights effects remote work productivity nuanced findings individuals adapt new norms WFH prolonged periods disruption equip various stakeholders – including individuals organizations governments – essential knowledge knowledge guide preparations similar future disruptions build sustainable resilience Second research reveals detrimental effects reduced F2F interactions challenging assumption CMC effortlessly replace F2F interactions without compromising productivity especially salient innovationdriven domains like OSS development insight enriches discussion comparative impacts CMC F2F efficacy virtual teams discussion become increasingly pertinent era reliance CMC remote work likely persist even beyond pandemic Airbnb 2022 Warren 2020 Third study stands adoption systematic causal analysis methods previous research impact lockdown mainly relied survey methods use analysis empirical data GitHub enables robust examination causal effects lockdowns methodological approach reinforced various robustness tests strengthens findings study also offers valuable framework leveraged future research includes exploring impact policy interventions organization strategies response similar disruptions Literature review 21 COVID19 work productivity COVID19 pandemic led unprecedented shift remote work millions mandated work home due governmentimposed lockdowns impact WFH arrangements brought lockdowns work productivity subject intensive study yielding mixed findings Several studies found lockdowninduced WFH associated declines productivity especially innovationoriented work development Ralph et al 2020 scholarly research Barber et al 2021 Walters et al 2022 Ralph et al 2020 surveyed 2225 developers across 53 countries found productivity wellbeing diminished due COVID19 primary influencing factors fear related pandemic disaster preparedness home office ergonomics Barber et al 2021 surveyed 1008 members American Finance Association 781 respondents suggesting research productivity negatively affected COVID19 due lack traditional F2F communications disseminate research obtain feedback well overwhelming health concerns Another survey study Walters et al 2022 investigated reasons behind reported decline research activity among female academics lockdowns primary reason working home female academics burdened traditional family roles typically assumed women well increasing teaching administrative workloads hand studies found productivity lockdowninduced WFH scenarios actually increased pandemic Asay 2020 reports OSS developers consistently increased work volume 2020 never truly left work Cui et al 2022 found overall 35 increase productivity 13 increase gender gap among social science scholars US since lockdown began suggest lockdown could result substantial time savings workrelated tasks commuting female researchers may find allocating time homerelated tasks childcare Another line research suggests lockdowns general little effect developers Forsgren 2020 reports activity GitHub developers early days COVID19 similar slightly increased compared previous year Neto et al 2021 surveyed 279 developers GitHub projects developed using Java found WFH pandemic affect task completion time code contribution quality Similar studies conducted survey developers major companies like Microsoft Ford et al 2021 Baidu Bao et al 2022 found lockdown generally little impact developers’ productivity However developers differing opinions effects lockdown suggest productivity benefited WFH fewer disturbances saved commuting time improved worklife balance Others suggested productivity suffered WFH due increased homerelated tasks decreased collaboration others interruptions family members summarize existing studies impact pandemicinduced lockdowns work productivity yielded mixed findings heavily reliant survey methods Moreover studies sufficiently explored knowledge workers developers adapt remote work settings adaptation influences productivity prolonged periods lockdown clear need systematic causal analyses large empirical datasets study impacts underlying mechanisms pandemicinduced lockdowns innovationrelated work considering effects adaptation 22 Facetoface communications computermediated communications Previous research NicCanna et al 2021 Smite et al 2023 highlighted one direct implications pandemicinduced lockdowns diminished opportunity traditional F2F interactions increased reliance CMC considered crucial realm OSS development Crowston et al 2007 O’Mahony Ferraro 2007 Crowston et al 2007 identify several settings OSS developers engage F2F meetings benefits derive interactions instance F2F meetings provide OSS developers great opportunities socialize build teams verify other’s identity also find certain OSS development activities best suited F2F interactions conveying important news Boden Molotch 1994 Kock 2004 suggests human beings evolved many years excel F2F interactions Moreover O’Mahony Ferraro 2007 discovered F2F interactions OSS community members could increase one’s likelihood ascending community leadership role achieved 1 building trusting reciprocal relationships 2 creating potential coalitions Butler Jaffe 2021 also suggested F2F interactions significantly influence one’s efforts community building OSS studies typically conducted empirical contexts F2F interactions CMC coexist among OSS community members making difficult disentangle effects However strict lockdown measures China presented unique opportunity examine developers’ OSS contributions setting F2F interactions entirely absent important conjecture OSS developers accustomed working productively using CMC remote asynchronous manner decades Columbro 2020 Wellman et al 1996 less likely affected absence F2F interactions COVID19 pandemic study puts conjecture test examining scenario F2F interactions largely absent due lockdowns China Moving specific context OSS general comparison F2F interactions CMC virtual teams findings remain inconclusive Townsend et al 1998 find CMC facilitate efficient connections individuals regardless geographical locations thereby significantly improving performance virtual teams Moreover team members distributed across different time zones leverage CMC coordinate effectively operate within flexible efficient 24hour cycle Lipnack Stamps 1999 Therefore Bergiel et al 2006 suggest virtual collaboration via CMC overcome constraints time distance organizational boundaries leading improvements productivity efficiency among team members hand another stream literature suggests compared F2F interactions CMC carries fewer physical emotional cues thereby limiting extent synchronicity information exchange Cramton Webber 2005 Daft Lengel 1986 Dennis et al 2008 negatively affect team members’ capabilities establish mutual understanding Kraut et al 1982 Sproull Kiesler 1986 Straus McGrath 1994 sense belonging awareness group activities Cramton 2001 Moreover absence F2F interactions individuals likely experience heightened conflicts Wakefield et al 2008 leading decreased team productivity satisfaction Hambrick et al 1998 Lau Murnighan 1998 Furthermore despite recent advances communication technologies videoconferencing allow users convey nonverbal information cues lack F2F interactions still negatively affect innovation relies collaborative idea generation recent study Brucks Levav 2022 discovered despite technological advancement absence F2F interactions COVID19 pandemic still negatively affected innovation authors attribute finding differences physical nature videoconferencing F2F interactions former focuses individuals display narrower cognitive focus summarize existing literature yet conclusively establish whether despite technological advancement CMC effectively replace role F2F interactions without impacting productivity collaborative work studies Crowston et al 2007 Ocker et al 1998 suggest mix CMC F2F interactions beneficial teamwork However preference remote work reliance CMC continue rise unprecedented scale even postpandemic era research aims fill gap studying whether CMC fully replace F2F interactions without negatively affecting teamwork productivity 23 Motivations open source contributions Another stream research relevant study literature motivations contributing OSS development prevailing framework field typically categorizes OSS developers’ motivations intrinsic extrinsic factors Intrinsic motivations often stem developers’ personal needs altruism joy derived contributing Davidson et al 2014 Hertel et al 2003 whereas extrinsic motivations usually related utilitybased external rewards opportunities career advancement Fang Neufeld 2009 Yang et al 2021 Studies Hertel et al 2003 Shah 2006 found intrinsic motivations enjoyment fun significantly influence OSS developers’ contributions However COVID19induced lockdowns developers may experience fear stress related health family friends could negatively affect intrinsic motivations especially early stages pandemic However dearth OSS motivation research focuses social effects developers’ contribution motivations influenced interactions peers instance individuals’ OSS contributions encouraged attention received peers Moqri et al 2018 collaboration team members Crowston et al 2007 Daniel Stewart 2016 Xu et al 2009 von Krogh et al 2012 suggest aspects social practice like ethics virtues largely overlooked context contribution motivations aspects typically cultivated social interactions among OSS community members including F2F interactions CMC study aims enrich understanding research community policymakers major disruptions like lockdowns may limit social effects particularly reduced F2F interactions thereby influence OSS developers contribution motivations Methods section first adopt mixedmethod approach study impacts two lockdowns China OSS developers contributions treat lockdowns Wuhan Xian natural experiments GitHub developer Wuhan Xian match developer comparable regions experience lockdown measures utilize differenceindifferenceindifferences DDD analyses combined propensity score matching PSM discern impacts delve deeper mechanisms underpin changes developers OSS contributions lockdowns also administer survey GitHub developers lockdowns Section 4 report main results analysis perform series robustness tests validate findings Moreover Section 5 extend empirical approaches analysis data collected distinct context – US supplementary analysis designed investigate whether patterns observed findings Chinese lockdowns also present regions comparing effects China US aim provide valuable insights wider implications lockdown measures OSS contributions global scale 31 Experimental settings COVID19 become one severe global pandemics recent decades Fang et al 2020 first natural experiment leverages lockdown imposed Wuhan China January 23 April 8 2020 response initial major outbreak COVID19 authorities enforced citywide lockdown Wuhan leading closure public transport nonessential businesses residents 7148 residential communities Wuhan mandated stay home leaving permitted emergencies abrupt imposition Wuhan lockdown implemented without prior warning serves exogenous shock natural experimental setting provides us opportunity examine impact Wuhan lockdown OSS contributions designate Wuhan developers treatment group choose developers Hong Kong Macau Taiwan HMT regions control group several reasons Firstly major cities mainland China swiftly followed Wuhans lead implementing strict lockdown social distancing measures HMT regions implement measures March 2020 Hong Kong authorities prohibited indoor outdoor public gatherings four people March 2020 Meanwhile although Macau authorities took adhoc measures closing casinos public parks implement citywide lockdown measures Therefore developers Wuhan strictly required stay home early stage COVID19 outbreak HMT regions could go engage F2F interactions Secondly compared developers parts world HMT developers much similar Wuhan developers belong ethnic group – Han Chinese Wikipedia 2022 share similar cultural backgrounds chosen tenweek period surrounding day Wuhan lockdown ie December 19 2019 February 27 2020 time frame analysis mainly two reasons Firstly timeframe short allowing us observe potential changes developers contributions Secondly COVID19 began spread parts world including HMT regions developers might started consciously avoid F2F meetings others prevent potential COVID19 infections even lockdown social distancing measures implemented would make HMT developers less ideal control subjects natural experiment Therefore set end time window February 27 2020 COVID19 cases HMT regions started increase significantly March also leverage lockdown Xian China second natural experiment strictness citys lockdown measures often corresponds severity local outbreak leading endogeneity attempting causally identify impacts lockdown measures pandemic Chinas ZeroCOVID policy provides ideal opportunity address endogeneity issue policy centered around lockdowns aims halt transmission COVID19 soon detected mass testing Chen et al 2022a Even COVID19 cases trigger fullscale citywide lockdown short period Chen et al 2022a swift lockdowns response extra small numbers new COVID19 cases minimize endogeneity policy responses Xian lockdown lasted December 23 2021 January 23 2022 strict Wuhan lockdown even far fewer initial infection cases thus minimizing endogeneity policy responses Xian lockdown public transport nonessential businesses suspended Xian residents strictly required stay home except emergencies Thus use Xian developers treatment group construct control group follow existing studies Muralidharan Prakash 2017 Wang 2022 choosing developers seven capitals provinces municipalities neighboring Xian implement lockdown measures Xian lockdown developers neighboring capitals similar Xian developers many aspects timeframe analysis covers eight weeks surrounding day Xian lockdown ie November 25 2021 January 20 2022 32 Data collection Chinese lockdowns empirical study collects uses two types data GitHub data COVID19 case data obtain historical GitHub data API GH Archive database latter archives public OSS development activities GitHub since February 2011 widely used recent OSS research Moqri et al 2018 Negoița et al 2019 first use “searchbylocation” function GitHub API extract developers least one public repository located regions chosen natural experiments experiment select developers joined GitHub chosen time window exclude developers push commit within time window procedure yields 1695 Wuhan developers 5282 HMT developers Wuhan case selected sample Xian case includes 919 Xian developers 4274 developers seven neighboring provincial capitals municipalities Moreover obtain data COVID19 cases relevant health authorities National Health Commission China well mainstream media comprehensive data collection allows us conduct robust analysis impact Chinese lockdowns OSS contributions 33 Propensity score matching address potential endogeneity issues employ technique conjunction PSM following methodology previous studies Chen et al 2019 Foerderer 2020 PSM selects control subjects measuring distance treated subjects based pretreatment covariates method particularly effective overcoming curse dimensionality ie many covariates transforming covariate vectors single propensity score selecting control subjects closest treated ones Chen et al 2022b allows us create balanced comparable control group thereby enhancing robustness findings specifically apply oneonone nearest neighbor matching without replacement select control developer treated developer based set observable characteristics lockdown Fang Neufeld 2009 Foss et al 2021 Moqri et al 2018 Zhang Zhu 2011 characteristics include number weeks since developer joined GitHub whether developer student employee based profile whether developer reports contact information profile number OSS projects developer created number commits developer contributed GitHub number starsissuescomments developer received repositories number starsissuescomments developer sent whether developer used following core language GitHub – CCCGoJavaJavaScriptPHPPythonRubyScalaTypeScript GitHub 2022a primary programming language number developer’s collaborators contributed projects number developer’s local collaborators contributed projects lived region average age developer’s OSS projects number projects General Public License GPL created developer GPL restrictive license could serve proxy developer’s ideological level Foss et al 2021 PSM procedure yields 1608 matched pairs Wuhan treatment HMT control developers Wuhan lockdown 919 matched pairs Xi’an treatment neighboringcity control developers Xi’an lockdown case Table 1 summarizes mean values pretreatment characteristics developers selected regions matching results ttest indicate significant differences across many observable characteristics developers lockdown areas nonlockdown areas lockdowns differences suggest direct comparison treatment control groups two natural experiments may appropriate Therefore apply aforementioned matching procedure Table 2 reports mean values characteristics matched sample ttest results Table 2 show significant differences across observable characteristics treatment matched control groups lockdowns suggests matching procedure effectively balanced observable characteristics treatment matched control groups
::::
34 Empirical models
::::
341 Differenceindifferences model natural experiment examine change OSS contributions every developer selected matched sample using following regression framework textCONTRIBUTION alpha beta textAFTER times textLOCKDOWN gamma textCV mui thetat epsilonit indexes developer indexes week dependent variable textCONTRIBUTION weekly OSS contributions developer add one weekly number commits developer contributed GitHub take logarithm measure weekly OSS contributions following previous literature Hu et al 2023 Moqri et al 2018 Zhang Zhu 2011 commit change made OSS adding modifying deleting codes textAFTER dummy variable equals one time period day lockdown zero otherwise textLOCKDOWNit dummy variable equals one developer treatment group ie city Lockdown Wuhan Mean Ttest Lockdown Xian Mean Ttest Wuhan developers HMT developers Difference Xian developers neighboringcity developers Difference Weeks 174761 224249 49488 258309 259105 0796 Student 0305 0162 0143 0279 0224 0054 Employee 0232 0290 0058 0288 0277 0011 Contact 0721 0690 0031 0703 0730 0027 Number projects 21780 26074 4294 27349 30320 2970 Commits 709337 1489132 779795 1869550 1725299 144251 Stars received 126292 77595 48697 139702 260150 120448 Issues received 6959 9129 2169 11473 15097 3624 Comments received 12684 24629 11945 29706 39904 10198 Stars sent 104883 107217 2334 118405 153768 35364 Issues sent 8740 12421 3681 13799 14680 0881 Comments sent 22530 47056 23526 61226 49162 12064 C 0041 0041 0000 0039 0036 0003 C 0086 0061 0025 0073 0064 0009 C 0019 0041 0022 0027 0031 0003 Go 0026 0024 0002 0065 0070 0005 Java 0198 0075 0123 0177 0180 0003 JavaScript 0202 0225 0023 0182 0199 0018 PHP 0021 0032 0012 0021 0026 0005 Python 0170 0196 0026 0186 0148 0038 Ruby 0004 0021 0017 0007 0004 0002 Scala 0002 0002 0000 0003 0001 0002 TypeScript 0007 0010 0003 0016 0025 0008 Collaborators 367333 899337 532003 632457 683823 51366 Local collaborators 0835 10044 9208 1342 2193 0851 Average age projects 64415 88233 23819 100200 102360 2160 Number projects GPL 1045 1187 0142 1256 1681 0425 p 01 p 005 p 001 lockdown implemented zero otherwise CV contains set control variables might influence developer’s OSS contributions according previous research Fang Neufeld 2009 Moqri et al 2018 Zhang Zhu 2011 number OSS projects created developer REPO number weeks since developer joined GitHub TENURE number stars developer received repositories STARR number stars developer sent STARS number issues developer received repositories ISSUER number issues developer sent ISSUES number comments developer received repositories COMMENTR number comments developer sent COMMENTS number new COVID19 cases developer’s region CASE control effects timeinvariant individual characteristics developer especially unobservable incorporate individual fixed effect mui model Moreover opposed standard twoperiod model model spans ten periods Wuhan lockdown case eight periods Xi’an lockdown case Consequently need control variables remain constant across subjects vary different periods Therefore include time fixed effect thetat comprises weekly time dummies control time trends LOCKDOWN standard twoperiod model absorbed individual time fixed effects respectively epsilonit error term coefficient beta indicates impact lockdown developers’ OSS contributions negative coefficient would suggest lockdown reduces developers’ OSS contributions whereas positive coefficient would indicate otherwise
::::
342 Differenceindifferenceindifferences models examine impact absence F2F interactions caused lockdowns F2F interactions serve important motivations OSS contributions previous research suggested Crowston et al 2007 Stam 2009 expect developers regularly engaged F2F meetings collaborators would profoundly affected lockdown end use GitHub developer’s engagement online comments ie GitHubsupported CMC proxy tendency meet OSS collaborators F2F lockdown approach grounded previous studies observed people engage CMC also likely meet F2F correlation understood reflect underlying social needs preferences Huang et al 2022 Khalis Mikami 2018 Suphan Mierzejewska 2016 Furthermore CMC found cultivate social relationships facilitate coordination F2F meetings DiMaggio et al 2001 Howard et al 2001 Kraut et al 2002 Suphan et al 2012 relationship online offline interaction supported Brandtzæg Nov 2011 discovered Facebook users prioritize CMC close friends also interact frequently F2F settings addition survey study Section 434 finds developers lockdowns made online comments local GitHub collaborators lockdown also likely meet F2F consistent findings previous studies Huang et al 2022 Khalis Mikami 2018 Suphan Mierzejewska 2016 intricate relationship CMC F2F interactions lays groundwork DDD analysis operationalize GitHub developer’s tendency meet local collaborators F2F compute number online comments made GitHub platform lockdown metric serves proxy social engagement preference F2F interactions Building baseline specification develop nuanced DDD specification CONTRIBUTIONit mui betai times LOCKDOWNit betai times AFTERit times LOCKDOWNit gamma CVit betai times AFTERit times LOCKDOWNit betai times AFTERit times LOCKDOWNit epsilonit LOCCOMSi number online comments developer made GitHub collaborators region lockdown important note individual fixed effect mui absorbs textLOCKDOWNt times textLOCCOMSi term Foerderer 2020 anticipate coefficient beta3 significant negative indicating developers engaged online interactions local collaborators adversely affected lockdown leading reduced contributions OSS projects Miller et al 2019 ensure results robust alternative explanation consider following DDD specification textCONTRIBUTION alpha beta1 textAFTERt times textLOCKDOWNi beta2 textAFTERt times textCOMSi beta3 textAFTERt times textLOCKDOWNi times textCOMSi gamma textCVi mui thetat epsilon 3 COMSi number online comments developer made GitHub collaborators including nonlocal ones lockdown alternative explanation true coefficient beta3 significant like one Eq 2 general social effects apply GitHub collaborators regardless location hand coefficient beta3 insignificant Eq 3 significant Eq 2 alternative explanation dismissed Results robustness checks Chinese lockdowns 41 Results differenceindifferences model Table 3 reports results Eqs 1–3 Columns 1 4 show results Eq 1 Wuhan Xian lockdowns respectively coefficient textAFTERt times textLOCKDOWNi Column 1 negative statistically significant 1 significance level suggesting Wuhan lockdown led reduction developers’ OSS contributions Specifically coefficient 0111 suggests Wuhan developers’ contributions decreased 105 e0111 1 five weeks following lockdown contrast coefficient textAFTERt times textLOCKDOWNi Column 4 positive significant 5 level suggesting Xian lockdown resulted increase developers’ OSS contributions coefficient 0086 suggests Xian developers’ contributions increased roughly 90 e0086 1 four weeks lockdown According findings survey study presented Section 434 contrasting results Wuhan Xian lockdowns mainly attributed adaption effect COVID19 initially emerged Wuhan unprecedented nature virus coupled rapid spread severity likely instilled high level fear uncertainty among population Therefore Wuhan developers may Dependent variable CONTRIBUTIONit Wuhan Lockdown 2020 Xian Lockdown 2021 1 2 3 4 5 6 textAFTERt times textLOCKDOWNi 0111 0108 0110 0086 0089 0085 0029 0030 0030 0043 0043 0043 textAFTERt times textLOCCOMSi 0003 0001 0001 0000 0000 0000 textAFTERt times textLOCKDOWNi times textCOMSi 0007 0003 0003 0000 0000 0000 0003 0000 0000 0000 0000 0000 textAFTERt times textCOMSi 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 textREPOi 0024 0024 0024 0372 0372 0372 0022 0022 0022 0028 0028 0028 textTENUREi 0030 0030 0029 0044 0047 0044 0039 0039 0039 0051 0051 0051 textSTARRi 0016 0016 0016 0003 0003 0003 0008 0008 0008 0002 0002 0002 textSTARSi 0015 0015 0015 0027 0027 0027 0006 0006 0006 0008 0008 0008 textISSUEi 0021 0020 0022 0014 0014 0014 0025 0025 0025 0038 0038 0038 textISSUESi 0076 0076 0077 0058 0058 0057 0028 0028 0028 0033 0033 0033 textCOMMENTi 0022 0022 0022 0011 0011 0011 0014 0014 0014 0011 0011 0011 textCOMMENTSi 0047 0047 0047 0053 0053 0053 0014 0014 0014 0007 0007 0008 textCASEi 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 Constant 5769 5727 5585 10529 11429 10494 6688 6690 6685 13023 13052 13041 Individual FE Yes Yes Yes Yes Yes Yes Time FE Yes Yes Yes Yes Yes Yes Observations 32160 32160 32160 14704 14704 14704 Rsquared 0048 0048 0048 0083 0083 0083 Robust standard errors brackets p 01 p 005 p 001 found challenging focus contribute OSS projects turbulent period Neto et al 2021 Ralph et al 2020 hand Xian lockdown occurred nearly two years Wuhans following dozen citylevel lockdowns time residents Xian much familiar virus associated lockdown measures experience level fear Wuhan adapted readily new lifestyles induced lockdown measures including new norm WFH adaptation coupled opportunities offered WFH increased available time flexibility may enabled Xian developers increase OSS contributions Ford et al 2021 Neto et al 2021 42 Results differenceindifferenceindifferences models Columns 2 5 Table 3 present results Eq 2 Wuhan Xian lockdowns respectively significant negative coefficient textAFTERt times textLOCKDOWNi times textLOCCOMS Column 2 suggests Wuhan developers engaged online comments local GitHub collaborators negatively affected Wuhan lockdown indicated survey results Section 342 developers made online comments local collaborators likely meet F2F lockdown Therefore result indicates Wuhan developers likely F2F interactions lockdown experienced pronounced reduction OSS contributions hand significant negative coefficient textAFTERt times textLOCKDOWNi times textLOCCOMS Column 5 suggests positive effect OSS contributions weaker Xian developers engaged online comments local collaborators Xian developers also likely F2F interactions reflecting similar pattern observed Wuhan lockdown findings highlight importance F2F interactions affecting OSS contributions loss interactions lockdowns significant impact contributions Columns 3 6 Table 3 report results Eq 3 Wuhan Xian lockdowns respectively coefficients textAFTERt times textLOCKDOWNi times textCOMS insignificant 10 level columns elaborated Section 342 results demonstrate developers socially engaged ie made online comments local collaborators might become concerned pandemics negative impacts others leading decrease OSS contributions Instead results Eqs 2 3 highlight social nature developers specifically loss F2F interactions lockdowns influences developers OSS contributions summary regression results suggest Wuhan lockdown led significant reduction developers OSS contributions Xian lockdown resulted increase analysis DDD regressions highlights importance F2F interactions driving developers OSS contributions GitHub absence F2F interactions brought lockdown measures appears negatively influence contributions 43 Robustness checks 431 Parallel trends key identification assumption estimation parallel trends assumption assumption posits lockdown OSS contributions treatment group control group would follow temporal trend assumption satisfied estimated effects could biased results could driven systematic differences treatment control groups rather lockdown ascertain validity analysis conduct two sets tests examine whether analysis satisfies assumption First plot weekly average contributions per developer made treatment group blue control group red time window surrounding Wuhan lockdown Fig 1a Xian lockdown Fig 1b measure developers weekly contributions add one weekly number commits contributed GitHub take logarithm consistent measure DDD models green vertical line figure demarcates day lockdown Fig 1a shows treatment control groups exhibit almost identical contribution trends Wuhan lockdown thereby fulfilling parallel trends assumption hand five weeks following day Wuhan lockdown contributions treatment group consistently fall control group evidenced substantial persistent gap red blue lines Fig 1b shows similar pattern Xian lockdown treatment control groups tend contribute less time lockdown thus satisfying parallel trends assumption However day Xian lockdown control group continues decreasing trend treatment group exhibits tendency increase contributions Second validate findings adopt eventstudy approach method commonly used previous literature Leslie Wilson 2020 Tanaka Okamoto 2021 approach involves fitting following equation textCONTRIBUTION alpha sum kn0 betak textWEEK itk times textLOCKDOWNi gamma CV mui thetat epsilonit n equals 5 Wuhan lockdown equals 4 Xian lockdown textWEEKit dummy variable equals one week corresponds k zero otherwise construct week k 0 sample use day lockdown separate pretreatment posttreatment periods k 1 indicates week day lockdown dropped equation reference week Intuitively betak captures difference contributions treatment control groups week relative k 1 expect two groups make similar contributions day lockdown k 0 diverge day lockdown k 0 Fig 2a Fig 2b show estimated betak Eq 4 Wuhan Xian lockdowns separately green vertical line figure represents day lockdown gray dotted lines surrounding coefficient depict 95 confidence intervals figures estimated betak k 0 nearly zero indicating pretreatment difference contribution trends treatment control groups pattern confirms parallel trends assumption satisfied analysis 432 falsification test ensure estimated effects artifacts seasonality conduct falsification test demonstrate effects replicated period without lockdowns involves repeating analysis time window previous years COVID19 yet emerged Cui et al 2022 Zhang Zhu 2011 Wuhan lockdown repeat analysis using data lunar year ago time window encompasses Chinese New Year holiday Xian lockdown use data two years earlier considering developers might experienced lockdowns year ago period lockdowns become common China control variable textCASEit excluded analysis since COVID19 yet broken earlier periods falsification test serves robustness check original analysis merely capturing seasonal effects would expect find significant effects previous years well However absence effects would strengthen validity main findings confirming observed changes OSS contributions indeed attributable lockdowns underlying seasonal patterns Table 4 reports results falsification test placebotreated treatment effects found insignificant Wuhan Xi’an lockdowns implies developers treated groups significantly change contributions time window previous years thus ruling seasonal effects driving factor behind observed changes OSS contributions 433 Alternative samples also replicate DDD analyses using two alternative matched samples ensure results driven specific choice caliper propensity score matching main analysis used caliper 03 Wuhan lockdown ensure statistical difference developer characteristics treatment control groups Chen et al 2019 Wang 2022 caliper defines range within logit propensity scores must fall considered valid match Cochran Rubin 1973 narrower caliper result inclusion fewer subjects also enhance balance treatment control groups thereby reducing bias estimating treatment effects Wang 2022 Wang et al 2013 validate findings following Wang 2022 employed alternative calipers 01 Wuhan lockdown 0001 Xi’an lockdown new PSM calipers yielded matched sample 1557 developers Wuhan lockdown matched sample 906 developers Xi’an lockdown treatment control groups Table 5 presents ttest results matching new calipers Importantly lockdowns none differences treatment control groups found significant 10 level indicating two groups remained comparable analysis matching even alternative calipers Table 6 shows regression results Eqs 1–3 based alternative matched samples coefficients textAFTERt times textLOCKDOWNi textAFTERt times textLOCKDOWNi times textLOCCOMSi textAFTERt times textLOCKDOWNi times textCOMSi found consistent main analyses Wuhan Xi’an lockdowns suggesting results driven choice caliper PSM process 434 survey study complement empirical analyses survey study conducted delve underlying mechanisms influencing factors behind changes developers’ OSS contributions lockdowns survey targeted treated developers matched sample provided email addresses GitHub encompassing 879 Wuhan developers 463 Xian developers encourage participation offered incentive 20 Chinese Yuan respondent successfully completed questionnaire questionnaire detailed Appendix designed questions answered fivepoint Likert scale Eventually received 109 responses Wuhan developers 71 responses Xian developers Another objective survey study justify important assumption underlying DDD analysis developers engaged online comments local collaborators GitHub also likely meet F2F examine relationship surveyed treated developers tendencies online commenting F2F interactions local collaborators conduct correlation test tendencies Wuhan Xian developers results detailed Table A1 Appendix findings reveal significant positive correlation coefficients tendencies online commenting F2F interactions supports assumption DDD analysis reinforcing validity empirical approach conclusions drawn Table 7 shows results linear regression analysis explores various surveyed factors explain changes OSS contributions two Chinese lockdowns dependent variable represents change contributions calculated difference respondents total contributions GitHub posttreatment period total contributions pretreatment period independent variables consist respondents ratings surveyed factors detailed Questions 2–6 questionnaire provided Appendix factors carefully selected inclusion questionnaire based previous research findings related work productivity COVID19induced WFH scenarios Bao et al 2022 Ford et al 2021 Miller et al 2021 Neto et al 2021 Walters et al 2022
::::
Table 4 Falsification test results Chinese lockdowns Dependent variable CONTRIBUTION Wuhan lockdown Xian lockdown 1 2 × LOCKDOWN 0018 −0024 0019 0024 REPO 0128 0296 0028 0031 TENURE 0010 0030 0026 0036 STARR 0001 −0000 0001 0000 STARR 0045 0030 0007 0005 ISSUER −0002 0045 0038 0050 ISSUES 0078 0095 0040 0045 COMMENTR 0006 −0010 0013 0014 COMMENTS 0079 0078 0016 0015 Constant −0928 −4253 3143 5374 Individual FE Yes Yes Time FE Yes Yes Observations 32160 14704 Rsquared 0151 0083 Robust standard errors brackets p 01 p 005 p 001
::::
Table 5 Ttests alternative matched sample Chinese lockdowns Wuhan lockdown Xian lockdown Treatment group Control group Difference Treatment group Control group Difference Weeks 177281 177547 −0266 257616 258481 −0865 Student 0277 0290 −0013 0277 0281 −0004 Employee 0244 0247 −0003 0286 0278 0008 Contact 0705 0713 −0008 0702 0721 −0019 Number projects 21423 21780 −0357 27156 25883 1273 Commits 663480 499008 164473 1710940 1299883 411057 Stars received 64570 90427 −25857 140910 126865 14044 Issues received 5550 5620 −0071 11545 9189 2357 Comments received 10196 11230 −1034 29975 27722 2253 Stars sent 94332 94636 −0304 119413 110372 9041 Issues sent 6821 7455 −0634 12710 10969 1741 Comments sent 16331 19586 −3255 46818 40392 6426 C 0043 0041 0002 0040 0044 −0004 C 0086 0083 0003 0073 0086 −0013 C 0021 0022 −0001 0028 0032 −0004 Go 0027 0028 −0001 0064 0070 −0006 Java 0153 0170 −0017 0180 0174 0006 JavaScript 0209 0215 −0006 0184 0180 0004 PHP 0021 0021 0001 0021 0020 0001 Python 0182 0168 0014 0183 0179 0004 Ruby 0004 0004 0000 0004 0003 0001 Scala 0002 0001 0001 0001 0000 0001 TypeScript 0008 0006 0002 0017 0020 −0003 Collaborators 184192 167135 17057 561135 527577 33557 Local collaborators 0789 0830 −0041 1254 1092 0162 Average age projects 65386 65249 0137 100184 98570 1615 Number projects GPL 1002 1140 −0138 1267 1329 −0062 p 01 p 005 p 001
::::
Table 6 Regression results alternative sample Chinese lockdowns Wuhan lockdown Xian lockdown 1 2 3 4 5 6 × LOCKDOWN −0106 −0103 −0101 0089 0092 0090 0030 0030 0030 0043 0044 0043 × LOCCOMS 0003 0001 0001 0000 × LOCKDOWN × LOCCOMS −0007 0003 −0003 0001 × COMS × COMS 0000 0000 0000 0000 REPO 0024 0024 0024 0371 0371 0371 0021 0021 0021 0028 0028 0028 TENURE −0038 −0038 −0038 0046 0050 0047 0040 0040 0040 0051 0051 0051 STARR 0014 0014 0014 0003 0003 0003 0007 0007 0007 0002 0002 0002 STARS 0015 0015 0015 0027 0027 0027 0006 0006 0007 0008 0008 0008 ISSUER −0036 −0036 −0037 0010 0010 0012 0027 0027 0027 0039 0039 0039 ISSUES 0088 0087 0089 0065 0064 0063 0031 0031 0031 0036 0035 0036 COMMENTR 0023 0023 0023 0012 0012 0012 0015 0015 0015 0011 0011 0011 COMMENTS 0047 0047 0047 0052 0052 0050 0016 0016 0016 0008 0008 0009 CASE −0000 −0000 −0000 −0000 −0000 −0000 0000 0000 0000 0000 0000 0000 Constant 7166 7125 7101 −11153 −12068 −11456 6877 6879 6874 13010 13038 13058 Observations 31140 31140 31140 14496 14496 14496 Rsquared 0048 0048 0048 0082 0082 0084 Robust standard errors brackets p 01 p 005 p 001 Columns 1 2 Table 7 present regression results based responses Wuhan Xian developers respectively results reveal fear related COVID19 pandemic housework burden significantly curtailed OSS contributions Wuhan’s initial lockdown longer impacted Xian developers 2021 hand availability uninterrupted time increased flexibility positively influenced Xian developers OSS contributions pattern observed among Wuhan counterparts 2020 findings taken together DDD regression results highlight adaptation effect Xian developers specifically posit Xian lockdown occurring nearly two years Wuhans following numerous citylevel lockdowns allowed developers adapt new norm remote work adaptation allowed Xian developers leverage flexibility opportunities provided WFH resulting increased OSS contributions contrast Wuhan developers facing novel threat COVID19 impeded fear possibly lacked capacity engage voluntary activities like OSS contributions Moreover results show consistent patterns Wuhan Xian developers lack F2F interactions significantly reduced OSS contributions increased available time home positively influenced findings offer valuable insights understanding individuals adapt unprecedented disruptions providing valuable guidance stakeholders preparing future challenges fostering resilience
::::
5 Results US lockdowns preceding sections conducted comprehensive examination impacts lockdowns OSS contributions within context China broaden understanding assess applicability findings beyond China section introduces results additional empirical analysis focusing lockdowns US explained Section 1 rationale focusing US stems prominent role global OSS development community well unique circumstances surrounding implementation lockdown measures ie stayathome orders COVID19 pandemic comparing observed effects China US seek determine whether similar patterns emerge across different regions comparative analysis enhances robustness findings also contributes valuable insights broader implications lockdown measures OSS development community worldwide early stages virus spread March April 2020 total 45 states District Columbia US implemented either statewide partialstate stayathome orders orders restricted residents leaving homes except essential activities obtaining food performing essential work functions contrast remaining 5 states US questioned necessity strict lockdown measures refrained issuing stayathome orders Wu et al 2020 One primary rationale behind resistance belief residents would continue leave homes shopping work rendering stayathome orders ineffective Wang 2022 alignment methodology outlined Wang 2022 study design constructs control group consisting OSS developers states refrained implementing stayathome orders form treatment group follow approach employed earlier studies Muralidharan Prakash 2017 Wang 2022 selecting developers states implemented statewide stayathome orders geographically adjacent control states selection criterion based assumption neighboring states likely share similarities control group observable unobservable characteristics refine selection first extract developers least one public repository exclusively located one state within US narrow treatment group including developers states fewer ten thousand GitHub developers ensuring consistency control group states meet criterion resulting control group consists developers five states – Arkansas Iowa Nebraska North Dakota South Dakota treatment group includes developers six neighboring states – Louisiana Mississippi Missouri Montana Tennessee Wisconsin Table 8 provides detailed summary start end dates stayathome orders states obtained official announcements respective state process enhances comparability treatment control groups thereby strengthening validity analysis Following approach delineated Wang 2022 focus time window spanning March 9 2020 April 20 2020 timeframe ensures developers treatment group least two weeks data implementation stayathome orders Consistent Section 32 include developers joined GitHub chosen time window pushed least one commit period selection process arrive final data sample comprising 2583 treated developers 4487 control developers Like analysis Chinese lockdowns employ combined PSM final data sample US lockdowns First apply oneonone nearest neighbor matching without replacement selecting control developer treated developer matching based set covariates used analysis Chinese lockdowns ensuring methodological consistency procedure obtain 2583 matched pairs treatment control groups Table 9 summarizes mean values pretreatment characteristics treatment control groups matching ttest results confirm significant differences across characteristics treatment control groups matching successful matching enhances validity subsequent analysis ensuring treatment control groups comparable terms observable characteristics thereby minimizing potential biases estimate impact stayathome orders OSS contributions using matched sample employing timevarying model CONTRIBUTIONit alpha beta ORDERit gamma CVit mui thetat epsilonit 5 also estimate moderating effects comment interactions local collaborators collaborators using two separate models CONTRIBUTIONit alpha beta1 ORDERit beta2 ORDERit times LOCCOMSit gamma CVit mui thetat epsilonit 6 CONTRIBUTIONit alpha beta1 ORDERit beta2 ORDERit times COMSit gamma CVit mui thetat epsilonit 7 indexes developer indexes date ORDERit binary variable equals one state developer located implemented stayathome order date earlier zero otherwise definitions remaining variables consistent Eqs 1–3 Table 10 reports results estimation Eqs 5–7 results adhere parallel trends assumption remain robust considering alternative matched sample please see detailed robustness tests described Appendix B coefficient ORDERit insignificant across specifications indicating stayathome orders US significant impact developers’ OSS contributions insignificance moderating effects corroborates finding findings contrast impacts observed Wuhan Xi’an lockdowns suggesting effects identified Chinese context may generalized less strict lockdowns implemented US contract findings China US may attributed underlying differences stringency enforcement lockdown measures two significant nations China lockdowns characterized strict restrictions
::::
Table 7 explains changes OSS contributions Chinese lockdowns Dependent variable change contributions Wuhan developers Xi’an developers 1 2 Available time 0202 0725 0092 0216 Interruptions −0123 −0389 0083 0159 Flexibility 0029 0409 0066 0189 Work environment 0158 0129 0094 0186 Fear −0987 −0304 0072 0264 Lack F2F interactions −0288 −0190 0068 0089 Lack worklife boundary −0148 −0172 0138 0190 Lack selfdiscipline −0013 −0180 0065 0198 Taking care family −0034 −0105 0069 0190 Housework −0144 −0160 0523 0179 Constant 3921 −0344 0523 1751 Observations 109 71 Rsquared 0850 0614 Robust standard errors brackets p 01 p 005 p 001
::::
Table 8 Status stayathome orders state State Acronym Order start date Order end date Control group Arkansas AR statewide order Iowa IA statewide order Nebraska NE statewide order North Dakota ND statewide order South Dakota SD statewide order Treatment group Louisiana LA March 23 2020 May 15 2020 Mississippi MS April 3 2020 May 11 2020 Missouri MO April 6 2020 May 3 2020 Montana MT March 28 2020 April 26 2020 Tennessee TN March 31 2020 April 30 2020 Wisconsin WI March 25 2020 May 26 2020 required residents leave home except emergencies restrictions often rigorously enforced severely limited developers’ opportunities F2F interactions hand stayathome orders US less strict allowing residents leave homes broader range activities shopping work relatively lenient approach may allowed US developers adapt easily new circumstances leaving insignificant impact work lifestyles Consequently may mitigate negative impacts lockdown measures OSS contributions Moreover less strict nature US orders may provided available time home OSS contributions developers could still engage many usual activities outside home Conclusion lockdowns induced COVID19 pandemic catalyzed global shift towards WFH demonstrating feasibility unprecedented scale previous research explored broader implications remote work nuanced dynamics F2F CMC context work productivity remains intricate underexplored area complexity particularly salient within technologydriven domains OSS development study first leverages two lockdowns China – Wuhan 2020 Xi’an 2021 – natural experiments study causal impacts developers’ OSS contributions GitHub improve generalizability relevance findings Chinese lockdowns also extend analysis encompass impacts stayathome orders implemented across different states US early stage pandemic findings present nuanced picture impact lockdowns developers’ OSS contributions discovered Xi’an lockdown 2021 corresponded 90 increase OSS contributions Wuhan lockdown 2020 saw 105 reduction apparent contradiction illuminated subsequent survey study reveals differing impacts mainly attributed adaptation effect related COVID19 pandemic specifically Xi’an lockdown occurred nearly two years Wuhan’s numerous citylevel lockdowns implemented China allowed developers adapt new norm WFH optimizing flexibility opportunities provided WFH increase OSS contributions stark contrast Wuhan lockdown occurring onset pandemic virus new severe spreading rapidly created climate fear uncertainty atmosphere compounded factors increased housework responsibilities significantly impeded Wuhan developers’ ability focus OSS contributions However influential factors became insignificant 2021 Xi’an lockdown highlighting adaptability resilience individuals context remote work largescale disruptions Moreover found consistent patterns across Wuhan Xi’an developers lack F2F interactions significantly reduced OSS contributions increased available time home positively influenced addition study China employed analysis assess generalizability findings examining impact stayathome lockdown orders US developers’ OSS contributions Interestingly found significant impact US lockdowns contributions posit may due less strict nature stayathome orders US may significantly disrupted developers’ work lifestyle thereby exerting minimal effects OSS contributions contributions threefold First findings contribute valuable insights effects remote work productivity exploring individuals adapt remote work norms prolonged disruptions pandemic insights offer stakeholders including individuals organizations governments knowledge needed prepare future disruptions foster sustainable resilience Second findings shed light negative impact reduced F2F interactions thereby challenging assumption CMC seamlessly
::::
Table 9 matching matching Mean Ttest Mean Ttest Treatment group Control group Difference Treatment group Control group Difference Weeks 237195 251904 14710 237195 232121 5074 Student 0177 0160 0016 0177 0184 0008 Employee 0391 0407 0016 0391 0389 0001 Contact 1000 1000 0000 1000 1000 0000 Number projects 18946 19269 0323 18946 18906 0039 Commits 2072852 2039565 33286 2072852 1813961 258891 Stars received 48550 71115 22565 48550 39534 9016 Issues received 14561 15099 0448 14561 11596 2965 Comments received 44084 47101 3017 44084 32748 11336 Stars sent 53217 53018 0199 53217 49277 3940 Issues sent 25280 26767 1487 25280 23936 1345 Comments sent 138069 163793 25723 138069 137564 0506 C 0019 0025 0006 0019 0021 0001 C 0033 0043 0011 0033 0034 0001 C 0055 0047 0008 0055 0055 0000 Go 0011 0017 0006 0011 0009 0002 Java 0101 0115 0014 0101 0102 0001 JavaScript 0211 0188 0023 0211 0216 0006 PHP 0039 0051 0012 0039 0037 0003 Python 0123 0133 0011 0123 0123 0000 Ruby 0025 0025 0000 0025 0027 0002 Scala 0002 0004 0002 0002 0002 0000 TypeScript 0018 0020 0001 0018 0015 0003 Collaborators 1225186 1203774 21412 1225186 1114715 110471 Local collaborators 1377 2207 0830 1377 1377 0000 Average age projects 89021 95269 6249 89021 87204 1817 Number projects GPL 1081 1181 0100 1081 1122 0040 p 01 p 005 p 001 Table 10 Regression results US lockdowns 1 2 3 ORDER 0000 0000 0000 0007 0007 0007 ORDER × LOCCOMS 0001 0001 ORDER × COMS 0000 0000 0000 REPO 0221 0221 0221 0120 0120 0120 TENURE 0000 0000 0000 0000 0000 0000 STARR 0012 0012 0012 0007 0007 0007 STARS 0000 0000 0000 0002 0002 0002 ISSUER 0012 0012 0012 0029 0029 0029 ISSUES 0090 0090 0090 0028 0028 0028 COMMENTR 0036 0036 0036 0009 0009 0009 COMMENTS 0076 0076 0076 0017 0017 0017 CASE 0000 0000 0000 0000 0000 0000 Constant 0580 0580 0580 0448 0448 0448 Individual FE Yes Yes Yes Time FE Yes Yes Yes Observations 222138 222138 222138 Rsquared 0049 0049 0049 Robust standard errors brackets p 005 p 001 substitute F2F interactions without detrimental effects productivity especially pertinent inherently digital domains OSS development study adds nuanced perspective broader discourse comparative impacts CMC vs F2F interactions virtual team performance contribution particularly important todays environment reliance CMC due shift towards WFH intensified continues shape way work collaborate even beyond pandemic era Airbnb 2022 Warren 2020 Third unlike previous research mainly relied survey methods investigate impacts lockdowns study embraced systematic causal analysis methods analysis Using empirical data GitHub rigorous approach reinforced various robustness tests complemented survey study established multifaceted research framework opens new avenues exploring impact policy interventions organizational strategies response similar disruptions thereby extending applicability relevance findings Moreover findings may help openinnovation platforms organizations depend collaborative contributions formulate WFHrelated strategies policies Airbnb 2022 Warren 2020 First stakeholders may need recognize individuals adaptation WFH vary significantly time across different contexts strategies must tailored accordingly instance many contextual factors analyzed survey accounted changes work time interruptions flexibility remote work technology conditions housework duties Second absence F2F interactions vital component collaboration requires exploration alternative methods compensate drawback instance platforms could invest advanced collaboration tools designed replicate even enhance interaction experience virtual environment facial recognition systems identify emphasize microexpression emotional cues Third positive impact increased home time highlights importance flexible work policies policies enable individuals capitalize benefits remote work without sacrificing productivity last initial negative impact fear suggests emotional support wellbeing essential remote workrelated policies strategies especially unprecedented disruptions like COVID19 pandemic limitations study generate directions opportunities future research instance although reassuring study leverages two citywide lockdowns China statewide stayathome orders US contrasting findings highlight complexity remote work suggest need research understand generalizability findings across different cultures industries types work Second study focuses OSS contributions measured number commits Future research needs consider measures innovationrelated work productivity code quality creativity CRediT authorship contribution statement Jin Hu Conceptualization Methodology Formal analysis Investigation Data curation Visualization Writing original draft Writing review editing Daning Hu Conceptualization Methodology Supervision administration Funding acquisition Writing original draft Writing review editing Xuan Yang Investigation Funding acquisition Writing review editing Michael Chau Supervision Writing review editing Declaration competing interest authors declare known competing financial interests personal relationships could appeared influence work reported paper Data availability Data made available request Acknowledgement authors gratefully acknowledge funding Guangdong Province Focus Research Grant Number 2019KZDZX1014 Guangdong Province Research Fund Grant Number 2019QN01X277 National Natural Science Foundation China Grant Numbers 71971106 72001099 Shenzhen Humanities Social Sciences Key Research Bases Appendix Questionnaires survey analysis questionnaire Wuhan developers includes following six questions 1 Please indicate choice following statements based experience January 23 2020 ie day Wuhan lockdown 1 Strongly disagree 2 Disagree 3 Neutral 4 Agree 5 Strongly agree 11 OnlineFrequency often made online comments GitHub developers city hereinafter referred local collaborators GitHub 12 OnlinePreference enjoyed making online comments local collaborators GitHub 13 OnlineNeed tasks required make online comments local collaborators GitHub 14 OfflineFrequency often interacted local collaborators offline 15 OfflinePreference enjoyed interacting local collaborators offline 16 OfflineNeed tasks required interact local collaborators offline Please answer Questions 2–5 based lockdown experience five weeks January 23 2020 compared five weeks day lockdown give time available making OSS contributions GitHub Option Code Gave much less time 1 Gave less time 2 Neutral lockdown 3 Gave time 4 Gave much time 5 interruptions making OSS contributions GitHub Option Code Much fewer interruptions 1 Fewer interruptions 2 Neutral lockdown 3 interruptions 4 Much interruptions 5 flexibility making OSS contributions GitHub Option Code Much less flexible 1 Less flexible 2 Neutral lockdown 3 flexible 4 Much flexible 5 work environment eg internet bandwidth hardware home making OSS contributions GitHub Option Code Much worse work environment 1 Worse work environment 2 Neutral lockdown 3 Better work environment 4 Much better work environment 5 would rate following factors respective impacts contributions GitHub five weeks January 23 2020 compared five weeks day 1 low impact 2 Low impact 3 Neutral 4 High impact 5 high impact Factor Code 61 Fear related COVID19 pandemic 62 Lack facetoface interactions collaborators 63 Lack worklife boundary 64 Lack selfdiscipline 65 Taking care family 66 housework questionnaire Xi’an developers Wuhan developers except following changes … December 23 2021 ie day Xi’an lockdown … Please answer Questions 2–5 based lockdown experience four weeks December 23 2021 compared four weeks day … … four weeks December 23 2021 compared four weeks day … Table A1 Correlation test results first survey question Correlation Wuhan developers 1 Xi’an developers 2 OnlineFrequency OfflineFrequency 0305 0253 OnlinePreference OfflinePreference 0423 0283 OnlineNeed OfflineNeed 0442 0540 p 01 p 005 p 001 Appendix B Robustness checks US lockdowns test parallel trends assumption US lockdowns adopt eventstudy approach fitting following equation textCONTRIBUTION alpha sum kn k neq 1n betak Titk gamma CVit mui thetat epsilonit tagB1 n equals 28 Titk represents series dummies indicate chronological distance observation actual date state developer resides implemented stayathome orders k 1 designates date immediately preceding treatment thus omitted equation serving reference date Fig B1 shows estimated coefficients betak Eq B1 green vertical line represents day stayathome order enacted accompanying gray dotted lines delineate 95 confidence intervals coefficient Notably estimated betak values k 0 virtually zero indicating significant pretreatment difference contribution trends treatment control groups Therefore analysis satisfies parallel trends assumption reinforcing validity analysis US lockdowns also perform another robustness check reestimating Eqs 5–7 using alternative matched sample achieved incorporating caliper 01 PSM procedure resulting matched sample includes 2568 pairs developers across treatment control groups summary test results presented Table B1 reveals statistically significant differences treatment control groups 10 significance level outcome substantiates comparability two groups following matching process Table B2 summarizes results Eqs 5–7 derived alternative matched sample coefficients textORDER textORDER times textLOCCOMS textORDER times textCOMSi found statistically insignificant outcome implies implementation stayathome orders US significant influence developers’ OSS contributions Table B1 matching matching Treatment group Control group Difference Treatment group Control group Difference Weeks 237195 251904 14710 236515 232990 3525 Student 0177 0160 0016 0177 0181 0004 Employee 0391 0407 0016 0390 0391 0002 Contact 1000 1000 0000 1000 1000 0000 Number projects 18946 19269 0323 18651 18921 0270 Commits 2072852 2039565 33286 2058688 1822745 235943 Stars received 48550 71115 22565 48081 39739 8342 Issues received 14561 15009 0448 13848 11662 2186 continued next page Table B1 continued matching matching Treatment group Control group Difference Treatment group Control group Difference Comments received 44084 47101 3017 42707 32934 9773 Stars sent 53217 53018 0199 45862 49546 3684 Issues sent 25280 26767 1487 23762 24065 0303 Comments sent 138069 163793 25723 133803 138307 4504 C 0019 0025 0006 0019 0021 0001 C 0033 0043 0011 0033 0034 0001 C 0055 0047 0008 0055 0055 0001 Go 0011 0017 0006 0011 0009 0002 Java 0101 0115 0014 0102 0103 0001 JavaScript 0211 0188 0023 0208 0217 0009 PHP 0039 0051 0012 0040 0037 0003 Python 0123 0133 0011 0123 0124 0001 Ruby 0025 0025 0000 0025 0027 0002 Scala 0002 0004 0002 0002 0002 0000 TypeScript 0018 0020 0001 0018 0015 0003 collaborators 1225186 1203774 21412 1074696 1118728 44032 Local collaborators 1377 2207 0830 1354 1384 0030 Average age projects 89021 95269 6249 88861 87490 1370 Number projects GPL 1081 1181 0100 1075 1127 0052 p 01 p 005 p 001 Table B2 Regression results alternative sample US lockdowns Dependent variable CONTRIBUTION 1 2 3 ORDER 0000 0000 0000 0007 0007 0007 ORDER × LOCCOMS 0001 0001 0001 0001 0001 0001 ORDER × COMS 0000 0000 0000 0000 0000 0000 REPO 0225 0225 0225 0120 0120 0120 TENURE 0000 0000 0000 0000 0000 0000 STARR 0012 0012 0012 0007 0007 0007 STARS 0006 0006 0006 0003 0003 0003 ISSUER 0012 0012 0012 0029 0029 0029 ISSUES 0089 0089 0089 0028 0028 0028 COMMENTR 0036 0036 0036 0009 0009 0009 COMMENTS 0075 0075 0075 0017 0017 0017 CASE 0000 0000 0000 0000 0000 0000 Constant 0593 0592 0593 0451 0451 0451 Individual FE Yes Yes Yes Time FE Yes Yes Yes Observations 220848 220848 220848 Rsquared 0049 0049 0049 Robust standard errors brackets p 01 p 005 p 001 References Airbnb 2022 Airbnbs Design Employees Live Work Anywhere httpsnewsairbnbcomairbnbsdesigntoliveandworkanywhere Asay 2020 COVID19 Isnt Slowing Open Source—Watch Developer Burnout httpswwwtechrepubliccomarticlecovid19isntslowingopensourcewatchfordeveloperburnout Bao L Li Xia X Zhu K Li H Yang X 2022 working home affect developer productivity – case study Baidu COVID19 pandemic SCIENCE CHINA Inf Sci 65 1–15 Barber BM Jiang W Morse Puri Tookes H Werner IM 2021 explains differences finance research productivity pandemic J Finance 76 1655–1699 Bergel BJ Bergel EB Balsmeier PW 2006 reality virtual teams Competition Forum 4 427–432 Boden Molotch HL 1994 compulsion proximity Friedland R Boden Eds Newcomers Time Modernity University California Press Berkeley pp 257–286 Brandzert PB Nov 2011 Facebook use social capital—a longitudinal study Proceedings Fifth International AAAI Conference Weblogs Social Media Barcelona Spain pp 454–457 Brucks MS Levav J 2022 Virtual communication curbs creative idea generation Nature 605 108–112 Butler J Jaffe 2021 Challenges gratitude diary study engineers working home COVID19 pandemic 2021 IEEEACM 43rd International Conference Engineering Engineering Practice IEEE Madrid Spain pp 362–363 Chen H Wu Q 2019 Bank credit trade credit evidence natural experiments J Bank Financ 108 105616 Chen J Chen W Liu E Luo J Song ZM 2022a Economic Cost Lockdown China Evidence Citytocity Truck Flows Chen X Guo Shangguan W 2022b Estimating impact cloud computing firm performance empirical investigation listed firms Inf Manag 59 103603 Cochran WG Rubin DB 1973 Controlling bias observational studies review Sankhya Indian J Stat Ser 35 417–446 Colombo G 2020 Open Source COVID19 Open Source Come Stronger Side Pandemic httpswwwfinosorgblogopensourceandcovid19opensourcewillcomeoutstrongerontheothersideofthepandemic Cranton CD 2001 mutual knowledge problem consequences dispersed collaboration Organ Sci 12 346–371 Cranton CD Webber SS 2005 Relationships among geographic dispersion team processes effectiveness development work teams J Bus Res 58 755–765 Crowston K Howison J Masango C Ereyel UY 2007 role facetoface meetings technologysupported selforganizing distributed teams IEEE Trans Prof Commun 50 185–203 Cui R Ding H Zhu F 2022 Gender inequality research productivity COVID19 pandemic Manuf Serv Oper Manag 24 707–726 Daft RL Lengel RH 1986 Organizational information requirements media richness structural design Manag Sci 32 554–571 Daniel Stewart K 2016 Open source success resource access flow integration J Strateg Inf 25 159–176 Davidson J Mannan U Naik R Dua J Jensen C 2014 Older adults freeopen source diary study firsttime contributors Proceedings International Symposium Open Collaboration Association Computing Machinery New York NY United States pp 1–10 Dennis AR Fuller RM Valacich JS 2008 Media tasks communication processes theory media synchronicity MIS Q 32 575–600 DiMaggio P Hargittai E Neuman WR Robinson JP 2001 Social implications internet Annu Rev Sociol 27 307–336 Fang Neufeld 2009 Understanding sustained participation open source projects J Manag Inf Syst 25 9–50 Fang H Wang L Voss J 2010 Human mobility restrictions spread novel coronavirus 2019nCoV China J Public Econ 191 104272 Foerderer J 2020 Interfirm exchange innovation platform ecosystems Evidence Apple’s worldwide developers conference Manag Sci 66 4772–4778 Ford Storey Zimmermann Bird C Jaffe Maddila C Butler JL Houck B Nagappan N 2021 tale two cities developers working home COVID19 pandemic ACM Trans Softw Eng Methodol 31 1–37 Forsgren N 2020 Octoverse Spotlight Analysis Developer Productivity Work Cadence Collaboration Early Days COVID19 httpsgithubcomblog20200506octoversespotlightananalysisofdeveloperproductivityworkcadenceancolaborationintheearlydaysofcovid19 Foss NJ Jeppesen LB Rullani F 2021 context attention shape behaviors online communities modified garbage model Ind Corp Chang 30 1–18 GitHub 2022a GitHub Language Support httpsdocsgithubcomengetstartedlearningaboutgithubgithublanguagesupport GitHub 2022b GitHub Builds httpsgithubcomabout Hambrick DC Davison SC Snell SA Snow CC 1998 groups consist multiple nationalities towards new understanding implications Organ Stud 19 181–205 Hertel G Niederer Hermann 2003 Motivation developers open source projects internetbased survey contributors linux kernel Res Policy 32 1159–1177 Howard PE Rainie L Jones 2001 Days nights internet impact major technology South Africa Res Sci 45 383–404 Hu J Hu Yang X Chau 2023 firms improve performance external contributions opensource projects Proceedings 31th European Conference Information Systems ECIS Kristiansand Norway Huang L Zhong Fan W 2022 social networking sites promote life satisfaction explanation online offline social capital transformation Inf Technol People 35 703–722 Khaliq Mikami AY 2018 Talking facetoface associations online offline interactions online relationships Comput Hum Behav 89 88–97 Kock N 2004 psychobiological model towards new theory computermediated communication based Darwinian evolution Organ Sci 15 327–348 Kraut Lewis SH Swezey LW 1982 Listener responsiveness coordination conversation J Pers Soc Psychol 43 718–731 Kraut R Kiesler Boneva B Cummings J Helgeson V Crawford 2002 Internet paradox revisited J Soc Issues 58 49–74 von Krogh G Haefliger 2012 Carrots rainbows motivation social practice open source development MIS Q 36 649–674 Lau DC Murmigan JK 1998 Demographic diversity fruitfulness compositional dynamics organizational groups Acad Manag Rev 23 325–340 Leslie E Wilson R 2020 Sheltering place domestic violence evidence calls service COVID19 J Public Econ 189 104241 Lipnitzki J Stamps J 1999 Virtual teams new way work Strateg Leader 27 14–19 Miller C Widder DG Kastner C Vasilieus B 2019 people give flossing study contributor disengagement open source IFIP International Conference Open Source Systems Springer pp 116–129 Miller C Rodeghero P Storey Ford Zimmermann 2021 “How weekend” development teams working home COVID19 IEEEACM 43rd International Conference Engineering IEEE pp 624–636 Moqi Mei X Qiu L Bandypadhyay 2018 Effect “following” contributions open source communities J Manag Inf Syst 35 1188–1217 Muralidharan K Prakash N 2017 Cycling school increasing secondary school enrollment girls India Econ J Appl Econ 9 321–350 Negoita B Vial G Shaikh Labbe 2019 Code forking development sustainability Evidence GitHub Fortieth International Conference Information Systems Munich Germany Neto PAdMS Mannan UA de Almeida ES Nagappan N Lo Singh Kochhar P Gao C Ahmed 2021 deep dive impact COVID19 development IEEE Trans Softw Eng 48 3342–3360 NicCanna C Razzak Noll J Beecham 2021 Globally distributed development COVID19 2021 IEEEACM 40th International Workshop Engineering Research Industrial Practice IEEE pp 18–25 Virtual Conference Ocker R Fjermedal J Hiltz SR Johnson K 1998 Effects four modes group communication outcomes requirements determination J Manag Inf Syst 15 99–118 O’Mahony Ferraro F 2007 emergence governance open source community Acad Manag J 50 1079–1106 Peters P Baltes Adisaputri G Torkar R Kovalenko V Kalinowski Novielli N Yoo Devroye X Tan X Zhou Turhan B Hoda R Hata H Robles G Fard Alkadhri R 2020 Pandemic programming Emir Softw Eng 25 1–35 Shah SK 2006 Motivation governance viability hybrid forms open source development Manag Sci 52 1000–1014 Sheridan Andersen AL Hansen ET Johannesen N 2020 Social distancing laws cause small losses economic activity COVID19 pandemic Scandinavia Proc Natl Acad Sci 117 20468 Smite Moe NB Klotins E GonzalezHuerta J 2023 forced workingfromhome voluntary workingfromanywhere two revolutions telework J Syst Softw 195 111509 Sproll L Kiesler 1986 Reducing social context cues electronic mail organizational communication Manag Sci 32 1492–1512 Stam W 2009 community participation enhance performance open source projects 1287–1299 Straus SG McGrath JE 1994 medium matter interaction task type technology group performance member reactions J Appl Psychol 79 397–405 Suphan Mierzejewska BL 2016 Boundaries online offline realms social grooming affects students USA Germany Inf Commun Soc 19 1287–1305 Suphan Feuls Fieseler C 2012 Social media’s potential improving mental wellbeing unemployed ErikssonBakka K Looma Krook E Eds Exploring Abyss Inequalities Springer Berlin pp 10–28 Tanaka Okamoto 2021 Increase suicide following initial decline COVID19 pandemic Japan Nat Hum Behav 5 229–238 Toussaint DeMarie SM Hendrickson AR 1998 Virtual teams workplace future Acad Manag Perspect 12 17–29 Wakefield RL Leidner DE Garrison G 2008 Research note—a model conflict leadership performance virtual teams Inf Syst Res 19 434–455 Walters C Mehl GG Piraino P Jansen JD Kriger 2022 impact pandemicenforced lockdown scholarly productivity women academics South Africa Res Policy 51 104403 Wang G 2022 Stay home save effectiveness stayathome orders containing COVID19 pandemic Proc Oper Manag 31 2289–2305 Wang Cai H Li C Jiang Z Wang L Song J Xia J 2013 Optimal caliper width propensity score matching three treatment groups Monte Carlo study PLoS One 8 e101405 Warren 2020 Microsoft Letting Employees Work Home Permanently httpswwwthevergecom202010921508964microsoftremoteworkfromhomemicrosoft2019fbclidIwAR08H1r0lBjymHbfw4fYApVhHcdRvK5tv5z2qYTaUYe6c8Q6ynMkXzQxQ4 Walters C Mehl GG Piraino P Jansen JD Kriger 2022 impact pandemicenforced lockdown scholarly productivity women academics South Africa Res Policy 51 104403 Wellman B Salaff J Dimitrova Garton L Gulia Haythornthwaite C 1996 Computer networks social networks collaborative work telework virtual community Annu Rev Sociol 22 213–238 Wikipedia 2022 Han Chinese httpsenwikipediaorgwikiHanChinese Wu J Smith Khurana Siemaszko C DeJesusBanos B 2020 Stayathome Orders Across Country httpswwwnbcnewscomhealthhealthnewsherearestayhomeordersacrosscountryn1168736 Xu B Jones DR Shao B 2009 Volunteers’ involvement online community based development Inf Manag 46 151–158 Yang X Li X Hu Wang HJ 2021 Differential impacts social influence initial sustained participation open source projects J Assoc Inf Sci Technol 72 1133–1147 Zhang XM Zhu F 2011 Group size incentives contribute natural experiment Chinese Wikipedia Econ Rev 101 1601–1615
::::
Automating Dependency Updates Practice Exploratory Study GitHub Dependabot Runzhi Hao Yuxia Zhang Minghui Zhou Abstract—Dependency management bots automatically open pull requests update dependencies behalf developers Early research shows developers suspicious updates performed dependency management bots feel tired overwhelming notifications bots Despite dependency management bots becoming increasingly popular contrast motivates us investigate Dependabot currently visible bot GitHub reveal effectiveness limitations stateofart dependency management bots use exploratory data analysis developer survey evaluate effectiveness Dependabot keeping dependencies uptodate interacting developers reducing update suspicion reducing notification fatigue obtain mixed findings positive side projects reduce technical lag Dependabot adoption developers highly receptive pull requests negative side compatibility scores scarce effective reducing update suspicion developers tend configure Dependabot toward reducing number notifications 113 projects deprecated Dependabot favor alternatives survey confirms findings provides insights key missing features Dependabot Based findings derive summarize key characteristics ideal dependency management bot grouped four dimensions configurability autonomy transparency selfadaptability Index Terms—Dependency Management Engineering Bot Dependabot Mining Repositories
::::
1 INTRODUCTION update update question haunting engineers decades engineering “gurus” would argue keeping dependencies uptodate minimizes technical debt increases supply chain security ensures sustainability long term 1 Nonetheless requires substantial effort also extra responsibility developers Consequently many developers adhere practice “if ain’t broke don’t fix it” majority existing systems use outdated dependencies 2 One promising solution dilemma use bots automate dependency updates Therefore dependency management bots invented automatically open pull requests PRs update dependencies collaborative coding platform eg GitHub hope saving developer effort Recently dependency management bots increasingly visible gaining high momentum among practitioners exemplars bots including Dependabot 3 Renovate Bot 4 PyUp 5 Synk Bot 6 opened millions PRs GitHub 7 adopted variety industry teams according websites However simple idea using bot save world early work Mirhosseini Parnin 8 Greenkeeper 9 reveals 32 Greenkeeper PRs merged developers suspicious whether bot PR break code ie update suspicion feel annoyed large number bot PRs ie notification fatigue Since similar bots emerged evolved gained high popularity among visible one GitHub Dependabot 7 many improvements Section 23 However remains unknown extent bots overcome two limitations Greenkeeper identified Mirhosseini Parnin 8 2017 shed light improving dependency management bots engineering bots general present exploratory study Dependabot study answers following four research questions RQs empirically evaluate effectiveness Dependabot version update different dimensions detailed motivations Section 3 RQ1 extent Dependabot reduce technical lag adoption RQ2 actively developers respond merge pull requests opened Dependabot RQ3 effective Dependabot’s compatibility score allaying developers’ update suspicion RQ4 projects configure Dependabot automating dependency updates find many projects deprecated Dependabot favor alternatives ask additional RQ RQ5 projects deprecate Dependabot developers’ desired features Dependabot answer RQs sample 1823 popular actively maintained GitHub projects study subjects conduct exploratory data analysis 502752 Dependabot PRs projects use survey 131 developers triangulate findings findings provide empirical characterizations Dependabot’s effectiveness various dimensions importantly discover important limitations Dependabot stateoftheart bot overcoming update suspicion notification fatigue along missing features overcoming limitations Based findings summarize four key properties ideal dependency management bot ie configurability autonomy transparency selfadaptability roadmap engineering researchers bot designers
::::
2 Background Related Work 21 Dependency Update modern development updating dependencies important also nontrivial typical may tens thousands dependencies outdated ones induces risks 10 However update may contain breaking changes hard discover fix 11 situation inspires research understanding update practices designing metrics inventing approaches support dependency updates Bavota et al 12 find updates Apache ecosystems triggered major changes large number bug fixes may prevented API removals Kula et al 2 discover 815 4600 studied JavaMaven projects GitHub still keep outdated dependencies due lack awareness extra workload Pashchenko et al 13 find semistructured interviews developers face tradeoffs updating dependencies eg vulnerabilities breaking changes policies Researchers proposed measurements quantify “freshness” “outdatedness” dependencies applied various ecosystems Cox et al 14 propose several metrics quantify “dependency freshness” evaluate dataset industrial Java systems series studies 15 16 17 18 19 20 introduce notion technical lag metric measuring extent dependencies lagging behind latest releases investigate evolution technical lag Debian 15 npm 16 17 18 Librariesio dataset 19 Docker images 20 find technical lag tends increase time induces security risks mitigated using semantic versioning long line research engineering supporting automated update Since API breaking changes form majority update cost studies propose automated approaches match adapt evolving APIs eg 21 22 23 24 25 However Cossette Walker 26 reveal manual analysis real API adaptation tasks complex beyond capability previous automated approaches Recently research interest automated API adaptation surging works Java 27 JavaScript 28 Python 29 Android 30 etc hand practitioners often take conservative update approaches upstream developers typically use semantic versioning 31 signaling version compatibility downstream developers perform updates manually detect incompatibilities release notes compilation failures regression testing Unfortunately studies 32 33 34 35 reveal none work well guaranteeing update compatibility Generally providing guarantees still challenging open problem 36 22 Dependency Management Bots Perhaps noticeable automation effort among practitioners dependency management bots bots automatically create pull requests PRs update dependencies either immediately new release available security vulnerability discovered currently used version words dependency management bots solve lack awareness problem 2 automatically pushing update notifications developers Mirhosseini Parnin 8 conduct pioneering study Greenkeeper find developers update dependencies 16x frequently Greenkeeper 32 Greenkeeper PRs merged due two major limitations Update Suspicion automated update PR breaks code developers immediately become suspicious subsequent PRs reluctant merge Notification Fatigue many automated update PRs generated developers may feel annoyed notifications simply ignore update PRs Rombaut et al 37 find Greenkeeper issues inrange breaking updates induce large maintenance overhead many false alarms caused CI issues limitations Greenkeeper align well challenges revealed engineering SE bot literature Wessel et al 38 find SE bots GitHub interaction problems provide poor decisionmaking support Erlenhov et al 39 identify two major challenges “Alex” bot ie SE bots autonomously perform simple tasks design establishing trust reducing interruptionnoise Wyrich et al 7 find bot PRs lower merge rate need time interacted merged Two subsequent studies Wessel et al 40 41 qualitatively show noise central challenge SE bot design mitigated certain design strategies use “metabot” Shihab et al 42 draw picture SE bot technical socioeconomic challenges Santhanam et al 43 provide systematic mapping SE bot literature Since Mirhosseini Parnin 8 many bots emerged automating dependency updates Dependabot 3 preview release May 2017 Renovate Bot 4 first release January 2017 Greenkeeper reaches endoflife June 2020 team merged Synk Bot 6 bots widely used according Wyrich et al 7 opened vast majority bot PRs GitHub six top seven top two occupied Dependabot 3 Dependabot Preview 44 ∼3 million PRs ∼12 million PRs respectively Erlenhov et al 45 find strict SE bot definition almost bots existing bot commit dataset 46 dependency management bots frequently adopted discarded switched even simultaneously used GitHub projects indicating fierce competition among 23 Dependabot Among different dependency management bots Dependabot 3 visible one GitHub projects 7 Dependabot Preview launched 2017 47 acquired GitHub 2019 48 August 2021 shut favor new GitHub native Dependabot 49 operating since June 2020 offers two main services Dependabot version update 50 configuration file named dependabotyml added GitHub repository Dependabot begin open PRs update dependencies latest version Developers specify exact Dependabot behavior dependabotyml eg update interval max number PRs Dependabot security update 51 Dependabot scans entire GitHub find repositories vulnerable dependencies Even dependabotyml supplied Dependabot still alerts repository owners repository owners tell Dependabot open PRs update vulnerable dependencies patched versions Figure 1 shows example Dependabot PR 52 Apart details one especially interesting Dependabot feature compatibility score badge According GitHub documentation 53 update’s compatibility score percentage CI runs passed updating specific versions dependency words score uses largescale regression testing data available GitHub CI test results estimate risk breaking changes dependency update looks like promising direction solving update suspicion problem previous studies shown test suites often unreliable detecting update incompatibilities 34 false alarms introduce significant maintenance overhead 37 However score’s effectiveness practice remains unknown notification fatigue problem Wessel et al 40 suggest SE bots offer flexible configurations send relevant notifications solutions principally implemented Dependabot still unclear whether specific configuration options notification strategies taken Dependabot really effective practice Alfadel et al 54 find developers receive Dependabot security PRs well 6542 PRs merged merged within day However security PRs constitute small portion Dependabot PRs 69 dataset developers perceive security updates highly relevant 13 effectiveness Dependabot version update general seems problematic SotoValero et al 55 find Dependabot opens many PRs bloated dependencies Cogo Hassan 56 provides evidence configuration Dependabot causes issues developers stated two developers GitHub issue 57 1 think we’d rather manage dependency upgrades time We’ve frequently bitten dependency upgrades causing breakages tend upgrade dependencies we’re close ready cut release 2 Also Dependabot tends pretty spammy rather annoying best knowledge comprehensive empirical investigation adoption Dependabot version update service still lacking knowledge Dependabot help formulation general design guidelines dependency management bots unveil important open challenges fulfilling guidelines
::::
3 RESEARCH QUESTIONS study goal evaluate practical effectiveness Dependabot version update service Section elaborate motivation RQ toward goal Dependabot version update service designed make developers aware new versions help keep dependencies uptodate quantitatively evaluate extent Dependabot fulfills main design purpose ie keeping dependencies uptodate reuse metrics technical lag literature 16 18 ask RQ1 extent Dependabot reduce technical lag adoption help developers keep dependencies uptodate Dependabot intervenes automatically creating update PRs new versions become available developers interact eg comment merge PRs evaluate effectiveness interaction process measuring extent developers interact smoothly Dependabot PRs forming next RQ RQ2 actively developers respond merge pull requests opened Dependabot One major limitation Greenkeeper developers tend suspicious whether dependency update introduce break changes 8 ie update suspicion hand Dependabot helps developers establish confidence update PRs using compatibility score feature Section 23 quantitatively evaluate effectiveness feature update suspicion ask RQ3 effective Dependabot’s compatibility score allaying developers’ update suspicion major limitation Greenkeeper developers tend overwhelmed large number update PRs 8 ie notification fatigue hand Dependabot provides flexible configuration options controlling amount notifications Section 23 explore developers configure reconfigure number notifications generated Dependabot study realworld Dependabot configurations ask RQ4 projects configure Dependabot automating dependency updates analysis discover nonnegligible portion projects studied corpus deprecated Dependabot migrated alternatives indepth retrospective analysis reasons behind deprecations help reveal important Dependabot limitations future improvement directions ask RQ5 projects deprecate Dependabot developers’ desired features Dependabot 4 STUDY DESIGN overview study shown Figure 2 study follows mixmethod study design obtain results repository data analysis triangulate developer survey Section introduce data collection survey methods specific analysis methods presented along results Section 5 41 Data Collection Selection first step need collect sample engineered maintained GitHub projects using used Dependabot version update workflow focus GitHub native Dependabot released June 1 2020 include Dependabot Preview study former provides much richer features allows us obtain latest stateoftheart results begin latest dump GHTorrent 58 released March 6 2021 largescale dataset GitHub projects widely used engineering research eg 7 34 find noticeable gap GHTorrent dataset July 2019 early January 2020 also observed Wyrich et al 7 Focusing solely GitHub native Dependabot allows us circumvent threats caused gap PRs created January 2020 select projects least 10 merged Dependabot PRs keep projects used Dependabot degree filter irrelevant lowquality unpopular projects retain nonfork projects least 10 stars inspired previous works 55 59 60 Since projects without sustained activities may perform dependency updates regular basis induce noise technical lag analysis RQ1 query GitHub APIs 61 retain projects median weekly commit least one past year exclude projects never utilized Dependabot version update clone retain projects git change history dependabotyml filtering steps end 1823 projects PR Collection use GitHub REST API 61 web scraper find Dependabot PRs February 14 2022 projects collect PR statistics CI test results timeline events leveraging distributed pool Cloudflare workers 62 web scraper empowers us bypass limitation GitHub APIs unhandy collecting CI test results PRs retrieve PR events CI test results scale PR body tell dependency PR updating current version updated version end stage obtain 540665 Dependabot PRs 711 CI test result updating 15590 dependencies 167841 version pairs next task identify security updates PRs created Dependabot However Dependabot longer labeling security updates due security reasons Instead Dependabot showing banner PR web page visible repository administrators default 51 Therefore choose construct mirror GitHub security advisory database 63 identify security PRs checking whether PR updates version vulnerability entry time PR creation specifically identify PR security update PR 1 dependency current version matches vulnerability GitHub security advisory database 2 updated version newer version fixes vulnerability ie vulnerability update 3 PR created vulnerability disclosure CVE Eventually identify 37313 security update PRs 69 540665 Dependabot PRs total Dataset Overview illustrated Table 1 projects dataset mostly engineered popular GitHub projects large code base active maintenance rich development history frequent Dependabot usage notice longtail distribution metrics concerning size ie number contributors lines code commit frequency expected common mining repository MSR datasets 35 64 65 441 projects dataset utilize npm package ecosystem followed Maven 123 PyPI 117 Go modules 78 Among Dependabot PRs update npm packages constitute even higher portion 649 followed PyPI 89 Go modules 43 Bundler 39 Maven 39 packages npm ecosystem generally evolve faster 66 Dependabot opened hundreds PRs projects mean 304 median 204 even thousands likely indicates high workload maintainers terms updated dependencies surprising Statistics Mean Median Distribution Stars 142392 6600 Commits 283711 104050 Contributors 2650 1200 Lines Code thousands 9818 1989 Commits per Week 1007 400 Age Adoption days 101818 71400 Dependabot PRs 30456 20400 Dependabot Interactions 64454 41000 Commits 47700 33150 Followers 16800 5350 Years Experience GitHub 1037 1068 TABLE 2 Survey Questions Results 131 Responses Total 5Point LikertScale Questions Distribution Avg RQ1 Dependabot helps keep dependencies uptodate 50 444 RQ2 Dependabot PRs require much work review merge 25 394 RQ2 respond Dependabot PR fast safely merged 25 442 RQ2 ignore Dependabot PR respond slower cannot safely merged 25 378 RQ2 handle Dependabot PR higher priority updates vulnerable dependency 25 419 RQ2 requires work review merge Dependabot PR updates vulnerable dependency 25 249 RQ2 Dependabot often opens PRs handle 25 273 RQ3 Compatibility scores often available Dependabot PRs 50 295 RQ3 compatibility score available effective indicating whether update break code 50 295 RQ4 Dependabot configured fit needs 50 354 RQ4 configure Dependabot make less noisy ie update certain dependencies scan less frequently etc 50 327 Multiple Choice Questions RQ5 GitHub repositories still using Dependabot automating version updates 50 089 RQ5 50 Results § 55 OpenEnded Questions∗ RQ5 Regardless current availability features want bot updates dependencies opinions suggestions 50 Results § 55 ∗ appropriate also use evidence openended question responses support results RQ1 RQ4 survey approved Ethics Committee Key Laboratory High Confidence Technology Ministry Education Peking University Grant CS20220011 candidate send personalized emails information used Dependabot avoid perceived spam try best follow common survey ethics 71 eg clearly introducing purpose survey transparent responses etc increase chance getting response contribute back opensource community offer donate 5 opensource respondents’ choice opt Therefore believe done minimal harm opensource developers contacted results get Dependabot far outweigh harm fact get several highly welcoming responses survey participants 1 keep good work 2 would like consult ping Cheers bottom half Table 1 summarizes demographics 131 survey respondents showing highly experienced Dependabot median 410 interactions open source development five 15 years experience hundreds commits many followers
::::
5 METHODS RESULTS 51 RQ1 Technical Lag 511 Repository Analysis Methods evaluate effectiveness Dependabot version updates comparing technical lag two time points day Dependabot adoption T0 90 days adoption ie T0 90 choose 90 days interval avoid influence deprecations 85 happen 90 days adoption Since technical lag naturally increases time 16 18 include additional time point comparison 90 days adoption ie T0 90 p time T0 90 T0 T0 90 denote direct dependencies textdepsp define technical lag p time texttechlagp fracsumd textdepsp textmean0 ttextlatestd ttextadopteddtextdepsp ttextlatestd denotes release time latest version time ttextadoptedd denotes release time adopted version use max guard occasional case ttextlatestd ttextadoptedd eg developers may continue release 09x versions release 100 technical lag definition inspired Zerouali et al 18 several adjustments First use timebased variant instead versionbased variant crossproject comparisons would intuitive using latter Second use mean value dependencies instead maximum median overall technical lag intend measure overall effectiveness Dependabot keeping dependencies uptodate eliminating outdated ones exclude projects age fewer 90 days Dependabot adoption projects deprecate Dependabot within 90 days also exclude projects migrate Dependabot Preview since may introduce bias results Since computation technical lag based dependency specification files version numbers requires nontrivial implementation work package ecosystem limit analysis JavaScriptnpm popular ecosystem dataset exclude projects eligible npm dependencies configured Dependabot T0 90 T0 T0 90 filtering retain 613 projects answering RQ1 adopt Regression Discontinuity Design RDD framework estimate impact adopting Dependabot technical lags RDD uses level discontinuity beforeafter intervention measure effect size taking influence overall background trend consideration Given technical lag tends naturally increasing time 16 17 18 RDD appropriate statistic modeling approach case compared hypothesis testing approaches eg oneside Wilcoxon ranksum tests Following previous SE works utilized RDD 72 73 use sharp RDD ie segmented regression analysis interrupted time series data treat projectlevel technical lag time series function compute technical lag every 15 days T0 90 T0 90 use ordinary least square regression fit RDD model watch presence discontinuity Dependabot adoption formalized following model yi alpha beta cdot texttime gamma cdot textinterventioni theta cdot texttime textafter interventioni sigmai yi denotes output variable ie technical lag case texttime stands number days T0 90 textintervention binarizes presence Dependabot 0 adopting Dependabot 1 adoption texttimetextafter intervention counts number days T0 0 T0 90 leq texttime T0
::::
512 Repository Analysis Results present technical lags delta time points Table 3 plot diagrams Figure 3 reflect different projects increasedecrease technical lag T0 90 T0 90 first surprising fact notice technical lag approximately onethird 216613 projects already decreasing T0 90 T0 even technical lag tends increase time 16 18 indicates projects already taking proactive dependency update strategy even adopting Dependabot hand half 303613 projects technical lag increases prior Dependabot adoption 94 projects keep technical lag unchanged projects mean median technical lag T0 90 7368 1627 days respectively decrease T0 4899 1396 days respectively T0 90 159 259 613 projects already achieved zero technical lag T0 T0 90 projects lower technical lag even mean 4899 days median 1396 days mean 2538 days median 362 days Among 303 projects increasing technical lag T0 90 T0 twothirds 220 see decrease adopting Dependabot among 216 projects decreasing technical lag nearly half 94 see decrease onethird 219 357 projects achieve completely zero technical lag 90 days Dependabot adoption Although still increases magnitude much smaller eg 75 quantile 175 days T0 T0 90 compared 75 quantile 1437 days T0 90 T0 Table 4 shows regression variable textintervention statistically significant negative coefficient textcoef 312137 p 0001 indicating adoption Dependabot might reduced technical lag kept dependencies uptodate sampled 613 projects straightforward look trend observed Figure 4 T0 projectlevel technical lag noticeable decrease discontinuity linerfitted technical lag beforeafter adoption texttime texttimetextafter intervention negative coefficients echoing earlier findings
::::
Table 3 Technical Lag days 613 npm Projects Metric Mean Median Distribution texttechlagp T0 90 7368 1627 Delta textin 2496 000 texttechlagp T0 4899 1396 Delta textin 2361 061 texttechlagp T0 90 2538 362
::::
Table 4 Estimated Coefficients Significance Levels RDD Model Fit Section 511 Feature Coef Std Err p Intercept 665209 4595 14477 0000 textintervention 312137 5694 5306 0000 texttime 00743 0079 0945 0345 texttimetextafter intervention 01011 0100 1008 0314 p 0001 technical lag sampled projects already decrease Dependabot adoption introduction Dependabot adds decreasing trend However coefficients comparable intervention statistically significant p 03
::::
513 Triangulation Survey developers agree Dependabot helpful keeping dependencies uptodate 558 responded Strongly Agree 357 Agree Table 2 noted one developer Dependabot great job keeping repositories current Dependabot serves well automated notification mechanism tells presence new versions pushes update dependencies mentioned two developers 1 Dependabot wonderful way learn majorminor updates libraries 2 Dependabot bit noisy makes aware dependencies However developers favor using Dependabot automating dependency updates use Dependabot way notification example 1 use notifications updates manually check anything broke process 2 using Dependabot tell something update update single shot plain package managers indicates trust reliability Dependabot automating updates think current design Dependabot help reduce manual workload updates example one developer states Dependency management currently much easier utilizing yarnnpm use Dependabot merely recommended updating dependencies faster solely used command line One developer suggests using Dependabot update notifications become common use case would prefer dedicated less noisy tool solely designed purpose Dependabot becomes like update notification ie I’m leveraging half capability Could something designed solely purpose Less invasive informative instead creating PR every package’s update would like see panelstyle hub collect information get better overview one place Findings RQ1
::::
90 days adopting Dependabot projects decrease technical lag average 4899 days average 2538 days 357 projects achieve zero technical lag 90 days adoption adoption Dependabot statistically significant intervention indicated RDD Developers agree effectiveness notifying updates question effectiveness automating updates
::::
52 RQ2 Developers’ Response Pull Requests
::::
521 Repository Analysis Methods Inspired prior works 7 54 use following metrics measure receptiveness ie active developers merge responsiveness ie active developers respond Dependabot PRs Merge Rate proportion merged PRs Merge Lag time takes PR merged
::::
522 Repository Analysis Results Table 5 shows PR statistics obtain group high merge rates 70 indicate projects highly TABLE 5 PR Statistics Different Groups lags measured days barx represents mean mu represents median PRs group Statistics regular secconf secnconf PRs 502752 13406 23907 Merge Rate 7013 7371 7601 Merge Lag barx176 mu018 barx345 mu018 barx815 mu076 Close Lag barx863 mu300 barx1442 mu500 barx2683 mu571 Resp Lag barx227 mu017 barx374 mu017 barx859 mu051 Close Lag time takes PR closed ie merged code base Response Lag time takes PR human interactions including observable action PR’s timeline eg adding label assigning reviewer merge rate intended measure receptiveness latter three intended measure responsiveness assume results may differ PRs different groups expect 1 developers receptive responsive security updates due higher priority eliminating security vulnerabilities 2 projects use Dependabot version update ie contain dependabotyml responsive Dependabot PRs verify expectations divide PRs three groups regular Dependabot PRs update package latest version old version contain known security vulnerabilities secconf Security PRs update package vulnerabilities patched version opened dependabotyml file repository ie using Dependabot version update secnconf Security PRs opened dependabotyml file repository PRs opened either adoption deprecation Dependabot version update examine significance intergroup metric differences unpaired MannWhitney tests Cliff’s delta delta Following Romano et al 74 consider effect size negligible delta 0 0147 small delta 0147 033 medium delta 033 0474 large otherwise receptive Dependabot PRs regardless whether securityrelated receptive security PRs merge rate 7453 even higher 6542 reported Dependabot Preview security updates 54 may projects welcome security updates even projects selected Alfadel et al 54 find Dependabot security PRs take longer close merge data illustrate similar story regular Dependabot PRs take median 018 days ≈ four hours merge median 300 days close difference statistically significant large effect size p 0001 delta 091 response lag however differ much merge lag groups confirms timeliness developers’ response towards Dependabot PRs observe human activities 360126 722 Dependabot PRs among 280276 778 take less one day respond However also indicates inconsistency fast responses slow closes glance caused inconsistency sample ten closed PRs developers’ activities closing inspect event history find 9 10 PRs closed Dependabot PR obsolete due release newer version manual upgrade similar observation Alfadel et al 54 Activities developmentrelated eg starting discussion assigning reviewers 5 PRs rest interactions Dependabot eg dependabot rebase Surprisingly security PRs require longer time merge p 0001 delta 087 close p 0001 delta 072 respond p 0001 delta 087 large effect sizes regardless whether using Dependabot version update Though Dependabot version update users process security updates quicker least merge lag response lag noticeably shorter difference significant negligible small effect sizes delta leq 023
::::
523 Triangulation Survey general developers agree Dependabot PRs require much work review merge 341 Strongly Agree 403 Agree 140 Neutral find follow two different patterns using Dependabot One pattern rapidly merge PR tests pass manually perform update hand otherwise 652 Strongly Agree 197 Agree 91 Neutral latter case respond Dependabot PR slower let Dependabot automatically close PR manual update 364 Strongly Agree 265 Agree 205 Neutral example almost never look Dependabot PRs tests 9999 PRs merged automatically Rarely dependency changes API example manually add fixesupdates mentioned Section 513 another pattern use Dependabot PRs solely way notification always perform manual updates cases contribute much larger close lag observe Dependabot PRs terms security updates developers handle security PRs higher priority 567 Strongly Agree 163 Agree 140 Neutral think security PRs require work review merge 194 Totally Disagree 364 Disagree 264 Neutral One possible explanation slower response merge close security PRs developers consider security vulnerabilities irrelevant want Dependabot ignore security vulnerabilities development dependencies don’t actually get used production Developers mixed opinion whether Dependabot opens PRs handle 159 Strongly Agree 152 Agree 220 Neutral 205 Disagree 265 Totally Disagree Whether PR workload introduced Dependabot acceptable may depend factors eg number dependencies fast packages evolve indicated two respondents 1 performance Dependabot similar bots could depend number dependencies smaller projects handful dependencies Dependabot less noisy usually safe compared large projects lot dependencies 2 utility something like Dependabot depends heavily stack number dependencies JS much noisy Ruby example Ruby moves slowly
::::
Findings RQ2 70 Dependabot PRs merged median merge lag four hours Compared regular PRs developers less responsive time respond close merge receptive higher merge rate security PRs Developers tend rapidly merge PRs consider “safe” perform manual updates remaining PRs
::::
53 RQ3 Compatibility Score
::::
531 Repository Analysis Methods explore effectiveness compatibility scores two aspects Availability Correlation Merge Rate 1 Availability begin analysis understanding data availability compatibility scores would take effect absent PRs purpose obtain compatibility scores badges PR bodies point URLs defined per dependency version pair Dependabot computes one compatibility score dependency version pair langle v1 v2 rangle show score PRs update dependency v1 v2 case computation fails Dependabot generates unknown compatibility score langle v1 v2 rangle Since compatibility scores computed datadriven manner wonder popularity updated dependencies affects availability quick evaluation sample 20 npm dependencies one million downloads per week representatives popular dependencies Next retrieve release history dependencies querying npm registry API retaining releases came available January 1 2020 recall Dependabot PRs dataset created January 2020 Section 4 releases dependency get possible dependency version pairs Cartesian product 1629 total query compatibility scores corresponding Dependabot URLs 2 Correlation Merge Rate theory developers perceive compatibility scores reliable PRs higher compatibility scores likely get merged quantitatively evaluate compare merge rates PRs different compatibility scores Since PRs update version pair share score Fig 5 Distribution compatibility scores available CI test results version pairs axios Compatibility Score b CI Test Results TABLE 6 Compatibility Score PR Merge Rate Compatibility Score PRs Merge Rate unknown 485501 6996 80 1321 3020 90 80 1605 6748 95 90 1794 7319 100 95 2228 8443 100 10303 8030 utilize Spearman’s rho measure correlation compatibility score dependency version pair v1 v2 b merge rate PRs update v1 v2 show Section 532 compatibility scores abnormally scarce Although reached Dependabot maintainers explanations claim information confidential refuse share details compute number CI test results dependency version pair analyze overall distribution provide possible explanations scarcity 532 Repository Analysis Results 1 Availability Compatibility scores extremely scarce 34 PRs 05 dependency version pairs compatibility score unknown Merely 018 dependency version pairs value 100 scarcity become better even among popular npm dependencies 1604 985 1629 dependency version pairs sample compatibility score unknown 10 06 compatibility score 100 15 09 compatibility score less 100 example plot compatibility score matrix axios 15 version pairs compatibility scores Figure 5a 2 Correlation Merge Rate summarize merge rates PRs different compatibility scores Table 6 observe PRs compatibility score high score indeed increases chance merged score higher 90 developers likely merge PR contrast score lower 80 developers become unlikely 3020 merge Spearman’s rho compatibility score merge rate 037 p 0001 indicating weak correlation according Prion Haerling’s interpretation 75 Figure 6 shows number dependency version pairs x CI test results observe extreme Paretolike distribution 167053 dependency version pairs dataset less 1000 50 CI test results less 100 150 CI test results case axios Figure 5b compatibility scores indeed available version pairs available CI test results hard explain scores missing even version pairs many CI test results eg update 0192 0200 know underlying implementation details 533 Triangulation Survey Developers diverging opinions whether compatibility scores available 7 Strongly Agree 248 Agree 388 Neutral 178 Disagree 116 Totally Disagree whether compatibility scores effective available 47 Strongly Agree 217 Agree 457 Neutral 194 Disagree 85 Totally Disagree answer distributions high number Neutral responses likely indicate many developers know rate two statements 76 compatibility scores scarce developers exposed feature replied one developer Compatibility scores vulnerable dependencies detection great use Dependabot lot aware existThey visible user Another developer express concerns compatibility scores effective saying Dependabot’s compatibility score never worked several developers 6 responses survey hold belief Dependabot works well projects highquality test suite example 1 Dependabot works best high test coverage fails people it’s likely little test coverage 2 Dependabot without good test suite indeed likely noisy good tests understanding code base trivial know whether update safe update Findings RQ3 Compatibility scores scarce effective 34 PRs known compatibility score PRs one scores weak correlation rho 037 PR merge rate scarcity may dependency version pairs sufficient CI test results ie Paretolike distribution inferring update compatibility result developers think Dependabot works well projects highquality test suites 54 RQ4 Configuration 541 Repository Analysis Methods Dependabot offers tons configuration options integration workflows review write commit messages label etc research question focus options related notifications expect possible countermeasures noise notification fatigue specifically investigate following options provided Dependabot 1 scheduleinterval option mandatory specifies often Dependabot scans dependencies checks new versions opens update PRs Possible values include daily weekly monthly 2 openpullrequestslimit specifies maximum number simultaneously open Dependabot PRs allowed default value five 3 allow tells Dependabot update subset dependencies default dependencies updated 4 ignore tells Dependabot ignore subset dependencies default dependency ignored latter two options flexible may contain constraints exclusive package ecosystems eg allowing updates production manifests ignoring patch updates according semantic versioning convention 31 understand developers’ current practice configuring Dependabot parse 3921 Dependabot configurations 1588 projects dependabotyml current working tree scheduleinterval openpullrequestslimit count frequency value allow ignore parse different options group three distinctive strategies 1 default allowing Dependabot update dependencies default behavior 2 ignorelist configuring Dependabot ignore subset dependencies 3 allowlist configuring Dependabot update subset dependencies explore modification history Dependabot configurations observe developers use configuration countermeasure noise wild purpose find commits 1823 projects modified dependabotyml extract eight types configuration changes file diffs 1 interval Developers increase scheduleinterval 2 interval Developers decrease scheduleinterval 3 limit Developers increase openpullrequestslimit 4 limit Developers decrease openpullrequestslimit 5 allow Developers allow dependencies automatically updated Dependabot 6 allow Developers longer allow dependencies automatically updated Dependabot 7 ignore Developers configure Dependabot ignore dependencies automated update 8 ignore Developers configure Dependabot longer ignore dependencies automated update Note 235 1823 projects dependabotyml current working tree investigate RQ5 One may depend one package ecosystem eg npm PyPI separate configurations Finally analyze configuration modifications time since Dependabot adoption mainly focus bursts modification patterns bursts illustrate lag developers’ perception noise countermeasures mitigate noise 542 Repository Analysis Results current configurations Dependabot show projects configure Dependabot toward proactive update strategy 2203 562 scheduleinterval daily merely 276 704 conservative monthly 1404 358 openpullrequestslimit configurations higher default value negligible proportion 23 lower allow ignore options configurations 3396 867 adopt default strategy less 380 97 use ignorelist small proportion 50 13 use allowlist modifications tell us another story 776 4257 1823 projects dataset modified Dependabot configuration options study eg update interval contain 218 modification commits average median 100 Figure 7 illustrates proportion modification type shows projects increase scheduleinterval lower openpullrequestslimit frequently opposite demonstrated Figure 8 projects increase scheduleinterval time Dependabot adoption likely reduce openpullrequestslimit several months Dependabot usage scheduleinterval determines often Dependabot bothers developers large extent seeing developers 336 projects increasing 868 configurations confirm behavior countermeasure noise reallife example developers reduce frequency monthly reduce noise 77 openpullrequestslimit quantifies devel operators’ workload interaction also noiserelated indicated developers’ complaint Dependabot PRs quickly get hand 78 focus modifications happen 90 days Dependabot adoption find nearly twothirds 625 openpullrequestslimit changes belong limit observations indicate following phenomenon beginning adoption developers configure Dependabot interact frequently update proactively However later get overwhelmed suffer notification fatigue causes reduce interaction Dependabot even deprecate Dependabot RQ5 extreme case one developer forces Dependabot open 1 PR time reduce noise 79 Ignoring certain dependencies seems another noise countermeasure developers tend add ignored dependency often remove one Figure 7 example commit says update ignored packagesso never automatically updated stop noise 80 However also observe cases developers add ignored dependencies due intentions handling breaking changes 81 preserving backward compatibility 82 allow allow observe interesting burst allow Figure 8c earlier allow dependencies later find evidence explaining trend 543 Triangulation Survey Although half respondents think Dependabot configured fit needs 256 Strongly Agree 302 Agree 78 Totally Disagree 14 Disagree peek controversy one developer says think people complain noisy I’ve seen lot don’t configure things correctly half 504 respondents configured Dependabot make less noisy roughly onethird 326 212 Strongly Agree 295 Agree 167 Neutral 205 Disagree 121 Totally Disagree possible default configurations Dependabot work projects limited number dependencies dependencies fastevolving see Section 523 projects developers need tweak configurations multiple times find sweet spot projects However many respondents eventually find Dependabot offer options want noise reduction update grouping automerge investigate indepth RQ5 Findings RQ4 majority Dependabot configurations imply proactive update strategy observe multiple patterns noise avoidance configuration modifications increasing schedule intervals lowering maximum number open PRs ignoring certain dependencies 55 RQ5 Deprecations Desired Features 551 Repository Analysis Methods locate projects may deprecated Dependabot find projects dependabotyml current working trees resulting 235 projects identify last commit removes dependabotyml inspect commit messages identify referenced issuesPRs following GitHub convention dependabotyml removal turns due restructure stop maintenance consider false positive exclude analysis remaining 206 projects analyze reasons deprecation commit messages issuePR text ie titles bodies comments Since large proportion text commit messages issues PRs irrelevant Dependabot deprecation reasons two authors read reread text corpus retaining relevant encode reasons text discuss reaching consensus conduct independent coding measure interrater agreement corpus small 27 deprecations contain documented reasons confirmed deprecations check bot configuration files commitPR history find possible migrations consider migrated another dependency management bot automation approaches meets following criteria 1 developers specified migration target commit message issuePR text 2 dependabotyml deleted another dependency management bot eg Renovate Bot automatically deletes dependabotyml setup PR 3 adopts another dependency management bot within 30 days Dependabot deprecation obtain developers’ desired features dependency management bot ask two optional openended questions end survey Table 2 two questions answered 97 46 developers respectively identify recurring patterns answers two authors paper 6 years development experience familiar using Dependabot conduct open coding 83 responses generate initial set codes read reread answers familiarize gain initial understanding one author assigns text answers initial codes reflects common features dependency management bots discusses author iteratively refine codes consensus reached conduct independent coding answers using refined codes exclude answers reflect anything related RQ response may contain multiple codes use MASI distance 84 measure distance two raters’ codes Krippendorff’s alpha 85 measure interrater reliability Krippendorff’s alpha obtain 0865 satisfies recommended threshold 08 indicates high reliability 85 552 Repository Analysis Results confirm 206 235 candidates reallife Dependabot deprecations substantial considering dataset contains 1823 projects Figure 9 observe Dependabot deprecations evenly distributed time general fluctuations mostly coming organizationwide deprecations instance maximum value December 2020 caused 26 Dependabot deprecations octokit official GitHub API client implementation encode nine categories reasons 27 deprecations explicitly mentioned reasons 1 Notification Fatigue 9 Deprecations Developers recognize Dependabot’s overwhelming notifications PRs central issue experience Dependabot noted one developer “I’ve going mad dependabot alerts annoying pointless I’d rather manual upgrades use this” 86 2 Lack Grouped Update Support 7 Deprecations Dependabot convention PR updates one dependency one dependency comes unhandy two scenarios related packages tend follow similar release schedules triggers Dependabot raise PR storm updates 87 b cases dependencies must updated together avoid breakages 88 excessive notifications additional manual work quickly frustrate developers example hope better group dependency upgrades default configuration grouping happening dependencies would upgraded individually 89 b Also lot packages updated together Separate PRs everything isn’t fun 90 3 Package Manager Incompatibility 7 Deprecations Developers may compatibility issues introduction new package manager newer version package manager seven cases found five concern yarn v2 one concerns npm v7 specifically lockfile v3 one concerns pnpm make matters worse Dependabot may even undesirable behaviors eg messing around yarn lockfiles 91 encountered incompatibilities contributes developers’ update suspicion merging pull requests leads possible breakages dependency specification files time writing Dependabot still clear timeline supporting pnpm 92 yarn v2 93 unlucky part Dependabot users means revert 94 patch Dependabot PRs manually automatically 95 migrate alternative eg Renovate Bot 96 4 Lack Configurability 5 Deprecations Dependabot also deprecated due developers’ struggle tailor suitable configuration example appears we’re able configure Dependabot give us majorminor upgrades 97 b Dependabot would require much configuration longterm – easy forget add new package directory 98 Developers mention dependency management bots provide finegrained configuration options update scope schedule Renovate Bot load options could tweak compared Dependabot want reduce frequency 99 5 Absence AutoMerge 3 Deprecations Alfadel et al 54 illustrate automerge features tightly associated rapid PR merges However GitHub refused offer feature Dependabot 100 claiming automerge allows malicious dependencies propagate beyond supervision maintainers may render Dependabot impractical claimed developer absence automerge creates clutter possibly high maintenance load notice nonnegligible proportion 817 pull requests merged thirdparty automerge implementations eg CI workflow GitHub App Unfortunately may become dysfunctional public repositories GitHub enforced change Dependabot PR triggered workflows 101 turns last straw several Dependabot deprecations developer states dropped Dependabot latest changes enforced GitHub prevent using action Dependabot’s PR’s context 6 High CI Usage 3 Deprecations Maintainers 3 projects complain Dependabot’s substantial autorebasings PRs devoured CI credits words Dependabot’s CI usage killed us Dependabot waste money carbon reasons Dependabot deprecation include 7 Dependabot Bugs 2 Deprecations 8 Unsatisfying Branch Support 1 Deprecation 9 Inability Modify Custom Files 1 Deprecation deprecation Dependabot necessarily mean developers’ loss faith automating dependency updates Actually twothirds 684 141206 projects turn another bot set custom CI workflows support dependency updates Among Renovate Bot 122 popular migration target followed projen 15 npmcheckupdates 2 depfu 1
::::
553 Triangulation Survey Among 131 surveyed developers 14 107 tell us deprecated Dependabot projects reasons provide fall within analysis frequency distribution highly similar two exceptions one deprecates Dependabot frequently breaks code one deprecates entire stalled Developers also respond survey think automated dependency management important beneficial projects limitations Dependabot causes deprecation example Dependabot could great needs fixes It’s unclear Dependabot hasn’t polished also reply us Renovate Bot provide features need eg grouped update PRs identify nine major categories developers’ desired features corresponds one code answers provided 84 respondents remaining categories discarded supported one answer thus may occasional generalizable explain category order popularity 1 Group Update PRs 29 Respondents category refers feature automatically grouping dependency updates one PR instead opening one PR update frequently mentioned developers consider feature important measure making handling bot PRs less tedious repetitive timeconsuming want bot automatically identify dependencies updated together merge one PR update many libraries eg symfony typescripteslint babel version packages single version also want bot automatically find merge “safe” updates one PR leaving “unsafe” updates single PRs careful reviewing 2 Package Manager Support 20 Respondents category refers feature supporting package managers corresponding ecosystems features bot align conventions package managerecosystem Developers expressed desire bot support Gradle Flatter Poetry Anaconda C yarn v2 Clojure Cargo CocoaPods Swift Package Manager iOS etc indicating dependency management bots well designed implemented indeed benefit wide range developers development domains Dependabot claim support many package managers mentioned still needs tailored improved eg performance update behaviors 3 open Poetry updates merge one wait 15 minutes conflicts resolved b Perhaps nodejs projects ability update packagejson addition packagelock dependency update made explicit 3 AutoMerge 19 Respondents category refers feature automatically merging update PRs repository certain conditions satisfied mentioned Section 533 developers believe long projects highquality test suites trivial review update PR would prefer merged automatically tests pass Despite significant demand feature also seems especially controversial means offloading trust giving bot autonomy Although GitHub considers unacceptable due security risks 100 survey clearly indicates many still want even well aware risks also think responsibility risk control eg vetting new releases given capable central authority three response examples might somewhat dangerous configurable somehow automerge something basically already happens merge PRs b merging Dependabot like 60 deps day don’t know versions published hackers took repository account would great authority humans actually check changes mark secure c it’d good could mute notifications Dependabot PRs except tests failed indicating need manually resolve issues Otherwise I’d happy hear updating deps 4 Display Release Notes 8 Respondents category refers feature always showing sort release notes change logs update PRs inform developers changes update Although Dependabot sometimes provide release notes PRs Figure 1 fails 248 PRs dataset One possible reason release notes often missing inaccessible open source projects 35 also confirmed one survey respondents npm package updates feel unnecessary maintainers often don’t bother write meaningful release notesAt time shouldn’t expect maintainers go dependencies’ changelogs either perhaps tool find release notes 5 Avoid Unnecessary Updates 7 Respondents category refers feature providing default behavior configuration options avoid updates developers ecosystem perceived unnecessary frequently mentioned feature ability define separate update behaviors development production runtime dependencies Many developers would avoid automatic update development dependencies perceive updates mostly noise little gain keeping development dependencies uptodate mentioned features include ability detect avoid updates bloated dependencies provide updates dependencies real security vulnerabilities 6 Custom Update Action 5 Respondents category features refers ability define custom update behaviors using eg regular expressions update dependencies unconventional dependency files 7 Configurability 5 Respondents category refers case developers expressing dependency management bots highly configurable provide information specific configuration options want eg configuration options 8 git Support 4 Respondents category features concerns integration dependency management bots version control system case git specific mentioned features include automatic rebase merge conflict resolution squashing etc help ensure bot PRs incur additional work developers eg manipulating git branches resolving conflicts 9 Breaking Change Impact Analysis 3 Respondents feature category refers ability perform program analysis identify breaking changes impact client code eg something like list parts codebase might impacted update would useful could based combination changes listed release notes analysis package used code developers’ desired features align well reasons Dependabot deprecation indicating feature availability important driver migrations competition dependency management bots Findings RQ5 113 studied projects deprecated Dependabot due notification fatigue lack grouped update support package manager incompatibility lack configurability absence automerge etc 684 migrate ways automation among common migration target Renovate Bot 865 identify nine categories developers’ desired features align well Dependabot deprecation reasons
::::
6 DISCUSSION 61 State Dependency Management Bots nutshell results indicate Dependabot could effective solution keeping dependency uptodate RQ1 RQ2 often significant noise workloads RQ1 RQ4 RQ5 many could mitigated features configuration options offered Dependabot RQ5 Apart Dependabot’s compatibility score solution hardly success indicating compatibility bot update PR RQ3 March 2023 Dependabot still active development GitHub majority effort supporting ecosystems eg Docker GitHub Actions adding features reduce noise eg automatically terminate Dependabot inactive repositories according GitHub change log 102 Still plenty room improvement tackle update suspicion notification fatigue problem 8 Among dependency management bots Renovate Bot actively developed popular alternative Dependabot version update RQ5 Greenkeeper 9 deprecated PyUp 5 seems longer active development Synk Bot 6 mainly offers securityfocused solutions March 2023 Renovate Bot provides features configuration options Dependabot finetuning notifications including update grouping automerge 103 also provides merge confidence badges information Dependabot 104 However still unclear whether features strategies taken Renovate Bot actually effective practice believe Renovate Bot could important study subject future dependency management bot studies 62 Key Characteristics Dependency Management Bot section try summarize key characteristics ideal dependency management bot based results analysis previous work believe serve general design guidelines practitioners design implement improve dependency management bots similar automation solutions Configurability Wessel et al 40 argue noise central challenge SE bot design reconfiguration main countermeasure noise case Dependabot find Dependabot also causes noise developers opening PRs developers handle RQ4 developers reconfigure multiple times reduce noise RQ4 However reconfiguration always successful due lack certain features Dependabot causing deprecations migrations RQ5 many development activities also unlikely “silver bullet” present noted one survey respondents best practice dependency management easy fast safe Therefore argue configurability ie offering highest possible configuration flexibility controlling update behavior one key characteristics dependency management bots helps bot minimize unnecessary update notifications attempts developers less interrupted Apart options already provided Dependabot study indicates following configuration options present dependency management bots 1 Grouped Updates Dependency management bots provide options group multiple updates one PR Possible options include grouping “safe” updates eg breaking CI checks updates closely related dependencies eg different components framework 2 Update Strategies Dependency management bots allow developers specify dependency update based conditions whether dependency used production severity security vulnerabilities whether dependency bloated etc 3 Version Control System Integration Dependency management bots allow developers define bot interact version control system including branch monitor manipulate branches handle merge conflicts etc Autonomy According SE bot definition Erlenhov et al 39 key characteristics “Alex” type SE bot ability autonomously handle often simple development tasks central design challenges include minimizing interruption establishing trust developers However without automerge feature Dependabot hardly autonomous lack autonomy disliked developers RQ5 extreme cases developers use Dependabot entirely notification tool bot Section 513 lack autonomy also causing high level interruption workload developers using Dependabot projects RQ5 argue autonomy ie ability perform dependency updates autonomously without human intervention certain conditions one key characteristics dependency management bots characteristic possible risks consequences dependency updates highly transparent developers know trust updates Within context GitHub believe current dependency management bots offer configuration option merge update PRs CI pipeline passes option turned projects wellconfigured CI pipeline thorough static analysis building testing stages developers believe pipeline effectively detect incompatibilities dependency updates Section 533 respect security concern automerge used quickly propagate malicious package across ecosystem 100 argue responsibility verifying new releases terms security given independent developers usually required time expertise RQ5 Instead package hosting platforms eg npm Maven PyPI vet new package releases quickly take malicious releases minimize impact practices also advocated literature supply chain attacks 105 Transparency Multiple previous studies SE bots kinds bots point importance transparency bot design example Erlenhov et al 39 shows developers need establish trust bot perform correct development tasks Similarly Godulla et al 106 argue transparency vital bots used corporate communications context code review bots Peng 107 find contributors expect bot transparent certain code reviewer recommended reduce update suspicion 8 dependency management bots developers also need know trust bot perform dependency updates argue transparency ie ability transparently demonstrate risks consequences dependency update one key characteristics dependency management bots However Dependabot compatibility score feature hardly success toward direction developers trust test suites Beyond compatibility scores test suites following research directions may helpful enabling transparency dependency management bots establishing trust bot users 1 Program Analysis One direction achieve leverage program analysis techniques significant research practitioner effort breaking change analysis 36 two demonstrated potential using static analysis assessing bot PR compatibility 34 108 Still given extremely large scale bot PRs 7 research engineering effort needed implement lightweight scalable approaches support popular ecosystem 2 CI Log Analysis Another direction extend idea compatibility score sophisticated techniques learn knowledge CI checks Since CI checks scarce many version pairs RQ3 interesting explore techniques transfer knowledge version pairs matrix Figure 5a less sparse massive CI checks available Dependabot PRs would promising starting point 3 Release Note Generation Dependabot sometimes fails locating providing release note updated dependency even one maintainers often don’t bother write meaningful release notes noted one respondent situation mitigated applying approaches change summarization eg 109 release note generation eg 110 SelfAdaptability ability adapt specific environment dynamics considered one key characteristics “rational agent” artificial intelligence 111 112 Dependency management bots also considered autonomous agents working artificial environment social coding platforms eg GitHub However findings reveal Dependabot often cannot operate ways expected developers RQ5 reconfigurations common RQ4 failures eg update actions package manager incompatibility git branching lead interruption extra work developers argue selfadaptability ie ability automatically identify selfadapt sensible default configuration project’s environment one key characteristics dependency management bots GitHub projects environment include major programming languages package managers ecosystems workflows used active timezone developer preferences recent activities etc dependency management bot ability automatically generate configuration file based information recommend configuration changes environment changed eg developer responses bot PRs become slower usual implemented providing semiautomatic recommender system recommending initial configuration developers prompting bot PRs modifying configurations bot adoption
::::
63 Comparison Previous Work Several previous studies also made similar recommendations based results Greenkeeper Dependabot 8 37 54 56 Studies Greenkeeper 8 37 show dependency management bot causes noise developers CI test results unreliable investigate effectiveness bot configurations countermeasure noise Studies Dependabot 54 56 either focuses different aspect ie security updates 54 provides specific recommendations Dependabot features 56 Compared previous studies contributions study 1 systematic investigation Dependabot version update service 2 comprehensive fourdimension framework dependency management bot design implications study also related larger literature SE bots dependency management respect two fields contribution study unique lens observation ie Dependabot results set tailored recommendations dependency management bot design carefully discussed Section 62 implications study confirm extend echo implications existing literature
::::
64 Threats Validity
::::
641 Internal Validity RQ1 provided holistic analysis impact Dependabot adoption without incorporating possible confounding factors eg types dependencies characteristics projects Consequently difficult study establish firm answer effectiveness adopting Dependabot future work needed better quantify impact among possible confounding factors Several approximations used throughout analysis RQ2 resort identify security PRs may introduce hardtoconfirm errors repository owners know whether PRs securityrelated merge rate may accurately reflect extent Dependabot updates accepted developers projects may use different ways accepting contributions mitigate threat focus projects merged least 10 Dependabot PRs intuition projects unlikely accept Dependabot PRs ways already merged many RQ3 Dependabot’s compatibility scores may change time impossible know score time PR creation RQ4 Dependabot supports ecosystem specific matchers dependency specifications eg angular consider parsing configuration files However believe noise introduced minor invalidate findings hinder reproducibility data analysis Like studies involving manual coding analysis developer discussions survey responses vulnerable author bias mitigate two authors doublecheck results validate findings commitPR histories RQ5 conduct interrater reliability analysis RQ5 dataset becomes larger Finally interpretation data RQ1 RQ5 may also biased towards judgment mitigate triangulate key findings using developer survey derive implications based analysis developers’ feedback 642 External Validity like case studies generalizing specific findings RQ dependency management bots even projects use Dependabot cautious dataset contains popular actively maintained GitHub projects many already taking proactive updating strategies Therefore findings may generalize projects smaller scale reluctant update dependencies survey responses collected convenience sampling may introduce possible yet unknown biases terms experience age gender development role etc generalization survey results broad developer audience cautious outcome Dependabot usage may also generalize dependency management bots due functionality user base differences RQ1 base analysis JavaScriptnpm projects may generalize ecosystems different norms policies practices 11 comparison dependency management bot usage different ecosystems could important avenue future work Despite believe implications obtain dependency management bot design general proposed framework Section 62 form roadmap dependency management bot designers methodology could applied future studies compare effectiveness different bots
::::
7 Conclusion present exploratory study Dependabot version update service using repository mining survey identify important limitations design Dependabot findings derive fourdimension framework hope help dependency management bot design inspire research work related fields Several directions future work arise study example investigating comparing dependency management bots especially Renovate Bot help verify generalizability proposed framework empirical foundation factors affecting effectiveness bot adoption also necessary interesting investigate recommendation bot configurations developers study different approaches eg program analysis machine learning release note generation help developers assess compatibility bot PRs
::::
8 Data Availability provide replication package Figshare httpsfigsharecoms78a92332e4843d64b984 package used replicate results repository mining preserve privacy survey respondents choose disclose raw data survey Acknowledgments work supported National Key RD Program China Grant 2018YFB1004201 National Natural Science Foundation China Grant 61825201 sincerely thank developers participated survey References 1 Winters Manshreck H Wright Engineering Google Lessons Learned Programming Time O’Reilly Media 2020 2 R G Kula Germán Ouni Ishio K Inoue “Do developers update library dependencies empirical study impact security advisories library migration” Empir Softw Eng vol 23 1 pp 384–417 2018 3 httpsgithubcomdependabot 4 httpsgithubcomrenovatebot 5 httpspyupio 6 httpsgithubcomsnykbot 7 Wyrich R Ghit Haller C Müller “Bots don’t mind waiting comparing interaction automatically manually created pull requests” 3rd IEEEACM International Workshop Bots Engineering BotSEICSE 2021 Madrid Spain June 4 2021 IEEE 2021 pp 6–10 8 Mirhosseini C Parnin “Can automated pull requests encourage developers upgrade outofdate dependencies” Proceedings 32nd IEEEACM International Conference Automated Engineering ASE 2017 Urbana IL USA October 30 November 03 2017 IEEE Computer Society 2017 pp 84–94 9 httpsgreenkeeperio 10 httpswwwsonatypecomresourcesstateofthesoftwaresupplychain2021 11 C Bogart C Kästner J Herbsleb F Thung “When make breaking changes Policies practices 18 open source ecosystems” ACM Trans Softw Eng Methodol vol 30 4 pp 421–4256 2021 12 G Bavota G Canfora Penta R Oliveto Panichella “How apache community upgrades dependencies evolutionary study” Empir Softw Eng vol 20 5 pp 1275–1317 2015 13 Pashchenko L Vu F Massacci “A qualitative study dependency management security implications” CCS ’20 2020 ACM SIGSAC Conference Computer Communications Security Virtual Event USA November 913 2020 ACM 2020 pp 1513–1531 14 J Cox E Bouwers C J van Eekelen J Visser “Measuring dependency freshness systems” 37th IEEEACM International Conference Engineering ICSE 2015 Florence Italy May 1624 2015 Volume 2 IEEE Computer Society 2015 pp 109–118 15 J GonzálezBarahona P Sherwood G Robles IzquierdoCortazar “Technical lag compilations Measuring outdated deployment is” Open Source Systems Towards Robust Practices 13th IFIP WG 213 International Conference OSS 2017 Buenos Aires Argentina May 2223 2017 Proceedings ser IFIP Advances Information Communication Technology vol 496 2017 pp 182–192 16 Zerouali Mens E Constantinou “On evolution technical lag npm package dependency network” 2018 IEEE International Conference Maintenance Evolution ICSME 2018 Madrid Spain September 2329 2018 IEEE Computer Society 2018 pp 404–414 17 Zerouali E Constantinou Mens G Robles J GonzálezBarahona “An empirical analysis technical lag npm package dependencies” New Opportunities Reuse 17th International Conference ICSR 2018 Madrid Spain May 2123 2018 Proceedings ser Lecture Notes Computer Science vol 10826 Springer 2018 pp 95–110 18 Zerouali Mens J GonzálezBarahona Decan E Constantinou G Robles “A formal framework measuring technical lag component repositories application npm” J Softw Evol Process vol 31 8 2019 19 J Stringer Tahir K Blincoe J Dietrich “Technical lag dependencies major package managers” 27th AsiaPacific Engineering Conference APSEC 2020 Singapore December 14 2020 IEEE 2020 pp 228–237 20 Zerouali Mens Decan J GonzálezBarahona G Robles “A multidimensional analysis technical lag Debianbased docker images” Empir Softw Eng vol 26 2 p 19 2021 21 K Chow Notkin “Semiautomatic update applications response library changes” 1996 International Conference Maintenance ICSM ’96 48 November 1996 Monterey CA USA Proceedings IEEE Computer Society 1996 p 359 22 J Henkel Diwan “CatchUp capturing replaying refactorings support API evolution” 27th International Conference Engineering ICSE 2005 1521 May 2005 St Louis Missouri USA ACM 2005 pp 274–283 23 Z Xing E Stroulia “APIevolution support DiffCatchUp” IEEE Trans Eng vol 33 12 pp 818–836 2007 24 H Nguyen Nguyen G W Jr Nguyen Kim N Nguyen “A graphbased approach API usage adaptation” Proceedings 25th Annual ACM SIGPLAN Conference ObjectOriented Programming Systems Languages Applications OOPSLA 2010 October 1721 2010 RenoTahoe Nevada USA ACM 2010 pp 302–321 25 B Dagenais P Robillard “Recommending adaptive changes framework evolution” ACM Trans Softw Eng Methodol vol 20 4 pp 191–1935 2011 26 B Cossette R J Walker “Seeking ground truth retrospective study evolution migration libraries” 20th ACM SIGSOFT Symposium Foundations Engineering FSE20 SIGSOFTFSE’12 Cary NC USA November 11 16 2012 ACM 2012 p 55 27 K Huang B Chen L Pan Wu X Peng “REPFINDER finding replacements missing APIs library update” 36th IEEEACM International Conference Automated Engineering ASE 2021 Melbourne Australia November 1519 2021 IEEE 2021 pp 266–278 28 B B Nielsen Torp Møller “Semantic patches adaptation JavaScript programs evolving libraries” 43rd IEEEACM International Conference Engineering ICSE 2021 Madrid Spain 2230 May 2021 IEEE 2021 pp 74–85 29 Haryono F Thung Lo J Lawall L Jiang “MLCatchUp Automated update deprecated machinelearning APIs Python” IEEE International Conference Maintenance Evolution ICSEM 2021 Luxembourg September 27 October 1 2021 IEEE 2021 pp 584–588 30 Haryono F Thung Lo L Jiang J Lawall H J Kang L Semino C Müller “AndroEvolve Automated Android API update data flow analysis variable denormalization” Empir Softw Eng vol 27 3 p 73 2022 31 httpssemverorg 32 Mostafa R Rodriguez X Wang “Experience paper study behavioral backward incompatibilities Java libraries” Proceedings 26th ACM SIGSOFT International Symposium Testing Analysis Santa Barbara CA USA July 10 14 2017 ACM 2017 pp 215–225 33 Raemaekers van Deursen J Visser “Semantic versioning impact breaking changes Maven repository” J Syst Softw vol 129 pp 140–158 2017 34 J Hejderup G Gousios “Can trust tests automate dependency updates case study Java projects” J Syst Softw vol 183 p 111097 2022 35 J Wu H W Xiao K Gao Zhou “Demystifying release note issues GitHub” Proceedings 30th IEEEACM International Conference Program Comprehension ICPC 2022 Pittsburgh USA May 1617 2022 ACM 2022 36 P Lam J Dietrich J Pearce “Putting semantics semantic versioning” Proceedings 2020 ACM SIGPLAN International Symposium New Ideas New Paradigms Reflections Programming Onward 2020 Virtual November 2020 ACM 2020 pp 157–179 37 B Rombaut F R Cogo B Adams E Hassan “There’s thing free lunch Lessons learned exploring overhead introduced Greenkeeper dependency bot npm” ACM Transactions Engineering Methodology 2022 38 Wessel B de Souza Steinmacher Wiese Polato P Chaves Gerosa “The power bots Characterizing understanding bots OSS projects” Proc ACM Hum Comput Interact vol 2 CSCW pp 1821–18219 2018 39 L Erlenhov F G de Oliveira Neto P Leitner “An empirical study bots development characteristics challenges practitioner’s perspective” ESECFSE ’20 28th ACM Joint European Engineering Conference Symposium Foundations Engineering Virtual Event USA November 813 2020 ACM 2020 pp 445–455 40 Wessel Wiese Steinmacher Gerosa “Don’t disturb Challenges interacting bots open source projects” Proc ACM Hum Comput Interact vol 5 CSCW2 pp 1–21 2021 41 Wessel Abdellatif Wiese Conte E Shihab Gerosa Steinmacher “Bots pull requests good bad promising” 44th IEEEACM 44th International Conference Engineering ICSE 2022 Pittsburgh PA USA May 2527 2022 IEEE 2022 pp 274–286 42 E Shihab Wagner Gerosa Wessel J Cabot “The present future bots engineering” IEEE 2022 43 Santhanam Hecking Schreiber Wagner “Bots engineering systematic mapping study” PeerJ Comput Sci vol 8 p e866 2022 44 httpsgithubcomappsdependabotpreview 45 L Erlenhov F G de Oliveira Neto P Leitner “Dependency management bots opensource systems prevalence adoption” PeerJ Comput Sci vol 8 p e849 2022 46 Dey Mousavi E Ponce Fry B Vasilescu Filippova Mockus “Detecting characterizing bots commit code” MSR ’20 17th International Conference Mining Repositories Seoul Republic Korea 2930 June 2020 ACM 2020 pp 209–219 47 httpswwwindiehackerscominterviewlivingoffoursavingsandgrowingoursaasto740mo 48 httpswwwindiehackerscomproductdependabotacquiredbygithub1g7T7DN1rGEZM204shF 49 httpsgithubcombakerdependabotpreview 50 httpsdocsgithubcomencodesecuritysupplychainsecuritykeepingyourdependenciesupdatedconfigurationoptionsfordependencyupdates 51 httpsdocsgithubcomencodesecuritysupplychainsecuritymanagingvulnerabilitiesinyourprojectsdependenciesaboutalertsforvulnerabledependenciesaccesstodependabotalerts 52 Pull Request 1127 datadeskbaker 53 httpsdocsgithubcomencodesecuritysupplychainsecuritymanagingvulnerabilitiesinyourprojectsdependenciesaboutdependabotsecurityupdates 54 Alfadel E Costa E Shihab Mkhallalati “On use Dependabot security pull requests” 18th IEEEACM International Conference Mining Repositories MSR 2021 Madrid Spain May 1719 2021 IEEE 2021 pp 254–265 55 C SotoValero Durieux B Baudry “A longitudinal analysis bloated Java dependencies” ESECFSE ’21 29th ACM Joint European Engineering Conference Symposium Foundations Engineering Athens Greece August 2328 2021 ACM 2021 pp 1021–1031 56 F R Cogo E Hassan “Understanding customization dependency bots case dependabot” IEEE 2022 57 Pull Request 4317 caddyservercaddy 58 G Gousios “The GHTorrent dataset tool suite” Proceedings 10th Working Conference Mining Repositories ser MSR ’13 Piscataway NJ USA IEEE Press 2013 pp 233–236 59 N Munaiah Kroh C Cabrey Naqappan “Curating GitHub engineered projects” Empir Softw Eng vol 22 6 pp 3219–3253 2017 60 H R H Gu Zhou “A largescale empirical study Java library migrations prevalence trends rationales” ESECFSE ’21 29th ACM Joint European Engineering Conference Symposium Foundations Engineering Athens Greece August 2328 2021 ACM 2021 pp 478–490 61 httpsdocsgithubcomenrest 62 httpsworkerscloudflarecom 63 httpsgithubcomadvisories 64 Goeminnie Mens “Evidence Pareto principle open source activity” Joint Proceedings 1st International workshop Model Driven Maintenance 5th International Workshop Quality Maintainability Citeseer 2011 pp 74–82 65 Zhang Zhou Mockus Z Jin “Companies’ participation OSS development—an empirical study OpenStack” IEEE Trans Eng vol 47 10 pp 2242–2259 2021 66 Decan Mens P Grosjean “An empirical comparison dependency network evolution seven packaging ecosystems” Empir Softw Eng vol 24 1 pp 381–416 2019 67 httpsgithubcomdependabotdependabotcoreissues4146 68 R Likert “A technique measurement attitudes” Archives Psychology 1932 69 httpstools4devorgresourceshowtochooseasamplesize 70 X Tan Zhou Z Sun “A first look good first issues GitHub” ESECFSE ’20 28th ACM Joint European Engineering Conference Symposium Foundations Engineering Virtual Event USA November 813 2020 ACM 2020 pp 398–409 71 httpswwwqualtricscomblogethicalissuesforonlinesurveys 72 Zhao Serebrenik Zhou V Filkov B Vasilescu “The impact continuous integration development practices largescale empirical study” Proceedings 32nd IEEEACM International Conference Automated Engineering ASE 2017 Urbana IL USA October 30 November 03 2017 IEEE Computer Society 2017 pp 60–71 73 N Cassee B Vasilescu Serebrenik “The silent helper impact continuous integration code reviews” 27th IEEE International Conference Analysis Evolution Reengineering SANER 2020 London Canada February 1821 2020 IEEE 2020 pp 423–434 74 J Romano J Kromrey J Coraggio J Skowronek “Appropriate statistics ordinal level data really using ttest Cohen’s evaluating group differences nse surveys” Annual Meeting Florida Association Institutional Research vol 177 2006 p 34 75 Prion K Haerling “Making sense methods measurement Spearmanrho rankedorder correlation coefficient” Clinical Simulation Nursing vol 10 p 535–536 10 2014 76 P Sturgis C Roberts P Smith “Middle alternatives revisited neithernor response acts way saying “i don’t know”” Sociological Methods Research vol 43 1 pp 15–38 2014 77 Pull Request 259 dropboxstone 78 Pull Request 3155 tuisttuist 79 Pull Request 663 rostoolingactionrosci 80 Commit b337b5f justeathttpclientinterception 81 Pull Request 1260 asynkronprotoactordotnet 82 Commit a06b04e Azurebicep 83 H Khandkar “Open coding” University Calgary vol 23 p 2009 2009 84 R J Passonneau “Measuring agreement setvalued items MASI semantic pragmatic annotation” Proceedings Fifth International Conference Language Resources Evaluation LREC 2006 Genoa Italy May 2228 2006 European Language Resources Association ELRA 2006 pp 831–836 85 K Krippendorff Content Analysis Introduction Methodology Sage publications 2018 86 Pull Request 134 skytableskytable 87 Comment Issue 1190 dependabotdependabotcore 88 Issue 1296 dependabotdependabotcore 89 Pull Request 2635 giantswarmhappa 90 Commit 8cecf22 FateGrandAutomataFGA 91 Pull Request 1976 stoplightiospectral 92 Issue 1736 dependabotdependabotcore 93 Issue 1297 dependabotdependabotcore 94 Issue 202 nitzanogatsbysourcehashnode 95 Issue 26 replygirltc 96 Pull Request 1987 stoplightiospectral 97 Pull Request 2916 codalabcodalabworksheets 98 Pull Request 126 lyftclutch 99 Pull Request 3622 videodevhlsjs 100 Comment Issue 1973 dependabotdependabotcore 101 Issue 60 ahmadnassriactiondependabotautomerge 102 httpsgithubblogchangeloglabeldependabot 103 httpsdocsrenovatebotcom 104 httpsdocsrenovatebotcommergeconfidence 105 Zimmermann C Staicu C Tenny Pradel “Small world high risks study security threats npm ecosystem” 28th USENIX Security Symposium USENIX Security 2019 Santa Clara CA USA August 1416 2019 USENIX Association 2019 pp 995–1010 106 Godulla Bauer J Dietlmeier Lück Matzen F Vaaßen “Good bot vs bad bot Opportunities consequences using automated corporate communications” 2021 107 Z Peng X “Exploring developers work mention bot github” CCF Trans Pervasive Comput Interact vol 1 3 pp 190–203 2019 108 Foo H Chua J Yeo Ang Sharma “Efficient static checking library updates” Proceedings 2018 ACM Joint Meeting European Engineering Conference Symposium Foundations Engineering ESECSIGSOFT FSE 2018 Lake Buena Vista FL USA November 0409 2018 ACM 2018 pp 791–796 109 L F CortesCoy L Vásquez J Aponte Poshyvanyk “On automatically generating commit messages via summarization source code changes” 14th IEEE International Working Conference Source Code Analysis Manipulation SCAM 2014 Victoria BC Canada September 2829 2014 IEEE Computer Society 2014 pp 275–284 110 L Moreno G Bavota Penta R Oliveto Marcus G Canfora “ARENA approach automated generation release notes” IEEE Trans Eng vol 43 2 pp 106–127 2017 111 Poole Mackworth R Goebel Computational Intelligence Modern Approach Pearson Education Inc 2010 Runzhi currently undergraduate student School Electronics Engineering Computer Science EECS Peking University research mainly focuses open source sustainability supply chain contacted via rzhepkueducn Hao currently PhD student School Computer Science Peking University received BS degree Computer Science Peking University 2020 research addresses sociotechnical sustainability problems open source communities ecosystems supply chains information found personal website httpshehao98githubio reached hehao98pkueducn Yuxia Zhang currently assistant professor School Computer Science Technology Beijing Institute Technology BIT received PhD 2020 School Electronics Engineering Computer Science EECS Peking University research interests include mining repositories opensource ecosystems mainly focusing commercial participation opensource contacted yuxiazhbiteducn Minghui Zhou received BS MS PhD degrees computer science National University Defense Technology 1995 1999 2002 respectively professor School Computer Science Peking University interested digital sociology ie understanding relationships among people culture products mining repositories projects member ACM IEEE reached zhmhpkueducn
::::
Open Source Sustainability Combining Institutional Analysis SocioTechnical Networks LIKANG YIN University California Davis USA MAHASWETA CHAKRABORTI University California Davis USA YIBO YAN University California Davis USA CHARLES SCHWEIK University Massachusetts Amherst USA SETH FREY University California Davis USA VLADIMIR FILKOV University California Davis USA CCS Concepts • Humancentered computing → Empirical studies collaborative social computing Additional Key Words Phrases Institutional Design Sociotechnical Systems OSS Sustainability ACM Reference Format Likang Yin Mahasweta Chakraborti Yibo Yan Charles Schweik Seth Frey Vladimir Filkov 2022 Open Source Sustainability Combining Institutional Analysis SocioTechnical Networks Proc ACM HumComput Interact 6 CSCW2 Article 404 November 2022 23 pages httpsdoiorg1011453555129 ABSTRACT Sustainable Open Source OSS forms much fabric digital society especially successful sustainable ones many OSS projects become sustainable resulting abandonment even risks world’s digital infrastructure Prior work looked reasons mainly two different perspectives engineering focus understanding success sustainability sociotechnical perspective OSS programmers’ daytoday activities artifacts create institutional analysis hand emphasis institutional designs eg policies rules norms structure governance Even though necessary comprehensive understanding OSS projects connection interaction two approaches barely explored paper make first effort toward understanding OSS sustainability using dualview analysis combining institutional analysis sociotechnical systems analysis particular use linguistic approaches extract institutional rules norms OSS contributors’ communications represent evolution governance systems ii construct sociotechnical networks based longitudinal collaboration records represent Authors’ addresses Likang Yin lkyinucdavisedu University California Davis CA USA Mahasweta Chakraborti mchakrabortiucdavisedu University California Davis CA USA Yibo Yan ybyanucdavisedu University California Davis CA USA Charles Schweik cschweikumassedu University Massachusetts Amherst USA Seth Frey sethfreyucdavisedu University California Davis CA USA Vladimir Filkov vfilkovucdavisedu University California Davis CA USA Permission make digital hard copies part work personal classroom use granted without fee provided copies made distributed profit commercial advantage copies bear notice full citation first page Copyrights components work owned others authors must honored Abstracting credit permitted copy otherwise republish post servers redistribute lists requires prior specific permission andor fee Request permissions permissionsacmorg © 2022 Copyright held ownerauthors Publication rights licensed ACM 25730142202211ART404 1500 httpsdoiorg1011453555129 Proc ACM HumComput Interact Vol 6 CSCW2 Article 404 Publication date November 2022 project’s organizational structure combined two methods applied dataset developer digital traces 253 nascent OSS projects within Apache Foundation ASF incubator find sociotechnical institutional features relate provide complimentary views progress ASF’s OSS projects Refining combined analyses help provide precise understanding synchronization evolution institutional governance organizational structure
::::
1 INTRODUCTION Open Source OSS multibillion dollar industry majority modern businesses including major tech companies rely OSS without even knowing OSS contributions important manifestation computersupported collaborative work high degree technical literacy typical OSS contributors Even though popularity attracts many developers open source 80 OSS projects abandoned 37 failure collaborative work OSS received attention two perspectives engineering focus understanding success sustainability sociotechnical perspective OSS developers’ daytoday activities artifacts create management domain hand emphasis institutional designs eg policies rules norms structure governance OSS administration particular systems generate public goods address endemic social challenges creating governance institutions attracting maintaining incentivizing coordinating contributions Ostrom 32 defines institutions “… prescriptions humans use organize forms repetitive structured interactions…” Institutions guide interactions participants OSS informal established norms behavior formalized written codified rules norms formalized rules along mechanisms rule creation maintenance monitoring enforcement means collective action OSS development occur 37 tiered nested context OSS projects embedded within overarching OSS nonprofit organization methods separately shown utilitarianly describing state process however combining two perspectives barely explored paper undertake convergent approach considering one side OSS projects’ sociotechnical structure aspects institutional design goal use two perspectives synergistically identify strengthen complement also refine understanding OSS sustainability two methodological approaches Central approaches idea trajectories individual OSS projects understood convergent framework context provided similar projects already readily sustained abandoned leverage previously published dataset 47 traces representing OSS developer’s daytoday activities part Apache Foundation Incubator ASFI developers part projects decided undergo process incubation toward becoming part ASF benefiting services provides member projects dataset includes historical traces sustainability label graduation retirement Graduation indication successful incubation readiness nascent join ASF proper otherwise retired words importantly paper use ASFI outcomes graduation retirement measure sustainability assume graduated projects sustained longer retired ones although might always casetextsuperscript1 key hurdles OSS projects demonstrate graduate 1 produce new releases 2 show ability attract new developers factors arguably key sustainability OSS projects utilize dataset study extent graduated retired projects differ point view sociotechnical structure institutional governance sociotechnical side construct monthly longitudinal social technical networks calculate several measures describing features networks institutional governance side implement classifier trained manual annotations institutional statements publicly accessible email communications among ASF participants compare findings sociotechnical institutional metrics projectlevel individuallevel activities Next perform exploratory data analyses deepdive case studies eventually look sociotechnical measures associate prevalence institutional statements evolutionary trajectories OSS incubation sustainability summary find effectively extract governance content email discussions form institutional statements fall 12 distinguishable topics Projects different graduation ie sustainability outcomes differ much governance discussion occurs within communities also sociotechnical structure Selfsustained projects ie graduated socially active community achieving within first 3 months incubation demonstrate active contributions documentation active communication policy guidance via institutional statements project’s sociotechnical structure temporally associated institutional communications occur depending role agent mentor committer contributor communicating institutional statements provide relevant context recently Yin et al 46 showed sociotechnical networks used effectively predict whether graduate retire ASF incubator work include institutional governance analysis focus closing gap studying relationship organizational structure ie sociotechnical system institutional governance peercontributed OSS projects study first attempt provide common framework simultaneous sociotechnical structure institutional analysis OSS projects order describe understand process affected gaining selfsustaining selfgoverning community eventually graduating ASF incubator hopeful refining convergent approach structural institutional analyses open new ways consider study emergent properties like sustainability
::::
2 THEORETICAL FRAMEWORK introduce theories behind two different viewpoints Institutional Analysis Development IAD SocialTechnical Systems STS well Contingency Theory serving glue institutional governance organizational structure OSS projects textsuperscript1For example could ASFI retired projects simply could adapt policies requirements set ASFI program yet continue ‘in wild’ perhaps aligned different OSS foundation 21 Institutional Theory Commons Governance OSS projects form digital commons precisely CommonsBased Peer Production CBPP 37 Legal scholar Yochai Benkler 2 introduced phrase CBPP describe situations people work collectively Internet organizational structure less hierarchical CBPP situations found variety settings eg collaborative writing open source hardware Benkler argues OSS ‘quintessential instance’ CBPP relatively long history study governance commons settings arguably led Nobel laureate Elinor Ostrom groundbreaking book Governing Commons 31 Ostrom’s Institutional Analysis Development IAD framework developed study governance institutions communities develop selfmanage natural resources Much research focuses governance sustainability natural resource settings eg water 6 marine 19 forest 16 settings key challenge natural resource commons settings individuals cannot easily excluded extracting resources pool available natural resources often little incentive contribute toward production maintenance resource – commonly referred ‘freeriders’ 29 forest fishery water settings freerider problem open access settings lead problem termed Hardin ‘Tragedy Commons’ 20 Ostrom famously pushed back Hardin’s analysis course lifetime work highlighted communities avoid tragedy hard work developing selfgoverning institutions OSS commons fundamentally different natural resources digital resources readily replicated subject degradation due overharvesting Therefore overappropriation problem potential tragedy commons OSS context Invariably answer yes lies heart idea OSS sustainability tragedy occurs freeriders insufficient human resources available continue develop maintain result fails achieve functionality use perhaps envisioned began becomes abandoned 36 Ostrom Hess 22 aptly describe tragedy ‘collective inaction’ Ostrom’s Nobel Prizewinning body work studying humans collectively act craft selfgoverning institutional arrangements effectively avoid tragedy natural resource settings Central effort introduction evolution Institutional Analysis Development IAD framework 32 Later IAD applied study digital knowledge commons 17 22 explicitly study selfgovernance OSS Schweik English undertook first study technical community institutional designs large number OSS projects 37 said prior work found selfgoverning OSS projects develop highly organized social technical structures 5 foundation support like ASF may additionally process organizing developers’ structured interactions second tier governance prescriptions required ASF Incubator refer individual institutional prescription Institutional Statement include rules norms define shared linguistic constraint opportunity prescribes permits advises actions outcomes actors individual corporate 10 39 Institutions understood operationally collections institutional statements create situations structured interaction collective action words configurations ISs affect way collective action organized context ASF OSS projects incubator ISs affect OSS social technical structure approaches institutional analysis becomes possible articulate relationships governance organizational technical variables example previous studies OSS often report code modularity key technical design attribute 28 30 Hissam et al 23 write ‘A wellmodularized system … allows contributors carve chunks work’ Open transparent verbal discussion OSS team members ASF officials eg mentors OSS ASF institutional design captured form institutional statements could predict effort contributors restructure project’s technical infrastructure modular inviting new contributors Using approaches institutional analysis extract institutional content open access email exchanges OSS contributors understand role communication governance information OSS sustainability 22 SocioTechnical System Theory SocioTechnical System STS comprises two entities 42 social system members continuously create share knowledge via various types individual interactions technical system members utilize technical hardware accomplish certain collective tasks STS theory considered combine views engineers social scientists intermediary entity sorts transfers institutional influence individuals 35 theory STS often referenced studying technical system able provide efficient reliable individual interactions 21 social subsystem becomes contingent interactions affects performance technical subsystem 15 Moreover sociotechnical system theory plays important role analyzing collective behavior OSS projects 3 OSS projects also studied network point view 12 24 GonzálezBarahona et al 18 proposed using technical networks nodes modules CVS repository edges indicate two modules share common committers study organization ASF projects sociotechnical systems organizations intervene longterm shortterm means Smith et al 40 propose two conceptual approaches ‘outside’ ‘inside’ ‘outside’ approaches represent sociotechnical managerial approach ‘Inside’ approaches reflexive role management coconstituting sociotechnical perspective Apache Foundation ASF community unique system outside influence regulations ASF board members inside governance managed selfgoverned individual Management Committees PMC 23 Contingency Theory Panaceas SelfGovernance Contingency theory notion one best way govern organization Instead decision organization must depend internal structure contingent upon external context eg stakeholder 43 risk 9 schedule 45 etc Joslin et al 25 find success associated methodologies eg processes tools methods etc adopted particular treat institutional statements abstraction methodologies OSS development organizational context changes time maintain consistency must adapt context accordingly Otherwise conflicts inefficiency occur 1 ie single organizational structure equally effective cases Similar arguments made field institutional analysis arguing panaceas standard blueprints guiding institutional design collective action problem 33 address conflicts caused incompatibilities project’s context previous work suggests thinking holistically Lehtonen et al 26 consider environment measurable spatiotemporal factors initiated processed adjusted finally terminated suggest factor opposite influence projects different context Joslin et al 25 consider governance part context concluding governance impact use effectiveness methodologies per contingency theory ASFI projects’ incubation developers mentors make intime decisions organizational structure contingent happening institutional rules governance vice versa
::::
3 RESEARCH QUESTIONS Reflecting previous discussion primary goal paper demonstrate evolution nascent state sustainable state studied effectively combining two different methodologies sociotechnical network analysis institutional analysis reported prior sections variety scholars utilized sociotechnical systems approach analyze collective behavior OSS projects also described institutional analysis useful understanding collective action OSS settings enable dualview sustainability first describe evaluate automated approach identifying institutional statements emails RQ1 institutional statements contained ASF Incubator email discussions effectively identify next two research questions assess utility convergent approach Institutional Analysis IAD STS frameworks case ASF incubation program two eventual outcomes either graduates ASF incubator becomes fullfledged ASFassociated retires without achieving goal context operationalize sustainable state one OSS graduates ASF incubator program rather retires ask RQ2 OSS evolution toward sustainability readily observable dual lenses institutional sociotechnical analysis temporal patterns differ Per institutional analysis theory strategies norms rules affect social technical organizations projects Governance organization per social theories must work handinhand make viable sociotechnical systems Illdesigned institutional arrangements would introduce inefficiencies system inefficiencies may amplify deviant behaviors irregular structures system influential links institutional design organizational structure fact bidirectional effect sustainable system illformed organizational structure may instigate new rules adjust improve structure improving efficiencies systems Thus hypothesize feedback governance organization observable specifically intensified governance discussion precede andor follow changes organizational structure reminder consider institutional statements indicators intensified discussions OSS selfgovernance new incubator requirements selfgovernance also consider sociotechnical network parameters indicators organizational structure Thus ask RQ3 periods increased Institutional Statements frequency followed changes organizational structure viceversa following section introduce methodologies approaching three research questions 4 DATA METHODS study difference projects graduate ASFI ie become sustainable paper use collection largescale data sets comprising Institutional Statements SocioTechnical variables extracted graduated retired projects Apache Foundation Incubator ASFI ASFI graduation indication nascent sufficiently sustainable join ASF proper2 otherwise retired combing Apache lists inspecting data speaking community members shown almost failures graduate sustainability failures rare occasions projects retired reasons sustainability eg good fit Apache model3 despite evidence projects generally sufficiently aware ASF model entering incubation according proposal4 sociotechnical networks collected historical trace data commits emails incubation outcomes 253 ASFI projects available archives commits emails 03292003 020120215 Among 204 projects already graduated 49 retired ASF incubator projects still incubation studied paper collected ASF incubator data ASF mailing list archives6 open access retrieved archive web page lists httpmailarchivesapacheorgmodmbox contain emails commits project’s ASF incubator entry date current URLs follow pattern projname listnameYYYYMMmbox example full URL dev mailing list Apache Accumulo Dec 2014 httpmailarchivesapacheorgmodmboxaccumulodev201412mbox mbox file contains month mailing list messages date specified URL dev stands ‘emails among developers’ Notably sites following pattern eg ‘ASFwide lists’ projectowned mailing lists list ‘incubatorapacheorg’ contains data one extract Institutional Statements combined email data set prior data set ASF policy documents given organization institutional statements characterized finite set semantic roles eg ASF Board Mentors contributors etc ASF interactions eg management committees requesting reports projects developers voting induct committers ASF specific contexts account representation training corpus included institutional statements ASF projectlevel email exchanges among participants also ASF policy documents supplementary set Institutional Statements included 328 policies compiled ASF policy documents eg Apache Cookbook PPMC Guide Incubator Policy etc economic analysis ASF Incubator’s policies 38 41 Preprocessing collected 1330003 emails across ASF Incubator projects 03292003 02012021 mailing lists ‘commit’ ‘dev’ ‘user’ etc find 128257 96 emails automatically generated broadcast continuous integration tools ie bots amount emails substantial carry less meaningful social institutional information list members rarely reply use regular expression rules identify eliminate corpus leaving us 1201746 emails 2ASF’s guide graduation httpsincubatorapacheorgguidesgraduationhtml 3ASF’s reason behind projects’ retirement httpsincubatorapacheorgprojectsretired 4ASF incubator projects’ proposal httpscwikiapacheorgconfluencedisplayINCUBATORProposals 5Our code data available Zenodo httpsdoiorg105281zenodo5908030 6During submission study ASF moved email archives Pony Mail system technical contribution side many projects especially ten years old used SVN utilized bot extensive mailings thus forming outliers dataset Thus eliminate commit messages automated bots eg ‘buildbot’ 253758 3654196 144 commit messages email messages issuesbug tracking bots eg ‘GitBox’ Moreover find developers contributed commits directly changinguploading massive nonsource code files eg data configuration image files Since committing noncoding files form outliers data set choose apply GitHub Linguist7 identify 731 collective programming language markup file extensions exclude noncoding commits eg creatingdeleting folders upload images etc
::::
42 Constructing Sociotechnical Networks Network science approaches prominent studying complex systems eg OSS projects 4 41 Since networks contain rich information elements ie nodes interactions ie edges study use sociotechnical networks anchor abstraction sociotechnical systems define projects’ sociotechnical structure using social emailbased technical codebased networks extracted emails mailing lists commits source files Similar approach Bird et al 3 form social network weighted directed graph incubation month communications developers directed edge developer B forms B replied A’s post thread emailed B directly weight edge represents communication frequency pair developers technical bipartite networks weighted bipartite graph formed similar way month include undirected edge developer source file F developer committed source file F month excluding SVN branch names weight edge represents committing frequency developer source file summary social networks weighted directed graphs form edges two developer nodes one developer replied referenced other’s email Technical networks undirected bipartite graphs developers forming one set nodes coding files forming link drawn developer contributed coding file use networkx package Python networkrelated implementation
::::
43 Extracting Institutional Statements combined email exchange data set ASF policy document data finetune BERTbased 8 classifier automatic detection ISs see Sect 21 definition start handannotated small subset data ISs follows selecting random subset 313 email threads incubator lists two handcoders labeled sentences ‘IS’ ‘Not IS’ basis whether fit definition Institutional Statements resolved disagreements discussion recorded conclusions achieving peak outofsample agreement 075 080 sentence coded complete sentence fragments parenthetical mentions rules resources annotated positive resulted 6805 labeled sentences ie ‘IS’ ‘Not IS’ 273 labeled treated 328 policies ASF documents institutional statements since policy documents provide arguably formal institutional sample text compared norm email discussions Thus 601 Institutional Statements total across two coded datasets Institutional statements refer prescriptions shared constraints form norms rules strategies meant mobilize organize actors towards collective actions examples institutional statements provided Table 1 provide instances developer exchanges 7GitHub Linguist httpsgithubcomgithublinguist Table 1 Selected Examples Institutional Statements Found ASFI Email Discussions Date Institutional Statements Airflow 21 Dec 2016 … running Lab virtually restriction could however hand select people access environment also hold ultimate power remove access anyone … ODF 07 Dec 2011 Please vote releasing package Package vote open next 72 hours passes majority least three 1 ODF Toolkit PMC votes cast … Airflow 24 Feb 2017 … Next steps 1 start voting process IPMC mailinglist … might end changes stable … 2 positive voting IPMC finalisation rebrand RC Release encompass norms strategies institutional implications first example Airflow dated 12212016 involves situation certain developers find computational infrastructure provided ASF insufficient testing development requirements discuss setting alternate arrangements meet bottleneck Faced resource limitations one developer offers externally hosted cloud environment private resources selected excerpt quote individual establishing terms using alternate resources may offer members including access permission usage restrictions ASF projects conduct voting time time gather community consensus matters significance following example ASFI ODF dated 12072011 describes stepwise process expected followed members projectwide conduct vote decides approval release current candidate development final example Airflow 02242017 also pertains similar process developer discusses voting process implications especially terms subsequent steps need fulfilled ensure product release BERTbased Sequential Classifier natural speech emails ISs appear whole sentences parts sentences span multiple sentences also relatively sparse institutional quality dependent inherent interpretation well context Framing extraction sequential sentence classification task context selfcontained email segments instead labeling individual sentences helps take account contextual cues used sequential sentence classifier developed Cohan et al 8 leverages Bidirectional Encoder Representations Transformers BERT sequence classifier 11 classify sentences documents BERT employed generate representation sentence joint encoding neighboring sentences leveraging corresponding sentence separator token’s tuned embedding downstream applications sentence labeling extractive summarizing etc Thus classifier comprises BERT attentionbased joint encoding across sentences followed feedforward classifier predict sentence labels based separator vectors test performance classifier email extraction heldout 40 email threads 125 randomly split 313 handannotated email threads training performed combined set remaining 273 coded email threads ASF policy documents coded training respectively testing email data contained 231 respectively 42 institutional statements training testing email threads processed generate classifier inputs follows include neighboring context meeting length limits BERTbased text classifier email document sentences first chunked segments using sliding window 256 BERT subword wordpiece tokens resulted segments containing 6 contiguous sentences average comprising many full sentences could accommodated specified subword limit rolling window step 1 full sentence generated 3322 384 email segments training testing respectively policy documents policy sentences treated segment leading 328 additional segments training data several reasons support inclusion ASF policies augment positive training examples 1 terms semantic information institutional themes actions expected help language model learn sets apart Institutional themes regular development activities artifacts 2 ASF policies critical common pool resource management institutional operations describe roles responsibilities regulate actions often invoked email discussions8 3 institutional statements formal policies source texts inemail references drawing discuss ASF’s rules email perspective vital source text detecting statements occur email settings Hence apparently sourced formal bylaws beyond emails ASF policies indeed institutional statements relevant recurring developer conversations hence included training data finetuned classifier endtoend corresponding labels sentences segment training stage conducted batch size 16 learning rate 2 cdot 105 6 epochs hyperparameters left defaults account class imbalance randomly oversampled training data segments least one sentence match number segments sentences 11 training predicting phase incorporate temporal information sequentiality captured segments extracting institutional statements model require exact time discussion testing prediction due variable length context preceding following sentence particular segment treat sentence email ‘positive’ classification detected least one segment performance model reported terms F1score precision recall respect positive ‘IS’ label detected sentences test email set Sect 51
::::
44 Topics Identification Institutional Statements purpose text modeling describe text given specific corpus provide numerically measurable relationships among texts eg topics identification measuring similarity etc use Latent Dirichlet Allocation LDA model get semantically meaningful topics better understand extracted institutional statements LDA unsupervised clustering approach 48 given set documents iteratively discovers relevant topics present based word distributions relative prevalence document used LDA identify prominent topic clusters occurring among institutional statements extracted email archives trained classifier see Sec 43 prior training coded email set preidentified topic labels used train LDA model use coherence score provided textttgensim package 44 optimize performance LDA model respect number topics higher coherence score represents better clustering performance select LDA model highest coherence score draw clusters However since LDA model automatically generate label cluster need assign label intuitively based domain knowledge ASF incubation process Naming topic cluster certainly carries risks interpretation however believe providing top keywords cluster reduces risk 8httpslistsapacheorgthreadzykybdvnk9cwx03pnrfl2br9nkcb7q3f Table 2 Summary statistics monthly sociotechnical variables counts institutional statements mentors committers contributors removal top 2 outliers numbers parentheses denote values removal inactive months ie absent emailscommits Prefix denotes features social network represents technical network Statistic Mean St Dev 25 75 snumnodes 1304 1696 1456 1504 4 7 17 22 sgraphdensity 030 030 027 022 012 014 040 040 savgclusteringcoef 022 029 023 021 0 011 039 043 sweightedmeandegree 1183 1556 1203 1281 4 743 16 1971 tgraphdensity 037 068 041 032 0 036 1 1 tnumdevnodes 118 221 159 160 0 1 2 3 tnumfilenodes 6099 11483 15394 19725 0 6 38 126 tnumfileperdev 2879 5357 8046 10423 0 4 20 545 numISmentor 1546 1599 2446 2501 0 1 20 20 numIScommitter 934 1289 1936 2236 0 0 10 16 numIScontributor 1318 1636 2172 2442 0 2 18 21 45 Variables Interest draw institutional sociotechnical features variables basis framework’s predictions research questions sociotechnical variables pulled recent study forecasting sustainability OSS projects 46 showing high predictive power sociotechnical variables metrics aggregated monthly intervals start end incubation Longitudinal SocioTechnical Metrics network month constructed social technical networks calculate various organizational structure measures tables results prefix variable’s name indicates technical code network prefix variable’s name indicates social email network monthly social networks calculate weighted mean degree sweightedmeandegree sum nodes’ weighted degree divided number nodes average clustering coefficient savgclusteringcoef average ratio closed triangles open triangles graph density sgraphdensity technical bipartite networks month calculate number unique developer nodes tnumdevnodes number unique file nodes tnumfilenodes number files per developer tnumfileperdev graph density tgraphdensity Institutional Statements Frequency Metrics month added ISs emails month sent following three separate identifiable groups people ASF mentors numISmentor registered ASF committers numIScommitter contributors numIScontributor summarize statistics Table 2 noted earlier final group emails accounted sent bots Similar calendar entries may useful object study 46 Granger Causality Time series data allows identification relationships temporal variables go beyond association One approach Granger causality statistical test identifying quasicausality pairs temporal variables 13 Given two variables Xt Yt Granger causality test calculates pvalue Yt generated statistical model including Y’s prior values Yt1 Yt2 etc versus generated model addition Y’s prior values also includes X’s prior values Xt1 Xt2 Thus Granger causality simply compares base model involving complex model involving X calculates latter better fit data context Granger causality prior values called lagged values Xt1 lag 1 Xt2 lag 2 etc Granger causality test returns small enough pvalue eg 001 interpreted rejection null hypothesis thus establishing X Granger causes Granger causality test makes assumption timeseries applied stationary meaning trend seasonal effects necessary test stationarity running Granger causality use augmented DickeyFuller test 7 implemented adftest R package tseries 27 test stationarity institutional sociotechnical variables found stationary note distinction typically made scientific causality based controlled experiments Granger causality latter satisfying one precursor property multiple different properties causality Granger causality used word ‘causality’ always preceded ‘Granger’ also note test identify sign ie positive negative Granger causality simply says one exists use pgRangerTest function test Granger causality
::::
5 RESULTS section answer proposed research questions adopting dualview institutional analysis sociotechnical network perspectives first establish utility identification methodology 51 RQ1 institutional statements contained ASF Incubator discussions effectively identify content ISs Detecting Institutional Statements First focus ability BERTbased classifier identify institutional statements emails tested 857 held sentences 40 email threads test set see Sect 43 classifier achieved precision score 0667 recall score 0681 F1 score 0674 classifying Institutional Statements demonstrating able extract ISs developer email exchanges spite 51 ISs model validation overfitting sought perform stratified crossvalidation CV training data note data ideal CV study 1 limited data size 2 uneven distribution ISs across email threads 3 class imbalance nonIS sentences Eg due limited data size emails high density could find way train test split dramatically increase variance crossvalidation results ameliorate uniform stratification chunked 273 threads training data 442 subemails 20 contiguous sentences email threads mean length 22 sentences finetuned classifier endtoend corresponding labels sentences subemails subsequent input segment generation training pipeline otherwise kept unchanged obtained mean F1 score positive labeling sentences ISs 0603 high variability folds still persisting consider performance results satisfactory given small highly imbalanced data set 273 ISs 6805 sentences strong indications increasing positive examples training data set increase classifier’s performance course challenging ascertain classifier performance varies across projects due limited 9When finetuned classifier 273 training email threads ie without Institutional statements ASF policy documents F1 positive label found 20 lower Fig 1 Comparing graduated blue vs retired red projects along number Institutional Statements color online MannWhitney U test pval sufficiently small brackets suggesting significant differences means groups ran classifier full corpus 1201746 emails bot email removal across ASF incubator projects identified 313140 ISs emails average 0261 sentencelevel ISs per email Table 2 shows descriptive statistics sociotechnical variables number institutional statements mentors committers contributors calculated monthly intervals per find classifier’s errors also informative one set false positives participants described plans event occurring outside Apache relevant incubator kind process behavioral constraint typical ISs probably detected due semantic similarity rules guidelines make positive examples Conversely sentence ‘Send see reaction is’ missed despite appearing context contributor agreements miss likely due fact many recommendations made emails would considered institutional indicate particular individual individual rather institutional role Institutional Statements Roles Sustainability Status turn exploratory analysis demonstrate utility chosen features reasoning differences graduated retired projects Comparing graduated retired projects find significant difference number ISs example Figure 1a number sent mentors graduated projects statistically higher retired projects MannWhitney U test used testing difference means along fact graduated projects tend active socially overall compared retired projects ie email exchanges suggests mentors retired projects concerned projects’ community progressing thus email content rules guidance hand also plausible mentors engage socially less institutionally graduated projects may benefit projects numbers ISs sent committers contributors show similar patterns investigate longitudinally next section Topics Identification Institutional Statements use Latent Dirichlet Allocation LDA model study tokenlevel topics institutional statements optimizing LDA coherence score get optimal number topics 12 result enables us study words important topic present clusters top words topic Table 3 table reveals words well extracted institutional statements distinguished example first topic ie ‘Progress Report’ cluster words – ‘review’ ‘board’ relates ASF board ‘submit’ ‘report’ – Table 3 Topics Identified Institutional Statements ID Heuristic Topic Top Sample Words 1 Progress Report review require meeting board submit report 2 Collective Decision vote start proposal thread close day bind 3 Release release issue think fix branch policy 4 Community email send community behalf incubation talk 5 Report Review board report time meeting prepare reminder review 6 Mailing List Issues list mailing discussion question issue comment request 7 Documentation update wiki page website documentation link doc 8 Testing release source build test note artifact check 9 licensing Policy license file version copyright compliance 10 Routine Work committer help work way code 11 Mentorship podling report form mentor know sign month wish 12 Distribution work repository information file distribute commit associated important incubator rule requires projects report regular progress reports topic 7 words like ‘update’ ‘wiki’ ‘page’ ‘website’ ‘documentation’ emerge related requirements projects need address related website documentation requirements results advance institutional theory engineering domain arguably associated OSS sustainability suggest diving deeper connections socialtechnical system institutional analysis RQ1 Summary demonstrated institutional analysis methodologies capture differences graduated projects retired projects also showed effectively identify meaningful institutional statements common topics ASF incubator projects’ emails 52 RQ2 OSS evolution toward sustainability observable dual lenses institutional sociotechnical analysis temporal patterns differ section goal contrast graduated retired projects time space sociotechnical space Projects exit ASF incubator different times effect larger variance end incubation month Therefore restrict first 24 months projects 60 projects stayed within 24 months incubator Topic Evolution Time identifying words contribute various identified topics aggregating projects get volume measured number tokens contributing topic topic month Moreover since exist trends number subtract mean volume month separately graduated retired projects present Figure 2 xaxis number months incubation start yaxis indicates relative volume compared mean results MannWhitney U test show 10 12 topics significantly different means graduated retired projects pval 001 significant topic 9 licensing policy topic 12 distribution Additionally augmented DickeyFuller test suggests time 9 12 topics stationary ie temporal trends exist pval Fig 2 Topics Evolution graduated projects blue compared retired projects red xaxis indicates ith month incubation start yaxis represents relative volume topics MannWhitney U test found 10 12 topics significantly different means graduated retired projects pval 001 significant topic 9 licensing policy topic 12 distribution 01 except topic 2 collective decision topic 6 mailing lists topic 12 distribution testing results prompt us analyze difference projectlevel dynamics graduated retired projects observe increasing trend Topic 1 ‘Progress Report’ small seasonal effect suggesting projects learning ‘Apache Way’ actively discussing regular reporting time seasonal effect found significant Topic 5 ‘Report Review’ releases documentation testing connected number people participating regularly Retired projects average smaller graduated ones likely explanation differences Eg Figure 3f show graduated projects average source files retired projects Moreover find Topic 9 ‘license policy’ increasing trend earlier stages incubation eg months 17 makes sense shift one OSS license license required ASF important discussion projects would want address earlier contrary longitudinal pattern language related testing relatively rare beginning incubation suggests earlier stages incubation developers likely focused transition incubator perhaps less new code development testing hand transitions implemented fast manner testing discussions increasing rapidly incubation months 3 4 5 comparing graduated retired projects find Topic 10 ‘Routine work’ dominant topic types projects almost projects’ incubation ie remain high volume compared topics also find graduated projects tend active Topic 7 ‘Documentation’ Topic 3 ‘Project Release’ Interestingly hand mentorshiprelated ISs Topic 11 found active retired projects rather graduated projects One possible reason retired projects seek help mentors projects experiencing downturns issuing institutionwise statements Fig 3 averaged monthly ST variables graduated projects retired projects top measures bottom ST measures Shades indicate one st error away mean Month index 0 indicates incubation starting month color online Metric Evolution continue exploring evolution metrics time Looking mentors’ ISs shown Figure 3a see even beginning incubation mentors email greater number ISs projects eventually graduate compared ones eventually retire Next see number ISs mentor emails decline graduated projects retired projects month 5 suggesting ASFI mentor activity may decrease incubating projects work first steps incubation process visually identify increasing trend mentors around month 6 graduated 5 retired projects One possible reason fact mentors start helping projects experiencing difficulties downturns consistent ASF mentorship early stage incubation developers required make institutionalrelated decisions eg voting reports discussing ASF required licensing communityrelated issues kinds areas mentors come help SocioTechnical networks side shown Figure 3d first 6 months see graduated projects clear increasing trend number nodes social networks seems constant retired projects see slight decrease around month 10 month 12 types projects suggesting 10 months might good timing mentors intervenemotivate projects experiencing difficulties RQ2 Summary identify sociotechnical institutional signatures OSS evolution evidence differs graduated retired projects patterns even distinguished institutional heuristic topics institutional side graduated retired projects stable institutional topics first 3 months SocioTechnical network side graduated projects keep attracting community first 6 months retired projects unstable first 3 months 53 Case Study Association Institutional Governance Organizational Structure communicate concretely institutional sociotechnical dimensions interact within ASFI ecosystem showcase four diverse instances mutual interrelationship Case July 2011 HCatalog announced vote first Release Candidate RC first officially distributed version code project’s RC’s reflect whole ASF require approval foundation contributors given approval preparation first vote developers doublechecked installation process reported missing files features drove contributions code documentation eg release notes added reported missing contributors cast votes four people’s votes product approved proposal forwarded Apache Incubator leadership approval Case B December 2010 independent developer emailed Jena community share idea new feature asking proceed toward contributing query includes policy questions whether must obtain Individual Contributor License Agreement ICLA developer responds policy require ICLA type smaller contribution volunteer proposing developer guides volunteer established processes contributing code including mailing lists use submit feature patch Case C December 2016 developer Airflow community raised concerns integration testing infrastructure offered Apache citing unnecessary obstacles imposes volunteer contributors developer offers resources alternative caveats administer control access triggers discussion technical merits developer’s concerns policy discussion whether ASF permits use unofficial alternative infrastructure options Several developers conclude transition technically advisable institutionally sound community transitions alternative integration testing framework Case September 2015 Kalumet received proposal retired ASFI code languishing several months Contributors agreed upon retirement almost unanimously One contributor identifying features could use ASF ASFI projects suggests distributing key parts functionality active projects retirement vote ultimately followed developer effort distributing Kalumet’s assets cases illustrate institutionside policy discussion sociotechnicalside contributions interact developments artifact motivating policy discussions policy constraints steering developer effort longitudinal data institutional sociotechnical variables transition quantitative investigation relationships 54 RQ3 periods increased Institutional Statements frequency followed changes organizational structure viceversa previous RQs conducted exploratory qualitative studies extraction technology sociotechnical variable changes time section investigate temporal relationship measures institutional governance organizational structure OSS projects progress incubation trajectories predicted contingency theory hypothesis evolution developers mentors must make time Fig 4 Granger Causality Institutional Statements SocioTechnical networks bluepurple directed links indicate Granger causality STIS measures respectively green bidirectional link indicates twoway significant temporal relationship pval 001 Graduated projects seem fewer links ST variables variables suggesting unidirectional flow institutional sociotechnical changes successful projects color online decisions related organizational structure contingent ASFrequired institutional arrangements governance incubating projects change organizational structure based institutional norms rules discussed required potential new member ASF community vice versa organizational changes incite followup discussions institutional processes test RQ3 use pairwise Granger causality test lagged order 2 run test pairs institutional statements sociotechnical variables resulting 36 separate tests graduated projects set 36 retired ones adjust pvalues multiple hypothesis testing control false discovery rate using BenjaminiHochberg procedure 14 consider significant pval 0001 results summarized Figure 4 directed edge node X node indicates X Grangercauses ie change X precursor change Also discussed Section 46 Granger approach used complete test causality yield effect directionality although without effect size sign observe large number 31 72 total Grangercausal relationship measures institutional governance organizational structure 31 Grangercausal relationships 15 graduated set 16 retired set 8 relationships shared sets conclude significant Grangercausality changes institutional governance discussions organizational structure projects note 8 bidirectional relationshipstext13 remaining 15 unidirectional 13Bidirectional causality indicates feedback sort Eg supply causes demand demand turn causes supply look graduated projects first Interestingly Figure 4 top shows number ISs mentors committers contributors effects technical network viceversa latter two Namely roles mentors committers contributors Grangercause changes technical networks ie developer file productivity tnumfilesperdev total number coding files changed tnumfilenodes variables Mentor additionally Grangercause changes number developers tnumdevnodes consistent ASFI expectations mentor’s emails provide advice engage people conversely drop engagement may elicit mentors’ engagement Mentors usually code presumably Grangercause appear feedback relationships technical network variables Notably absent however links mentor contributor ISs social network variables committer ISs bidirectionally Grangercause changes social network density perhaps simply indicates ISs committers induce substantial traffics social network turn gets committers discuss policy rules issues observed situations mentors likely interrupt projects projects become less active either socially technically14 hand could also mentor reacting particular broader discussion among developers eg one monthly report Together tells story importance technical networks changes variable Surprisingly mentor changes consequential social network seemingly odds ASF communityfirst goals Thus may room enhance community engagement mentors viceversa RQ 3 Summary graduated retired projects inputs social network variables even though inputs technical network variables Retired projects exhibit less bidirectionality ST variables Finally interestingly among retired projects causal inputs contributor ISs social technical variables case graduated projects
::::
6 DISCUSSION study use individual institutional prescriptions Institutional Statements SocioTechnical ST network features reason OSS sustainability OSS projects form digital public goods like public goods eg water forest marine etc subject degradation due overharvesting eg form freeriders take advantage OSS contribute required resources development maintenance Ostrom’s work illuminated fact many communities avoid dreaded ‘Tragedy Commons’ collective action problems hard work designing implementing selfgoverning institutions context ASF nonprofit foundation incubation program encourages nascent OSS projects follow ASFguided operationallevel rules policies around selfgovernance OSS projects join ASF incubator trade freedom unlimited institutional choice exchange incubator resources increase chances enduring collective action problems characterize OSS development 36 becoming sustainable long run found ASF Incubator amount institutional statements levels sociotechnical variables associated projects graduation outcome suggesting measures institutional governance organizational structure signal information sustainability 14An example mentor interrupting warble httpslistsapacheorgthreadx6h8pzhmfwtyy354ml1xm9sylq4y5r7l particular RQ1 MannWhitney U test shows graduated projects significantly ISs three types participants committers contributors mentors retired projects presumably indicative active intentional selfgovernance theoretical empirical work commons governance well documented getting selfgoverning institutions ‘right’ hard work takes time effort 32 consistent narrative participants graduated projects debate work harder project’s operationallevel institutional design Recent work shown ASFI graduate retired projects sufficiently different sociotechnical structures 46 graduation predicted early development 85 accuracy results RQ2 show first 3 months incubation developer nodes social networks graduated projects increase higher rate means increase 101 171 73 91 graduated retired projects respectively suggesting graduated projects able keep developers contributing actively recruit new members hand first 3 months also found amount Institutional Statements mentors increases graduated projects decreases retired projects 197 227 vs 226 146 graduated retired projects respectively suggesting initial help project’s mentors importance study effects ISs performed deepdive topics found topics institutionalrelevance graduated projects differ retired projects specifically find topic documentation topic 7 graduated projects prevalent retired projects hand found topics mentorship topic 11 retired projects significantly higher retired projects signaling retired projects might struggling incubation Combined fact developer nodes social technical networks together findings suggest graduated projects capacity energy attend noncoding issues like documentation retired projects However even among graduated projects still diversity institutional statements Thus predicted contingency theory well Ostrom’s theory institutional diversity 33 onesizefitsall solution successful trajectory toward sustainability likely Instead future work focus gathering larger corpora data able resolve individual smallgroup differences sustainable projects framework allowed us combine STS structures study together time RQ3 found twoway causal correlations sociotechnical variables ISs time arguably indicating OSS sociotechnical structure governance structure evolve together coupled system addition methods point way study possible interventions underperforming projects Specifically finding retired projects bidirectional links committer’s ISs three features technical networks ie ttextnumdevnodes ttextnumfileperdev ttextnumfilenodes suggest increase committer’s interleaved changes features sociotechnical networks design implications addition current categories mailing lists ASF incubator eg ‘commit’ ‘dev’ ‘user’ etc benefit creating separate mailing list institutionallyrelated discussions help committers also mentors contributors participate faster discussions timely manner could made useful using technology selfmonitoring participants could monitor project’s digital traces discussions order quickly react episodic events tools already created sociotechnical networks ASFI projects 34 could extended include ISs well tools help identify entry points targets interventions whereby underperforming projects could leaned internally externally via rules advice adjust trajectories Contributions Institutional Analysis Sociotechnical System Theory Making full circle findings also point ways theories started refined extended find Sect 54 evidence features OSS projects’ sociotechnical systems cochange together amount Institutional Statements cochange relationships sparse evidence cochange implies OSS projects’ structure governance form loosely coupled system controllability point view dynamically coupled system refines Smith et al’s mechanistic binary notion ‘inside’ ‘outside’ interventions 40 findings also suggest OSS projects adopting additional rules norms eg joining ASFI worth loss freedoms Institutional Statements Sect 52 53 54 seem serve organize project’s actions discussions predicted Siddiki et al 39 Crawford Ostrom 10 Thus findings tie potentially extend Institutional Analysis Design IAD view suggesting feedback sociotechnical system structure institutional governance analysis sufficiently direct significant considered unitary studies practically institutional statement predictor although still work progress effectively predict atomic elements selfgovernance used tool provide quantitative data applying institutional analysis design IAD generally eg OSS projects outside ASF selfgoverned systems public documents discussion forums
::::
7 THREATS VALIDITY First data hundreds projects ASF incubator projects Thus generalizing implications beyond ASF even beyond ASF Incubator projects carries potential risks example OSS projects incubator programs may mentors Expanding dataset beyond ASF incubator eg additional projects OSS incubator programs could lower risk Second consider communication channels ASF mailing lists eg inperson meetings website documentation private emails etc However ASF mandates use public mailing lists discussions policy ensures particularly low risk missing institutional sociotechnical information Annotations Institutional Statements biased individual annotators gave annotators sufficient training reference documentation lowers risk expect performance classifier increase size training set better incorporate contextual information plan distinguish types ISs future work OSS projects developers may use different emails aliases turn complicates identification distinct developers assigning insisting using unique apacheorg domain email address reduces risks Finally noted Sect 4 likely cases OSS projects retired ASF Incubator program still go become sustained time instances OSS projects entering ASFI may simply good fit ASF culture institutional requirements policies ultimately retire result paper explicitly use graduation measure sustainability given ultimate goal ASFI – create projects indeed sustainable want recognize point retired projects still could become sustainable following different path association ASF 15 Apache Way httptheapachewaycomonlist 16 ASF committer emails httpsinfraapacheorgcommitteremailhtml 8 CONCLUSION Understanding OSS projects cannot meet expectations nonprofit foundations may help others improve individual practice organizational management institutional structure importantly understanding relationship institutional design sociotechnical aspects OSS bring insights potential sustainability projects showed quantitative network science features capture organizational structure developers collaborate communicate artifacts create Combining two perspectives sociotechnical measures institutional analysis leverage unique affordances Apache Foundation’s OSS Incubator extend modeling OSS sustainability leveraging novel longitudinal dataset vast text log corpus extrinsic labels success failure sustainability ACKNOWLEDGEMENTS authors greatly thank reviewers constructive comments material based upon work supported National Science Foundation GCR grant 2020751 2020900 REFERENCES 1 Barclay W Interdepartmental conflict organizational buying impact organizational context Journal Marketing Research 28 2 1991 145–159 2 Benkler wealth networks Yale University Press 2008 3 Bird C Gourley Devanbu P Gertz Swaminathan Mining email social networks Proceedings 2006 international workshop Mining repositories 2006 pp 137–143 4 Bird C Nagappan N Gall H Murphy B Devanbu P Putting together Using sociotechnical networks predict failures 2009 20th International Symposium Reliability Engineering 2009 IEEE pp 109–119 5 Bird C Pattison D’Souza R Filkov V Devanbu P Latent social structure open source projects Proceedings 16th ACM SIGSOFT International Symposium Foundations engineering 2008 pp 24–35 6 Blomquist W et al Dividing waters governing groundwater Southern California ICS Press Institute Contemporary Studies 1992 7 Cheung YW Lai K Lag order critical values augmented dickey–fuller test Journal Business Economic Statistics 13 3 1995 277–280 8 Cohan Beltagy King Dalvi B Weld Pretrained language models sequential sentence classification Proceedings 2019 Conference Empirical Methods Natural Language Processing 9th International Joint Conference Natural Language Processing Hong Kong China 2019 Association Computing Machinery p 3693–3699 9 CookeDavies “real” success factors projects International journal management 20 3 2002 185–190 10 Crawford Ostrom E grammar institutions American Political Science Review 89 3 1995 582–600 11 Devlin J Chang MW Lee K Toutanova K Bert Pretraining deep bidirectional transformers language understanding arXiv preprint arXiv181004805 2018 12 Ducheneaut N Socialization open source community sociotechnical analysis Computer Supported Cooperative Work CSCW 14 4 2005 323–368 13 Dumitrescu EI Hurlin C Testing granger noncausality heterogeneous panels Economic modelling 29 4 2012 1450–1460 14 Ferreira J Zwinderman benjamini–hochberg method Annals Statistics 34 4 2006 1827–1849 15 Fischer G Herrmann Sociotechnical systems metadesign perspective International Journal Sociotechnology Knowledge Development IJSKD 3 1 2011 1–33 16 Fleischman F Loken B GarciaLopez G VillamayorTomas Evaluating utility commonpool resource theory understanding forest governance outcomes Indonesia 1965 2012 International Journal Commons 8 2 2014 17 Frischmann B Madison Strandburg K Governing Knowledge Commons Oxford University Press 2014 18 GonzálezBarahona J Lopez L Robles G Community structure modules apache Proceedings 4th International Workshop Open Source Engineering 2004 IET pp 44–48 19 Gruby R L Basurto X Multilevel governance large marine commons politics polycentricity palau’s protected area network Environmental science policy 33 2013 260–272 20 Hardin G tragedy commons population problem technical solution requires fundamental extension morality science 162 3859 1968 1243–1248 21 Herrmann Hoffmann Kunau G Loser KU modelling method development groupware applications sociotechnical systems Behaviour Information Technology 23 2 2004 119–135 22 Hess C Ostrom E Understanding knowledge commons theory practice JSTOR 2007 23 Hissam Weinstock C B Plakosh Asundi J Perspectives open source Tech rep Carnegie Mellon Univ Pittsburgh PA Engineering Inst 2001 24 Joblin Apel successful failed projects differ sociotechnical analysis ACM Trans Softw Eng Methodol dec 2021 25 Joslin R Müller R impact methodologies success different environments International Journal Managing Projects Business 2016 26 Lehtonen P Martinsuo Three ways fail management role management methodology Perspectives 28 1 2006 6–11 27 Lopez J H power adf test Economics Letters 57 1 1997 5–10 28 Narduzzo Rossi role modularity freeopen source development FreeOpen source development Igi Global 2005 pp 84–102 29 Olson logic collective action 1965 Contemporary Sociological Theory 124 2012 30 O’Reilly Lessons opensource development Communications ACM 42 4 1999 32–37 31 Ostrom E Governing commons evolution institutions collective action Cambridge university press 1990 32 Ostrom E Understanding institutional diversity Princeton university press 2009 33 Ostrom E Janssen Andereis J Going beyond panaceas Proceedings National Academy Sciences 104 39 2007 15176–15178 34 Ramchandran Yin L FilKov V Exploring apache incubator trajectories apex 2022 IEEEACM 19th International Conference Mining Repositories MSR 2022 IEEE p Accepted 35 Ropohl G Philosophy sociotechnical systems Techné Research Philosophy Technology 4 3 1999 186–194 36 Schweik C English R Tragedy foss commons investigating institutional designs freelibre open source projects First Monday 2007 37 Schweik C English R C Internet success study opensource commons MIT Press 2012 38 Sen Atkisson C Schweik C Cui bono open source incubator policies procedures benefit projects incubator Available SSRN 2021 39 Siddiki Heikkila Weible C PachecoVega R Carter Curley C Deslatte Bennett Institutional analysis institutional grammar Policy Studies Journal 2019 40 Smith Stirling Moving outside inside objectification reflexivity governance sociotechnical systems Journal Environmental Policy Planning 9 34 2007 351–373 41 Surian Tian Lo Cheng H Lim EP Predicting outcome leveraging sociotechnical network patterns 2013 17th European Conference Maintenance Reengineering 2013 IEEE pp 47–56 42 Trist E evolution sociotechnical systems conceptual framework action research program Ontario Ministry Labour 1981 43 Turner J R Müller R Communication cooperation projects owner principal manager agent European management journal 22 3 2004 327–336 44 Řehůřek R Sojka P et al Gensim—statistical semantics python Retrieved genism org 2011 45 Wearn Stanbury study reality management Wg morris gh hough john wiley uk 1987 e 2995 isbn 0471 95513 pp 295 International Journal Management 7 1 1989 58 46 Yin L Chen Z Xuan Q FilKov V Sustainability forecasting apache incubator projects Proceedings 29th ACM Joint Meeting European Engineering Conference Symposium Foundations Engineering New York NY USA 2021 Association Computing Machinery p 1056–1067 47 Yin L Zhang Z Xuan Q FilKov V Apache foundation incubator sustainability dataset 2021 IEEEACM 18th International Conference Mining Repositories MSR 2021 IEEE pp 595–599 48 Yu H Yang J direct lda algorithm highdimensional data—with application face recognition Pattern recognition 34 10 2001 2067–2070 Received July 2021 revised November 2021 accepted April 2022
::::
Labor Maintaining Scaling Free OpenSource Projects R STUART GEIGER∗ University California San Diego Department Communication Halicioglu Data Science Institute USA DOROTHY HOWARD University California San Diego Department Communication Feminist Labor Lab USA LILLY IRANI University California San Diego Department Communication Design Lab Feminist Labor Lab USA Free andor opensource FOSS projects play major dominant role society constituting critical digital infrastructure relied upon companies academics nonprofits activists FOSS become larger established investigate labor maintaining sustaining projects various scales report findings interviewbased study contributors maintainers working wide range FOSS projects Maintainers FOSS projects maintain code traditional engineering understanding term fixing bugs patching security vulnerabilities updating dependencies FOSS maintainers also perform complex ofteninvisible interpersonal organizational work keep projects operating active communities users contributors particularly focus labor maintaining sustaining changes projects grow scale across many dimensions understanding FOSS much maintaining communal maintaining code discuss broadly applicable considerations peer production communities sociotechnical systems broadly CCS Concepts • Social professional topics → Computer supported cooperative work Sociotechnical systems Computing profession people management • engineering → Open source model Additional Key Words Phrases open source free maintenance infrastructure labor ACM Reference Format R Stuart Geiger Dorothy Howard Lilly Irani 2021 Labor Maintaining Scaling Free OpenSource Projects Proc ACM HumComput Interact 5 CSCW1 Article 175 April 2021 28 pages httpsdoiorg1011453449249
::::
1 INTRODUCTION Free andor opensource FOSS refers broad set working processes social movements organizations formed around production distribution ∗The majority work conducted Geiger affiliated Berkeley Institute Data Science University California Berkeley Authors’ addresses R Stuart Geiger University California San Diego Department Communication Halicioglu Data Science Institute 9500 Gilman Dr La Jolla California USA 92093 Dorothy Howard University California San Diego Department Communication Feminist Labor Lab 9500 Gilman Dr La Jolla California USA 92093 Lilly Irani University California San Diego Department Communication Design Lab Feminist Labor Lab 9500 Gilman Dr La Jolla California USA 92093 Permission make digital hard copies part work personal classroom use granted without fee provided copies made distributed profit commercial advantage copies bear notice full citation first page Copyrights thirdparty components work must honored uses contact ownerauthors © 2021 Copyright held ownerauthors 2573014220214ART175 httpsdoiorg1011453449249 Proc ACM HumComput Interact Vol 5 CSCW1 Article 175 Publication date April 2021 work licensed Creative Commons Attribution International 40 License © 2021 Copyright held ownerauthors 2573014220214ART175 httpsdoiorg1011453449249 complex contested history going back decades movements extensively studied many disciplinary perspectives well subject substantial commentary members across many factions eg 49 102 125 projects publicly release source code rather various commercial models firms require payment use andor restrict ability users modify Practitioners often describe FOSS ‘free’ two ways free available cost called “free beer” free source code available licensed users modify called “free speech” 78 However important ask work maintaining projects fits paradigms freeness FOSS similar peer production projects require labor material resources 41 84 124 prior decades many early FOSS projects began hobbyist efforts build alternatives commercial proprietary tech industry Many early contributors volunteered spare time negotiated employer let spend work time FOSS 8 27 29 64 76 94 Many FOSS projects become commercial part tech sector past two decades 10 47 Today FOSS grown many projects become dominant product sector extensively relied upon commercial firms eg Linux Apache Python Many successful FOSS projects userfacing applications infrastructure relied upon companies inside outside industry operating systems programming languages libraries servers web components 2020 survey 950 enterprisesized companies across sectors reported 95 said open source important infrastructure strategy 77 would adopt open source next year 103 FOSS also relied upon government entities nonprofits activist movements free cost ability modify crucial FOSS projects become critically embedded organizations economies major shift questions “sustainability” within many projects especially began volunteers’ side projects term used call attention whether projects keep developing maintaining others rely must maintained continue useful users Nadia Eghbal’s influential report topic opens Heartbleed bug OpenSSL FOSS library used twothirds websites handle encryption leading worst security vulnerability web’s history 41 Despite critical centrality OpenSSL project’s maintainers long struggled find time money work Eghbal quotes lead maintainer’s public post “The mystery overworked volunteers missed bug mystery hasn’t happened often” 88 Eghbal’s recent work suggests small numbers individual developers often bulk work many FOSS projects somewhat transactional relationships contributors users contrast predominant narratives present FOSS composed large collaborationdriven communities 42 research question asks work maintaining projects changes FOSS projects become key dependencies others including wellresourced organizations tech sector conducted 37 qualitative interviews current former FOSS contributors maintainers focus projects began purelyvolunteer efforts since become widely relied upon infrastructure organizations beyond find projects scale across kinds dimensions — number users contributors maintainers kinds users contributors maintainers size complexity features codebase interdependence ecosystem — work maintaining meaning maintainer dramatically change Scale brings new tasks changes nature existing tasks example projects users providing technical support users exciting opportunity grow community around Yet largescale millions users become overwhelming flood demands requires establishing specific rules roles norms developing processes triaging user support requests particular find ostensiblytechnical work engineering takes organizational communicative even competitive aspects larger scales wellestablished theme sociotechnical nature computersupported cooperative work engineering work general However study details activities experiences maintenance work change projects grow develop become embedded within broader networks people code money institutions including corporations governments academia nonprofits FOSS projects conclude discussing “scalar labor” managing scales deferral labor scaling create consequences projects line – problem term “scalar debt” Maintenance tasks pile requiring massive amount often lessvisible work build organizational capacity keep onslaught demands Finally discuss FOSS maintainers popular projects also sometimes face additional work become hypervisible even microcelebrities contrary infrastructural maintenance typically described “behind scenes” “invisible” work
::::
2 BACKGROUND LITERATURE 21 Trajectories FOSS research study took place 20192020 era FOSS different prior decades foundational works FOSS proliferated classic accounts eg 24 suggested FOSS composed ideologicallydriven collaborative voluntary communities producing public goods intended supplant proprietary alternatives Past work documented FOSS’s connections early internet engineers makers 16 universities academic research opposition corporatefriendly copyright patent law 25 76 Academics practitioners discussed FOSS social movement splintered “open source” rising competing movement free one transformed original anticommercial values free 77 Researchers focused FOSS contributors collaborate organize FOSS projects “open development model” 53 studied “peer production” 6 communities like Wikipedia citizen science Unlike firms employees directed managers projects often rely selfdirected contributions individuals individuals working private industry voluntary FOSS contributor communities Popular accounts often marvel relatively high quality products produced ostensibly ‘anarchistic’ approach eg 123 see also critiques 114 124 127 However past work repeatedly shown made possible though lessvisible coordination articulation conflict resolution work done review assemble align others’ contributions 18 31 46 69 83 124 leadership FOSS work involves predicting becomes leader 48 59 leaders’ motivations 82 although past work discussed roles leaders often resolve conflicts mentor newcomers set rules organize tasks 4 30 74 literature review FOSS within beyond ComputerSupported Cooperative Work CSCW Germonprez et al 57 note studies CSCW HumanComputer Interaction HCI either “input” topics like developer motivations “process” topics collaboration governance also detail massive transformation FOSS past decades Early work often found contributors purely volunteers working looselyformalized quasiorganizations operating adhoc rules cite recent findings showing rise paid roles 109 corporate involvement 10 47 56 100 formal organizational structures 20 45 — nonprofit foundations based fundraising revenuegenerating business models — increasingly norm especially popular longstanding projects Finally Germonprez et al note FOSS projects often studied singleproject case studies — 32 also find 2012 review However discuss contemporary FOSS projects exist “complex supply chains” 57 p 9 multiple cascading interdependencies FOSS projects mesolevel ecosystems supply chains FOSS projects FOSS projects complex relationships tech industry governments academia nonprofits Ekbia Nardi 43 discuss wide dependency industrial profit volunteer undercompensated labor part set practices call “heteromation” argue global economy relies extensively digitallymanaged forms un undercompensated labor usergenerated content microtasking platforms FOSS variously called crowdsourcing 15 cognitive surplus 115 peer production “wealth networks” 6 Ekbia Nardi argue understood heteromation Computational industries benefit labor subsidized institutions support people working cheaply free ranging welfare states universities family charity issues money financial sustainability corporate relationships long studied FOSS less studied lived experiences FOSS maintainers projects become enmeshed within institutions significant access financial social cultural capital 22 Infrastructural maintenance labor Scholars long drawn attention work maintaining technologies often ignored neglected scholarship notes importance maintenance shaping forms functions technologies beyond moments invention 39 111 Scholars ComputerSupported Cooperative Work long emphasized less visible ostensibly ‘nontechnical’ labor crucial functioning computational infrastructures especially “human infrastructure” scientific cyberinfrastructure 80 work often underbudgeted unrecognized undervalued necessary make systems ‘seamless’ 107 users Eghbal draws metaphors public infrastructure road public works suggests FOSS infrastructure needs considered lens 41 Following Jackson’s call take maintenance repair essence technology 73 empirical studies practices become common many areas One theme tasks roles construed “maintenance” “repair” often involve responsibilities beyond technological particularly maintaining repairing social institutional relationships 37 66 71 72 110 Infrastructure maintenance work also often discussed alongside work sometimes invisibilized gendered classed assumptions nature importance 44 51 70 117 example Orr’s ethnography photocopy repair technicians showed serve customer’s primary point contact photocopy company managing relationship designated account representatives 99 Suchman’s analyses expert systems emerging Xerox PARC also found computer engineers underestimated complexity secretaries 121 However literature examined cases maintenance work becomes highly visible even constitutive leadership largescale FOSS projects often impulse always make work visible literature show making work visible come regimes surveillance micromanaging selfcensorship 14 97 120 23 many meanings “scale” interested work maintaining FOSS projects changes projects scale exactly scale FOSS common refer organizations communities platforms “small scale” “large scale” often compresses many aspects single term CSCW HCI scale often synonym number users classic work designed systems intended operate predefined range simultaneous users eg 61 anthropology Carr Lempert 122 argue people use terms like “large scale” “at scale” often intuiting kind synthetic construct combines multiple related distinct measures case like number users number user organizations kinds users interdependence ecosystem number contributors Recent work CSCW similarly identified multivalent understandings scale include Lee Paine’s “model coordinated action” 81 identify seven different dimensions along softwaremediated organizations range number participants number communities practice physicalgeographical distribution nascence routines planned permanence rate turnover level asynchronicity interactions findings relate specific dimensions emerged relevant FOSS maintainers well insight shifts dimensions occur independently shifts one dimension also impact depend others Studies scientific cyberinfrastructures widely demonstrate theme scaling also includes integrating interdependent projects standards 3 – often called “embeddedness” 9 40 117 — supporting wider range use cases longer periods time 75 studies shown various “tensions across scales” 105 emerge projects grow
::::
3 METHODS METHODOLOGY 31 Research methods qualitative research primarily based semistructured interviews 37 maintainers FOSS projects 20192020 Interviews lasted median 55 minutes covered range openended topics including toplevel questions interviewee’s personal history FOSS kinds work roles participation changed time governance decisionmaking funding financial sustainability motivationdemotivation burnout careers technologies platforms impact participation worklife balance common nonrandom sampling qualitative research sought strategically sample diversity across many dimensions 95 rather seek kind random uniform representative sample common survey research specifically choose recruit interview broad set maintainers varied across geography national origin age employment status sector gender made efforts sample demographic diversity reflecting structural problems gender gap among FOSS contributors 38 present challenges recruiting diverse sample originally ask interviewees demographics sent postinterview survey 85 completed gender 19 identified womenfemale 81 menmale 0 nonbinary raceethnicity allowed multiple selections 72 identified whiteCaucasian 66 exclusively 16 HispanicLatinx 13 IndianSouth Asian 6 East Southeast Asian 3 BlackAfrican 3 Interviewees born 14 different countries 5 continents US common 47 Interviewees currently reside 12 different countries 5 continents US common 56 Ages spanned 25 64 years old 53 aged 3039 years old also sampled diversity recruit maintainers different kinds FOSS projects projects whose maintainers interviewed range single developermaintainer hundreds contributors similar variance terms number users existed decades complex governance structures roles norms others relatively new projects represent range topical areas locations within technical stack including operating systems programming languages libraries development environments web frameworks servers databases packaging data analytics research computing devops electronic arts media focus ensuring interview pool included maintainers projects across dimensions scale follows existing work importance scale component ethnographic work CSCW infrastructures 101 104 broader globallydistributed phenomena 86 recruitment methods involved utilizing existing personal networks attending FOSS conferences events call participation shared Twitter cold emailing FOSS maintainers also conducted snowball sampling asking interviewees suggest potential interviewees us help sampling diversity utilized techniques similar “trace ethnography” 55 recruitment methods identify potential maintainers recruit based available user data social coding platforms including GitHub see 34 36 126 identified core contributors GitHub timelines recent commits release notes interviewees generally selfidentified current former maintainers terms interviewed maintainers hold various roles within wide range FOSS projects particularly focused projects become relied upon infrastructure others either began largely based volunteer labor However asked maintainers projects worked encountered projects beyond 32 Methodology interpretive approaches interpretive approach grounded symbolic interactionism focuses actors organize interactions world one another categories emerge social worlds 12 wider discourses 22 transcribed inductively analyzed interviews themes using grounded theory approach 23 119 involves multistage process coding statements iterativelygenerated themes themes identify social processes common across organizational site generalizing across specific local experiences remaining bound particularities cultures work processes examination conducted interviews many participants reflected practices practices others including broader theories political economy history FOSS 67 move research findings reflect upon participants may related information researchers based understood study research might affect reflexivity “recursive” 76 pattern FOSS communities 68 memberchecking form sharing transcripts findings gave participants opportunities give feedback engage interpretations relationship communities neither purely outsiders insiders funded nonprofit foundations also directly fund FOSS projects worked former current FOSS contributors maintainers regularly attended various FOSSrelated meetups events authors ethnographers embedded larger ongoing research projects area — either FOSS projects organizations rely andor contribute FOSS projects data present paper centered set 37 interviews although broader ethnographic experiences informed kinds questions asked interpreted interviewees’ responses 4 FINDINGS 41 FOSS maintainer change projects scale “Maintainer” meaning within FOSS rarely technical domains FOSS maintainers perform upkeep repair work term also usually connotes leadership role 42 also discusses leadership role often enacted access permissions project’s code repositories becoming maintainer typically involves given technical capacity make changes project’s code includes capacity approve reject proposed changes called “pull requests” GitHubstyle platforms nonmaintainer contributors well capacity moderate issue trackers anyone report bug request feature Beyond use access permissions formalize maintainer status role maintainer use term varied widely 42 also finds Like firms organizations social movements FOSS projects range widely size scale complexity popularity interdependence makes difficult unwise make overarching generalizations instead illustrate maintainership differs across different kinds projects particularly focusing labor maintaining FOSS changes projects develop grow scale projects encountered fewer contributors usually one individual singular leadership role leads majority work common term interviewees used refer individual “the maintainer” although interviewees corporate academic institutions also noted sometimes code switch use titles like “project lead” “project manager” depending environment projects larger number contributors multiple maintainers often shared responsibilities sometimes formal divisions labor projects leadership aspect sometimes called “core maintainer” analogous steering committee decisions made consensus voting 112 However even projects many maintainers often primary leadership role typically held original creator someone took responsibility creator departed One common term “benevolent dictator life” “BDFL” although one maintainer role interviewed described “the person tends feel ownership things go wrong” Finally projects encountered used “maintainer” describe roles instead used “core developer” signify dual upkeepleadership role 42 also discusses following sections identify kinds tasks involved FOSS maintainer positions change grows interdependencies complexity users become apparent necessary certain scales organizing events coordinating FOSS projects tasks occur scales become quite different larger scales providing support users fixing bugs developing new features intend comprehensive survey maintenance work FOSS focus relationship scale labor 42 Maintaining Users User support asked interviewees work maintaining projects user support major topic maintainer new users first user asks help raises issue sign validation success FOSS projects meant used common attitude maintainers smaller projects whether user’s issue due misunderstanding bug maintainer learn something Maintainers told us users ask help become contributors comaintainers alternatively donors patrons Yet interviewees maintained large wellknown projects many users identified user support overwhelming neverending chore particularly projects use GitHubstyle collaboration platforms One interviewee stated “user support something regularly evenings week weekends actually takes large chunk free time” maintainers user support around clock reality position Requests user support come many channels including messages sent private emails social media accounts users often seek help QA sites like StackOverflow project’s maintainers generally obligated present spaces although 134 GitHub allowed user web account open issue FOSS hosted site number open issues potentially numbering thousands prominently displayed project’s landing page creating reputational pressure Managing triaging issue queue often identified interviewees major task maintainers largescale projects although variation level obligation stated maintainers may obligation actually fulfill requests issue obligation respond acknowledge issue timely manner – sometimes described within 24 48 hours though others said week acceptable Eghbal suggests maintainers engage “curation” user contributor interactions intense time pressure 42 projects grew larger projects often implement rules recommendations even templates raising issues example common larger projects actively discourage using issues request assistance using properlyfunctioning parts However common tension arises users report experience bugs project’s contributors maintainers see operating properly Maintainers large central projects interviewed told us users “disrespectful” “entitled” “demanding” time attention also growing topic public discussion within FOSS several talks articles respectful maintainers 19 62 79 113 work investigating triaging resolving issues intensified maintainers reported feelings demotivation exhaustion burnout especially common maintainers larger projects often mentioned user support one emotionallyintensive aspects maintainer one interviewee discussed “I think burnout come lot different things come constant bombardment issues notifications you’re constantly reminded things you’re doing” Several interviewees noted way users interact contributors maintainers crucial kind words lessdemanding phrasing going long way maintainers 131 also finds However complicated global landscape FOSS several interviewees discussing crosscultural language barriers projects scaled interviewees described reciprocity affected felt FOSS work One interviewee stated top priority would user requesting support another FOSS contributor related seeking fix genuine bug affects ability two projects used together Interviewees also expressed enthusiasm supporting educational institutions educator issues part teaching class contrast maintainers interviewed expressed frustration user support work generated large tech company integrated FOSS part building selling — particularly company “given back” financial donations inkind donations labor eg developers regularly contributing FOSS projects One interviewee noted demanding freeriders limited forprofit corporations academic researchers behaved similar attitudes difference maintainers framed experiences collective work exploited labor related social communicative relationships maintainers collaborated coordinated contributed 43 Maintaining “Mainline” Code Scaling Trust FOSS maintenance repairing fixing crucially updating changing stay relevant grows users expect canonical version even number diversity contributors user needs might expand contributors scale maintainers must devise ways scale trust Version control practices central managing changes especially many contributors open development model contemporary FOSS projects typically involves contributor making copy entire codebase making whatever changes see fit submitting modified version review approval merging Traditionally maintainer decides patches accept keeps canonical version source code regularly making public releases keep rest date Smaller projects typically begin single maintainer begin get proposed contributions solo maintainers give regular contributor commit rights maintainer status let manage specific releases However founding maintainer must trust new maintainer default full technical privileges accept proposed changes one case mentioned interviews solo maintainer unable spend much time maintaining someone know asked comaintainer happily accepted However new maintainer added code silently used users’ computers mine cryptocurrency deposit profits account projects scale number maintainers code review processes common way producing trustworthy code Code review process one designated individuals author change approve pull request maintainer accept merge process somewhat similar academic peer review especially many cycles review revision code reviewers original author Code reviewers typically read line code specific issues contemporary social coding platforms supporting finegrained linelevel comments Code reviewers typically look bugs inefficiencies plus conformity project’s code style naming conventions approach modularity projects maintainers code review others allow wider set trusted nonmaintainers participate smaller projects code reviews might informal implicit formally specifying rules crucial aspect scaling project’s number contributors code reviewers maintainers codebase projects grow maintainers devise ways distributing work review recheck submissions mainline canonical version example Linux kernel still uses mode development creator lead maintainer Linus Torvalds accept patches official “mainline” codebase much easier much smaller since grown many thousands contributors developed cascading “chain trust” 28 top tier subsystem maintainers responsible various sections codebase making decisions patches accept subsystem maintainers delegate responsibility processes making decisions part codebase Linux kernel’s documentation describes “toplevel maintainers ask Linus ‘pull’ patches selected merging repositories Linus agrees stream patches flow repository becoming part mainline kernel amount attention Linus pays specific patches received pull operation varies clear sometimes looks quite closely general rule Linus trusts subsystem maintainers send bad patches upstream 28 Version control code reviews may seem purely technical express direction Authority merge changes often means authority set enforce specific vision necessity keeping single canonical code repository traditional approach model single lead maintainer ultimately final decisionmaker became widely prevalent FOSS known “benevolent dictator” “benevolent dictator life” BDFL one interview maintainer described tense environment caused transition BDFL model democratized system decisionmaking essence interpersonal relationships strained maintainers sought democratize leadership roles within contributors felt speed progress limited power BDFL veto group decisions BDFL resisted change 44 Labor Managing Maintaining Donations Labor contribute code FOSS projects donating products labor donations also generate new work maintainers discussing subject changes merge maintainers projects contributors told us perceived mismatches expectations nonmaintainers made proposed changes pull requests cases nonmaintainer added expanded code way found useful contributed code pull request feeling generously donated time effort intellectual property However maintainer’s perspective new pull requests heavy obligation time review longterm costs maintaining code indefinitely Wiggins 132 discusses similar trend citizen science “free puppies” donations commit recipient care years even value contribution uncertain interviewees maintained FOSS projects contributors mentioned importance rules requiring new code follow certain standards make code easier review maintain Maintainers interviewed projects rules procedures around merging changes told us cases contributor became increasingly frustrated maintainer asking order approve proposed changes cases contributor abandoned contribution altogether addition sometimes pull request perfectly conforms rules would take direction maintainers decided scope thus rejected “As maintainer don’t merge things also try thought leader what’s happening you’re going go certain direction — able politely say ‘We don’t want go way’ things know don’t want That’s hardest thing think situations someone earnestly trying add something don’t want shut sometimes it’s something declared weren’t going project” open contribution model anyone web propose new changes work maintainer FOSS typically involves substantial amount labor managing labor even emotions others interviewees emotional labor one difficult draining aspects position 45 Scaling automation continuous integration build systems testing projects grew contributors “continuous integration” CI build systems code testing ways automating code review testing CI involves automated processes build project’s codebase across multiple platforms run prewritten scripted tests check functioning Automated linting practices even check proper formatting style conventions within projects CI services directly integrated GitHubstyle platforms one many ways bots automation govern transform virtual organizations 54 107 2016 survey found 40 34000 popular FOSS projects GitHub use CI rising 70 examining 500 popular projects 63 number CI tests run staggeringly large Python programming language’s standard math library currently 134 different “unit tests” check inputs outputs square root sqrt function alone Major libraries programming languages tens thousands tests programming languages hundreds thousands tests computationally intensive point return many maintainers projects scales continuous integration build systems testing major strategy relied automate laborintensive interpersonallyintensive tasks code review particularly case rejecting code requests “The thing we’ve found computer tells it’s wrong take better human automated things catch lowhanging fruit less offense causes people computer says ‘you’ve got tab instead spaces here’ don’t mind someone tells get grumpy it” Maintainers also described CI testing strategy relied upon help manage workloads several interviews asked avoiding burnout worklife balance advice give new maintainers strategies first responses One interviewee helps manage large ecosystem projects discussed require kinds measures projects ecosystem “there packages maintain updated two plus years things work something breaks get email” Like automation strategies redistribute generate new forms labor Tests must continually written updated especially new features added projects require new functions features cannot added without also adding appropriate levels testing gifts create labor could mitigated come testing projects become increasingly complex however new kinds integration tests must written check various subsets codebase work together Testing also grows projects become integrated within interdependent ecosystem projects depending relying increasingly common projects test proposed changes break anything projects ecosystem Although developing maintaining CI processes laborintensive process work distributed contributors create automated tests code write rather code reviewers responsible catching bugs way help distribute maintainer labor widely scales terms codebase featureset contributor base interdependence within ecosystem Yet automation computationally expensive challenge FOSS values privilege relying free open platforms 2017 blog post testing Rust programming language 2 reported 126000 total tests run proposed changepull request across 20 different configurations taking 2 hours computing time post references resources run one possibly tests time additional pull request adds queue meaning contributors may wait days see proposed change breaks testing suite author describes longer queues lead conflict contributors need CI system approve changes move next stage code review approval 2httpsgithubcompythoncpythonblobmasterLibtestcmathtestcasestxt way CI projects scale use either commercial cloud computing selfhosted server cluster Contributors supposed run full test suite computer submitting pull requests important test wide range configurations well common public infrastructure verifies tests actually passed GNU GCC part free movement maintains distributed “compile farm” donated servers hosted movement3 Commercial CI platforms growing including Microsoft’s Azure Pipelines runs Microsoft’s cloud computing infrastructure services venture capital funded companies CircleCI AppVeyor run cloud computing infrastructure Google Amazon Microsoft commercial CI platforms often give public FOSS projects free single CPU run single test time Microsoft Azure currently gives 10 free simultaneous tests FOSS projects charge simultaneous tests – necessity projects grow complexity commercial CI infrastructures challenge open source cultures privilege creating autonomous freely available infrastructures using freely available infrastructures anthropologist Chris Kelty called “recursive public” 76 decision go beyond free tier CI services first time FOSS takes recurring financial expense driving organizational changes work smaller projects free tiers often sufficient projects grow codebase contributor base decide whether fundraise pay simultaneous tests deal strains able easily verify proposed changes break fundraise CI resource require projects add accounting financial roles although events often first time occurs discuss later section Even noncommercial selfhosted alternative projects like GNU GCC follow also require dedicated maintenance roles soliciting donations fund selfhosted compile farm 46 Ecosystem work interdependence competition One major dimension FOSS projects scale interdependence FOSS projects take variety forms First become relied upon critical infrastructure FOSS projects especially case libraries programming languages operating systems number users typically means number developers building using FOSS chain cascading dependencies grow quite complex program may rely explicitly imported dependencies projects dependencies make chain hundreds projects long Maintainers must manage relationships projects “upstream” depend “downstream” depend modifies features break downstream projects expecting feature work consistently one common use continuous integration regularly test functionality beta release versions upstream projects relies many issues arise interdependent ecosystems beyond finegrained issues around compatibility new versions One maintainer shared difficulties arose depended began face internal conflict forced either pick side additional work maintain compatibility two projects Complex interdependent FOSS ecosystems often find needing coordinate highlevel tasks decisions Conferences conventions major site work interviewees even described ecosystemlevel conferences similar terms political delegations referring perceived need send representatives 3httpscfarmtetaneutralnet large blocks time dedicated open discussions topics relevant projects across ecosystem ecosystemlevel topics include softwarespecific issues would require consensus implement new features across ecosystem issues packaging release managers data types hardware support user telemetry Another major perennial often controversial ecosystemlevel topic proposed consolidation related competing projects within ecosystem often common FOSS ecosystems many projects created solve similar problem may good reasons multiple related competing projects circulate ecosystem navigating crowded ecosystem confusing frustrating users developers ecosystems common someone suggest many competing packages needs consolidation One interviewee shared case open discussion session ecosystemlevel conference maintainer one declared competing projects ecosystem “needed die” seeking gain support consolidation consensus reached raises issue highintensity communicative work representation meetings becomes solution vying keep projects alive within ecosystem secure funds project’s maintainers certainly decide keep operating ecosystem decided consolidate around another competing find fewer fewer users 47 Growing Community Evangelizing developer conference attended particular subset FOSS ecosystem observed dedicated plenary session 24 minute lightning talks almost pitches FOSS speaker developed Many projects pitched newer lessestablished secured spots competitive conference speakers noted Maintainers asked others use projects one explicitly imploring audience “rely upon” integrate workflows FOSS projects asked role ritual important space maintainers convince others use oftenfledgling FOSS projects many take form “rely upon me” pitch particularly quite yet fully developed interviewees called “rely upon us” pitches “evangelizing” brought frequently response questions scaling even general strategies maintaining computing general FOSS specifically evangelizing widelyused describe efforts sustain maintain projects bringing people 1 85 Maintainers repeatedly told us important fellow contributors maintainers distribute work make sustainable key rationale users rely projects presumably become invested projects’ success — particularly users also FOSS developers wellresourced organizations used relied upon programmers firms gains access potential pool skilled labor resources contributors users also make projects appear successful 5 thus worthy funding entities fund FOSS Like many startups projects often signal credibility showing logos wellknown companies universities rely projects websites tasks constitute vast domain evangelizing include developing social media accounts FOSS projects maintaining educational resources documentation building updating websites FOSS information moderating building QA sites forums giving talks meetups conferences companies schools promotional communicational educational evangelical activities done maintainers FOSS lay conditions expansion infrastructural change — maintenance work interviewees said deeply enjoyed evangelizing work others described exhausting task outside expertise also heard highlyvisible evangelizermaintainers receive personal credit group effort Matthew effect 93 accumulated status generate tensions even intend happen actively work elevate contributors maintainers 48 Building Maintaining Relationships Meetups Events common FOSS projects hold inperson meetups events often get new andor existing contributors together accomplish work build relationships events vary widely conferences hackathons happy hours findings align existing work also found events play critical role developing trust maintaining positive lasting social relations 90 91 104 Maintainers described inperson events essential function critical value helping maintainers build good relationships helps better understand virtual environments 26 Several longtime maintainers major projects told us stories first FOSS event claimed inspired get even involved FOSS way years online conversations Past work described gendered labor women take organize events often goes unacknowledged technical contributions valued social organizational work 92 participants also reflected upon Projects many contributors resources institutional connections often run major conferences smaller projects often meet adhoc events Smaller projects also rely ecosystemlevel conferences various related projects eg written programming language serve similar purposes organize collective events ecosystemlevel conferences often dedicated periods projects various sizes hold events projects grow number contributors many move holding satellite events major FOSS conferences conferences interviews observation events found maintainers often key organizers events smaller projects although many larger projects capacity fundraise hire dedicated event organizers Projects make connections companies universities often get inkind donations space event organizing labor Major projects many users contributors maintainers hold events like trade conventions thousands attendees high competition speaking slots companies even specialize hosting FOSS conferences behalf projects companies take portion registration fees larger projects 1000 USD Interviewees told us often easier get companies fund events anything else – testing infrastructure time labor costs accrue Many ecosystemlevel FOSS events directly sponsored companies either rely extensively FOSS projects business arms FOSS projects Maintainers discussed labor associated events projects scale holding events people becomes increasingly difficult costly also heard struggles maintainers face users contributors maintainers across world another form scaling often seen key metric success Inperson events global community require work skills resources including tasks like visas fundraising travel grants maintainers interviewed shared spent significant amounts personal money events organized particularly funding promises fell costs exceeded budgets expenses sometimes reimbursed donations actual labor organizing events less likely compensated recognized maintainers took issue restrictions donors would fund stipends even travel funds projects organized event even excess funds available expenses attendee travel venues catering 49 Funding finances donations Funding finances donations major topic interviews become widelydiscussed broader public conversations FOSS interviewees described funding way compensate labor already performed maintaining FOSS projects well pay contributors perform tasks done voluntarily However found fundraising fund management involve substantial amounts unanticipated specialized labor including seeking funding writing proposals budgets accounting reimbursements managing relationships funders strong parallel nonprofit sectors including academic research charities political organizations work around funding become substantial fraction work performed maintaining organization academic research however scientists learn training running lab means maintaining funding pipeline networking applications rejections 96 Maintainers trained system caught unprepared mismatch vision reality getting funding Even receive funding mere availability potential funding shift maintainers’ projects’ conceptions FOSS life work fits Funding also push projects develop formal organizational structures cases odds existing governance style ethos 491 Fundraising maintenance patronage business models found two general approaches maintainers took funding patronage models business models patronage maintainers solicit donations grants business models involve range strategies sell services top FOSS projects may seem obvious building business involves substantial amount work startup costs patronage models also involve massive work getting patrons maintaining good relationship Companies foundations governments individuals donate idiosyncratic processes around donations Managing patron relationship complex task projects gain patrons particularly patrons contradictory expectations interviews maintainers projects funding regularly hire multiple fulltime employees expressed common sentiment found less less work work seeking funding managing especially saw maintainers largescale projects academic settings also encountered nonacademic projects Interviewees actively sought grants patronage told us grant agencies patrons often fund novelty new features necessary upkeep repair security compatibility work Maintainers struggle producing visions novel innovation needed get funding common theme scientific cyberinfrastructure 9 public works 39 111 work around soliciting funding raw amount time energy maintainers spend trying write proposals find interested donors heavy personal burden maintainers become responsible livelihoods careers real people hired One interviewees — academic researcher whose grants fund employees work FOSS — explained getting funding create obligation continue get funding keep supporting people hired particularly heard academicaligned FOSS projects hiring graduate students postdocs work FOSS projects common 492 Money changes everything labor spending funding obtained money kind account question distribution governance arises Smaller projects grapple learning navigate nonprofit forprofit laws around hiring accounting taxes requiring bring kinds expertise projects fundraise maintainers find obligations expand decisionmaking include funders chosen funders informal smaller projects become explicit projects scale fundraising example Linux Foundation Platinum membership level costs 500000 USD annually corporate charter holds 80 Board Directors chosen Platinum members 50 projects receive grants traditional foundations whether private public grant proposal already specifies funds spent However many projects receive adhoc funding donors require extensive budgeted proposals especially solicit funding Patreonstyle platforms like OpenCollective GitHub Sponsors one maintainer explained getting funds easiest part “we created OpenCollective bunch companies contributed didn’t really address issue disburse funds real system figuring spend pay contributors money one pull request problem doesn’t solve existence money that’s available still mechanism policy like distributing among people project” introduction money bring social relations collaboration conflict labor trade agreements FOSS projects often try hire long time volunteer contributors matter live means navigating labor laws varied immigration statuses banking networks sanctions far restrictive kinds contributors others Funding transforms structure organization possible formations open source community kinds collaboration sustain maintenance
::::
5 DISCUSSION LABOR SCALE MAINTAINERSHIP findings speak two distinct linked issues FOSS labor scale discuss specific implications findings reflect mean multifaceted term “scale” interviews topics labor became apparent scale clearly important introduced interviewees response wide range questions many different aspects work positions Based interviews “scale” refer number people use use within large andor prestigious organizations number contributors maintainers number bug reports issues andor proposed changes made geographic distribution users contributors andor maintainers amount rules governance procedures number communication channels used amount andor rate internal external communication size complexity features code interdependence code within broader ecosystem Scale also invoked holistic feeling particularly described felt grown much fast making scale closer signifier affect 122 also describe findings advance work interprets scale multidimensional quality beyond number usersparticipants 81 methodologies use participants’ multiple understandings scale analytic resource 86 104 Table 1 Summary forms work examples change FOSS projects scale smaller scales larger scales maintainers Many maintainers various divisions labor hierarchies organizational structures Solo lead maintainer makes decisions often work overwhelming flood Work established rules teams triaging User support opportunity retain recruit new contributors Work adhoc overwhelming flood Work established rules teams triaging Managing development Governance often implicit led leadsolo maintainer accepts rejects proposed changes Governance often explicitly discussed variety formal rules structures decisionmaking Code review testing Either automated tests lightweight tests managed leadsolo maintainer Widespread use tests review proposed changes enforce rules Managing testing dedicated role Ecosystemlevel work Projects may rely moreestablished projects adapt changes made “upstream” Projects embedded interdependent ecosystem must coordinate ensure compatibility Evangelizing crucial task get new users contributors Leadsolo maintainer must work get speaking spots Maintainers routinely invited speak conferences prestigious organizations celebrities Meetings events Smaller events focused growing user contributor base often organized leadsolo maintainer little financial support Larger events let contributors maintainers coordinate build relationships Dedicated organizing roles financial support Funding finances Small nonexistent work uncompensated projects may receive donations small expenses Routine successful enough hire contributors maintainers accountants Debates raise spend funds 51 projects scale work increases fundamentally changes summarize Table 1 various kinds work positions labor FOSS become quite different smaller larger scales findings extend prior work investigates different modes scaling technologicallymediated organizations scientific cyberinfrastructure 3 9 75 similarities potentially due fact many FOSS projects also relied upon decentralized communities including science wellresourced user organizations grant agencies contribute development maintenance adhoc fashion Carr Lempert discuss scale merely matter existing activities amplified work fundamentally transformed scale scale deeply linked power relations 122 FOSS projects grow across many different understandings scale showed kind work involved maintaining also changes instance findings referred interviewee shared growth contributors FOSS necessitated formalization democratization leadership positions “benevolent dictator” model longer sufficient required decisions approved one person simply projects grow work done although also case New kinds work often needed existing kinds work become transformed example users providing user support someone raises issue exciting opportunity grow userbase sole maintainer likely work capacity attend individual concerns FOSS projects gain thousands even millions users maintainers often must implement distributed approaches like directing questions QA sites forming teams solely triage issue queue also case continuous integration CI projects grow codebase interdependence contributors tests must run exceed free allowances commercial CI offerings may initially seem purely “technical” challenge raises questions fundraising organizational roles CI example also illustrates deeply sociotechnical nature work long established concept CSCW organization studies 7 98 108 133 findings extend literature maintenance repair practices bound social relationships 37 66 71 72 99 110 120 One way understanding implication “purely technical” work FOSS requires engineering expertise forms work interpersonal organizational dimensions even often implicit 52 Scalar labor needed grow many directions Many FOSS projects small user base grow beyond single maintainercontributor 42 prior section discussed become widely relied upon infrastructure must maintained increasingly work different kinds work occurs maintainers must constantly ensure enough people available willing work needs people must also skills resources institutional knowledge organizational forms necessary work well introduce term “scalar labor” describe kinds work seek ensure capacity meet many growing needs across many dimensions may scale use “scalar” primarily adjective form scale mathematical meaning magnitude without specified direction apt metaphor FOSS particularly projects achieve one interviewee called “catastrophic success” concept scalar labor overlaps Ribes’s focus “scalar devices” 104 studies scientific cyberinfrastructure tools practices people organizations use understand manage size scope spread organization include surveys allhands meetings analyses logs digital traces “little technologies community” Like works scaling Ribes discusses many heterogeneous dimensions people tend compress single term Ribes’s article methodologically focused ethnographers study organizations scalar devices also found many empirical findings ethnography similar kind process different set social worlds FOSS scientific cyberinfrastructure prevalent assumption scalingup inherent good rewarded funders support projects demonstrate successful scaling Ribes also discusses far difficult manage projects seek scaleup particularly scaling becoming infrastructure entire academic discipline Ribes’ “scalar device” sociologists call sensitizing concept 11 draws attention knowing scale organization could future within know organization Scalar labor contrast draws attention work transforms changing organizational economic institutional context Labor also calls attention work recognized work costs compensated made whole Scalar labor also related Strauss’s concept “articulation work” 118 Gerson summarises “making sure various resources needed accomplish something place functioning they’re needed” 58 Bietz et al’s study scientific cyberinfrastructure 9 introduces related extension articulation work “synergizing” work creating maintaining common field quite different kinds people organizations systems articulation work Synergizing calls attention work impacted heterogeneity interdependent people organizations systems must coordinated also certainly theme findings emphasizing labor dimension FOSS scalar labor covers similar range activities synergizing draws attention work specifically impacted heterogeneity different dimensions growth interdependence one mode scaling Like articulation work synergizing scalar labor complex interdependent prime example phenomena raising funds host event evangelize new users would mentored contributors would respond bug reports fix identified issues may even become mentored maintainers example traditional engineering firm made money selling licenses services would simply hire someone directly respond bug reports fix issues FOSS projects significant fundraising capacity exactly relieve major burdens However projects encountered could recruit volunteers cannot charge free also struggle achieve status would help fundraise either case concept scalar labor draws attention growth sought sake rather strategy build capacity important maintenance work done Yet growth bring different kinds work even growth may needed work Also like articulation work synergizing scalar labor useful concept studying FOSS organizations produce maintain infrastructure includes interpersonal organizational financial skills often far outside scope traditional engineer’s duties — even though done improve project’s capacity traditional engineering work Yet work also often requires projectspecific knowledge trust contributors makes difficult delegate work becomes maintainers call “governance” – longstanding topic FOSS research eg 87 Yet despite focus governance governance work rarely analyzed form labor perhaps seen organizational forms decisionmaking Kelty 76 notable exception details free projects often make enact governance decisions engineering recognize engineering decisions constituting social values concept scalar labor calls attention governance form labor much exhausting uncompensated invisible burden exercise power 53 Scalar debt consequences ‘catastrophic success’ concept scalar labor leads related issue many projects become widely relied upon accumulate call “scalar debt” term introduce based concept “technical debt” 33 Technical debt refers engineering decisions initially help advance expand quickly cost must ‘paid back’ later even work initially saved Projects grow rapidly achieve one interviewee called “catastrophic success” struggle done enough scalar labor growth users led growth project’s capacity maintain new scale Paying scalar debt comes immense cost maintainers contemporary FOSS parlance often referred finding working “sustainability model” — way recruit new volunteers raise funds hire contributors maintainers work currently backlogged focus present different “sustainability” discussed scientific cyberinfrastructure usually refers questions whether infrastructure persist longterm often future decades 105 106 Scalar debt also related FOSS projects tend develop management governance structures adhoc basis time rather preemptively plan needed adhoc “spontaneous” 35 governance model common FOSS well peer production platforms 17 124 Researchers practitioners FOSS identified key strength 21 130 — comparisons nowdominant Agile development methodology 52 129 adhocness FOSS also described product shared ideological cultural commitments whose members may object formalization instead value autonomy distributed governance models 27 35 However adhocness also related resources labor available projects mostlyvoluntary peerproduction model FOSS projects studied began interviews maintainers told us concerns around formalization bureaucratization take substantial time energy social capital risk meaning good reasons may want create structures unless absolutely apparently necessary However one key case scalar debt encountered multiple interviews around Codes Conduct moderation mechanisms become particularly central discussions diversity inclusion FOSS 38 difficult know taking scalar debt although becomes apparent maintainers described living constant state incipient crisis overwork burnout keep projects falling far backwards Ironically projects state “putting fires” work must done get manage resources necessary help put fires whether involves recruiting mentoring new volunteer contributors raising funds hire employees success sustainability approaches guaranteed volunteers leave patrons withdraw support grants rejected business models fail profitable maintainers overwhelming ‘putting fires’ state difficult choice whether spend scarce time energy unproven strategy leverage resources versus spending time putting fires currently flaring 122 104 also discuss nonFOSS contexts projects continuously recalibrate scale currently also uncertain laborintensive task 54 Scaling becoming critical infrastructure wellresourced organizations issue scalar debt leads related issue becomes apparent scales becoming relied upon critical infrastructure wellresourced organizations forprofit companies universities governments previously reviewed entire sectors global economy reliant FOSS projects use term “labor” intentionally paper calls attention work part economy contributing production distribution goods services contrast earlier work FOSS often emphasized voluntary altruistic alternative nature work framing projects communities rising commercialization FOSS movement long wellstudied historical trend past two decades 10 47 77 find contemporary projects begin voluntary peerproduction efforts similarly transform scale Early many smaller voluntary FOSS projects seek relied upon especially large wellknown tech companies universities directly indirectly help support bring users may become contributors also prestige connections pittance donations — cultural social financial capital 13 Yet several maintainers interviewed described “blessing curse” relied upon organizations build products projects companies benefited labor often offer resources labor return often heard relied upon manner adds work also difficult engineers elite user organizations especially demanding entitled Maintainers must respond elite users ways addressed studies emotional labor managing emotions others 65 Many user organizations “free riders” contribute back Even corporations contribute back FOSS ways place additional demands maintainers demanding additional code review asking FOSS contributors managing relationships patrons ways becoming critical infrastructure wellresourced organizations increase transform work maintainers even brings users resources prestige Another issue arises becoming relied upon infrastructure others make maintainers feel morally responsible whatever products built using projects described becoming disillusioned burnedout specifically began imagining would used could afford commercial alternatives found used companies lower costs even expressed wrestle projects used part products maintainers believed unethical harmful frustrations compounded maintainers’ frustrations wealth derived companies relying upon technologies get shared maintainerscontributors evidencing inequitable form extraction activities around mentally working issues also seen form invisible work scalar labor 55 Scaling dynamics hypervisibility Scale impacts maintainers’ personal identities relationships broader publics much prior literature maintenance infrastructure sectors eg power plants transportation commercial discusses maintenance relatively lessvisible lessrecognized work 39 111 117 120 centrality maintainers leaders FOSS projects leads different set issues Much FOSS work done public view open nature FOSS work especially dominance allinclusive public code collaboration platforms like GitHub discussed sections user support proposed changes maintainers receive deluge requests users contributors visible public web public scrutiny maintainers engage communicative labor tracking management emotional labor user contributor response sum production optics successful Maintainers projects achieved massive success scale — widely reliedupon andor large contributor base — achieve kind “microcelebrity” 89 status term originally studies social media Eghbal compares FOSS maintainers content creators Youtube Instagram particularly earn quasiindependent living patronage 42 found maintainers grow microcelebrities fueled dynamics social media technology standardization maintainers hundreds thousands followers social media sites like Twitter write widelyread blog posts state FOSS flown speak major conferences companies maintainers play major role conflict resolution governance issues public platforms mailing lists particularly governance model adhoc influential leaders become substitutes reorganization decision making conflict resolution – case scalar debt Evangelizing reinforce trend towards hypervisible microcelebrity maintainers already famous maintainer invited give talks FOSS conferences thousands people flown give talks companies universities makes even famous Funding patronage business models benefit famous figureheads well often require single individual designated Principal Investigator grant CEO business arm microcelebrity maintainers told us actively work dynamics sending others place asked speak conferences Yet still forms invisible work hypervisible maintainers routinely receive torrents unsolicited emails private messages lavish praise harassment Much work also takes place outside public code platforms like writing grants conflict resolution Precisely microcelebrity maintainers called adjudicate disputes behind scenes findings suggest maintenance labor always invisible hypervisible highly valued Given dominant framing maintenance infrastructure invisible work 60 116 128 urge future research intersection issues
::::
6 CONCLUSION focus paper intersection labor scale context maintaining FOSS projects findings contribute understanding challenges faced people engaging variety types collaborative work build common information resources simultaneously developing organizations governance structures interviews maintainers described burned changes expected fundamentally changed projects scaled interviews rich insights deep varied commitments FOSS maintainers also emotional toll FOSS work take findings wide import discussions governance leadership sustainability sociotechnical systems including crowdsourcing citizen science scientific cyberinfrastructure crisis informatics Particularly focus labor people’s reactions changes labor help build awareness infrastructure sustainability tied longterm wellbeing maintainers individuals communities 61 Limitations Although attempted recruit diverse group participants interviews — particular attention typesize FOSS worked employment geography — findings limited number interviews conducted recruitment methods mostly studied projects relied upon others infrastructure began volunteer projects findings speak overwhelming majority FOSS projects developed used single person released publicly well entirely corporatedeveloped FOSS projects also sought capture kind longitudinal view focusing maintainers long histories involvement traditional longitudinal study would capture issues scale even depth Like interviewbased studies memories may less accurate study could complemented detailed contemporaneous methods capturing work maintainers daytoday participantobservation diary studies analyses trace data also acknowledge implicated kinds systems FOSS sustainability participants authors direct participant experience FOSS projects contributors maintainers gives us sensitivity topics also means lack analytical distance strands social science value particular fact funded study issues nonprofit foundations also direct funders FOSS projects — public knowledge disclosed prior interviews — may impact kinds responses received 62 Recommendations future work Contributors maintainers might better manage difficulties posed scale regularly conversations responsibilities entail much time effort work takes distribution workloads resources change changes Maintainers may benefit explicitly acknowledging scalar debt taken sometimes commonly acknowledged technical debt taken Focusing questions scalar labor brings light scale always universally good thing — even though broad pressures projects equate scale success 104 also discusses science benefits scaling success many also equitably distributed discussed around less visible gendered labor event organizing versus dynamics lead microcelebrity maintainers Finally efforts build capacity reduce burdens maintenance work compound amount work done funders donors mindful opportunity costs projects spend soliciting resources involve lightweight funding mechanisms require less upfront work part maintainers leaders Many areas paper might expanded future work Specifically interested unpacking effects corporate reliance FOSS projects maintainers’ working emotional lives Although brought value misalignment one way interpret maintainers’ reactions corporations took didn’t give back FOSS believe work done area understand political economy value misalignment effects corporate reliance maintainers’ mental health wellbeing might involve conducting additional interviews focus projects’ growth trajectories focusing projects experienced ‘catastrophic success’ gestured discussion exploring areas might contribute valuable actionable insights improve FOSS sustainability
::::
7 ACKNOWLEDGMENTS authors would like thank Alexandra Paxton Nelle Varoquaux Chris Holdgraf ongoing feedback well Linwei Lu Julio Gonzalez CSCW reviewers insights thankful cohort advisors program managers FordSloan Critical Digital Infrastructures Initiative helping us plan research appreciate time anonymous interviewees spent talking us reviewing various drafts work thankful Stacey Dorton administrative support work financially supported Ford Sloan Foundation Critical Digital Infrastructures Initiative grant G201811354 National Science Foundation grant DDRIG 1947213 well Gordon Betty Moore Foundation grant GBMF3834 Alfred P Sloan Foundation grant 20131027 MooreSloan Data Science Environments grant UCBerkeley REFERENCES 1 Morgan G Ames Daniela K Rosner Ingrid Erickson 2015 Worship faith evangelism Religion ideological lens engineering worlds Proceedings 18th ACM Conference Computer Supported Cooperative Work Social Computing ACM New York 69–81 httpsdoiorg10114526751332675282 2 Brian Anderson 2017 Rust tested httpsbrsongithubio20170710howrustistested 3 Karen Baker David Ribes Florence Millerand Geoffrey Bowker 2005 Interoperability strategies scientific cyberinfrastructure Research practice Proceedings American Society Information Science Technology 2005 httpsdoiorg101002meet14504201237 4 Flore Barcellini Françoise Détienne JeanMarie Burkhardt 2014 situated approach roles participation open source communities Human–Computer Interaction 29 3 2014 205–255 httpsdoiorg101080073700242013812409 5 Ann Barcomb 2016 Episodic volunteering open source communities Proceedings 20th International Conference Evaluation Assessment Engineering 1–3 httpsdoiorg10114529159702915972 6 Yochai Benkler 2007 Wealth Networks Social Production Transforms Markets Freedom Yale University Press 7 Richard Bentley John Hughes David Randall Tom Rodden Peter Sawyer Dan Shapiro Ian Sommerville 1992 Ethnographicallyinformed systems design air traffic control Proceedings 1992 ACM Conference ComputerSupported Cooperative Work 123–129 httpsdoiorg101145143457143470 8 Magnus Bergquist Jan Ljungberg 2001 power gifts Organizing social relationships open source communities Information Systems Journal 11 4 2001 305–320 httpsdoiorg101046j13652575200100111x 9 Matthew J Bietz Eric PS Baumer Charlotte P Lee 2010 Synergizing cyberinfrastructure development Computer Supported Cooperative Work CSCW 19 34 2010 245–281 httpsdoiorg101007s106060109114y 10 Benjamin J Birkinbine 2015 Conflict Commons Towards Political Economy Corporate Involvement Free Open Source Political Economy Communication 2 2 2015 httpwwwpolecomorgindexphppolecomarticleview35 Number 2 11 Herbert Blumer 1954 wrong social theory American Sociological Review 19 1 1954 3–10 12 Herbert Blumer 1969 Symbolic Interactionism Perspective Method University California Press Berkeley 13 Pierre Bourdieu 1973 Cultural reproduction social reproduction Knowledge Education Cultural Change Richard Brown Ed London Tavistock 14 Geoffrey C Bowker Susan Leigh Star 2000 Sorting Things Classification Consequences MIT Press 15 Daren C Brabham 2013 Crowdsourcing MIT Press Cambridge 16 Dale Bradley 2006 divergent anarchoutopian discourses open source movement Canadian Journal Communication 30 4 2006 17 Bruckman Forte 2008 Scaling consensus Increasing decentralization Wikipedia governance Proceedings 41st Annual Hawaii International Conference System Sciences HICSS 2008 157 18 Julia Bullard 2016 Motivating invisible contributions Framing volunteer classification design fanfiction repository Proceedings 19th International Conference Supporting Group Work Sanibel Island Florida USA 20161113 GROUP ’16 ACM 181–193 httpsdoiorg10114529572762957295 19 Brett Cannon 2017 give take open source Talk JupyterCon 2017 O’Reilly Media httpswwworeillycomradarthegiveandtakeofopensource 20 Andrea Capiluppi Martin Michlmayr 2007 cathedral bazaar empirical study lifecycle volunteer community projects Open Source Development Adoption Innovation Springer US 31–44 httpsdoiorg10100797803877248673 21 Eugenio Capra Chiara Francalanci Francesco Merlo 2008 empirical study relationship design quality development effort governance open source projects IEEE Transactions Engineering 34 6 2008 765–782 22 Adele E Clarke 2003 Situational analyses Grounded theory mapping postmodern turn Symbolic Interaction 26 4 2003 553–576 23 Adele E Clarke Susan Leigh Star 2008 social worlds framework theorymethods package Handbook Science Technology Studies MIT Press Cambridge 113–137 24 Gabriella Coleman 2004 political agnosticism free open source inadvertent politics contrast Anthropological Quarterly 77 3 2004 507–519 25 Gabriella Coleman 2009 Code speech Legal tinkering expertise protest among free open source developers Cultural Anthropology 24 3 2009 420–454 26 Gabriella Coleman 2010 hacker conference ritual condensation celebration lifeworld Anthropological Quarterly 2010 47–72 27 Gabriella Coleman 2012 Coding Freedom Ethics Aesthetics Hacking Princeton University Press Princeton 28 Kernel Development Community 2018 development process works Linux Kernel documentation httpswwwkernelorgdochtmlv415process2Processhtml 29 Kevin Crowston 2011 Lessons volunteering freelibre open source development future work Researching Future Information Systems Berlin Heidelberg 2011 IFIP Advances Information Communication Technology Mike Chiasson Ola Henfridsson Helena Karsten Janice DeGross Eds Springer 215–229 httpsdoiorg101007978364221364914 30 Kevin Crowston Robert Heckman Hala Annabi Chengetai Masango 2005 structurational perspective leadership FreeLibre Open Source teams Proceedings First International Conference Open Source 31 Kevin Crowston Qing Li Kangning Wei U Yeliz Eseryel James Howison 2007 Selforganization teams freelibre open source development Information Technology 49 6 2007 564–575 httpsdoiorg101016jinfsof200702004 32 Kevin Crowston Kangning Wei James Howison Andrea Wiggins 2012 FreeLibre OpenSource Development Know Know ACM Computing Surveys CSUR 44 2 March 2012 35 httpsdoiorg10114520891252089127 33 Ward Cunningham 1992 WyCash portfolio management system Proceedings Objectoriented Programming Systems Languages Applications Addendum Vancouver British Columbia Canada 19921201 OOPSLA ’92 Association Computing Machinery 29–30 httpsdoiorg101145157709157715 34 Laura Dabbish Colleen Stuart Jason Tsay Jim Herbsleb 2012 Social coding GitHub transparency collaboration open repository Proceedings ACM 2012 conference computer supported cooperative work ACM New York 1277–1286 35 Paul B de Laat 2007 Governance open source state art Journal Management Governance 11 2 2007 165–177 httpsdoiorg101007s1099700790229 36 Luiz Felipe Dias Igor Steinmacher Gustavo Pinto Daniel Alencar da Costa Marco Gerosa 2016 shift github impact collaboration 2016 IEEE International Conference Maintenance Evolution ICSME IEEE 473–477 37 Fernando Domínguez Rubio 2020 Ecologies Modern Imagination Art Museum University Chicago Press Chicago 38 Christina DunbarHester 2019 Hacking Diversity Politics Inclusion Open Technology Cultures Vol 21 Princeton University Press 39 David Edgerton 2011 Shock Old Technology Global History Since 1900 Oxford University Press Oxford 40 Paul N Edwards Steven J Jackson Geoffrey C Bowker Cory P Knobel 2007 Understanding infrastructure Dynamics tensions design Report NSF Workshop “History Theory Infrastructure Lessons New Scientific Cyberinfrastructures” 2007 httpsdeepbluelibumichedubitstreamhandle20274249353UnderstandingInfrastructure2007pdf 41 Nadia Eghbal 2016 Roads bridges unseen labor behind digital infrastructure Ford Foundation 42 Nadia Eghbal 2020 Working Public Making Maintenance Open Source Stripe Press 43 Hamid R Ekbia Bonnie Nardi 2017 Heteromation Stories Computing Capitalism MIT Press 44 Nathan Ensmenger 2008 Fixing things never broken maintenance heterogeneous engineering Proceedings SHOT Conference 45 Joseph Feller Patrick Finnegan Brian Fitzgerald Jeremy Hayes 2008 Peer Production Productization Study Socially Enabled Business Exchanges Open Source Service Networks Information Systems Research 19 4 2008 475–493 httpsdoiorg101287isre10800207 46 Anna Filippova Hichang Cho 2015 Mudslinging Manners Unpacking Conflict Free Open Source Proceedings 18th ACM Conference Computer Supported Cooperative Work Social Computing CSCW ’15 ACM 1393–1403 httpsdoiorg10114526751332675254 47 Brian Fitzgerald 2006 Transformation Open Source MIS Quarterly 30 3 2006 587–598 httpsdoiorg10230725148740 48 Lee Fleming David Waguespack 2007 Brokerage boundary spanning leadership open innovation communities Organization Science 18 2 2007 165–180 49 Karl Fogel 2005 Producing Open Source Run Successful Free O’Reilly Media 50 Linux Foundation nd Bylaws Linux Foundation httpswwwlinuxfoundationorgenbylaws 51 Sarah E Fox Kiley Sobel Daniela K Rosner 2019 Managerial Visions Stories upgrading maintaining public restroom IoT Proceedings 2019 CHI Conference Human Factors Computing Systems 1–15 52 Erich Gamma 2005 Agile open source distributed ontime Inside eclipse development process International Conference Engineering Proceedings 27th International Conference Engineering Vol 15 4–4 53 Juan Mateos Garcia W Edward Steinmueller et al 2003 open source way working new paradigm division labour development SPRU 54 R Stuart Geiger 2011 Lives Bots Wikipedia Critical Point View G Lovink N Tkacz Eds Institute Network Cultures 78–93 httpwwwstuartgeigercomlivesofbotswikipediacpovpdf 55 R Stuart Geiger David Ribes 2011 Trace ethnography Following coordination documentary practices 2011 44th Hawaii International Conference System Sciences IEEE 1–10 56 Matt Germonprez Julie E Kendall Kenneth E Kendall Lars Mathiassen Brett Young Brian Warner 2016 Theory Responsive Design Field Study Corporate Engagement Open Source Communities Information 57 Matt Germonprez Georg JP Link Kevin Lumbard Sean Goggins 2018 Eight Observations 24 Research Questions Open Source Projects Illuminating New Realities Proc ACM HumComput Interact 2 CSCW Article 57 2018 22 pages httpsdoiorg1011453274326 58 Elihu Gerson 2008 Reach bracket limits rationalized coordination challenges CSCW Resources CoEvolution Artifacts Springer 193–220 59 Paola Giuri Francesco Rullani Salvatore Torrisi 2008 Explaining leadership virtual teams case open source Information Economics Policy 20 4 2008 305–315 60 Stephen Graham Nigel Thrift 2007 order Understanding repair maintenance Theory Culture Society 24 3 2007 1–25 61 Kaj Grønbæk Morten Kyng Preben Mogensen 1992 CSCW challenges largescale technical projects—a case study Proceedings 1992 ACM Conference Computersupported Cooperative Work 338–345 62 Scott Hanselman 2015 Bring Kindness back Open Source httpswwwhanselmancomblogbringkindnessbacktoopensource 63 Michael Hilton Timothy Tunnell Kai Huang Darko Marinov Danny Dig 2016 Usage Costs Benefits Continuous Integration OpenSource Projects Proceedings 31st IEEEACM International Conference Automated Engineering ASE 2016 Association Computing Machinery New York NY USA 426–437 httpsdoiorg10114529702762970358 64 Eric von Hippel 2001 Innovation User Communities Learning OpenSource MIT Sloan Management Review 42 4 2001 82–82 httpsgogalecompsidopAONEswwissn15329194v21itridGALE7CA77578225sidgoogleScholarlinkaccessabs Sloan Management Review 65 Arlie Russell Hochschild 1983 Managed Heart Commercialization Human Feeling University California Press Oakland California 66 Lara Houston Steven J Jackson Daniela K Rosner Syed Ishtiaque Ahmed Meg Young Laewoo Kang 2016 Values Repair Proceedings 2016 CHI Conference Human Factors Computing Systems San Jose California USA 20160507 CHI ’16 ACM 1403–1414 httpsdoiorg10114528580362858470 67 Dorothy Howard R Stuart Geiger 2019 Ethnography Genealogy Political Economy PostMarket Era Free OpenSource Proceedings CSCW ’19 Extended Abstracts 68 Dorothy Howard Lilly Irani 2019 Ways Knowing Research Subjects Care Proceedings 2019 CHI Conference Human Factors Computing Systems 1–16 69 James Howison 2015 Sustaining scientific infrastructures transitioning grants peer production iConference 2015 20150315 httpswwwidealsillinoiseduhandle214273439 Accepted 20150323T215814Z Publisher iSchools 70 Lilly C Irani Six Silberman 2013 Turkopticon Interrupting worker invisibility amazon mechanical turk Proceedings SIGCHI Conference Human Factors Computing Systems 611–620 71 Lilly C Irani Six Silberman 2016 Stories tell labor Turkopticon trouble design Proceedings 2016 CHI Conference Human Factors Computing Systems San Jose California USA 20160507 CHI ’16 ACM 4573–4586 httpsdoiorg10114528580362858592 72 Steven J Jackson Syed Ishtiaque Ahmed Md Rashidujjaman Rifat 2014 Learning innovation sustainability among mobile phone repairers Dhaka Bangladesh Proceedings 2014 conference Designing interactive systems Vancouver BC Canada 20140621 DIS ’14 Association Computing Machinery 905–914 httpsdoiorg10114525985102598576 73 Steven J Jackson Alex Pompe Gabriel Krieshok 2012 Repair worlds maintenance repair ICT development rural Namibia Proceedings ACM 2012 conference Computer Supported Cooperative Work 107–116 74 C Jensen W Scacchi 2005 Collaboration Leadership Control Conflict Negotiation Netbeansorg Open Source Development Community Proceedings 38th Annual Hawaii International Conference System Sciences 200501 196b–196b httpsdoiorg101109HICSS2005147 ISSN 15301605 75 Helena Karasti Karen Baker Florence Millerand 2010 Infrastructure time Longterm matters collaborative development Computer Supported Cooperative Work CSCW 19 3 2010 377–415 httpsdoiorg101007s106060109113z 76 Christopher Kelty 2008 Two Bits Cultural Significance Free Duke University Press 77 Christopher Kelty 2013 free Journal Peer Production 2013 Issue 3 httppeerproductionnetissuesissue3freesoftwareepistemicsdebatethereisnofreesoftware 78 Mathias Klang 2005 Free open source freedom debate consequences First Monday 10 3 2005 79 Nolan Lawson 2017 feels like opensource maintainer Read Tea Leaves httpsnolanlawsoncom20170305whatitfeelsliketobeanopensourcemaintainer 80 Charlotte P Lee Paul Dourish Gloria Mark 2006 human infrastructure cyberinfrastructure Proceedings 2006 20th Anniversary Conference Computer Supported Cooperative Work Banff Alberta Canada CSCW ’06 ACM New York NY USA 483–492 httpsdoiorg10114511808751180950 81 Charlotte P Lee Drew Paine 2015 matrix model coordinated action MoCA conceptual framework CSCW Proceedings 18th ACM Conference Computersupported Cooperative Work Social Computing 179–194 82 Yan Li ChuanHoo Tan HockHai Teo 2012 Leadership characteristics developers’ motivation open source development Information Management 49 5 2012 257–267 83 YuWei Lin Jo Bates Paula Goodale 2016 Coobserving weather copredicting climate Human factors building infrastructures crowdsourced data Science Technology Studies 29 3 2016 10–27 httpdspacestiracukhandle189326101 Accepted 20171128T232819Z Publisher Finnish Society STS 84 Arwid Lund 2017 Wikipedia Work Capitalism Springer London 85 Jennifer Helene Maher 2015 Evangelism Rhetoric Morality Coding Justice Digital Democracy Routledge London 86 George E Marcus 1995 Ethnography inof World System Emergence MultiSited Ethnography Annual Review Anthropology 24 1 1995 95–117 httpsdoiorg101146annurevan24100195000523 87 Lynne Markus 2007 governance freeopen source projects monolithic multidimensional configurational Journal Management Governance 11 2 2007 151–163 88 Steve Marquess 2014 Money Responsibility Pride httpveridicalsystemscomblogofmoneyresponsibilityandpride Library Catalog veridicalsystemscom 89 Alice Marwick Danah Boyd 2011 see seen Celebrity practice Twitter Convergence 17 2 2011 139–158 90 Ashwin Mathew Coye Cheshire 2017 Risky Business Social Trust Community Practice Cybersecurity Internet Infrastructure IEEE httpsdoiorg1024251HICSS2017283 91 Ashwin J Mathew 2016 myth decentralised internet 5 3 2016 httpspolicyreviewinfoarticlesanalysismythdecentralisedinternet 92 Amanda Menking Ingrid Erickson 2015 heart work Wikipedia Gendered emotional labor world’s largest online encyclopedia Proceedings 33rd Annual ACM Conference Human Factors Computing Systems ACM 207–210 93 Robert K Merton 1968 Matthew effect science reward communication systems science considered Science 159 3810 1968 56–63 94 Audris Mockus Roy Fielding James Herbsleb 2000 case study open source development Apache server Proceedings 22nd International Conference Engineering Limerick Ireland 20000601 ICSE ’00 Association Computing Machinery 263–272 httpsdoiorg101145337180337209 95 Lauren Morse Janice Clark 2019 nuances grounded theory sampling pivotal role theoretical sampling SAGE Handbook Current Developments Grounded Theory 2019 145–166 96 Chandra Mukerji 1989 Fragile Power Scientists State Princeton University Press 97 Joel Novek 2002 gender professional practice automated drug distribution system sent back manufacturer Science Technology Human Values 27 3 2002 379–403 httpsdoiorg101177016224390202700303 SAGE Publications 98 Wanda Orlikowski Susan Scott 2008 Sociomateriality Challenging separation technology work organization Academy Management Annals 2 1 2008 433–474 99 Julian E Orr 2016 Talking Machines Ethnography Modern Job Cornell University Press Ithaca 100 Mathieu O’Neil Laure Muselli Mahin Raissi Stefano Zacchiroli 2020 ‘Open source lost war’ Legitimising commercial–communal hybridisation FOSS New Media Society 2020 1461444820907022 101 Elena Parmiggiani 2017 Fish Scale Politics Infrastructure Design Studies Computer Supported Cooperative Work CSCW 26 1 2017 205–243 httpsdoiorg101007s1060601792660 102 Eric Raymond 1999 cathedral bazaar Readings Cyberethics Richard Spinello Herman Tavani Eds O’Reilly Press 103 RedHat 2020 State Enterprise Open Source httpswwwredhatcomcmsmanagedfilesrhenterpriseopensourcereportdetailf21756202002enpdf 104 David Ribes 2014 Ethnography scaling fit national research infrastructure room Proceedings 17th ACM Conference Computer Supported Cooperative Work Social Computing 158–170 105 David Ribes Thomas Finholt 2007 Tensions across scales planning infrastructure longterm Proceedings 2007 International ACM Conference Supporting Group Work 229–238 106 David Ribes Thomas Finholt 2009 long infrastructure Articulating tensions development Journal Association Information Systems JAIS 2009 107 David Ribes Steven Jackson R Stuart Geiger Matthew Burton Thomas Finholt 2013 Artifacts organize Delegation distributed organization Information Organization 23 1 2013 1–14 108 David Ribes Charlotte P Lee 2010 Sociotechnical studies cyberinfrastructure eresearch Current themes future trajectories Computer Supported Cooperative Work CSCW 19 34 2010 231–244 109 Dirk Riehle Philipp Riemer Carsten Kolassa Michael Schmidt 2014 Paid vs Volunteer Work Open Source Proceedings 47th Hawaii International Conference System Sciences 3286–3295 httpsdoiorg101109HICSS2014407 110 Daniela K Rosner 2014 Making Citizens Reassembling Devices Gender Development Contemporary Public Sites Repair Northern California Public Culture 26 1 2014 51–77 httpsdoiorg101215089923632346250 111 Andrew L Russell Lee Vinsel 2018 Innovation Turn Maintenance Technology Culture 59 1 2018 1–25 httpsdoiorg101353tech20180004 Publisher Johns Hopkins University Press 112 Bert Sadowski Gaby SadowskiRasters Geert Duysters 2008 Transition governance mature open source community Evidence Debian case Information Economics Policy 20 4 2008 323–332 113 Salvatore Sanfilippo 2019 struggles open source maintainer httpantirezcomnews129 114 Trebor Scholz 2008 Market ideology myths Web 20 First Monday 13 3 2008 115 Clay Shirky 2010 Cognitive Surplus Creativity Generosity Connected Age Penguin UK 116 Susan Leigh Star 1999 ethnography infrastructure American behavioral scientist 43 3 1999 377–391 117 Susan Leigh Star Anselm Strauss 1999 Layers silence arenas voice ecology visible invisible work Computer Supported Cooperative Work CSCW 8 12 1999 9–30 118 Anselm Strauss 1988 articulation work organizational process Sociological Quarterly 29 2 1988 163–178 119 Anselm Strauss Juliet Corbin 1994 Grounded theory methodology Handbook Qualitative Research 17 1994 273–85 120 Lucy Suchman 1995 Making work visible Commun ACM 38 9 1995 56–64 121 Lucy Suchman 2007 Humanmachine Reconfigurations Plans Situated Actions Cambridge University Press 122 E Carr Summerson Michael Lempert 2016 Scale Discourse Dimensions Social Life University California Press 123 Tapscott Anthony Williams 2008 Wikinomics Mass Collaboration Changes Everything Penguin 124 Nathaniel Tkacz 2014 Wikipedia Politics Openness University Chicago Press 125 Linus Torvalds David Diamond 2002 Fun Story Accidental Revolutionary Harper Business 126 Jason Tsay Laura Dabbish James Herbsleb 2014 Let’s Talk Evaluating Contributions Discussion GitHub Proceedings 22nd ACM SIGSOFT International Symposium Foundations Engineering Hong Kong China FSE 2014 ACM New York NY USA 144–154 httpsdoiorg10114526358682635882 127 José Van Dijck David Nieborg 2009 Wikinomics discontents critical analysis Web 20 business manifestos New Media Society 11 5 2009 855–874 128 Kazys Varnelis 2008 Invisible City Telecommunication Actar Barcelona New York 129 Juhani Warsta Pekka Abrahamsson 2003 open source development essentially agile method Proceedings 3rd Workshop Open Source Engineering 143–147 130 Steve Weber 2004 Success Open Source Harvard University Press 131 Kangning Wei Kevin Crowston U Yeliz Eseryel Robert Heckman 2017 Roles politeness behavior communitybased freelibre open source development Information Management 54 5 2017 573–582 httpsdoiorg101016jim201611006 132 Andrea Wiggins 2013 Free puppies compensating ICT constraints citizen science Proceedings 2013 Conference Computer Supported Cooperative Work 1469–1480 133 Susan Winter Nicholas Berente James Howison Brian Butler 2014 Beyond organizational ‘container’ Conceptualizing 21st century sociotechnical work Information Organization 24 4 2014 250–269 134 Alexey Zagalsky Carlos Gómez Teshima Daniel German MargaretAnne Storey Germán PooCaamaño 2016 R community creates curates knowledge comparative study stack overflow mailing lists Proceedings 13th International Conference Mining Repositories 441–451 Received June 2020 revised October 2020 accepted December 2020
::::
License usage changes largescale study GitHub Christopher Vendome1 · Gabriele Bavota2 · Massimiliano Di Penta3 · Mario LinaresVásquez1 · Daniel German4 · Denys Poshyvanyk1 Published online 6 June 2016 © Springer ScienceBusiness Media New York 2016 Abstract Open source licenses determine legal point view conditions integrated redistributed reason developers adopt change license may depend various factors eg need ensuring compatibility certain thirdparty components perspective towards redistribution commercialization need protecting somebody else’s commercial usage paper reports large empirical study aimed quantitatively qualitatively investigating developers adopt change licenses Specifically first identify license changes 1731828 commits representing entire history 16221 Java projects hosted GitHub understand rationale license changes perform qualitative analysis 1160 projects written seven different programming languages namely C C C Java Javascript Python Ruby—following open coding approach inspired grounded theory—on commit messages issue tracker discussions concerning licensing topics whenever possible try build traceability links discussions changes one hand results highlight different contexts license adoption changes triggered various reasons hand results also highlight lack traceability licensing changes made major concern change license system negatively impact reuse conclusion results study trigger Communicated Lin Tan Christopher Vendome cgvendomeemailwmedu 1 College William Mary Williamsburg VA USA 2 Free University BozenBolzano BozenBolzano Italy 3 University Sannio Benevento Italy 4 University Victoria British Columbia Canada need better tool support guiding developers choosingchanging licenses keeping track rationale license changes Keywords licenses · Mining repositories · Empirical studies
::::
1 Introduction recent past years diffusion Free Open Source FOSS projects increasing significantly along availability forges hosting projects eg SourceForge1 GitHub2 foundations supporting promoting development diffusion FOSS eg Apache Foundation3 GNU Foundation4 Eclipse Foundation5 availability FOSS projects precious resource developers reuse existing assets extendevolve way create new work productively reduce costs example blog post IBM6 outlines reasons pushing companies reuse open source code “Yes cost factor one important factors attract small companies startup’s also big corporations days” happen context open source projects frequent commercial projects survey conducted Black Duck7 found 78 companies use open source code double 2010 93 claimed increase open source reuse 64 contribute open source development 55 indicated lack formal guidance utilizing open source code findings Black Duck demonstrate two key implications commercial reuse open source code increasing ii general lack oversight reuse occurs Nevertheless whoever interested integrating FOSS code redistributing along modifying existing FOSS projects create new work—referred “derivative work”—must aware activities regulated licenses particular specific FOSS license reused order license projects developers either add licensing statement source code files comment beginning file andor include textual file containing license statement source code root directory subdirectories Generally speaking FOSS licenses classified restrictive also referred “copyleft” “reciprocal” permissive licenses restrictive license requires developers use license distribute new incorporates licensed restrictive license ie redistribution derivative work must licensed terms meanwhile permissive licenses allow redistributors incorpo httpsourceforgenet httpsgithubcom httpswwwapacheorg httpwwwgnuorg httpwwweclipseorg httpswwwibmcomdeveloperworkscommunityblogs6e6f6d1b95c346df8a26b7efd8ee4b57entrywhybigcompaniesareembracingopensource119langen httpswwwblackducksoftwarecomfutureofopensource rate reused difference license Singh Phelps 2009 Free Foundation 2015 GPL versions classic example restrictive license Section 5 GPL30 license addresses code modification stating “You must license entire work whole License anyone comes possession copy” httpwwwgnuorglicensesgplhtml BSD licenses examples permissive licenses instance BSD 2Clause two clauses detail use redistribution modification licensed code source must contain copyright notice ii binary must produce copyright notice contain disclaimer documentation httpopensourceorglicensesBSD2Clause developers organizations decide make available open source license code one many different existing licenses choice may dictated set dependencies eg libraries uses since dependencies might specific licensing constraints reuse instance links statically GPL code must released GPL version failing fulfill constraint could create potential legal risk Also shown Di Penta et al 2010 choice licenses FOSS may massive impact success well projects using example—as happened IPFilter httpwwwopenbsdorgfaqpf—a highly restrictive license may prevent others redistributing case IPFilter caused exclusion OpenBSD distributions opposite case one MySQL connect drivers originally released GPL20 whose license modified exception Oracle httpwwwmysqlcomaboutlegallicensingfossexception allow driver’s inclusion released open source licenses would otherwise incompatible GPL eg original Apache license summary choice license—or even decision change existing license—is crucial crossroad point context evolution every FOSS order encourage developers think licensing issues early development process forges eg GitHub introduced mechanisms possibility picking license time repository created Also Web sites eg httpchoosealicensecom helping developers choose license Furthermore numerous research efforts aimed supporting developers classifying source code licenses Gobeille 2008 Germán et al 2010b identifying licensing incompatibilities Germán et al 2010a Even initiatives Package Data Exchange SPDX httpspdxorg aimed proposing formal model document license system However despite effort put FOSS community researchers independent companies turns developers usually clear idea exact consequences licensing code using specific license unsure example redistribute code licensed dual license among issues Vendome et al 2015b Paper Contributions paper reports results large empirical study aimed quantitatively qualitatively investigating licenses change open source projects extent possible establish traceability links licensing relateddiscussions changes First perform quantitative analysis conducted 16221 Java projects hosted GitHub conduct study first mined entire change history projects extracting license name eg GPL Apache version eg v1 v2 applicable 4665611 files involved total 1731828 commits Starting data provide quantitative evidence diffusion licenses FOSS systems ii common licensechange patterns iii traceability license changes commit messages issue tracker discussions following open coding approach inspired grounded theory Corbin Strauss 1990 qualitatively analyze sample commit messages issue tracker discussions likely related license changes qualitative analysis performed 1160 projects written seven different languages 159 C 91 C 78 C 324 Java 166 Javascript 147 Python 195 Ruby projects results analysis provide rationale developers adopt specific licenses initial licensing licensing changes study reported paper poses basis previous work aimed exploring license incompatibilities Germán et al 2010a license changes Di Penta et al 2010 license evolution Manabe et al 2010 integration patterns Germán Hassan 2009 Building upon previous work licensing analysis paper Constitutes best authors’ knowledge largest study aimed analyzing change patterns licensing systems earlier work limited analysis six projects Manabe et al 2010 Di Penta et al 2010 best knowledge first work aimed explaining rationale license changes means qualitative analysis commit notes issue tracker discussions achieved results suggest determining appropriate license far trivial community’s usage expectations influence developers picking license also observe licensing expectations may different based programming language Although choosing license considered important developers even early releases projects forges third partytools provide little support developers performing licensingrelated tasks eg picking license declaring license changing license restrictive one towards permissive one vice versa importantly keeping track rationale license changes example creation new repository GitHub allows user select initial license list commonly used ones offers guidance implications choice simply redirects user httpchoosealicensecom aside GitHub offers support licensing management Also lack consistency standardization mechanism used declaring license eg putting source code heading comments separate license files README files etc Moreover legal nature licenses exacerbate problem since implications grants restrictions always clear developers license present Last least currently available Configuration Management SCM technology provides support trace licensingrelated discussions decisions onto actual changes whereas traceability links useful understand impact decisions Paper Structure paper organized follows Section 2 relates work existing literature licensing analysis Section 3 describes study design details data analysis procedure Results reported discussed Section 4 Lessons learned study results summarized Section 5 Section 6 discusses threats study’s validity Finally Section 7 concludes paper outlines directions future work 2 Related Work work mainly related techniques tools automatically identifying classifying licenses artifacts ii empirical studies focusing different aspects license adoption evolution 21 Identifying Classifying Licenses problem license identification firstly tackled FOSSology Gobeille 2008 aimed building repository storing FOSS projects licensing information using machine learning approach classify licenses Tuunanen et al 2009 proposed ASLA tool aimed identifying licenses FOSS systems tool shown determine licenses files 89 accuracy Germán et al 2010b proposed Ninka tool uses patternmatching based approach identifying statements characterize various licenses Given text file input Ninka outputs license name version evaluation reported authors Ninka achieved precision sim 95 detecting licenses Ninka currently considered stateoftheart tool automatic identification licenses typical license classification problem arises source code available cases source code available—ie byte code binaries available—and goal identify whether byte code produced source code certain license aim Di Penta et al 2010 combined code search textual analysis automatically determine license jar files released approach automatically infers license decompiled code relying Google Code search engine Note differently previous techniques approach Di Penta et al 2010 able identify license family eg GPL without specifying version eg 20 22 Empirical Studies Licenses Adoption Evolution Di Penta et al 2010 investigated—on six open source projects written C C Java—the migration licenses course project’s lifetime study suggests licenses changed version type evolution generic patterns generalizable six analyzed FOSS projects Also Manabe et al 2010 analyzed changes licenses FreeBSD OpenBSD Eclipse ArgoUML finding different evolution patterns Germán Hassan 2009 analyzed 124 open source packages exploited several applications understand developers deal license incompatibilities Based analysis built model outlining specific licenses applicable advantages disadvantages Later Germán et al 2010a presented empirical study focused binary packages Fedora12 Linux distribution aimed understanding licenses declared packages consistent present source code files ii detecting licensing issues derived dependencies packages able find licensing issues confirmed Fedora Germán et al 2009 analyzed presence cloned code fragments Linux Kernel two distributions BSD ie OpenBSD FreeBSD aim verify whether cloning performed accordance terms licenses Results show cases codemigrations admitted since went less restrictive licenses towards restrictive ones Wu et al 2015 investigated license inconsistencies cloned files performed empirical study Debian 75 demonstrate ways licensing become inconsistent file clones eg removal license one clone pairs previous work Vendome et al 2015a focused analysis Java projects work expand analysis include six new languages—C C C Javascript Python Ruby Also new grounded theory analysis features categorization commit messages issue discussions seven categories turn detailed total 27 subcategories addition extracting new support rationale also defined new subcategories subsequently distilled lessons new data example observed asserting license standardized consistent across languages would benefit developers consistent means documenting presenting license system within forge Vendome et al 2015b conducted survey developers contributed projects experienced changes licensing understand rationale adopting changing licensing survey results indicated facilitating commercial reuse common reason license changes Also survey highlighted general developers lack understanding legal implications open source licenses highlighting need recommenders aimed supporting choosing changing licenses share similar goals prior related work—understanding insights license usage migration—our analysis done much larger scale including quantitative analysis 16221 Java projects ii qualitative analysis upon sample commit messages issue tracker discussions 1160 projects written seven different programming languages latter allowed us perform indepth analysis rationale behind license usages migrations
::::
3 Design Empirical Study goal study investigate license adoption evolution FOSS projects purpose understanding overall rationale behind picking particular license changing licenses determining underlying license change patterns perspective researchers interested understanding main factors leading towards specific license adoption change context consists change history 16221 Java open source projects mined GitHub used quantitatively investigate goals study ii commit messages issue tracker discussions 1160 projects written seven different programming languages ie C C C Java JavaScript Python Ruby exploited qualitative analysis 31 Research Questions aim answering following research questions RQ1 usage different licenses projects GitHub research question examines proportions different types licenses introduced FOSS projects hosted GitHub consider GitHub relatively young forge launched April 2008 seen exponential growth number projects past years projects hosts young terms first available commit date repository created RQ₂ common licensing change patterns second research question investigates popular licensing change patterns GitHub Open Source community aim driving out—from qualitative point view—the rationale behind change patterns eg satisfying dependency constraints RQ₃ extent licensing changes documented commit messages issue tracker discussions research question investigates whether licensing changes system traced commit messages issues’ discussions RQ₄ rationale sources contain licensing changes research question investigates rationale behind particular change licenses developer’s perspective address four research questions looking licensing phenomenon two different points view namely quantitative analysis licenses projects released changes across evolution history ability match changes either commit messages issue tracker discussions ii qualitative analysis licensingrelated discussions made developers issue trackers way developers documented licensing changes commit messages quantitative analysis licensing changes interested analyzing license migration patterns fall following three categories license → Licenses – N2L reflects case developers realized need license added licensing statement files Licenses → license – L2N case various reasons licensing statements removed source code files example developer accidentally added wrong licenselicense version Licenses → Licenses – L2L general case change licensing distinct licenses address RQ₁ RQ₂ RQ₃ perform quantitative analysis mining version history 16221 Java projects address RQ₄ perform qualitative analysis commit messages issue tracker discussion 1160 projects written seven different programming languages following subsections describe two kinds analysis detail
::::
32 Quantitative Analysis order generate dataset used study mined version history 16221 Java projects publicly available GitHub GitHub hosts twelve million Git repositories covering many popular programming languages provides public API httpsdevelopergithubcomv3 used query mine information Also Git version control system allows local cloning entire repository facilitates comprehensive analysis changehistory thus license changes happened commit extract data quantitative analysis first identified comprehensive list projects hosted GitHub implementing script exploiting GitHub’s APIs computation comprehensive list resulted twelve million projects Since infrastructure use license extraction supports Java systems explained later filtered systems written Java obtaining list 381161 Java projects hosted GitHub cloned 381161 git repositories locally total 63 Terabytes storage space analysis randomly sampled 16221 projects due computation time aforementioned infrastructure Git repositories cloned used code analyzer developed context MARKOS European Bavota et al 2014 extract license information commitlevel granularity MARKOS code analyzer uses Ninka license classifier Germán et al 2010b identify classify licenses contained files hosted version control system 16221 projects study MARKOS code analyzer mined change log producing following information commit Commit Id identifier commit currently checked Git repository analyzed Date timestamp associated commit Author person responsible commit Commit Message message attached commit File path files committed Change File field indicate whether file involved commit Added Deleted Modified License Changed boolean value indicating whether particular file experienced change license commit respect previous version feature applies modified files case addition deletion file field set false License name version eg GPL20 license applied file computation information 16221 projects took almost 40 days resulted analysis total 1731828 developers’ commits involving 4665611 files Note BSD CMU licenses Ninka able correctly identify variants reporting BSD var CMU var Additionally GPL LGPL may contain “” version number eg 30 represents clause license granting ability use future versions license ie GPL20 would allow utilization terms GPL30 Also values “no license” “unknown” represents case license attached file Ninka unable determine license determine whether trend proportions adopted licenses observed years used Augmented DickeyFuller ADF test Dickey Fuller 1979 1981 test widely used test stationarity time series test used reject two different null hypotheses H0 time series significantly stationary H0 time series significantly explosive latter used determine whether significantly increasing trend time series statistical tests considered significance level 005 ie rejected null hypotheses pvalues 005 quantitatively analyzed collected data presenting descriptive statistics license adoption common atomic license changes found latter defined commits detected specific kind license change within least one source code textual file example given commit three files experiencing licensing change license rightarrow Apache20 10 files GPL20 rightarrow GPL30 atomic license changes commit one License rightarrow Apache20 change one GPL20 rightarrow GPL30 change prefer count number changes file level done previous work Di Penta et al 2010 avoid inflating analysis large commits make comparable commits performed small large projects possible coarsegrained analysis may fail capture license changes example due change licensing dependency although also case principle licensing changes reflected level appropriate end identified total 1833 projects atomic license changes dataset 16221 projects subset projects used investigate license change traceability Intuitively require presence license changes order determine well changes licensing documented either commit messages issue tracker discussion Therefore used web crawler identify among 1833 projects using GitHub issue tracker finding total 1586 projects least one issue link licensing changes commit messagesissue reports performed string matching date matching either commit messages issue tracker discussions extracted licensing information eg license name date license committed decided rely commit messages issue discussions two sources information publicly available considered subject projects ii commit messages issue discussions likely report different level detail rationale behind specific change implemented considered case issues developers including changes related licenses 33 Qualitative Analysis qualitative analysis aims answering RQ4 based manual inspection categorization commit messages issue tracker discussions Since limitations terms project’s programming language analyze unlike quantitative analysis performed qualitative analysis commit messages issue tracker discussions set 1160 projects written seven different languages 159 C 91 C 78 C 324 Java 166 Javascript 147 Python 195 Ruby projects Note choice languages considered study random focused seven ten popular programming languages 2014 2015 Zapponi httpgithutinfo Cass httpspectrumieeeorgcomputingsoftwarethe2015toptenprogramminglanguages considered projects instead selected applying following procedure Firstly list twelve million repositories extracted written seven languages interest extracted repositories satisfying following two criteria forks main repository ii least one star ie least one user expressed appreciation repository watcher ie least one user asked receive notification changes made repository selection criteria used exclude analysis personal repositories eg website GitHub user might biased results However important note Java considered comprehensive list 381161 projects initial investigation Java projects Vendome et al 2015a observed need refinement thus adopted additional six languages observed high proportion false positive commit messages issues discussions Thus filtering sought improve generated taxonomy extracted change log cloned projects order analyze identify commit messages likely related licensing total 103128211 commits considered identify commit messages likely related license changes adopted caseinsensitive keywordbased filtering based critical words exploited Ninka license identification augmented license names detailed set keywords used matching reported Table 1 cases keywordfilters included bigrams composed license type version since license types eg apache produced large amount false positive discussions considered alone eg commit message talking Apache projects end keywordbased filtering allowed us identify total 746874 commit messages 742671 Java amounted approximately sim 1 overall commits Java Given high number relevant commits sampled 20 commits found language object manual inspection However set minimum threshold 100 commits per language maximum threshold 500 thresholds adopted ensure representativeness studied language keeping manual analysis effort reasonable Note sampling statistically significant 95 confidence interval pm 10 better resulted total 1413 commits inspected worth noting Java projects addition 500 sampled commit messages matching keywords Table 1 also considered 224 randomly sampled commit messages commits 1833 projects identified quantitative analysis instance atomic license change interested investigating reasons behind changes Clearly possible systems written programming languages said part quantitative analysis number sampled commits programming language reported second column Table 2 Language commits issue tracker discussions C 227 30 C 100 6 C 139 12 Python 130 41 Java 724 273 JavaScript 122 79 Ruby 195 45 Overall 1637 486 Concerning issue tracker discussions built Web crawler collecting information present issue trackers studied projects particular issue crawler collected title description ii text comment added iii date issue opened closed applicable order find relevant issues ie presenting discussions licenses used keyword search mechanism aimed matching issue title keywords related licensing previously explained commit messages applying procedure identified total 486 issue discussions potentially related licensing shown third column Table 2 collecting commit messages issue discussions order analyze categorize followed open coding process inspired Grounded Theory GT principles formulated Corbin Strauss 1990 analysis commit messages issue tracker discussions aimed finding rationale licensing changes particular aimed answering following two subquestions reasons pushing developers associate particular license causes migrate licenses release new license ie colicensing perform open coding distributed commit messages issue tracker discussions among authors two authors randomly assigned message message commit message entire issue tracker discussion round open coding authors independently created classifications messages met discuss coding identified us refined categories Note round categories defined previous rounds refined accordingly new knowledge created additional manual inspections authors’ discussions Overall open coding concerned 1413 randomly selected licensingrelated commit messages identified via keywordsbased mechanism ii 224 commit messages Java systems’ commits licensing change observed quantitative analysis iii 486 issue tracker discussions matching licensingrelated keywords output open coding procedure set categories group explaining licenses adopted changed qualitatively discuss findings analysis Section 44 presenting categories classification examples commit messages issue tracker discussions belonging various categories 34 Dataset Diversity Analysis get idea external validity dataset measured diversity metric proposed Nagappan et al 2013 dataset matching list mined projects GitHub list available projects Boa Dyer et al 2013 Given different datasets exploited context quantitative qualitative analysis discuss diversity metrics separately 341 Quantitative Analysis able match name 1556 16221 projects exploited quantitative analysis names projects diversity metric dataset Nagappan et al 2013 subset used computation diversity metric obtaining score 8We looked target keywords issue titles found including issue descriptions search generates considerable number false positives 035 indicating around 10 dataset covers third open source projects according six dimensions programming language developers age number committers number revisions number programming languages dimensional scores 045 099 100 099 096 099 respectively suggesting subset covers relevant dimensions analysis However focus Java projects limits programming language score affecting overall score Another important aspect evaluate representativeness licenses present dataset respect diffused FOSS community Open Source Initiative OSI specifies list 70 approved licenses indicating ones reported first column Table 3 commonly used FOSS specify order second column Table 3 reports top licenses extracted FLOSSmole’s SourceForge snapshot December 2009 Howison et al third column shows top licenses extracted sample GitHub projects exploited quantitative analysis licenses declared OSI commonly used also commonly found dataset BSD 2 3 fall BSD type comparison dataset SourceForge order diffusion different licenses exactly six top eight licenses SourceForge also present dataset Public Domain Academic Free License analysis together diversity metric suggests dataset exploited quantitative analysis representative Open Source systems Table 4 reports year first commit date 16221 considered projects table clearly shows exponential growth GitHub 2012 confirming already observed people GitHub community Doll httptinyurlcommuyxkru GitHub also experienced exponential growth 2013 httpsoctoversegithubcom dataset mirror fact due design choice made randomly choosing projects clone particular cloned projects January 2014 excluding projects commit history less one year set 381161 Java projects ie projects first commit performed later January 2013 needed since context RQ2 interested observing migration patterns occurring projects’ history Thus projects short commit history likely relevant purpose study Moreover since RQ1 interested observing licenses’ usage context GitHub’s drastic OSI popular license unordered SourceForge Dec 2009 Github data set Quant Analys Apache2 Lic GNU Public Lics GNU Public Lics BSD 2Clause Lic Lesser GNU Public Lics Apache Lics BSD 3Clause Lic BSD Lics Lesser GNU Public Lics GNU Public Lics Apache Lics MIT Lic Lesser GNU Public Lics Public Domain Eclipse Public Lic MIT Lic MIT Lic Comm Dev Dist Lic Mozilla Public Lic 2 Academic Free Lic Mozilla Public Lic Comm Dev Dist Lic Mozilla Public Lics BSD Lics Eclipse Public Lic expansion decided exclude 60 projects first commit 2013 analysis due severe lack representation sample despite continued growth GitHub
::::
342 Qualitative Analysis Similarly able match name 471 1160 projects names projects diversity metric dataset Nagappan et al 2013 manually investigated commit messages issue discussions qualitative analysis done quantitative analysis considered matched subset computation diversity metric obtaining score 032 indicating sim 40 dataset covers third open source projects according six dimensions programming language developers age number committers number revisions number programming languages dimensional scores 043 099 100 099 094 10 respectively Intuitively scores directly impacted limited number projects able match However still observe relatively high diversity scores suggesting qualitative analysis representative substantial portion open source systems
::::
35 Replication Package working data set study available httpwwwcswmedusemerudataEMSE15licensing includes lists projects urls ii issues tracker commit data iii analysis scripts iv summary achieved results
::::
4 Study Results section discusses achieved results answering four research questions formulated Section 31
::::
41 RQ 1 Usage Different Licenses GitHub Figure 1 depicts percentage licenses first introduced given year refer relative license usage report first occurrence license committed file ease readability bars grouped permissive dashed bars restrictive licenses solid bars Additionally omit data prior 2002 due limited number projects created years sampled dataset see Table 4 year 2002 observed restrictive licenses permissive licenses used approximately equally slight bias towards using restrictive licenses Although LGPL21 LGPL21 variants restrictive licenses less restrictive GPL counterpart LGPL specifically aimed ameliorating licensing conflicts arose linking code nonLGPL library Instead various versions GPL license require system change license version GPL else component would legally able redistributed together source code Thus suggests bias toward using less restrictive licenses even among mostly used copyleft licenses subsequent year 2003 clear movement towards using less restrictive licenses seen wider adoption MITX11 license well Apache11 license Additionally observe LGPL still prominent CMU CPL10 GPL20 licenses declining following five years 2004–2008 Apache20 CDDL10 EPL10 GPL30 LGPL30 DWTFYW2 licenses created observation period Bavota et al found Apache ecosystem grew exponentially Bavota et al 2013 observation explains rapid diffusion Apache20 license among FOSS projects observed growth resulted Apache20 license accounting approximately 41 licensing 2008 Conversely observed decline relative usage GPL LGPL licenses two observations suggest clear shift toward permissive licenses since sim 60 licenses attributed permissive starting 2003 small drops 2007 2009 Another interesting observation newer version GPL GPL30 GPL30 lower relative usage compared earlier version 2011 Additionally adoption rate gradual Apache20 license appears supersede Apache11 license However LGPL30 LGPL30 popularity prior versions terms adoption despite relative decline LGPL21’s usage starting 2010 manual analysis commits highlighted explicit reasons pushed developers choose LGPL license instance developer hibernatetools committing addition LGPL21 license wrote LGPL guarantees Hibernate modifications made Hibernate stay open source protecting work commit note indicates LGPL21 chosen best option balance freedom reuse guarantee remain free Conversely observed abandonment old licenses old license versions newer FOSS licenses introduced example Apache11 CPL10 become increasingly less prevalent longer used among projects cases newer license appears replace former license Apache20 offers increased protections eg protections patent litigation EPL10 CPL10 license difference IBM replaced Eclipse Foundation steward license Thus two licenses intrinsically legal perspective likely projects migrated CPL EPL would explain EPL adoption grew CPL usage shrunk Finally observed fluctuations adoption MITX11 license adoption permissive licenses grew introduction Apache20 license first declined adoption followed growth approximately original adoption Ultimately observed stabilization MITX11 usage approximately 10 starting 2007 order determine whether proportions given license exhibited stationary trend clearly increasing trend observed years performed ADFtests explained Section 32 Results reported Table 5 significant pvalues shown bold face second column indicate series stationary H0 rejected significant pvalues third column indicates series explosive ie clearly increasing trend H0e rejected results indicate Almost license exhibiting stationary trend results show significant differences zend20 license particularly popular marginal significance CMU CPL10 GPL10 Confirming discussion clearly increasing trend permissive licenses Apache20 MITX11 also new versions restrictive licenses facilitating integration licenses particular GPL30 eases compatibility Apache license well LGPL20 facilitates compatibility code integrated library also see increase DWTFYW20 discussed Section 5 likely due cases developers clear idea license used Summary RQ 1 analyzed Java projects observed clear trend towards using permissive licenses like Apache20 MITX11 Additionally permissiveness restrictiveness license impact adoption newer license versions permissive licenses rapidly adopted Conversely restrictive licenses seem maintain greater ability survive usage compared permissive licenses become superseded Restrictive GPL30 semirestrictive LGPL20 licenses facilitate Table 5 results augmented DickeyFuller test determine stationary explosive trends license usage License Stationary trend pvalue Explosive Trend pvalue Apache11 014 086 Apache20 098 002 BSD 073 027 CDDL v1 042 058 CMU 005 095 CPL10 043 057 EPL10 007 093 DWTFYW20 099 001 MPL10 090 010 MPL11 032 068 NPL11 055 045 svnkit 078 022 zend20 001 099 MITX11 097 003 GPL10 005 095 GPL20 067 033 GPL20 066 034 GPL30 098 002 GPL30 069 031 LGPL20 099 001 LGPL20 067 033 LGPL21 035 065 LGPL21 054 046 LGPL30 063 037 LGPL30 052 048 integration licenses also exhibit increasing trend Finally observed stabilization license adoption proportions particular licenses despite exponential growth GitHub code base 42 RQ2 Common Licensing Change Patterns analyzed commits license change occurred twofold goal analyze license change patterns understand prevalence types licensing changes affecting systems ii understand rationale behind changes Overall found 204 different atomic license change patterns analyze identified patterns highest proportion across projects ie global patterns within ie local patterns sought distinguish dominant global patterns Table 6 dominant local patterns Table 7 study one hand overall trend licensing changes hand understand specific phenomena occurring certain projects global patterns extracted identifying counting presence pattern per aggregating counts projects instance 823 Table 6 Top ten global atomic license change patterns Top Patterns Overall Frequency license unknown → Apache20 823 Apache20 → license unknown 504 license unknown → GPL30 269 GPL30 → license unknown 181 license unknown → MITX11 163 license unknown → GPL20 113 GPL20 → license unknown 111 MITX11 → license unknown 98 license unknown → EPL10 94 license unknown → LGPL21 91 Top Migration Patterns Licenses Frequency GPL30 → Apache20 25 GPL20 → GPL30 25 Apache20 → GPL30 24 GPL20 → LGPL21 22 GPL30 → GPL20 21 LGPL21 → Apache20 16 GPL20 → Apache20 15 Apache20 → GPL20 13 MPL11 → MITX11 11 MITX11 → Apache20 11 projects dataset experienced least one change license → Apache20 thus final count globally pattern 823 dominant global patterns either change either license unknown license particular license change either particular license license unknown license Table 6 shows top ten global patterns observe inclusion Apache20 common pattern unlicensed unknown code Clearly likely due specific programming language ie Java exploited sample projects quantitatively analyzed Table 6 also shows common global migrations focusing attention changes happened different licenses observe migration towards permissive Apache20 dominant change among top ten atomic license changes global license migrations interesting observation license upgrade downgrade GPL20 GPL30 GPL30 considered Free Foundation compatible license Apache20 license9 Due large usage Apache license Java projects pattern quite expected However migration GPL30 → GPL20 interesting since still allows redistributed GPL30 also allows usage GPL20 less restrictive well Regarding local patterns Table 7 frequencies computed first identifying frequent ie dominant pattern counting number 9httpgplv3fsforgwikiindexphpCompatible licenses times specific pattern frequent across whole dataset instance textitGPL10 rightarrow textitGPL30 pattern frequent 36 projects dataset Table 7 summarizes common local migrations migrations appear toward less restrictive license license version low frequency textitatomic license change local patterns indicates migrating licenses nontrivial also introduce problems respect reuse example observed single textitGPL10 code changed textitLGPL20 total nine times textitLGPL less restrictive textitGPL code used library Thus parts system textitGPL developer must comply restrictive possibly incompatible constraints considered textitatomic license changes among file repository needed since analyzed projects lack specific file eg licensetxt declaring license extract declared license considered file top level directory named textitlicense textitcopying textitcopyright textitreadme focusing projects including files extracted 24 different change patterns Table 8 illustrates top eight licensing changes particular licenses ie excluded license unknown license table declared licenses considered top eight since tie five patterns next group change patterns observe change textitApache20 rightarrow textitMITX11 prevalent license change pattern colicense textitMITX11 textitApache20 second prevalent one Interestingly pattern dominant filelevel analysis although Grounded Theory analysis provided us support pattern textitMITX11 license Pattern Frequency GPL20 rightarrow GPL30 36 GPL20 rightarrow LGPL30 15 LGPL30 Apache20 rightarrow Apache20 12 GPL30 Apache20 rightarrow Apache20 12 GPL20 rightarrow LGPL21 10 GPL10 rightarrow LGPL20 9 GPL20 rightarrow GPL30 9 GPL30 rightarrow Apache20 8 GPL30 rightarrow GPL20 8 GPL30 rightarrow LGPL30 8 Pattern Frequency Apache20 rightarrow MITX11 12 Apache20 rightarrow MITX11 Apache20 8 GPL20 rightarrow GPL30 7 MITX11 rightarrow Apache20 6 GPL30 rightarrow Apache20 6 MITX11 Apache20 rightarrow Apache20 5 Apache20 rightarrow GPL30 5 GPL30 rightarrow MITX11 3 used allow commercial reuse still maintaining open source nature pattern textGPL20 rightarrow textGPL30 Top3 Table 8 expected since tied prevalent among global atomic license changes Similarly patterns textMITX rightarrow textApache20 textGPL30 rightarrow textApache20 textApache20 rightarrow textGPL30 also among top eight global changes Another notable observation license changes frequently happening toward permissive licenses Excluding five changes textApache20 rightarrow textGPL30 remaining changes top eight either licensing change restrictive copyleft license permissive license licensing change two different permissive licenses Summary RQ 2 key insight analysis atomic license change patterns observed studied Java projects licenses tend migrate toward less restrictive licenses
::::
43 RQ 3 Extent Licensing Changes Documented Commit Notes Issue Tracker Discussions Table 9 reports results identification traceability links licensing changes commit messagesissue tracker discussions found clear lack traceability license changes commit message history issue tracker data sources first extracted instances ie commit messages issue tracker discussion comments Data source Linking query Links Commit Commits keyword “license” 70746 Messages Commits containing new license name 519 Commits containing new license name keyword “license” 399 Issue Comments closed issues containing keyword “license” 0 Tracker Comments closed issues containing new license 0 Comment Comments closed issues containing new license keyword “license” 0 Matching Comments open issues containing keyword “license” 68 Comments open issues containing new license 712 Comments open issues containing new license keyword “license” 16 Issue Closed comments opened license change closed license change 197 Tracker Open comments open license change 2241 Datebased Comments closed issues open license change closed license change keyword “license” 0 Matching Comments open issues open license change keyword “license” 0 Issue Comments closed issues containing keyword “Fixed issuenum” 66025 Comments open issues containing keyword “Fixed issuenum” 3407 Commit Comments closed issues containing commit hash license change occurs 0 Matching Comments open issues containing commit hash license change occurs 1 discussions keyword “license” appears license name mentioned eg “Apache” former case identifying potential commits issues related licensing latter attempts capture related specific types licenses using first approach retrieved 70746 commits 68 issues looking license names identified 519 commits 712 issues However numbers inflated false positives eg “Apache” relate license relate one Apache Foundation’s libraries reason looked commit messages issue discussions containing word “license” well name license resulted drop linked commit messages 399 zero issue discussions results highlight license changes rarely documented developers commit messages issues also investigated whether relevant commits issues could linked together linked commit messages issues former explicitly mentions fixing particular issue eg “Fixed 7” would denote issue 7 fixed observed technique resulted large number pairs issues commits thus observation lack license traceability simply artifact poor traceability projects investigate linking extracted commit hashes license change occurred attempted find hashes issue tracker’s comments Since issue tracker comments contain abbreviated hash truncated hashes appropriately prior linking results indicated one match open issue zero matches closed issues Finally attempted link changes issues matching date ranges issues commit date license change issue open prior change issue closed closing date must change However find matches datebased approach Summary RQ 3 analyzed Java projects issue tracker discussions commit messages yielded minimal traceability license changes suggesting analysis licensing requires finegrained approaches analyzing source code
::::
44 RQ 4 Rationale Sources Contain Licensing Changes section firstly present taxonomy resulted open coding commit messages issue tracker discussions explained Section 3 analysis performed 1637 commit messages 486 issue tracker discussions 1160 projects written seven programming languages aims modeling rationale license adoption changes Secondly present findings looking commits introduce atomic license changes analyzed Java projects
::::
441 Analyzing Commit Messages Issue Discussions Table 10 reports categories obtained open coding process total grouped commit messages issue tracker discussions 28 categories organized seven groups described detail rest section Additionally 430 commits 161 issue discussions identified means pattern matching potentially related licensing classified false positives mainly due wide range matching keywords used filtering see Section 3 identify many commitsissues possible Finally 16 commits two issue discussions Table 10 Categories defined open coding Issue tracker discussion comments Commit notes Category C C C Java Javascript Python Ruby Overall C C C C C C C C Generic license additions Choosing license 1 0 0 0 0 6 0 2 0 1 0 11 0 License added 1 22 3 19 0 15 25 75 22 34 9 34 1 33 59 232 License change License change 2 14 1 8 1 5 3 14 4 9 2 6 2 18 15 74 License upgrade 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 4 License rollback 0 0 0 0 0 0 1 0 0 0 0 2 0 0 0 3 Removed licensing 0 3 0 3 0 4 0 6 1 8 0 2 0 3 1 29 Changes copyright Copyright added 0 6 0 3 0 2 0 0 0 2 0 2 0 0 0 15 Copyright update 2 24 0 7 1 6 5 89 2 7 2 4 1 8 13 138 License fixes Link broken 7 0 2 0 0 0 1 0 16 0 1 0 19 0 46 0 License mismatch 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Fix licensing 4 2 0 1 0 2 1 3 2 0 0 1 2 1 9 10 License file modification 0 11 0 8 0 14 0 0 1 11 1 7 1 29 3 80 Missing licensing 1 1 0 0 0 3 2 0 7 0 12 0 4 1 26 5 License compliance Compliance discussion 1 9 0 5 1 1 0 1 0 3 0 1 0 0 2 20 Derivative work inconsistency 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 Add compatible library 0 1 0 0 0 0 3 0 0 0 0 2 0 0 3 3 Removed thirdparty code 3 13 1 8 0 1 0 1 0 2 0 4 0 3 4 32 License compatibility 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 6 Reuse 1 1 1 0 0 17 0 1 0 10 0 1 0 0 21 1 Dep license added 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Dep license issue 2 0 0 0 0 0 1 0 1 0 0 0 0 0 0 4 ClarificationsDiscussions License clarification 2 0 2 1 1 0 19 0 2 1 4 0 2 0 32 2 Terms clarification 0 0 0 0 0 0 5 0 2 0 0 0 0 0 7 0 Verify licensing 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 2 License agreement 0 0 0 0 0 0 2 0 2 0 0 0 0 0 0 4 Request license Licensing request 1 0 0 0 0 0 0 0 4 0 0 0 6 0 11 0 License output end user 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 related licensing possible based available information perform clear categorization Thus excluded study following discuss examples related various groups categories Generic License Additions group categories concerns cases license added file component present well discussions related choosing license added One typical example commit message related first introduction license repository mentioned “Added license page TARDIS” httpsgithubcomtardissntardiscommit07b2a072d89d45c386d5f988f04435d76464750e commit messages falling category even precise reporting exact license committed repository eg “Add MIT license Rename README include rst file extension” httpsgithubcomSchevoschevorecipedbcommitb73bef14adeb7c87c002a908384253c8f686c625 Finally commit messages automatically generated GitHub’s licensing feature present eg “Created LICENSEmd” commit messages show addition license provide rationale behind specific choice found sometimes discussions carried developers issue trackers establish license would released example one issue discussions analyzed titled “Add LICENSE file” httpsgithubcomroseduwebworkshopsissues1 webworkshops issue opener explained need deciding license adopt ii involve projects’ contributors decision “A license needs chosen repo contributors need agree chosen license list contributors enclosed below” Doubts indecision license adopt also evident several issue discussions manually analyzed “What license use BSD GNU GPL APACHE” httpsgithubcomkovmarci86d3armoryissues5 Interestingly one developer submitted issue InTeX entitled “Dual license LGPL EPL” httpsgithubcommtrintexissues1 related adding new license balance code reuse system avoiding “contagious” licensing term “contagious” used original developer system developer commented “Your package licensed GPL I’m lawyer far understand intention GPL LaTeX documents compiled InTeX package made available GPL think want users publish changes code dual license LGPL EPL would ensure changes code published along binary publication b code used GPL nonGPL projects See JGraphT’s relicensing background” response demonstrates potential lack understanding regarding license implications compiled LaTeX proposes duallicensing solution However original developer also indicates lack legal background willing offer duallicense based understanding stating Thank interest lawyer either intentions want changes source code InTeX made available others benefit want “contagious” copyright documents compiled InTeX However I’ve always thought InTeX precompiler given GPL FAQ answer think licensing compiler’s source code GPL limit affect copyright documents used process Unless prove wrong close issue Thus developer responds providing understanding GPL referencing response GNU regarding compiled Emacs However developer indicate openness adding new license GPL would fact applied generated LaTeX documents example particularly interesting since shows original developer’s rationale picking GPL well difficulty developers respect licensing License Change group categories concerns cases licensing statement changed one license towards different one ii license upgraded towards new version eg GPL20 GPL30 iii cases license rollback ie license erroneously changed rollback previous license needed ensure legal compliance iv cases various reasons developers removed previously added license commit messages briefly document performed change eg “Switched BSDstyle license” “Switch GPL” others partially report rationale behind change “The NetBSD Foundation granted permission remove clause 3 4 software” commit message explains permission granted license change NetBSD Foundation However committer explain reason removal two clauses commits instead detailed providing full picture happened terms licensing “Relicensed CZMQ MPLv2 fixed source file headers removed COPYINGCOPYINGLESSER GPLv3 LPGv3 exceptions added LICENSE MPLv2 text removed ztree class cannot relicensed reintroduced foreign code wrapped CZMQ code” commit message CZMQ httpsgithubcomzeromqczmqcommiteabe063c2588cde0af90e5ae951a2798b7c5f7e4 informative reporting former license ie GPL30 LGPL30 new license ie MPL20 changes applied repository ensure compliance new licensing terms eg removal ztree class license change demonstrates move towards permissive license shown prevalent study Java projects 10httpwwwgnuorglicensesoldlicensesgpl20faqhtmlCanIUseGPLToolsForNF also found commit messages reporting rationale behind specific license changes following commit nimble httpsgithubcombradleybeddoesnimblecommite1e273ff18730d2f8e0d7c2af1951970e676c8d1 “Change License AGPL 30 Apache 20 prior first public release Several factors influenced decision largest community building making things easy possible folks get started don’t however believe Open Source Free continue investigate best way commercialize Restrictive copyleft licenses aren’t however answer” developers want enable external developers reuse system also interested commercializing product developers acknowledge copyleft licenses meet needs License Rollback observed PostGIS reverted back licensing custom license httpsgithubcompostgispostgiscommit4eb4127299382c971ea579c8596cc41cb1c089bc commit offer rationale since simply states “Restore original license terms” analysis commit emerged author relicensed system GPL earlier subsequently reverted back licensing custom license However clear rollback due misappropriation GPL incompatibility system factors Additionally found commit messages illustrating license removals necessarily indicate licensing system removed instance “Removing license declared elsewhere” httpsgithubcomrosroscommcommite451639226e9fe4eebc997962435cc454687567c “Remove extra LICENSE files One repository one license need put box either” httpsgithubcomopenatvenigma2commitb4dfdf09842b3dcacb2a6215fc040f7ebbbb3c03 “Remove licenses unused libraries” httpsgithubcomttopcuantocommita1e58f2c93de40ab304c494e05853957c549fd44 cases system contains redundant superfluous license files removed observation highlights strictly analyzing license changes happened history system could wrongly suggest system migrated toward closedsource third commit message instead indicates licenses removed due unused code required licenses cases adopting unnecessary licenses due thirdparty libraries longer needed carefully managed since may discourage developers reuse especially unnecessary licenses restrictive Changes Copyright group categories includes commitsissues related simple changesadditions applied copyright statement like copyright year authors Changes list author names occur indicate names people provided substantial contribution therefore claiming ownership Previous work indicated often additions occur correspondence large changes performed contributors whose names mentioned yet copyright statement Penta Germán Changes copyright years also previously investigated often added allow claiming right source code modified given year Di Penta et al 2010 License Fixes group categories related changes license mainly due various kinds mistakes formatting issues well cases licensing statement accidentally missing note different cases license addition license originally intended absent example group observed cases issues discussing license mismatch developers found conflicting headers conflicts declared license license headers former case developer posted issue gtksourcecompletion ’s issue tracker httpsgithubcomchuchiperrimangtksourcecompletionissues1 “The license states LGPL3 copyright headers source files say otherwise missing intentional license I’ve included licensecheck output below” Subsequently issue poster listed files system GPL LGPL copyright Additionally indicated cases Free Foundation address incorrect well observed similar situation another developer opened issue “LICENSE file doesn’t match license header svgeezyjs” httpsgithubcombenhowdle89svgeezyissues20 svgeezy’s issue tracker stated LICENSE file specifies MIT license header svgeezyjs says it’s released WTFPL correct license second case observe declared license source header consistent However issue resolved time writing paper cannot report resolution feedback offered original developers system interesting cases ones related fix missing licenses Often developers made aware missing licenses via issue tracker projects’ users reporting issue Sometimes complete may unlicensed leading discussions like one titled “GNU LGPL license missing” rcswitchpi httpsgithubcomr10rrcswitchpiissues17 license source code published heavily based wiringpi rcswitch rcswitch GNU Lesser GPL wiringpi GNU Lesser GPL GNU Lesser GPL added httpwwwgnuorglicenseslgplhtml Based project’s characteristics ie foundations previously existing projects developer recommends addition missing LGPL license commits issues falling License File Modification category related changes applied license file type name example developers may change license file default LICENSEmd file generated GitHub txt rtf Additionally developers change file name often make meaningful illustrated commit message Haml httpsgithubcomhamlhamlcommit537497464612f1f5126a526e13e661698c86fd91 “Renamed LICENSE MITLICENSE don’t open file find license released Also wrapped 80 characters I’m picky edited” quote edited language typical changes concern renaming COPYRIGHT file LICENSE move license file project’s root directory cases indicate changes towards different license general change license semantics way license presented License Compliance group categories probably interesting analyze concerns categories related discussions changes license compliance Specifically generic compliance discussions cases derivative work’s legal inconsistency spotted discussed ii compatible library added replace another incompatible library licensing point view iii thirdparty code completely removed legallycompliant alternative possible iv cases discussion related license compatibility context reuse v cases added dependency existing dependency conflicts current license interesting example issue discussion entitled “Using OpenSSL violates GPL licence” SteamPP httpsgithubcomseishunSteamPPissues1 Surprisingly developer initially commented gnutls libnss terrible documentation don’t consider priority issue anyway would like submit pull request guest Despite initial reaction OpenSSL library replaced Crypto within week order meet licensing requirements Examples thirdparty libraries removed due licensing issues also prevalent commit messages eg “Remove elle1 editor due incompatible license” httpsgithubcombooster23minixwallcommit342171fa9e9d769ce4aa48525142a569b34962f7 incompatibility case due elle ’s clause explicitly reporting “NOT sold made part licensed products” Additionally saw commit wkhtmltopdfqtbatch files removed due recommendation project’s legal staff “Remove files instructed Legal department” httpsgithubcomalexkoltunwkhtmltopdfqtbatchcommit9b142a07a7576afa15ba458e97935aac5921ef8d shows license compliance may always straightforward developers may need rely legal council order determine whether licensing terms met also observed changes system’s licensing aimed satisfying compliance thirdparty code gubg httpsgithubcomgfannesgubgdeprecatedcommit4d291ef433f0596dbd09d5733b25d27b3a921cf4 Changed license LGPL able use msgpack implementation GET nv Similarly found issue tracker discussions conflicting licenses compatibility licenses thirdparty libraries Interestingly issue opened noncontributor androidsensorium httpsgithubcomfmetzgerandroidsensoriumissues11 stating Google Play Services GMS proprietary hence compatible GNU LGPL jar inside Android library referred projectproperties FDroidorg publishes o3gm package cant publish without removing library Thus license incompatibility created potential license violation also prevented noncontributor cataloging system among projects hosted FDroid httpsfdroidorg wellknown forge open source Android apps Additionally observed issues related reuse one contributor suggests dual license allow greater reuse applications contributor pythonhpilo httpsgithubcomseveaspythonhpiloissues85 stated Due incompatibility GPLv3 Apache 20 hard use pythonhpilo instance OpenStack would therefore helpful code could also released permissive license like instance Apache 20 OpenStack licensed contributors subsequently utilized thread vote ultimately agreed upon dual license example indicate consideration reuse also demonstrates licensing decisions determined copyright holders single developer also important note GPL30 Apache20 considered incompatible Free Foundation Conversely also observed interesting discussion issue posted patchelf httpsgithubcomNixOSpatchelfissues37 asked “Is possible change GPL LGPL would help using software” developer posting question developing system licensed BSD license GPL would compatible contributor refused change licensing stating “GPL would compatible” Moreover one contributors explained changing licensing nontrivial responding wouldn’t easy change license given contains code several contributors would need approve change response highlights importance contributors approve license change However reaching agreement among contributors might far trivial due personal biases developers could respect licensing Vendome et al 2015b also observed case related derivative work license differed original system’s licensing category Derivative Work Inconsistency developer created issue “Origin License Issue” tablib httpsgithubcomkennethreitztablibissues114 offered support first noted tablib MITlicensed several potential provenance license issues Oo XLS XLSX formats tablib embeds collected potential issues best byzantine httpsbitbucketorgericgazoniopenpyxl reported derived PHPExcel LGPLlicensed httpsgithubcomPHPOfficePHPExcel openpyxl LGPL MITlicensed really derived possible issue license may original derivative issue poster lists various components used licensing point incompatibility issues particular related derivative code system utilizes ClarificationsDiscussions group categories contains issues related clarifying project’s licensing terms implications licensing agreement contributors made Contributor License Agreement CLA License Clarification actual license typically occurred system contain license file ie declared license example one project’s user created issue “Please add LICENSE file” Mozilla’s 123done httpsgithubcommozilla123doneissues139 stating repo public it’s easy find I’m allowed use share code Could add LICENSE file make easier users understand you’d like used Similarly another pyelection issue “What license code released under” httpsgithubcomalexpyelectionissues1 comments poster Thus observe developers use issue tracker mean understand licensing request explicit licensing file Another surprising issue discussion related understanding terms license issue posted neunode’s issue tracker httpsgithubcomsnakajimaneunodeissues5 external developer looking reuse code asked impressed you’ve done neuNode interested using offline mapping applications However work company 1M revenue license terms say MIT companies less 1M revenue approach I’ve seen Please could clarify license terms company larger We’re trying make decisions direction moment quick response would appreciated possible Interestingly license terms set conditions based money value company looking reuse code case external developer’s company exceeds threshold original developer indicates intended benefit developer community whole specifically students individuals original developer gave two options large check without maintenance support ii detail descriptions product compelling argument giving free license reuse system acknowledgment description Thus original developer interested financial gain though could reasonably convinced right price rather wants support open source community receive credit work identified category License Agreement scenario arises external developer submits code contribution contributors require developer complete Contributor License Agreement CLA avoid licensingcopyright disputes observed discussion related updating textual information project’s CLA respect country designations httpgithubcomadobebracketsissues8337 Similarly previous Java study Vendome et al 2015a developer submitted patch could merged system developer filled CLA httpsgithubcomFasterXMLjacksonmodulejsonSchemaissues35 CLA makes explicit author contribution granting recipient right reuse distribute contribution Brock 2010 Thus prevents contributed code becoming ground potential lawsuit Request License group contains issue discussions developer asks license license file similar reuse differs since developers necessarily state want reuse system since possible want contribute well Thus generic requests developer attribute license system without explaining reason request example found issue titled “No license included repository” jquerybrowserify httpsgithubcomjmarsjquerybrowserifyissues20 poster commented Would consider adding license repository It’s currently missing one according TOS posting license means retain rights source code nobody else may reproduce distribute create derivative works work might intend Even intend publish source code public repository GitHub accepted Terms Service allow GitHub users rights Specifically allow others view fork repository want share work others strongly encourage include open source license don’t intend putting license that’s fine want use open source license please I’d happy forkPR let know license want put MITBSDApacheetc comment demonstrates licensing also impacts derivative work prevent developers contributing system important distinction since findings prior work Vendome et al 2015a b Sojer Henkel 2010 demonstrate licensing could impediment reuse impediment contribute towards projectsystem License Output End User category describes unique case issue posted regarding output license end user issue stated “This output could read monitoring tools example automatically warn expiration although Phusion also emails expiration warnings desired upfront time warning configurable like that” httpgithubcomphusionpassengerissues1482 Unlike previous categories issue relates end user licensing contributor system suggests inclusion feature aid monitoring license expiration Interestingly category shows developers also consider licensing impact “client” using system aspect understanding impact licensing “client” end user also unexplored prior studies 442 Analysing Commits Implementing Atomic License Changes Java Systems analysis specifically targeted commit messages licensing change occurred could understand rationale behind change apply keyword commit messages since knew commits related changes licensing reading commits also included atomic license change pattern observed particular commit add context observed new support existing categories results reported Table 11 refer new support commit messages indicating new rationale existing categories Table 11 Categories defined open coding commit messages license change occurred Category Commits Generic license additions Choosing license 0 License added 63 License change License change 9 License upgrade 1 License rollback 1 License removal 19 Changes copyright Copyright added 0 Copyright update 1 License fixes Link broken 0 License mismatch 0 Fix missing licensing 9 License file modification 0 Missing licensing 1 License compliance Compliance discussion 0 Derivative work inconsistency 0 Add compatible library 0 Removed thirdparty code 1 License compatibility 0 Reuse 0 Dep license added 0 Dep license issue 0 ClarificationsDiscussions License clarification 0 Terms clarification 0 Verify licensing 0 License agreement 0 Request license Licensing request 0 License output end user Output licensing 0 License Change group categories observed general messages indicating license change occurred cases explicitly stating new license following commit messages “Rewrite get LGPL code” “Changed license Apache v2” two commit messages offer rationale least indicate new license attributed system developer inspecting change history would able accurately understand particular license change Since observed many instances license rightarrow license prevalence License Added expected However License Added commit messages resembled License Change messages since often include clear rationale ie part License Added category level detail similar License Change category example developer asserted Apache20 license headers source files across commit message simply stated “Enforce license” case License Removal observed licenses removed due code clean files deletion dependencies removal example observed removal GPL20 license following commit message “No smoketestclientlib” indicates removal previously exploited library Additionally licenses removed developers cleaned Fix Missing Licensing related license addition occurred author intended license file forgot either initial commit commit introducing licensing example one commit message stated “Added missing Apache License header” indicates available source code may inaccurately seem unlicensed Additionally License Upgrade refers license change version license modified recent particular case observed change GPL20 GPL30 commit message stated “Change copyright header refer version 3 GNU General Public License point readers COPYING3 file FSF’s license web page” commit message describes version change supply rationale Instead message log changes important observation second round analysis ambiguity commit messages example observed commit classified Copyright Update stating “Updated copyright info” However commit corresponded change licensing GPL20 LGPL21 case illustrates lack detail offered developers commit messages illustrates update significant adding header changing copyright year Since sampled commits Java projects infeasible sample larger representative number commit messages Thus augmenting second round considering commits atomic license change occurred benefited taxonomy targeting relevant commits better However able sample statistically representative sample sizes work due prefiltering projects results corroborate representativeness since observed categories Another important observation appears support supposition traceability analysis developers remove licensing related issues issue tracker found links removed period time crawling data analysis categorized Link Broken amounted 45 overall issues also possible cases represent developers utilize external bug tracking systems well Summary RQ 4 open coding analysis based grounded theory indicated lack documentation eg prevalence false positives poor quality documentation respect licensing issue tracker discussion commits notes formally categorized available rationale also found rationale may incomplete ambiguously describe underlying change eg “Updated copyright info” representing change different licenses Finally observed issue trackers also served conduits authors external developers discuss licensing
::::
5 Lessons Implications analysis commit messages issue tracker discussions highlighted information offered respect licensing choicechange often quite limited developer interested reusing code would forced check source code component understand exact licensing ask clarification using issue tracker example Additionally reason behind change usually well documented detail particularly important system uses externalthirdparty libraries since license may change addition removal libraries important observation open coding analysis also stresses need better licensing traceability aid explaining license grantsrestrictions found several instances issue tracker used ask clarifications regarding licensing external developers sought reuse code example observed developers interpret implications licensing differently generates misunderstandings terms reuse suggests code reuse problematic developers due licensing Therefore study demonstrates need clear explicit licensing information projects hosted forge Similarly observed external developers would request license since projects appeared unlicensed however subset requests due licensing attributed different manner external developers expected eg part gemspec file Ruby projects LICENSE file also observed developers adding license files parent directories opposed headers source code well appending license name license file eg LICENSE would renamed LICENSEMIT way declaring license particularly used GitHub system asks developers choose license created creates LICENSE file project’s root directory observations indicate lack standardization licensing expressed among projects language projects across different languages suggests developers need standardized mechanism declare license Thirdparty tools forges could support developers maintaining standardized documentation automatically Another important observation type difficulty developers licensing thirdparty code ways achieve compliance observe issue discussions commit messages libraries removed due incompatible licensing terms Conversely libraries also chosen due particular license source code feature important open source developers aim wide adoption systems choice licensing may directly impact adoption suggests choice licensing directly impact adoption libraries Therefore foresee librarycode recommenders based open source code base license aware consideration applies example approaches recommending code examples libraries sensing developers’ context Cubranic et al 2005 Holmes Murphy 2005 Ponzanelli et al 2013 2014 words one hand project’s license relevant part context hand code search engines eg Grechanik et al 2010 McMillan et al 2012a b c 2011 2013 consider target code license constraint search lack traceability licensing changes also important researchers investigating licensing GitHub cannot generalize features suggest commit message analysis may largely incomplete respect details licensingrelated changes made commit One way achieve developers take advantage summarization tools ARENA Moreno et al 2014 ChangeScribe CortésCoy et al 2014 LinaresVásquez et al 2015 ARENA analyzes documents licensing changes release level ChangeScribe automatically generates commit messages however using ChangeScribe would require extending analyze licensing changes commit level Another option forges tools general verify every file contains license every properly documents license feature could optional summary would greatly improve traceability license changes rationale assert consistency among repositories Also would beneficial developers using another informed licensing change occurs example developer could mark specific projects dependents receive automated notifications particular changes occur would beneficial licensing since change license dependency could result license incompatibilities open coding commit messages issue tracker discussions also suggests commercial usage code concern open source community Currently MITX license Apache license seem prominent licenses purpose Indeed also quantitative analysis Java projects showed trend towards use permissive licenses lack license important consideration open source development since suggests code may fact closed source copyrighted original author observed issues discussions related lack licensing since hindered reuse Indeed sometimes developers initiate open source without attributing license either lack deep knowledge importance licensing possibility disallowing certain types reuse code Vendome et al 2015b also limited support task choosing suitable license Existing tool support Choose License helps users choosing license tool completely contextinsensitive respect constraints imposed better contextsensitive tool support provided Markos Bavota et al 2014 mainly provides list compatible licenses given component 11httpchoosealicensecom 6 Threats Validity Threats construct validity concern relationship theory observation relate possible measurement imprecision extracting data used study mining Git repositories relied GitHub API git command line utility tools active development community supporting Additionally GitHub API primary interface extract information cannot exclude imprecision due implementation API terms license classification rely Ninka stateoftheart approach shown 95 precision Germán et al 2010b however always capable identifying license 15 time study concerns open coding performed context RQ4 identified stratified sampling sample commit messages issue tracker discussions large enough ensure error ±10 confidence level 95 sample identified starting candidate commit messages discussions identified means pattern matching using keywords Table 1 Although aimed build comprehensive set licensingrelated keywords possible missed licensingrelated discussions matching keywords Threats internal validity related confounding factors internal study could affected results atomic licensing changes reduced threat size confounding factor representing presences particular change commit license change typically handled given instance frequency using commitlevel analysis prevent number files inflating results inappropriately suggest large numbers changes occurred analyze changes across projects took binary approach analyzing presence pattern Therefore particular would dominate results due size limit subjectiveness open coding classifications always performed two authors every case discording classification discussed explained Section 33 Threats external validity represent ability generalize observations study quantitative study based analysis 16K Java projects makes us confident findings good generalizability concerns Java systems cannot extended systems written programming languages qualitative study performed instead commit messages issue discussions extracted systems written seven different languages However generalizability qualitative results limited seven considered languages supported relatively low number considered systems ie 1160 due manual effort required identification rationale behind licensing decisions well limited number potential repositories licenserelated commit messages issue discussions GitHub’s exponential growth popularity public forge indicates represents large portion open source community exponential growth relative youth projects seen impacting data two characteristics represent growth open source development discounted Additionally GitHub contains large number repositories may necessarily comprehensive set open source projects even Java projects However large number projects dataset relatively high diversity metrics values shown Section 34 gives us enough confidence obtained findings evaluation projects across open source repositories programming languages quantitative part would necessary validate observations general context also important note observations consider open source projects Since need extract licenses source code consider closed source projects cannot assert results would representative closed source projects
::::
7 Conclusions paper reported empirical study aimed analyzing quantitative qualitative point view adoption change licenses open source projects hosted GitHub study consists quantitative part studied license usage licensing changes set 16221 Java projects hosted GitHub ii qualitative analysis analyzed commit messages issue tracker discussions 1160 projects hosted GitHub developed using seven popular programming languages ie C C C Java Javascript Python Ruby quantitative analysis Java projects aimed providing overview kinds licenses used time different projects ii analyzing licensing changes iii identifying traceability links licensing changes licensingrelated discussions Results indicated – New license versions quickly adopted developers Additionally new license versions restrictive licenses eg GPL30 vs GPL20 favored longer survival earlier versions unlike earlier version permissive licenses seem disappear – Licensing changes predominantly toward permissive licenses ease kind derivative work redistribution eg within commercial products – clear lack traceability discussions related license changes qualitative analysis based open coding procedure inspired grounded theory Corbin Strauss 1990 aimed categorizing licensingrelated discussions commits results indicate – Developers post questions issue tracker ascertain project’s license andor implications license suggesting licensing difficult – lack standardization consistency licensing attributed system within programming language across different programming languages causes misunderstandings confusion external developers looking reuse system – Developers general supply detailed rationale document changes commit messages issue tracker discussions – License compatibility impact adoption removal thirdparty library due issues license compliance work mainly exploratory nature aimed empirically investigating license usage licensing changes quantitative qualitative points view Nevertheless different possible uses one make results paper results indicate developers frequently deal licensingrelated issues highlighting need developing semiautomatic recommendation systems aimed supporting license compliance verification management Additionally tools compatible integrated within forge support licensing documentation change notification education ie picking appropriate license compatibility would benefit developers attempting reuse code working direction one aware possible factors could influence usage specific licenses factors motivating licensing changes paper provides solid empirical results analysis factors real developers Future work area aim extending study performing larger quantitative qualitative analysis projects ii performing deeper investigation rationale licensing changes example performing analysis dependencies projects relating analysis changes performed Last least discussed Section 5 would useful incorporate licensing analysis existing recommender systems recommenders could rely local project’s context also exploit rationale previous licensing changes produce recommendations Acknowledgments work supported part NSF CAREER CCF1253837 grant Massimiliano Di Penta partially supported Markos funded European Commission Contract Number FP7317743 opinions findings conclusions expressed herein authors’ necessarily reflect sponsors References 123done issue 139 httpsgithubcommozilla123doneissues139 androidsensorium issue 11 httpsgithubcomfmetzgerandroidsensoriumissues11 Bavota G Canfora G Di Penta Oliveto R Panichella 2013 evolution interdependencies ecosystem case apache280–289 Bavota G Ciemniewska Chulani De Nigro Di Penta Galletti Galoppini R Gordon TF Kedziora P Lener Torelli F Pratola R Pukacki J Rebahi Villalonga SG 2014 market open source intelligent virtual open source marketplace 2014 evolution week IEEE conference maintenance reengineering reverse engineering CSMRWCRE 2014 Antwerp Belgium February 36 2014 pp 399–402 brackets issue 8337 httpgithubcomadobebracketsissues8337 Brock 2010 harmony inbound transfer rights FOSS projects Intl Free Open Source Law Review 22139–150 Cass 2015 top ten programming languages httpspectrumieeeorgcomputingsoftwarethe2015toptenprogramminglanguages Corbin J Strauss 1990 Grounded theory research procedures canons evaluative criteria Qual Sociol 1313–21 CortésCoy LF LinaresVásquez Aponte J Poshyvanyk 2014 automatically generating commit messages via summarization source code changes 2014 IEEE 14th international working conference source code analysis manipulation SCAM IEEE pp 275–284 Cuanto commit httpsgithubcomttopcuantocommita1e58f2c93de40ab304c494e05853957c549fd44 Cubranic Murphy GC Singer J Booth K 2005 Hipikat memory development IEEE Trans Softw Eng 316446–465 Czmq commit httpsgithubcomzeromqczmqcommiteabe063c2588cde0af90e5ae951a2798b7c5f7e4 d3armory issue 5 httpsgithubcomkovmarci86d3armoryissues5 Di Penta Germán DM Antoniol G 2010 Identifying licensing jar archives using codesearch approach Proceedings 7th international working conference mining repositories MSR 2010 Colocated ICSE Cape Town South Africa May 2–3 2010 Proceedings pp 151–160 Di Penta Germán DM Guéhéneuc Antoniol G 2010 exploratory study evolution licensing Proceedings 32nd ACMIEEE international conference engineering Volume 1 ICSE 2010 Cape Town South Africa 1–8 May 2010 pp 145–154 Dickey DA Fuller WA 1979 Distributions estimators autoregressive time series unit root J Stat Assoc 74427–431 Dickey DA Fuller WA 1981 Likelihood ratio statistics autoregressive time series unit root Econometrica 4941057–1072 Doll B octoverse 2012 httptinyurlcommuyxkru Last accessed 20150115 Dyer R Nguyen HA Rajan H Nguyen TN 2013 Boa language infrastructure analyzing ultralargescale repositories 35th international conference engineering ICSE ’13 San Francisco CA USA May 18–26 2013 pp 422–431 enigma2 commit httpsgithubcomopenatvenigma2commitb4dfdf09842b3dcacb2a6215fc040f7ebbbb3c03 Free Foundation 2015 Categories free nonfree httpswwwgnuorgphilosophycategorieshtml Last accessed 20150115 FDroid httpsfdroidorg Last accessed 20150115 Germán DM Hassan AE 2009 License integration patterns addressing license mismatches componentbased development 31st international conference engineering ICSE 2009 May 1624 2009 Vancouver Canada Proceedings pp 188–198 Germán DM Di Penta Guéhéneuc siblings G Antoniol 2009 Code technical legal implications copying code applications Proceedings 6th international working conference mining repositories MSR 2009 Colocated ICSE Vancouver BC Canada May 1617 2009 Proceedings pp 81–90 Germán DM Di Penta Davies J 2010a Understanding auditing licensing open source distributions 18th IEEE international conference program comprehension ICPC 2010 Braga Minho Portugal June 30July 2 2010 pp 84–93 Germán DM Manabe Inoue K 2010b sentencematching method automatic license identification source code files ASE 2010 25th IEEEACM international conference automated engineering Antwerp Belgium September 20–24 2010 pp 437–446 GitHub API httpsdevelopergithubcomv3 Last accessed 20150115 GNU General Public License 2015 httpwwwgnuorglicensesgplhtml Last accessed 20150115 gtksourcecompletion issue 1 httpsgithubcomchuchiperrimangtksourcecompletionissues1 Gobeille R 2008 FOSSology Proceedings 2008 international working conference mining repositories MSR 2008 Colocated ICSE Leipzig Germany May 10–11 2008 Proceedings pp 47–50 Grechanik Fu C Xie Q McMillan C Poshyvanyk Cumby C 2010 search engine finding highly relevant applications Proceedings 32Nd ACMIEEE international conference engineering Volume 1 ICSE ’10 New York NY USA ACM pp 475–484 gubg commit httpsgithubcomgfannesgubgdeprecatedcommit4d291ef433f0596dbd09d5733b25d27b3a921cf4 Holmes R Murphy GC 2005 Using structural context recommend source code examples 27th international conference engineering ICSE 2005 15–21 May 2005 St Louis Missouri USA pp 117–125 Howison J Conklin Crowston K FLOSSmole collaborative repository FLOSS research data analyses IJITWE’06 117–26 Haml commit httpsgithubcomhamlhamlcommit537497464612f1f5126a526e13e661698c86fd91 Intex issue 1 httpsgithubcommtrintexissues1 jacksonmodulejsonschema issue 35 httpsgithubcomFasterXMLjacksonmodulejsonSchemaissues35 jquerybrowserify issue 20 httpsgithubcomjmarsjquerybrowserifyissues20 LinaresVásquez CortésCoy LF Aponte J Poshyvanyk 2015 ChangeScribe tool automatically generating commit messages 37th IEEEACM international conference engineering ICSE’15 formal research tool demonstration page appear Manabe Hayase Inoue K 2010 Evolutional analysis licenses FOSS Proceedings joint ERCIM workshop evolution EVOL international workshop principles evolution IWPSE Antwerp Belgium September 20–21 2010 pp 83–87 ACM McMillan C Grechanik Poshyvanyk Xie Q Fu C 2011 Portfolio finding relevant functions usage Proceedings 33rd international conference engineering ICSE ’11 New York NY USA ACM McMillan C Grechanik Poshyvanyk 2012a Detecting similar applications pp 364–374 McMillan C Grechanik Poshyvanyk Fu C Xie Q 2012b Exemplar source code search engine finding highly relevant applications IEEE Trans Softw Eng 3851069–1087 McMillan C Hariri N Poshyvanyk ClelandHuang J Mobasher B 2012c Recommending source code use rapid prototypes Proceedings 34th international conference engineering ICSE 12 Piscataway NJ USA IEEE Press pp 848–858 Mcmillan C Poshyvanyk Grechanik Xie Q Fu C 2013 Portfolio searching relevant functions usages millions lines code ACM Trans Softw Eng Methodol 224371–3730 minixwall commit httpsgithubcombooster23minixwallcommit342171fa9e9d769ce4aa48525142a569b34962f7 Moreno L Bavota G Di Penta Oliveto R Marcus Canfora G 2014 Automatic generation release notes Proceedings 22nd ACM SIGSOFT international symposium foundations engineering FSE22 Hong Kong China November 16–22 2014 pp 484–495 Nagappan Zimmermann Bird C 2013 Diversity engineering research Joint meeting European engineering conference ACM SIGSOFT symposium foundations engineering ESECFSE13 Saint Petersburg Russian Federation August 18–26 2013 pp 466–476 neunode issue 5 httpsgithubcomsnakajimaneunodeissues5 Nimble commit httpsgithubcombradleybeddoesnimblecommite1e273ff18730d2f8e0d7c2 af1951970e676c8d1 Oracle MySQL FOSS License Exception httpwwwmysqlcomaboutlegallicensingfossexception Last accessed 20150115 Passenger issue 1482 httpgithubcomphusionpassengerissues1482 patchelf issue 37 httpsgithubcomNixOSpatchelfissues37 Penta MD Germán DM 2009 source code contributors change 16th working conference reverse engineering WCRE 2009 13–16 October 2009 Lille France pp 11–20 PF OpenBSD Packet Filter httpwwwopenbsdorgfaqpf Last accessed 20150115 Ponzanelli L Bacchelli Lanza 2013 Leveraging crowd knowledge comprehension development 17th european conference maintenance reengineering CSMR 2013 Genova Italy March 5–8 2013 pp 57–66 Ponzanelli L Bavota G Di Penta Oliveto R Lanza 2014 Mining stackoverflow turn IDE selfconfident programming prompter 11th working conference mining repositories MSR 2014 Proceedings May 31 June 1 Hyderabad India pp 102–111 Postgis commit httpsgithubcompostgispostgiscommit4eb4127299382c971ea579c8596cc41cb1c089bc pyelection issue 1 httpsgithubcomalexpyelectionissues1 pythonhpilo issue 85 httpsgithubcomseveaspythonhpiloissues85 rcswitchpi issue 17 httpsgithubcomr10rrcswitchpiissues17 Roscomm commit httpsgithubcomrosroscommcommite451639226e9fe4eebc997962435cc454687567c schevorecipedb commit httpsgithubcomSchevoschevorecipedbcommitb73bef14adeb7c87c002a908384253c8f686c625 Singh P Phelps C 2009 Networks social influence choice among competing innovations Insights open source licenses Inf Syst Res 243539–560 Sojer Henkel J 2010 Code reuse open source development Quantitative evidence drivers impediments J Assoc Inf Syst 1112868–901 Package Data Exchange SPDX httpspdxorg Last accessed 20150115 State Octoverse 2012 httpsoctoversegithubcom Last accessed 20150115 Steampp issue 1 httpsgithubcomseishunSteamPPissues1 svgeezy issue 20 httpsgithubcombenhowdle89svgeezyissues20 tablib issue 114 httpsgithubcomkennethreitztablibissues114 Tardis commit httpsgithubcomtardissntardiscommit07b2a072d89d45c386d5f988f04435d76464750e BSD 2Clause License httpopensourceorglicensesBSD2Clause Last accessed 20150115 Tuunanen Koskinen J Kärkkäinen 2009 Automated license analysis Softw Autom Eng 1634455–490 Vendome C LinaresVásquez Bavota G Di Penta Germán DM Poshyvanyk 2015a License usage changes largescale study Java projects GitHub 23rd IEEE international conference program comprehension ICPC 2015 Florence Italy May 18–19 2015 IEEE Vendome C LinaresVásquez Bavota G Di Penta German DM Poshyvanyk 2015b developers adopt change licenses 31st IEEE international conference maintenance evolution ICSME 2015 Bremen Germany September 29 October 1 2015 pages 31–40 IEEE Christopher Vendome fourth year PhD student College William Mary member SEMERU Research Group advised Dr Denys Poshyvanyk received BS Computer Science Emory University 2012 received MS Computer Science College William Mary 2014 main research areas maintenance evolution mining repositories provenance licensing member IEEE ACM Gabriele Bavota assistant professor Free University BolzanoBozen received cum laude Laurea Computer Science University Salerno Italy 2009 defending thesis Traceability Management advised Prof Andrea De Lucia received PhD Computer Science University Salerno 2013 Form January 2013 October 2014 research fellow Department Engineering University Sannio research interests include maintenance evolution refactoring systems mining repositories empirical engineering information retrieval Massimiliano Di Penta associate professor University Sannio Italy since December 2011 assistant professor University since December 2004 research interests include maintenance evolution mining repositories empirical engineering searchbased engineering servicecentric engineering currently involved principal investigator University Sannio European code search licensing issues MARKOS wwwmarkosprojecteu Mario LinaresVásquez PhD candidate College William Mary advised Dr Denys Poshyvanyk cofounder liminal ltda received BS Systems Engineering Universidad Nacional de Colombia 2005 MS Systems Engineering Computing Universidad Nacional de Colombia 2009 research interests include evolution maintenance architecture mining repositories application data mining machine learning techniques support engineering tasks mobile development member IEEE ACM Daniel German Associate Professor University Victoria Victoria Canada received PhD degree Computer Science University Waterloo Canada research interests engineering particular evolution open source intellectual property Denys Poshyvanyk Associate Professor College William Mary Virginia received PhD degree Computer Science Wayne State University 2008 also obtained MS degrees Computer Science National University KyivMohyla Academy Ukraine Wayne State University 2003 2006 respectively research interests engineering maintenance evolution program comprehension reverse engineering repository mining source code analysis metrics member IEEE ACM
::::
Maintaining interoperability open source case study Apache PDFBox Simon Butlertextsuperscriptatextsuperscript Jonas Gamalielssontextsuperscriptatextsuperscript Björn Lundelltextsuperscriptbtextsuperscript Christoffer Braxtextsuperscriptb Anders Mattssontextsuperscriptc Tomas Gustavssontextsuperscriptd Jonas Feisttextsuperscripte Erik Lönrothtextsuperscriptf textsuperscriptaUniversity Skövde Skövde Sweden textsuperscriptbCombitech AB Linköping Sweden textsuperscriptcHusqvarna AB Huskvarna Sweden textsuperscriptdPrimeKey Solutions AB Stockholm Sweden textsuperscripteRedBridge AB Stockholm Sweden textsuperscriptfScania AB Södertälje Sweden textbfA B R C interoperability commonly achieved implementation standards communication protocols data representation formats Standards documents often complex difficult interpret may contain errors inconsistencies lead differing interpretations implementations inhibit interoperability case study two years activity Apache PDFBox examine daytoday decisions made concerning implementation PDF specifications standards community open source OSS Thematic analysis used identify semantic themes describing context observed decisions concerning interoperability Fundamental decision types identified including emulation behaviour dominant implementations extent implement PDF standards Many factors influencing decisions related sustainability influences result decisions made external actors including developers dependencies PDFBox article contributes fine grained perspective decisionmaking interoperability contributors community OSS study identifies decisions made support continuing technical relevance factors motivate constrain activity © 2019 Authors Published Elsevier Inc open access article CC license httpcreativecommonsorglicensesby40 Introduction Many projects seek implement one standards support interoperability example interconnected systems implement standardised communications protocols open systems interconnect stack web standards including hypertext transfer protocol HTTP secure sockets layer SSL support information exchange commercial activities Wilson 1998 Treese 1999 Ko et al 2011 businesses civil society — governments national local level legal system — move away paper documents Lundell 2011 Rossi et al 2008 rely increasingly digitised systems implementation communication protocols document standards becomes ever crucial Rossi et al 2008 Wilson et al 2017 Lehtonen et al 2018 Standards written humans despite care taken creation imperfect vague ambiguous open interpretation implemented Allman 2011 Egyedi 2007 Furthermore practice evolves implementations often seen de facto reference standard diverge published standard case JPEG image format Richter Clark 2018 Indeed practice example HTML CSS JavaScript repeatedly deviate standards sometimes intention locking users specific products W3C 2019a Bouvier 1995 Phillips 1998 consequence web content becomes challenging implement access Phillips 1998 archive Kelly et al 2014 interoperability relies standards different implementations given standard interpretations standard may fully interoperable Egyedi 2007 Consequently developers implementations become involved discourse find common understanding standard supports interoperability illustrated Allman 2011 Lehtonen et al 2018 Watteyne et al 2016 means interoperability achieved varies Internet Engineering Task Force IETF IETF 2019a example uses process often summarised “Rough consensus running code” Davies Hoffmann 2004 requires interoperability independent implementations achieved early standardisation process Wilson 1998 increasing proportion implements communication data standards particularly nondifferentiating developed collaboration companies working community open source OSS projects Lundell et al 2017 Butler et al 2019 community OSS mean OSS projects managed foundations collectively organised Riehle 2011 many developers directed companies organisations collaborate create high quality Fitzgerald 2006 Examples process include OSS projects umbrella Eclipse Internet Things Working Group Eclipse IoT Working Group 2019 LibreOffice Document Foundation 2019 many cases domains OSS proprietary solutions available standard need interoperate remain relevant products literature documents process standardisation technical challenges implementing standards compliant little research focuses participants OSS projects decide implement standard revise implementation correct improve behaviour explicate challenges facing community OSS projects developing standards compliant daytoday decisions made contributors study investigates following research question community OSS maintain interoperability address research question case study Gerring 2017 Walsham 2006 two years contributions Apache PDFBox1 OSS PDFBox governed Apache Foundation ASF ASF 2019a develops maintains mature Black Duck 2019 Java library tools create process Portable Document Format PDF documents Lehmkuhler 2010 PDFBox used OSS projects Apache Tika 2019 CEF Digital 2019 Khudairi 2017 component proprietary products services PDFBox described Section 32 Developed 1990s PDF widelyused file format distributing documents created processed read many different applications multiple platforms Versions PDF defined number specifications standards documents including formal ISO standards implementers need follow ensure interoperability evidence PDF standards challenging implement Bogk Schöpl 2014 Endignoux et al 2016 quality PDF documents varies Lehtonen et al 2018 Lindlar et al 2017 dominance Adobe’s products creates user expectations need met developers PDF Gamalielsson Lundell 2013 Endignoux et al 2016 Amiouny 2017 2016 following section provide background description PDF also review related academic literature Section 3 details reasons purposeful sampling Patton 2015 PDFBox case study subject also identify data sources investigated case study give account application thematic analysis Braun Clarke 2006 identify semantic themes types decisions concerning interoperability PDFBox made contributors factors influencing decisions analysis data identified four fundamental types decision made concerning interoperability PDFBox related compliance published PDF specifications standards types decision technical circumstances made described Section 4 also provide account factors identified influence decisions including resources knowledge influence external actors developers PDF creators documents discuss challenges faced PDFBox Section 5 including technical challenges faced developers PDF potential solutions Thereafter consider behaviour contributors PDFBox sustains long term Lastly present conclusions Section 6 identify contributions made study Background related work 21 Standards development interoperability development standards information communications technologies undertaken companies organisations using range approaches eg whether technology implemented standard developed working practices standards body involved One perspective standards two different types origin standards specified standards bodies eg ISO ITU others arise extensive widespread use particular technology regardless whether developed one company collaboratively Treece 1999 Another perspective standards either requirementled implementationled Phipps 2019 Phipps director sometime President Open Source Initiative argues primary use requirementled model standardisation used create market example development 5G Nikolich et al 2017 contrast implementationled standards developed support innovation data format adopted wider audience creating company standardisation necessary support interoperability third view provided Lundell Gamalielsson 2017 identify standards developed implemented forms basis standardisation process including PDF development standards parallel latter process identified increasing importance telecommunications industry Wright Druta 2014 examples found standardisation process internet protocols managed IETF IETF 2019a IETF emphasises interoperability early stage protocol development rather technical perfection Bradner 1996 Wilson 1998 Bradner 1999 process developing interoperability low powered devices IoT domain described Ko et al 2011 record development internet protocol IP 6LoWPAN provide interoperable communications stacks two IoT operating systems ContikiOS TinyOS interoperable implementations used determine whether solutions achieved practicable types IoT devices expected use Ko et al 2011 approach interoperability development implementations standards particularly communication protocols OSS projects Companies participating Eclipse IoT Working Group 2019 example collaborate sometimes competitors OSS projects develop implementations open communications standards used IoT domain support products Butler et al 2019 Examples include implementation Open Mobile Alliance’s OMA 2019 lightweight machine machine LWM2M protocol Leshan Eclipse Foundation 2019b Wakaama Eclipse Foundation 2019c constrained application protocol CoAP Shelby et al 2014 Californium Eclipse Foundation 2019a Additionally collaborative OSS serves identify document cogent misinterpretations misunderstandings standard Butler et al 2019 22 PDF standards interoperability Adobe Systems developed PDF platformindependent interchange format documents preserve presentation independently application operating system 1993 first PDF specification made freely available number revisions specification published since see Table 1 versions specification published ISO standards eg ISO 3200012008 ISO 3200022017 including specialised subsets PDF format print industry eg ISO 159292002 ISO 1593012001 engineering applications eg ISO 2451712008 PDF documents vary size complexity single page tickets receipts order summaries academic papers large documents Government reports books Consequently PDF documents may short lifespans significantly longer life business legal records particularly organisations move away paper Many different packages exist create display edit process PDF files significant problem longterm use PDF many documents outlive used create Gamalielsson Lundell 2013 require standards compliant faithfully reproduce documents available arbitrary point future PDF therefore work isolation must interoperate extent implementations need able process documents created regardless long ago create documents implementations read Furthermore requirement documents readable many years future particularly case documents contracts official documentation issued governmental agencies requirements theoretical exercise practical requirements already pose problems organisations businesses example dataset examined article evidence contractors Government Netherlands created many thousands official academic transcripts PDF documents comply PDF specifications best problematic process see mailing list thread Users1 Table 5 p 8 PDF complex file format used create documents rich variety content including text images internal document links indexes fillable forms digital signatures version PDF standard cites normative references — standards — form part standard described “ indispensable application document” ISO 3200012008 ISO 2008 normative references include standards fonts image formats character encodings addition several normatively referenced standards include normative references example among normative references ISO 3200022017 part 1 early revision JPEG 2000 ISO standard ISOIEC 1544412004 turn 13 normative references including 10 IEC ISO ISOIEC standards specifications standards also define declarative programming language describes PDF documents well expected behaviours capabilities programs create process PDF documents size complexity PDF specifications ISO standards pose daunting challenge developers implementing recently published ISO 3200022017 standard example consists 984 pages 90 normative references ISO 2017 challenges complicate development works PDF files key challenge common perception Adobe Reader family applications de facto reference implementations PDF specifications standards performance implementations compared Amiouny 2016 Lehtonen et al 2018 Another source difficulty Robustness Principle Allman 2011 otherwise known Postel’s Law applied Adobe’s Reader products stated Postel context communication protocols “ conservative liberal accept others” Postel 1981 practice PDF reading processing implements repair mechanisms allow malformed files read within limitations limitations however documented behaviour Adobe’s products 23 Related work key aspect interoperability agreement documentation data formats communication protocols specifications standards many practical challenges standardisation process number approaches tried Ahlgren et al 2016 argue open standardisation processes needed support interoperability IoT domain example found development implementations 6TiSCH communications protocol lowpower devices Watteyne et al 2016 Watteyne et al describe iterative process interoperability testing implementations lessons learnt testing inform iterations standardisation process Another example standardisation QUIC protocol Originally implemented Google QUIC use 6 years standard developed IETF committee IETF 2019b 2019c Piraux et al 2018 Table 1 Selected PDF versions ISO standards Version ISO Standard Year Comment PDF v10 First published PDF specification PDF v14 Improved encryption added XML metadata predefined CMaps PDF v15 Added JPEG 2000 images improved encryption PDFA1 ISO 1900512005 2005 archive format standalone PDF documents based PDF v14 PDF v17 ISO 3200012008 2008 Extended range support encryption PDFA2 ISO 1900522011 2011 archive format standalone PDF documents based ISO 3200012008 PDFA3 ISO 1900532012 2012 extension PDFA2 support file embedding PDF v20 ISO 3200022017 2017 Revision ISO 3200012008 Piraux et al 2018 evaluated interoperability fifteen implementations QUIC finding shortcomings tests developed Piraux et al since incorporated test suites implementations tested Piraux et al 2018 Standardisation processes take long time consequently may seen inhibitor innovation De Coninck et al 2019 example cite slowness QUIC standardisation process motivation proposed plugin mechanism extend QUIC proposed implemented investigated flexible approach applications communicating QUIC negotiate extensions QUIC use connection setup De Coninck et al 2019 Standards also longlived require review revision response developments practice technology Joint Photographic Expert Group JPEG initiated number standardisation efforts update 25 year old JPEG standards image files including JPEG XT JPEG 2019 Richter Clark 2018 identify JPEG implementations differ standard difficulties applying JPEG conformance testing protocol published ISO 1091852013 ISO 2013 current implementations Richter et al identify two key issues Firstly evolution body practice building standard 25 years since made available motivates standardisation review Secondly parts current standard used practice may longer need part revised standard Richter Clark 2018 standardisation HTML CSS web technologies followed different path Standards HTML CSS developed World Wide Web Consortium W3C W3C 2019b since 1990s W3C 2019a initially auspices IETF Bouvier 1995 browser wars Bouvier 1995 companies would add functionality browsers extend standard encourage web site developers create content specifically innovative features found one browser process developing websites support variations HTML became onerous developers practitioners campaigned Microsoft Netscape adhere W3C standards Phillips 1998 WaSP 2019 Previous research development PDF two OSS projects found developers adopted specific strategies support interoperability Gamalielsson Lundell 2013 Specifically developers would exceed specification mimic dominant implementation complied implementation addition study illuminated difficulties developers interpreting PDF standard One issue identified lack detail parts specification made implementation imprecise unreliable Another concern expressed complexity specification inhibited implementation Gamalielsson Lundell 2013 Indeed analyses PDF perspective creating parsers found task challenging Bogk Schöpl 2014 Endignoux et al 2016 part investigation PDF Endignoux et al 2016 identify ambiguities file structures used discover bugs number PDF readers Bogk Schöpl 2014 describe experience trying create formally verified parser PDF advise creators future file format definitions ensure format “ complete unambiguous doesn’t allow unparseable constructions” Bogk Schöpl 2014 practice complexity PDF specifications lead significant security vulnerabilities implementations Mladenov et al 2018a 2018b PDFA standards see Table 1 used document preservation area concern management documents comply PDFA standards Lehtonen et al 2018 identify complexity problems faced handling documents explore mechanisms documents might repaired “wellformed valid PDFA files” team behind development veraPDF PDFA validator identify difficulties interpreting PDFA standard Wilson et al 2017 able create validation tests representing clear understanding standards Additionally Wilson et al 2017 record need limit scope validation tests implemented veraPDF scale task particularly validation normative references JPEG 2000 Lindlar et al 2017 record development test set PDF documents test conformance PDF files structural syntactic requirements ISO 3200012008 authors argue test set used examine basic wellformedness requirements helpful digital preservation simplifies detection specific problems precursor application document repair techniques Lindlar et al 2017 summary previous research shows necessity standardisation interoperability details approaches standardisation Research also identified practice deviate standards case PDF practical difficulties developing challenges creating mechanisms evaluate standards compliance challenges implementing standards also recorded However lack research examines nature daytoday practical decisionmaking developers implementing standard Research approach undertake case study Gerring 2017 Walsham 2006 single purposefullysampled Patton 2015 community OSS focuses challenges contributors face creating maintaining interoperable collaborate resolve problems 31 Case selection Apache PDFBox selected relevant subject case study several reasons Firstly PDFBox value users must able interoperate reads writes PDF documents must implement sufficient PDF specifications standards perceived viable solution Secondly PDF specifications standards complex documented challenging implement additional requirement implementations need process wide variety conforming nonconforming documents emulate functionality dominant implementation Thirdly though produced OSS likely used business setting PDFBox ASF independent direct company control Consequently contributors PDFBox obliged rely cooperation others community achieve goals Fourthly PDFBox actively develops maintains responds reports issues releases revisions frequently scope investigation publicly documented work contributing nine releases Apache PDFBox release v203 September 2016 release v2012 October 2018 period investigated specifically chosen include publication ISO 3200022017 standard also known PDF v20 August 2017 32 Case description Apache PDFBox develops Java library command line tools create process PDF files library relatively low level used create process PDF documents conforming different versions PDF specifications ISO standards see Table 1 examples development since 2002 ASF governed since 2008 PDFBox maintained small group core developers active community contributors PDFBox dependency ASF projects including Apache Tika Apache Tika 2019 OSS projects including European Union funded Digital Signature Services CEF Digital 2019 PDFBox used parse documents one version veraPDF validator veraPDF 2019 well used proprietary products services PDFBox also part suite used journalists extract information PDF files amongst documents collectively known Panama Papers Khudairi 2017 ICJ 2019 time study recent major revision PDFBox v200 released March 2016 maintenance releases generally made approximately every two three months since addition maintains older version v18 bugs fixed releases made less often overwhelming majority bug fixes 18x series backported 20x series also working towards major revision v30
::::
33 Data collection core data case study consists online archives activity PDFBox Using PDFBox website Apache PDFBox 2019 identified communication channels available making contributions resources available users contributors see Table 2 Three public communication channels used make contributions Jira issue tracker developers users mailing lists addition commits mailing list reports commits made PDFBox source code repository messages generated version control system readonly mirror PDFBox source code also provided GitHub Mailing list archives identified downloaded ASF mail archives ASF 2019b GrimoireLab Perceval component Bitergia 2019 used parse Mbox format files convert JSON format files JSON files processed using Python scripts reconstruct email threads write threads emacs orgmode files analysis orgmode plain text format emacs supports text folding annotation Jira issue tracker tickets retrieved JSON format using Jira REST API Atlassian 2019 JSON records ticket aggregated processed Python scripts create orgmode files containing problem description comments ticket
::::
34 Data analysis data gathered PDFBox analysed using thematic analysis framework Braun Clarke 2006 Initially first author worked systematically collected data identify email threads issue tracker tickets address topic interoperability regard mailing list threads issue tracker tickets cover wide range topics including administration well help requests potential bug reports Key factors considered included reference capabilities PDFBox comparison PDF processing mention PDF specification standard normative references font image formats phase email threads reconstructed parts conversations subject line recorded archives separate threads3 set candidate email threads issue tracker tickets examined detail identify discussions decisions made concerning implementation functionality related PDF specifications standards normative references PDFBox Mailing list threads issue tracker tickets clear decision articulated ignored analytical purposes discussions judged insufficient information given decisions made clearly understood conversations recorded mailing list threads issue tracker tickets contain technical opinions judgements domain experts including core developers often contain explicit reference PDF specifications standards specific reference standard conversation topic discussion used determine relevance comparison conversations topic explicitly linked PDF standards contributors end process 111 mailing list threads 394 issue tracker tickets identified analysis Coding also used stage annotate discussions particularly decisions made help identify nature problems addressed relationship problems PDF standards PDF outcome decisionmaking process corpus 505 mailing list issue tracker discussions analysed depth first author identify candidate semantic themes describe types decision made identify candidate thematic factors influencing decisions made coding previous phase supported grouping decision types development semantic themes Additional coding undertaken stage used identify factors influencing decisions develop set candidate thematic factors subsequent phase authors discussed candidate decision types factors alongside illustrative discussions taken corpus set four semantic themes seven thematic factors agreed consistency larger body evidence reviewed first author
::::
4 Findings section describes semantic themes identified thematic analysis categorise decisions made contributors PDFBox regarding maintenance interoperability decision type illustrated examples Thereafter provide 3 email header contains reference message replies Sometimes reference omitted replying mailing list message Table 3 Types development decisions related PDF specifications standards Apache PDFBox Decision Type Description Improve match de facto reference implementation decision taken context improving correcting PDFBox match de facto reference implementation Degrade match de facto reference implementation decision taken context degrading compliance PDFBox PDF specification standard behaviour matches Adobe implementation Improve match standard decision taken context improving correcting behaviour PDFBox meet PDF specification standard Scope implementation decision taken extent PDFBox implementation Table 4 Apache PDFBox JIRA issue tracker tickets referenced Section 41 Decision type Issue tracker ticket Improve match de facto reference implementation PDFBOX3513 PDFBOX3589 PDFBOX3654 PDFBOX3687 PDFBOX3738 PDFBOX3745 PDFBOX3752 PDFBOX3781 PDFBOX3789 PDFBOX3874 PDFBOX3875 PDFBOX3913 PDFBOX3946 PDFBOX3958 Degrade match de facto reference implementation PDFBOX3929 PDFBOX3983 Improve Match Standard PDFBOX3914 PDFBOX3920 PDFBOX3992 PDFBOX4276 PDFBOX3293 PDFBOX4045 PDFBOX4189 account main factors motivate constrain outcomes types decision made 41 Decision types identified four major types decision related implementation PDF specification standards PDFBox see Table 3 described illustrative examples also provide descriptions thematic factors identified combination influence decisions made 411 Improve match de facto reference implementation Much work PDFBox contributors focused trying match behaviour Adobe’s PDF PDFBox core developers many contributors treat Adobe PDF readers de facto reference implementations PDF specifications standards eg PDFBOX3738 PDFBOX3745 – PDFBox JIRA issue tracker tickets referred Section 41 listed Table 4 use maxim PDFBox able process document Adobe PDF readers one core developer explains “There PDF spec real world PDFs real world PDFs correct regards spec Acrobat PDFBox many libraries try best provide workarounds typically try match Acrobat ” PDFBOX3687 ISO 3200022017 standard ISO 2017 pp 1819 identifies two classifications PDF processing PDF readers PDF writers Accordingly developers trying match Adobe implementations face two major challenges first able process input Adobe second create output similar quality produced Adobe also two types output PDF document created given document rendered screen print “try match Acrobat” PDFBOX3687 documents created PDFBox insofar possible match output Adobe rendered consistently expectation PDFBox created using also render documents similar quality Adobe implementations eg PDFBOX3589 PDFBOX3752 convention reads PDF files apply Robustness Principle Allman 2011 Postel 1981 documents compliant PDF specifications standards processed rendered insofar possible eg PDFBOX3789 Exactly incorrect malformed content parsed working document specified PDF specifications standards exemplar developers behaviour Adobe Readers well behaviour PDF PDF documents consist four parts header body cross reference table trailer header consists string “PDF–” version number followed second line minimum four bytes value 128 greater tool trying determine file contains treat binary data text trailer consists string “EOF” separate line immediately preceded number one line representing offset crossreference table string “startxref” line see Fig 1 PDF parser reads first line file searches “EOF” marker works backwards find crossreference table using offset preceding line read trailer confirms number objects referenced table object reference root object document tree parser able read objects PDF file crossreference table missing damaged PDF parsers may according ISO 3200012008 standard ISO 2008 p 650 try reconstruct table searching objects file see Fig 2 practice Adobe appears apply Principle Robustness widely wide range problems example fonts also tolerated parser
::::
4 PDFBox JIRA issue tracker tickets referenced URLs form httpsissuesapacheorgjirabrowsePDFBOX‘NNNN’ ‘NNNN’ four digit number ticket example PDFBOX3738 URL httpsissuesapacheorgjirabrowsePDFBOX3738
::::
5 also ‘linearised’ PDF files intended network transmission trailer crossreference tables precede body 6 repair mechanism sometimes Adobe applications offer opportunity user save newly opened document work required resolve issues nature varies scope Sometimes source code revision relatively trivial simple change make parser lenient document author’s intention clear example PDFBOX3874 small change made font parser accept field names font metadata capitalised differently specification Similarly PDFBOX3513 PDFBox core developers identify error ISO 3200012008 standard underlying cause observed problem PDFBox One column table specifies two types name dictionary value encoding dictionary Type 3 fonts ISO 2008 p 259 next column table clearly specifies field must dictionary contributor encountered document proposes revision parser accommodate error PDFBOX3513 One core developer comments “we’ve never encountered file problem you’ve presented” Another core developer points guidance specification treat Type 3 font encoding dictionary Instead improvising fallback encoding core developers argue may case ignore font specified document cannot reliably used parser revised given rarity problem Adobe PDF sometimes exceed specifications standards PDFBOX3654 example file found renders many applications PDFBox problem font encoded hexadecimal format standard unequivocal subject “Although encrypted portion standard Type 1 font may binary ASCII hexadecimal format PDF supports binary format” ISO 2017 p 351 source code revised support font encoding core developer processing issue observes “So font incorrectly stored obviously Adobe supports too” PDFBOX3654 cases Adobe extends specifications standards implementation additional functionality reflects wider practice Often documentation additional functionality implementation implementers discover change differences behaviour reported example report PDFBOX3913 shows Adobe PDFjs process render Japanese URI PDFBox ISO 3200022017 standard specifies targets URIs links encoded UTF8 applications URI encoded UTF16 necessary represent Japanese characters used domain names exceeds standard Revisions made PDFBox documented PDFBOX3913 PDFBOX3946 PDFBOX3958 support UTF16 URIs implement functionality Adobe PDFjs PDFBox contributors also find instances documents created rendered expected Adobe’s cases typically difference model documents created PDFBox model Adobe expects cases great deal work required understand Adobe readers interpret PDF document PDFBOX3738 work undertaken understand output digitally signed files interpreted Adobe reader products acquired knowledge applied PDFBox create documents read rendered digital signature displayed PDF developers also identify related problem documented PDFBOX3781 affects documents forms digital signatures Merging PDF files difficult problem implementers solve PDFBOX3875 records challenges faced merging two documents internal bookmarks structured using slightly different representations document model merged document bookmarks work expected initial assessment one core developers cause within PDFBox source code “…probably bug kind fixed quickly ” One approach used core developers evaluate best solve problem merge documents using applications including Adobe examine document created following merge Work started try create viable solution emulating document resulting merging files using Adobe problems encountered work completed 412 Degrade match de facto reference implementation noted already developers PDF including PDFBox developers tend view Adobe PDF implementations gold standard However Adobe’s developers always implement PDF specifications standards way others might occasions implement solutions seen incorrect Consequently developers PDF need determine might degrade adherence PDF specifications standards match Adobe’s implementations PDFBOX3929 begins discussion PDFBox users mailing list user observes PDF documents created PDFBox floating point numbers used field widget border 7 PDFjs widely used open source PDF reader implemented JavaScript see httpsmozillagithubiopdfjs widths rendered Adobe XI Adobe DC without border Users2 Users3 Table 5 borders annotation types unaffected width borders drawn around annotations form fields defined PDF documents two ways border array holding three four values cases border style dictionary associative array includes value width border points cases value specify width defined number PDF specifications standards define two numeric types integer objects real objects ISO 32000 standards say “ term number refers object whose type may integer real” ISO 2008 p 14 ISO 2017 p 24 ISO 3200022017 example explicit fields required hold integer values uses term number numeric fields versions ISO 32000 standard define border array using following sentence “The array consists three numbers defining horizontal corner radius vertical corner radius border width default user space units” ISO 2008 p 384 ISO 2017 p 465 Accordingly interpretation standards used PDFBox agrees standard border width specified floating point number However Adobe reader expects integer ignores noninteger values 30 treating value zero Consequently PDFBox implementation revised annotations documents created PDFBox rendered borders Adobe DC bug report also made Adobe support saying standard interpreted incorrectly closely related issue found thread users mailing list Users4 developer reports Adobe reader implementations behave unexpected way time concern border drawn around URI action annotation link border defined standard described Adobe reader implementations interpret values 1 2 3 meaning thin medium thick border respectively PDFBox API documentation updated describe Adobe reader implementations interpret border width value contributor reports PDFBOX3983 Acrobat Reader fails display outlines borders miter limit set value zero less miter limit indicates junctions lines drawn ISO 3200012008 standard states Parameters numeric values current colour line width miter limit shall forced valid range necessary ISO 2008 p124 statement revised ISO 3200022017 replacement “forced” “clipped” ISO 2017 p 157 Accordingly one interpretation might compliant PDF reader would able display document correctly regardless value miter limit recorded would automatically correct value However Adobe implementations appear correct value user reporting problem supplies patch miter limit documents created PDFBox contain miter limit values positive simple fix allows Adobe display document OpenPDFtoHTML another OSS also encountered problem takes similar action6 413 Improve match standard PDFBox implementation also revised meet requirements PDF standards normative references independently need match performance Adobe products use multibyte representations characters Unicode character encodings UTF16 require careful processing PDF parsers single byte values misinterpreted single byte value 0x20 represents space character fonts encoded one byte multibyte character encodings byte 0x20 may part character treated single byte Two kinds operator used PDF documents position text one used multibyte font encodings single byte values form part multibyte characters misinterpreted patch contributed PDFBOX3992 PDFBox fully supports operator used justify multibyte encoded text comply ISO 3200012008 standard PDFA group standards define archive format PDF demands standards high compliance requires great deal attention detail document preparation general PDFA standards constrain types content present compliant files sometimes make precise demands quality embedded resources veraPDF develops freely available validator PDFA files PDFBox also implements ‘preflight’ functionality validate documents requirements PDFA1b ISO 1900512005 standard examples implementation revised match performance veraPDF validator differences found example bug preflight validator found PDFBOX4276 functionality corrected incorrect output detected veraPDF would PDFBOX3920 user reports font subsets created PDFBox include data required PDFA2 standard ISO 1900522011 PDFBox source code modified output meets standard number revisions PDF specifications standards mean occasionally found PDFBox implement particular feature capture data PDF document contributor reports problem PDFBox field ignored parsing leads content rendered supposed hidden user provides patch PDFBOX3914 6 httpsgithubcomdanfickleopenhtmltopdfissues135 forms basis update source code field imported document rendered correctly 414 Scope implementation core developers also make decisions scope implemented PDFBox question functionality forms scope PDFBox implementation arises bug reports feature requests multiple dimensions PDFBox intended comprehensive solution creating processing rendering PDF documents charter mission statement says “The Apache PDFBox library open source Java tool working PDF documents allows creation new PDF documents manipulation existing documents ability extract content documents Apache PDFBox also includes several commandline utilities Apache PDFBox published Apache License v20” Apache PDFBox 2019 PDFBox relies external libraries provide functionality especially area image processing need PDFBox reimplement wheel particularly technically demanding domains difficulty image processing provision within core Java libraries incomplete varies Java versions functionality JPEG 2000 codec longer maintained difficult OSS implementers adopt licence used potential patent issues discussed Section 426 Java provision image processing changing Java v9 functionality gradually returned core libraries However JPEG 2000 codec remains outside main Java libraries PDFBox core developers often recommend use Twelve Monkeys plugin9 image processing particular processes CMYK images PDFBox areas work outside current scope PDFBox including implementation rendering complex scripts provision developers contributed code nonEuropean languages expertise example Users5 cases layout languages sufficiently close Latin scripts need additional provision fonts correct shown PDFBOX3293 However many languages including Arabic Indian subcontinent need implement code position glyphs using GSUB GPOS tables PDFBOX4189 user provides lot functionality support GSUB tables Bengali complexity task clear discussions reviewing accepting source code Decisions also made cause observations whether observed result problem PDFBox issue lies PDFBox decisions made resolving problem Sometimes erroneous observation results user reports difference assessments Adobe preflight PDFBox concerning document’s compliance PDFA1b standard PDFBOX4045 Adobe XI identifies inconsistencies glyph widths one font document investigation core developers determine error PDFBox Adobe X agrees document compliant Given inconsistent assessments made Adobe X XI inspection font show issue reported Adobe XI PDFBox core developers conclude problem implementation preflight particular version Adobe XI used
::::
9 httpsgithubcomharaldkTwelveMonkeys 42 Factors influencing decisionmaking Common decision types observed set considerations factors influence outcome decisionmaking process see Table 6 Factor Description Workforce availability contributors work Maintenance Risk maintenance burden feature implementation Expertise collective expertise contributors Sustainable Solution longterm viability technical solution Capability ability make relevant meaningful changes given context Intellectual Property Rights Matters pertaining copyright patents licensing Java Interoperability consequences interoperability revisions Java 421 Workforce Companies choose use PDFBox appropriate needs contribute improvement work developers noted core developers PDFBox number emphasise paid work PDFBox “The volunteer effort always looking interested people help us improve PDFBox multitude ways help us depending skills” Apache PDFBox 2019 limited time available Targett 2019 PDFBox core developers concentrate efforts Khudairi 2019 areas work priority unless developers community able contribute example given previously work solution document merging problem PDFBOX387510 halts may explained limited workforce focused achievable tasks illustrated core developers’ comment another task “I hoped implement given current commitments unlikely I’m able short term I’m trying concentrate resolving AcroForms related stuff spare time moment1” PDFBOX3550 Another example influence available workforce decision making found PDFBOX3875 developer working company wants problem resolved problem challenging take time understand resolve developer reporting problem given three choices adopt use another OSS application implicitly buy licence Adobe professional contribute fix either directly commissioning developers work 422 Maintenance risk notion maintenance risk related factors expertise workforce Core developers sometimes express imply concern unwilling accept solution example PDFBOX3962 user proposes solution repairs unicode mappings one PDF document 10 Issue tracker tickets referenced Section 42 given Table 7 rendered core developers identify solution resolves special case work would required develop viable solution Java 9 libraries Another concern articulated requests support complex scripts core developers skills maintain functionality lengthy discussion issue found PDFBOX3550 core developers identify central challenges creating solution main concern cases providing additional functionality cannot maintained challenge maintain either terms effort required necessary expertise risk utility perhaps viability 423 Expertise implementation PDF requires expertise wide range areas addition PDF Limitations available expertise help determine work done contributors One implication already noted reluctance maintain source code areas limited expertise amongst core developers Another areas functionality cannot developed example user asks compressing CMYK JPEG images PDFBOX3844 core developer responds saying “There JPEG compression CMYK BufferedImage objects box ie Java ImageIO doesn’t support don’t skills I’ll close “won’t fix” time” PDFBOX3844 alternative suggested PDFBOX3844 investigate Twelve Monkeys builds Java ImageIO functionality also great deal expertise within PDFBox community enable implementation solutions PDFBOX4095 one contributor provides proposed solution challenging problem work evaluating proposed change isn’t going well another contributor suggests simple revision resolves problems Similarly complex image rendering problem solved help advice contributor PDFBOX4267 another contributor implements code process YCbCr CMYK JPEG images PDFBOX4024 Expertise alone however sufficient provide solution problem cases discussion PDFBOX4189 shows considerable expertise within user community core developers fonts render complex scripts Key factors prevented work done previously shortage available workforce also lack expertise target language would provide sufficient understanding distinguish good bad solutions “Many complex scripts Arabic require shaping engines require deep knowledge languages order follow rules OpenType tables” PDFBOX3550 424 Sustainable solution often implementation choices made resolving problem better longterm solution viable shortterm fix workaround PDFBOX3300 concerns reported way font subset created prior embedding document specific solution proposed provides way resolving problem Another developer identifies optimal solution resolve problems CMaptextsuperscript11 parser sustainable solution patch provide specific workaround case developers able create generic solution better addresses font standards thereby PDF standards provides longerlived solution 425 Capability key factor decisions concerns whether able correct problem causing observed behaviour examples given Section 412 PDFBox implementation degraded meeting standard match behaviour Adobe’s illustrate one aspect capability factor cases ‘incorrect’ implementation could revised revision PDFBox could ensure documents created would rendered expected Adobe’s implementations cases bugs found external libraries infrastructure impact PDFBox Often workaround found alternative library recommended example PDFBOX3641 describes situation PDFBox uses core Java library way triggers bug Java implementation code PDFBox revised prevent bug triggered Java bug also reportedtextsuperscript12 426 Intellectual property rights PDF documents include technologies artifacts use constrained copyright patents licences addition PDFBox implemented Java lifetime moved closed source largely open source variants eg OpenJDK derivatives like Amazon Corretto entirely open source implementation JPEG 2000 codec included extensions Java libraries Sun Microsystems’ process make Java open source codec along image codecs released separate library known ImageIO licence used implementation JPEG 2000 codec Open Initiative OSI approved open source licence consider licence used incompatible OSS licences GPL v3 Apache Licence v20textsuperscript13 addition concerns amongst OSS developers potential patent claims related JPEG 2000 though concerns diminishing passage time image codecs ImageIO library reincorporated Java libraries OpenJDK since v9 JPEG 2000 codec Consequently JPEG 2000 support PDFBox required users relies jaiimageiotextsuperscript14 implementation codec textsuperscript11 CMap table font file maps character encodings glyphs represent textsuperscript12 httpsbugsopenjdkjavanetbrowseJDK8175984 textsuperscript13 example opinion expressed httpsgithubcomjaiimageiojaiimageiojpeg2000 textsuperscript14 httpsgithubcomjaiimageiojaiimageiojpeg2000 longer maintained user reports using OpenJPEG implementation JPEG 2000 PDFBOX4320 However OpenJPEG implemented C used native code may suitable deployment contexts development replacement OSS JPEG 2000 codec inhibited resources including expertise finance required implement large complex standardfootnoteJPEG 2000 defined ISOIEC 15444 consists 14 parts see Lundell et al 2018 ISO 1900512005 standard ISO 2005 p 11 archival PDF documents mandates embedding fonts including standard 14 fontsfootnotePDF specifications require 14 fonts present systems render documents eg ISO 3200012008 ISO 2008 p 256 substitute fonts files document contains resources required render requirement stated “Only fonts legally embeddable file unlimited universal rendering shall used” ISO 2005 p 10 requirement problematic many fonts licences permit redistribution matter discussed PDFBOX3618 legality embedded fonts responsibility document creator PDFA1 PDFA2 standards include note clarifies need legal use font clearly verifiably stated “This part ISO 19005 precludes embedding font programs whose legality depends upon special agreement copyright holder allowance places unacceptable burdens archive verify existence validity longevity claims” ISO 2005 p 11 ISO 2011 p 15 427 Java interoperability addition set problems concerning interoperability Java influence solutions implemented PDFBox related PDF standards Java used provide support image processing required standards example found PDFBOX3549 Java versions differing capability process ICC colour spaces versions bugs affect handling ICC colour spaces period PDFBox activity investigated three new major versions Java released many revisions made version also evidence mailing lists Jira tickets users still using Java 5 already obsolete start period investigated 43 Summary analysis two years activity PDFBox related implementation PDF specifications standards identified four decision types related development seven factors influence decisions four decision types related adapting emulate behaviour Adobe’s PDF implementing PDF standards scope PDFBox implementation seven factors act combination facilitate constrain development activity especially interplay expertise workforce Analysis Much work PDFBox contributors consists trying match implementation Adobe PDF reader reasons matching Adobe implementations mostly clear yet trying emulate Adobe’s clearly challenging solutions including validators might reduce extent challenges risks challenging create 51 challenges developing PDF parsers PDF specifications standards specify PDF may try reconstruct files cross reference table incorrect omitted practice Principle Robustness applied Adobe’s PDF PDF files wellformed often rendered developers PDF applications obliged follow Adobe’s lead developers nonAdobe PDF implement parsers behaved similarly Adobe’s products would quickly become irrelevant PDF users often believe documents read rendered Adobe must meet standard Amiouny 2016 Lehtonen et al 2018 extent PDF applications libraries expected tolerate errors documents documented Adobe’s creates number challenges developers PDF Firstly nonAdobe developers left timeconsuming puzzle trying match Adobe implementations Indeed puzzle includes element chance differences performance discovered PDF document including triggering problem processed Secondly clearly security concerns approach Parsing arguably one challenging engineering tasks case PDF core specifications standards extensive complex include large number normative references component file media types need parsed either PDF implementation dependencies PDFBox subject Common Vulnerabilities Exposures CVE notices related parser implementationfootnoteFor example CVE20188036 CVE2018117979 PDF implementations core developers therefore making decisions security part around viability trying match behaviour Adobe’s practitioners argue small revision made ISO 3200022017 standard concerning structure file precisely defines relationship header end file marker largely put end need apply Principle Robustness PDF parsing Amiouny 2017 However though changes standard important may ease burden developers share optimism changes apply structure documents claim PDF v20 compliant course remain circulation documents created 25 years PDF usage well documents continue created compliant earlier specifications standards Principle Robustness applied tolerate nonconformance normative standards PDF fonts images well minor PDF implementation errors Given history malformed PDF files challenges standards compliance fact document claims PDF v20 complies structural requirements ISO 3200022017 guarantee either document components comply standard Consequently need tolerant parsing remains One improvement might creation reference implementations validation tools practices adopted development open standards example IoT domain noted Section 23 eg Watteyne et al 2016 Validation tools fonts could help ensure font creators build font files contain sufficient accurate information use font file implementers font parsers means evaluate validation tools PDF documents reference implementation PDF would help developers PDF create interoperable ap lications less effort possibly reduce security risks arising need parse malformed documents However practice PDF validators difficult expensive implement veraPDF veraPDF 2019 PDFA validator example created European Union funded PDFTools validator proprietary licenced softwaretext19 problem remains also solutions validators forward looking address challenge processing noncompliant PDF files created last 25 years still need read though case introducing validators reference implementations help ensure PDF files created future pose fewer problems developers Lundell Gamalielsson 2018 Furthermore tools validators provide reference point try improve quality existing documents exemplified work Lehtonen et al 2018 applications PDF file preservation 52 Practice vs standard challenges PDFBox contributors arise development practice particularly Adobe moves away standards PDFBOX3913 records discovery Adobe’s PDF PDFjs exceed ISO 3200012008 standard implementing UTF16 encoding destination URIs links bug report dates August 2017 contemporary publication ISO 3200022017 specifies use UTF8 encoding ISO 2017 p 515 Given use UTF16 encoded URIs part HTML 5 since 2011text20 outwardly reasonable Adobe others follow practice However remains open question UTF16 encoding URIs part ISO 3200022017 standard issue found PDFBox Jira issue tickets grey area standard document presented PDF specifications standards apply quality document manner parts document rendered example character spacing However standard specify might render document examples given illustrate degradation compliance standard match Adobe’s implementation particular interest ISO 3200012008 ISO 3200022017 standards clear value border width represented compliant PDF document PDFBox core developers identified representation values border widths within document comply PDF specifications standards valid noninteger values accepted Adobe However presentation screen Adobe border widths defined document interpretation values document one may need followed slavishly 53 sustainability PDFBox core developers generally act improve functionality However times actions appear constrained longterm interests decisions example around support complex scripts graphics processing ready explanations core developers always necessary skills time implement required solutions also activities may clear decision stated core developers contributors complete tasks run time higher priority tasks attend may inferred developers acting longterm interests create works maintained concern contributors overreach collective abilities capacity develop maintain good quality risk may cease viable parallels drawn decisionmaking core developers reflect capacity make maintain specific changes decisions made within business maintain going concern Implicit idea PDFBox remains marketable ie sufficiently compliant PDF specifications standards useful many users therefore continue attract users contributors without need take risks making unsustainable changes recognised observed process selfregulation precisely company group companies driving development PDFBox making strategic decisions dedicated managers making strategic decisions Instead appear sensible levelheaded strategic decisions might made business made small small collective individuals company developers collaborating development maintenance PDFBox 54 Limitations case study reported article describes analyses activity practitioners collaborating OSS community develop create process PDF documents acknowledge limitations transferability findings arise nature study However conjecture findings may representative challenges faced decision types made OSS projects perhaps businesses implementing standardsbased interoperable particular dominant implementation contributes discourse meaning interoperability factors informing decisions made relate technical resource concerns appear relevant businesses organisations Conclusions study reports findings investigation practical decisions concerning interoperability made two year period contributors community open source Apache PDFBox PDFBox develops maintains used create process documents conform multiple PDF specifications published ISO standards Four types decision made contributors maintain interoperability PDFBox identified thematic analysis Decisions interoperability concern compliance PDF specifications ISO standards match mimic behaviour de facto reference implementation unrelated standards conflict conjunction contributors also make decisions scope PDFBox implementation Contributors PDFBox able deliver high quality careful times conservative decisionmaking process allows often agile response discovery problems project’s changes dominant proprietary implementation time decisions made informed factors including resource technical considerations contribute towards longer term viability created 19 PDFTools 3Heights Validator httpswwwpdftoolscompdf20enproductspdfconvertervalidationpdfvalidator 20 httpswwww3orgTR2011WDhtml520110525urlshtml summary study makes following contributions existing body knowledge area rich detailed account types decisions made within community OSS maintain interoperability account technical nontechnical factors motivate constrain development activity support sustainability study provides rich illustration analysis challenges faced contributors community OSS implement maintain interoperable standardsbased study shown contributors PDFBox able meet challenges arising demands technical specifications standards performance de facto reference implementation study also finds awareness resources available able maintain interoperable continuing technical relevance topic future research understand extent challenges decisiontypes identified factors influencing decisions representative faced organisations — businesses OSS projects — developing standardsbased implementations Declaration competing interest None Acknowledgements research financially supported Swedish Knowledge Foundation KKstiftelsen participating partner organisations LIMIT authors grateful stimulating collaboration support colleagues partner organisations References Ahlgren B Hidell Ngi ECH 2016 Internet things smart cities interoperability open data IEEE Internet Comput 20 52–56 doi101109MIC2016124 Allman E 2011 robustness principle reconsidered Commun ACM 54 40–45 doi10114519785421978557 Amiouny 2016 Buggy PDF Files Try Fix Amyuni Technologies Inc httpblogamyunicomp1627 Accessed 20190515 Amiouny 2017 PDF 20 Future PDF Takeways PDF Days Europe 2017 Amyuni Technologies Inc httpblogamyunicomp1702 Accessed 20190514 Apache PDFBox 2019 Apache PDFBox Java PDF Library Apache Foundation httpspdfboxapacheorg Accessed 20190917 Apache Tika 2019 Apache Tika — Content Analysis Toolkit Apache Foundation httpstikaapacheorg Accessed 20190605 ASF 2019 Apache Foundation Apache Foundation httpwwwapacheorg Accessed 20190605 ASF 2019 Apache Foundation Public Mailing List Archives Apache Foundation httpmailarchivesapacheorg Accessed 20190605 Atlassian 2019 Jira REST APIs Atlassian httpsdeveloperatlassiancomjiradevnetjiraapisjirarestapis Accessed 20190415 Bitergia 2019 GrimoireLab Bitergia httpschaossgithubiogrimoirelab Accessed 20190803 Black Duck 2019 Apache PDFBox Black Duck Inc httpswwwopenhubnetppdfbox Accessed 20190308 Bogk Schöpl 2014 pitfalls protocol design attempting write formally verified PDF parser 2014 IEEE Security Privacy Workshops pp 198–203 doi101109SPW201436 Bouvier DJ 1995 Versions standards HTML SIGAPP Appl Comput Rev 3 9–15 doi101145238228238232 Bradner 1996 internet standards process — revision 3 Internet Engineering Task Force httpswwwrfceditororgrfcrfc2206html Accessed 20190919 Bradner 1999 internet engineering task force DiBona C Ockman Stone Eds OpenSources Voices Open Source Revolution OReilly Associates pp 28–30 Braun V Clarke V 2006 Using thematic analysis psychology Qual Res Psychol 3 77–101 doi1011771478088706063003 Butler Gamalielsson J Lundell B Brax C Sjöberg J Mattsson Gustavsson Feist J Lönnroth E 2019 company contributions community OSS projects IEEE Trans Softw Eng early access doi101109TSE20192910305 1–11 CEF Digital 2019 Start Using Digital Signature Services DSS CEF Digital httpseceuropaeucefdigitalwikipagesviewpageactionpageId77177034 Accessed 20190429 Davies EB Hoffmann J 2004 IETF Problem Resolution Process Internet Engineering Task Force httpswwwrfceditororgrfcrfc3844html Accessed 20190919 De Coninck Q Michel F Piraux Rochet F GivenWilson Legay Perret P Ronaventure 2019 Pluginizing QUIC Proceedings ACM Special Interest Group Data Communication ACM New York NY USA pp 59–74 doi10114533413023342078 Eclipse Foundation 2019 Californium Cf CoAP framework Eclipse Foundation httpswwweclipseorgcf Accessed 20191003 Eclipse Foundation 2019 Eclipse Leshan Eclipse Foundation httpswwweclipseorgleshan Accessed 20191003 Eclipse Foundation 2019 Eclipse Wakaama Eclipse Foundation httpswwweclipseorgwakaama Accessed 20191003 Eclipse IoT Working Group 2019 Open Source IoT Eclipse IoT Working Group httpswwweclipseorgiot Accessed 20180829 Egidi TM 2007 Standardcompliant incompatible Comput Standards Interfaces 29 605–613 doi101016jcsi200704020 Endignonx G Levillain Migeon JY 2016 Caradoc pragmatic approach PDF parsing validation 2016 IEEE Security Privacy Workshops SPW pp 125–139 doi101109SPW201639 Fitzgerald B 2006 transformation open source Manage Inf Syst Q 30 587–598 Gamalielsson J Lundell B 2013 Experiences implementing PDF open source Challenges opportunities standardisation processes Proceedings 8th International Conference Standardization Innovation Information Technology SITI 2013 pp 1–11 doi101109SITI20136774572 Gerring J 2017 Case Study Research Principles Practices second ed Cambridge University Press Cambridge UK ICJ 2019 Panama Papers Exposing Rogue Offshore Finance Industry httpswwwicjorginvestigationspanamapapers Accessed 20190529 IETF 2019 Internet Engineering Task Force Internet Engineering Task Force httpswwwietforg Accessed 20190927 IETF 2019 QUIC quic — Internet Engineering Task Force httpsdatatracker ietforgwgquicabout Accessed 20190924 IETF 2019 QUIC quic — documents Internet Engineering Task Force httpsdatatracker ietforgwgquicdocuments Accessed 20190924 ISO 2005 Document management — Electronic Document File Format LongTerm Preservation — Part 1 Use PDF 14 PDFA1 ISO 1900512005 first ed International Organization Standardisation Geneva Switzerland ISO 2008 Document Management — Portable Document Format — Part 1 PDF 17 ISO 3200012008 first ed International Organization Standardisation Geneva Switzerland ISO 2012 Document Management — Electronic Document File Format LongTerm Preservation — Part 2 Use ISO 320001 PDFA2 ISO 1900522011 first ed International Organization Standardisation Geneva Switzerland ISO 2013 Digital Compression Coding ContinuousTone Still Images JPEG File Interchange Format JFIF ISOIEC 1091852013 first ed International Organization Standardisation Geneva Switzerland ISO 2017 Document Management — Portable document format — Part 2 PDF 20 ISO 3200022017 first ed International Organization Standardisation Geneva Switzerland JEGF 2019 Overview JPEG XT International Standards Organisation httpsjegforgjegxt Accessed 20190401 Kelly Nelson ML Wégie MC 2014 archival acid test Evaluating archive performance advanced HTML JavaScript IEEEACM Joint Conference Digital Libraries pp 25–28 doi101109ICDL20146970146 Khudairi 2017 Apache Foundation Recognizes Apache Innovations Pulitzer Prizewinning Panama Papers investigation Apache Foundation httpsblogsapacheorgfoundationentrytheapachesoftwarefoundationrecognizes Accessed 20190214 Khudairi 2019 Apache 2018 — Digits Apache Foundation httpsblogsapacheorgfoundationentryapachein2018bythe Accessed 20190102 Ko J Eriksson J Tsiftes N DawsonHaggerty Vasseur J Durvy Terzis Dunkels Culler 2011 Industry Beyond Interoperability Pushing Performance Sensor Network IP Stacks Proceedings 9th ACM Conference Embedded Networked Sensor Systems ACM New York NY USA pp 1–11 doi10114520709422070944 Lehmkuhler 2010 Apache PDFBox — Working PDFs Dummies Apache Foundation httpspeopleapacheorglehmanapachecon ApacheConPDFBoxpdf Accessed 20190604 Lehtonen J Helin H Kylander J Koivunen K 2018 PDF mayhem broken really broken Proceedings 15th International Conference Digital Preservation IPRES 2018 doi1017615IRRES1228649 Lindlar Tunnat Wilson C 2017 testset wellformedness validation JHOVE — good bad ugly Proceedings 15th International Conference Digital Preservation IPRES 2017 doi105281zenodo1228649 Lundell B 2011 eGovernance public sector ICT procurement shaping practice Sweden Eur J ePractice 12 66–78 httpsjoinupeceuropaeusitesdefaultfilesdocument201406ePractice220Journal20Vol205202012MarchApril202011pdf Lundell B Gamalielsson J 2017 potential improved standardisation use open source work practices different standardisation organisations open source projects contribute development ITstandards Jakobs K Ed Digitalisation Challenge Opportunity Standardisation Proceedings 22nd EURAS Annual Standardisation Conference EURAS Contributions Standardisation Research Vol 12 Verlag Mainz Aachen pp 137–155 Lundell B Gamalielsson J 2018 Sustainable digitalisation different dimensions openness lockin interoperability longterm maintenance systems addressed Proceedings OpenSym ’18 ACM New York NY USA doi 10114532333913235527 Lundell B Gamalielsson J Katz 2018 challenges implementing ISO standards open closed standards implemented open source Jakobs K Ed Corporate Global Standardization Initiatives Contemporary Society IGI Global Hershey PA USA pp 219–251 doi 1040189781522553205 Lundell B Gamalielsson J Tengblad Yousefi BH Fischer Johansson G Rödung B Mattsson Oppmark J Gustavsson Feist J Landemo Lönnroth E 2017 Addressing lockin interoperability longterm maintenance challenges open source companies strategically use open source Open Source Systems Towards Robust Practices – Proceedings 13th IFIP WG 213 International Conference Open Source Systems OSS 2017 Springer pp 80–88 doi 10100797833195773579 Mladenov V Mainka C Meyer zu Selhausen K Grothe Schwenk J 2018a 1 Trillion dollar refund — spoof PDF signatures httpswwwpdfinsecurityorgdownloadpaperpdf Accessed 20190509 Mladenov V Mainka C Meyer zu Selhausen K Grothe Schwenk J 2018b break PDF signatures httpspdfinsecurityorg Accessed 20190514 Nikolich P C L Korhonen J Marks R Tye B Li G Ni J Zhang 2017 Standards 5G beyond use cases applications httpsfuturenetworksieeeorgtechfocusjune2017standardsfor5gandbeyond Accessed 20191003 OMA 2019 OMA SpecWorks Open Mobile Alliance httpswwwomaspecworksorg Accessed 20191003 Patton MQ 2015 Qualitative Research Evaluation Methods fourth ed Sage Publications Inc Thousand Oaks California USA Phillips B 1998 Designers browser war casualties Computer 31 14–16 doi 1011092722269 Phipps 2019 Open Source FRAND Legal Issues Wrong Lens Open Forum Academy httpwwwopenforumeuropaorgwpcontentuploads201903OFA OpinionPaper SimonPhipps OSSandFRANDpdf Accessed 20191003 Piraux De Coninck Q Bonaventure 2018 Observing evolution QUIC implementations Proceedings Workshop Evolution Performance Interoperability QUIC ACM New York NY USA pp 8–14 doi 10114532488503248487 Postel J 1981 RFC 793 Transmission Control Protocol Internet Engineering Task Force httpstoolsietforghtmlrfc793 Accessed 20190415 Richter Clark R 2018 JPEG JPEG — testing 25 years old standard 2018 Picture Coding Symposium PCS pp 1–5 doi 101109PCS20188456260 Riehle 2011 Controlling steering open source projects IEEE Comput 44 93–96 doi 101109MC2011206 Rossi B Russo B Succi G 2008 Analysis diffusion data standards inside European public organizations 2008 3rd International Conference Information Communication Technologies Theory Applications pp 1–6 doi 101109ICTTA20084529953 Shelby Z Hartke K Bormann C 2014 Constrained Application Protocol CoAP Internet Engineering Task Force httpswwwrfceditororgrfcrfc7252 html Accessed 20191003 Targett E 2019 Meet Apache Foundations Top 5 code Committers Computer Business Review httpswwwchrononlinecomfeatureapachetop5 Accessed 20191004 Document Foundation 2019 LibreOffice Document Foundation httpswwwlibreofficeorg Accessed 20190926 Treese W 1999 Putting together Engineering Net IETF netWorker 3 13–19 doi 10294562294634 veraPDF 2019 Industry supported PDFA validation veraPDF Consortium httpverapdforg Accessed 20190603 W3C 2019 history web World Wide Web Consortium httpswwww3orgHistorythehistoryoftheweb Accessed 20190918 W3C 2019 World wide web consortium W3C World Wide Web Consortium httpswwww3org Accessed 20190918 Walsham G 2006 interpretive research Eur J Inf Syst 15 320–330 doi 101057palgraveesi3000585 WaSP 2019 History Web Standards Web Standards httpswwwwebstandardsorgabouthistory Accessed 20190927 Watteyne Handziski V Vilajosana X Duquennoy Hahn Baccelli E Wolisz 2016 Industrial wireless IPbased cyberphysical systems Proc IEEE 104 1025–1038 doi 101109JPROC20152509186 Wilson C McGuinness R Jung J 2017 veraPDF Building open source industry supported PDFA validator cultural heritage institutions Digital Lib Perspect 33 156–165 Wilson J 1998 IETF Laying Net’s asphalt Computer 31 116–117 doi 1011092707624 Wright SA Druta 2014 Open source standards role open source dialogue research standardization 2014 IEEE Globecom Workshops GC Wkshps pp 650–655 doi 101109GLOCOMW20147063506 Simon Butler received PhD Open University 2016 researcher Systems Research Group University Skövde Sweden research interests include engineering open source program comprehension development tools practices maintenance Jonas Gamalielsson received PhD Heriot Watt University 2009 senior lecturer University Skövde member Systems Research Group conducted research related free open source number projects research reported publications variety international journals conferences Professor Björn Lundell received PhD University Exeter 2001 leads Systems Research Group University Skövde Professor Lundell’s research contributes theory practice systems domain area open source open standards related development use procurement systems research addresses sociotechnical challenges concerning systems focuses lockin interoperability longevity systems Professor Lundell active international national research projects contributed guidelines policies national EU levels Christoffer Brax received MSc degree University Skövde 2000 PhD Örebro University 2011 consultant Combitech AB working systems engineering requirements management systems design architecture security Christoffer 18 years experience systems engineer Anders Mattsson received MSc degree Chalmers University Technology Sweden 1989 PhD engineering University Limerick Ireland 2012 almost 30 years experience engineering currently RD manager Information Products owner development process Husquarna AB Anders particularly interested strengthening engineering practices organizations Special interests include architecture modeldriven development context embedded realtime systems Tomas Gustavsson received MSc degree Electrical Computer Engineering KTH Royal Institute Technology Stockholm 1994 cofounder current CTO PrimeKey Solutions AB Tomas researching implementing public key infrastructure PKI systems 24 years founder developer open source enterprise PKI EBJCA contributor numerous open source projects member board Open Source Sweden goal enhance Internet corporate security introducing cost effective efficient PKI Jonas Feist received MSc degree Computer Science Institute Technology Linköping University 1988 senior executive cofounder RedBridge AB computer consultancy business Stockholm Erik Lönroth holds MSc Computer Science Technical Responsible high performance computing area Scania AB leading technical development four generations super computing initiatives Scania supporting subsystems Erik frequently lectures development super computer environments industry open source governance HPC related topics
::::
empirical study downstream workarounds crossproject bugs Hui Ding Wanwangying Lin Chen Yuming Zhou Baowen Xu State Key Laboratory Novel Technology Nanjing University China dinghui85gmailcom wwymasmailnjueducn lchen zhouyuming bwxunjueducn Abstract—GitHub fostered complicated enormous ecosystems projects depend coevolve error upstream may affect downstream projects interdependencies forming crossproject bugs Though upstream developers fix bugs side proposing workaround ie temporary solution downstream common practice downstream developers study empirically investigated characteristics downstream workarounds scientific Python ecosystem Combining statistical comparisons manual inspection following three main findings First general workarounds corresponding upstream fixes significantly different code size code structure Second three kinds crossproject bugs downstream developers usually work around Last four types common patterns identified investigated workarounds findings study lead better understanding crossproject bugs practices developers ecosystems Keywords—GitHub ecosystems crossproject bugs workarounds practices INTRODUCTION Benefiting social coding capabilities GitHub development GitHub evolved beyond single sociotechnical ecosystems 1 Projects rely infrastructure functional components provided projects forming complex interproject dependencies way bugs upstream projects may affect downstream projects dependencies phenomenon confirmed et al 2 study investigated crossproject correlated bugs ie causally related bugs reported different projects scientific Python ecosystem GitHub focusing developers coordinate triage fix kind bugs context crossproject bugs doubt upstream bug roots provide radical cure However affected downstream projects usually offer workaround ie temporary solution locally bypass upstream error et al posted questionnaire asked downstream developers usually deal crossproject bugs result indicated 893 respondents chose propose temporary workaround proven common practice 2 Workarounds important two folded 2 First used avoid longlasting impact upstream bug workaround must implemented upstream team willing able fix bug quickly allows downstream temporarily suppress upstream bug Second adding workaround upstream bug enables downstream support buggy upstream version without affecting end users many users may still use old version upstream downstream developers cannot rely fix next upstream release Therefore downstream developers work around bugs regardless whether already fixed upstream Despite wide use importance workarounds crossproject bugs little work paid attention issue Studying workaround help understand fixing process crossproject bugs also coordination projects ecosystem Therefore conduct study investigate characteristics downstream workarounds context crossproject bugs base study scientific Python ecosystem GitHub crossproject bug refer patch injected buggy upstream upstream fix temporary solution provided affected downstream downstream workaround make investigation workarounds three aspects First compare code size design workarounds corresponding upstream fixes Second inspect whether crossproject bugs worked around downstream projects something common Third investigate whether practitioners developed workarounds common ways main contributions study follows First extract 60 downstream workarounds scientific Python ecosystem Second identify three kinds crossproject bugs downstream developers usually work around Third summarize four common workaround patterns Last provide several design requirements workaround supporting tools rest paper organized follows Section II describes related work Section III presents research methodology Section IV shows empirical results propose discussions findings Section V II RELATED WORK Crossproject Bugs development ecosystems crossproject bugs appear attract attention increasing number researchers existing studies showed crossproject bugs brought many troubles ecosystem developers Decan et al 3 reported developers R ecosystems felt pain upstream packages broke Adams et al 4 indicated core activity integration open source distributions synchronizing newer upstream version avoid crossproject bugs developers pay great attention synchronizing process Bavota et al 5 found upstream upgrade would strong effects downstream projects general dependencies study showed large amount downstream code modified upstream changed downstream depended upstream framework general services case upstream bugs would leave wide impact downstream projects researches focused coordination developers different projects fixing crossproject bugs Villarroel et al 6 leveraged reviews App users help developers realize downstream demand classified prioritized downstream reviews upstream developers able catch important bugs quickly et al 2 studied developers fixed crossproject correlated bugs scientific Python ecosystem Combining manual inspection results online survey revealed developers especially downstream side tracked root cause crossproject bugs dealt eliminate bad effects study bases extends work focus specific common practice downstream developers facing crossproject bugs ie proposing workaround B Blocking Bugs Another special type bugs blocking bugs extent similar crossprojects bugs Blocking bugs prevent bugs projects fixes often happens dependency relationship among components environment developers cannot fix bugs modules fixing depend modules unresolved bugs Due severe impact researchers turned eyes blocking bugs Garcial Shihab 7 found took two three times longer fix blocking bugs nonblocked bugs employed decision tress predict whether bug blocking bug extracted 14 kinds features construct predictor evaluated features influential indicate blocking bugs Later Xia et al 8 proposed novel method named ELIBloker identify blocking bugs class imbalance phenomenon taken account ELIBloker utilized features combined multiple classifiers learn appropriate imbalance decision boundary ELIBloker outperformed method 7 147 Fmeasure Unlike blocking bugs prevent fixing bugs dependent modules crossprojects bugs occur upstream projects affect normal operation downstream projects affected downstream modulesprojects developers attempt take action released blockingcrossproject bugs components paper investigate downstream practices facing crossproject bugs C Design Bug Fixes Fixing bugs important activity maintenance Developers devote substantial efforts design bug fixes reflect developers’ expertise experience Various studies investigated nature design bug fixes Zhong Su 9 extracted analyzed 9000 realworld bug fixes six Java projects obtained 15 findings could gain insights automatic program repair Pan et al 10 explored underlying bug fix patterns identified 27 bug fix patterns amenable automatic detection Park et al 11 analyzed bugs fixed understand characteristics incomplete patches revealed predicting supplementary patch difficult problem Jiang et al 12 conducted study characteristics Linux kernel patches could explain patch acceptance reviewingintegration time Misirli et al 13 proposed measure study impact fixinducing changes found lines code added number developers worked change number prior modifications files modified change best indicators highimpact fixinducing changes Echeverria et al 14 evaluated developers’ performance fixing bugs propagating fixes products industrial Product Line According different characteristics bug fixes researches developed various automatic tools support bug repair Goues et al 1314 used genetic programming repair bugs C programs evaluated fraction bugs could repaired automatically generated large indicative benchmark set systematic evaluations Mechtaev et al 17 presented semanticsbased repair method applicable largescale realworld Gu et al 18 considered bad fix problem implemented prototype automatically detects bad fixes Java programs fixing bugs developers may different options design bug fix Leszak et al 19 pointed defects fixed correcting real errorcausing component rather workaround injected another location online material gives clear description workaround 20 “A workaround far less elegant solution problem Typically workaround viewed something designed panacea cureall rather crude solution immediate problem temporary fix workaround well suitable permanent fix implemented management personnel” MurphyHill et al 21 studied developer might choose workaround instead fix real location summarized six factors risk management interface breakage consistency user behavior cause understanding social factors studies also paid attention phenomenon workarounds Ko et al 22 found bug known workaround developers often focused severe bugs Berglund 23 indicated bugs could worked around workarounds relevant early stages bug fixing process Different existing studies investigated design fixes withinproject bugs study concentrates characteristics downstream workarounds context crossproject bugs III RESEARCH METHODOLOGY section first introduce collected data study present research questions Finally describe research methods used investigate questions Data Source crossproject bugs investigation collected et al 2 data available online1 dataset contains 271 pairs crossproject bugs gathered scientific Python ecosystem GitHub Every pair includes upstream issue reported rootcause downstream issue reported affected Specifically crossproject bugs involve 204 projects including seven core libraries ecosystem IPython2 NumPy3 SciPy4 Matplotlib5 Pandas6 Scikitlearn7 Astropy8 Since study focuses workarounds interested crossproject bugs downstream developers provided workaround order extract data needed manually read bug reports downstream side 271 pairs bugs downstream developers willing propose workaround likely leave related information issue reports example developer IPython suffering bug Setuptools commented “I’ll open Issue setuptools deal figure best workaround IPython be” ipythonipython8804 Two authors paper carried task found 60 pairs crossproject bugs investigate study 60 pairs bugs concentrated downstream workarounds corresponding upstream fixes Usually upstream issue link bugfix commits repaired Also downstream issue worked around commits including workaround would indicated manually inspecting issue reports two authors linked every pair closed crossproject bugs commits containing fixworkaround Note nine crossproject bugs fixed upstream projects Therefore total collected 60 downstream workarounds 51 upstream fixes B Research Questions aim study investigate characteristics downstream workarounds context crossproject bugs particular attempt answer following three research questions RQ1 differences downstream workarounds corresponding upstream fixes Compared upstream fix workaround injected different serves different purpose Therefore design workaround different fix compared two aspects code size code structure RQ2 crossproject bugs downstream developers work around common features stated crossproject bugs workarounds features 60 bugs workarounds common RQ2 sought find answer RQ3 workarounds common patterns RQ3 attempted find whether downstream developers worked around upstream bugs common ways C Research Methods 1 Quantitative analysis methods RQ1 Wilcoxon signedrank test Cliff’s δ served compare code size upstream fixes downstream workarounds Wilcoxon signedrank test nonparametric statistical hypothesis test used compare whether two matched groups data identical 24 paired sample study sizes concerning number modified files number changed lines code downstream workarounds upstream fixes set null hypothesis H0 alternative hypothesis H1 follows H0 number modified files number changed lines code downstream workarounds upstream fixes H1 number modified files number changed lines code downstream workarounds significantly different upstream fixes assessed test results significance level 005 pvalue obtained Wilcoxon signedrank test lower 005 sizes workarounds fixes considered significantly different Together 1 httpsgithubcomnjuapICSE2017 2 httpipythonorg httpsgithubcomipythonipython 3 httpwwwnumpyorg httpsgithubcomnumpynumpy 4 httpwwwscipyorgscipylib httpsgithubcomscipyscipy 5 httpmatplotliborg httpsgithubcommatplotlibmatplotlib 6 httppandaspydataorg httpsgithubcompydatapandas 7 httpscikitlearnorg httpsgithubcomscikitlearnscikitlearn 8 httpwwwastropyorg httpsgithubcomastropyastropy median values sizes able decide whether size workaround smaller size corresponding fix Furthermore used Cliff’s delta effect size measure magnitude difference sizes workarounds fixes Cliff’s delta provides simple way quantifying practical difference two groups 25 kinds effect sizes Cliff’s delta direct simple variety nonparametric one 26 convention magnitude difference considered either trivial delta 0147 small 0147033 moderate 0330474 large 0474 27 2 Qualitative analysis RQ2 RQ3 part RQ1 performed qualitative analysis investigate questions Two authors manually inspected issue reports code fixesworkarounds crossproject bugs two authors first individually completed task following procedure criteria reviewed issue reports code carefully executed existing test cases provided developers keep track traces observe inputoutput procedure wrote necessary information bug information bug type root cause bug impact participants bug context related methods test cases traces inputoutput workaround fix strategies also wrote findings individual investigation came together discuss findings draw conclusions IV RESEARCH RESULTS RQ1 Differences Fixes Workarounds order compare upstream fixes downstream workarounds first statistically compared sizes terms number modified files number modified lines code inspected code structure fixes workarounds see whether different Among 60 pairs crossproject bugs nine fixed upstream projects Therefore could compare workarounds upstream fixes RQ1 investigated remaining 51 pairs crossproject bugs 1 Statistical comparison size TABLE shows minimum maximum average values well 25th 50th 75th percentiles workaroundfix size facilitate visual comparison also use boxplots illustrate size distributions Fig 1 clear number modified files number modified lines code workarounds smaller fixes also adopted Wilcoxon signedrank test Cliff’s delta effect size statistically compare workarounds fixes results shown TABLE II pvalues less 005 indicate number modified files number modified lines code significantly different workarounds fixes values Cliff’s delta mean difference number changed files small difference number modified lines code large Files SLOC Pvalue 0019 0014 delta 0232 0771 Combining boxplots results statistical tests conclude size workaround significantly smaller size corresponding upstream fix 2 Inspection code statistically comparing size downstream workarounds corresponding upstream fixes looked code make investigation general eight 51 crossprojects bugs upstream fix corresponding downstream workaround designed manner developers sides similar idea modify projects facing bug example using Astropy normalizer led TABLE SIZES UPSTREAM FIXES DOWNSTREAM WORKAROUNDS Min Max Avg 25th 50th 75th Files Fixes 1 8 3 2 2 4 Workarounds 1 6 2 1 2 3 SLOC Fixes 1 829 93 19 36 105 Workarounds 1 662 61 10 26 45 TypeError Sunpy playing mapcube peek animation sunpysunpy1532 caused bug ImageNormalize class Astropy include call inherited method autoscaleNone astropyastropy 4117 address problem Sunpy Astropy used explicit call autoscaleNone Fig 2 shows downstream workaround upstream fix bug Additionally worth noting fix workaround proposed developer Another example shown astropyastropy3052 caused numpynumpy 5251 downstream workaround copy upstream fix crossproject bug remaining 43 51 crossproject bugs downstream developers worked around different way upstream developers fix bugs seems accord intuition Whether withinproject crossproject bugs workaround shortterm solution injected place true rootcause location crossproject bugs workaround placed downstream upstream buggy method called ultimate fix repair buggy method Intuitively two kinds modification usually different confirmed observations Section IVC discuss workaround patterns detail B RQ2 Common Bug Features manually inspecting issues reports 60 crossproject bugs found bugs something common totally identified three kinds common features Fortynine investigated bugs could classified remaining 11 bugs distinct characteristics cannot put category 1 Emerging cases crossproject bug reported downstream encountered emerging case upstream method cover Thirtynine 60 crossproject bugs could classified kind specifically divided 39 bugs two subcategories First original upstream method could process certain types forms data example astropyastropy3052 reported method NumPy use suitable format Unicode data numpynumpy5251 Astropyastropy4658 caused npmedian NumPy could handle masked arrays numpynumpy7330 LucadexpyTSA18 worked around upstream bug Pandas could read csv files column separator comma pandasdevpandas2733 Second upstream method might consider processing edge cases example method utilitiesautowrapufuncify Sympy failed length symbol list larger 31 sympysympy9593 failure resulted error method frompyfunc NumPy check number arguments numpynumpy5672 2 Wrong outputs Sometimes upstream methods might produce wrong results specific inputs could break downstream projects Six studied upstream bugs caused wrong outputs wrong outputs partly caused incorrect design functionality Blazeodo331 caused wrong output datetime64 series Pandas method return NAT instead NaN empty series pandasdevpandas11245 NumPy nplog1pinf returned NaN return Inf numpynumpy4225 led 2037 20511 def updatefigi im annotate anidata removes 203 205 204 206 imsetarrayanidataidata 205 207 imsetcmapselfmapsiplotsettingscmap 206 imsetnormselfmapsiplotsettingsnorm 208 norm deepcopyselfmapsiplotsettingsnorm 210 following explicit call bugged versions Astropys ImageNormalize 211 normautoscaleNoneanidataidata 212 imsetnormnorm downstream workaround 675 678 def callself values clipNone 67 68 values nparrayvalues copyTrue dtypefloat 69 70 Set default values vmin vmax specified 71 selfautoscaleNonevalues 70 73 Normalize based vmin vmax 71 74 npsubtractvalues selfvmin outvalues b upstream fix Fig 2 comparison code downstream workaround corresponding upstream fix undesired result Nengo nengonengo260 unexpected outputs upstream methods introduced carelessly incompatible changes upstream developers fixed another bug developed new feature example method combinefirst new version Pandas performed unwanted conversion dates integers pandasdevpandas3593 made modules Clair unusable eikewelkclair43 3 Python 3 incompatibility upstream methods could perform correctly Python 3 could work perfectly Python 2 running downstream projects Python 3 original upstream method resulted bug example method loadtxt NumPy failed complex data Python 3 numpynumpy5655 affected downstream msmtools markovmodelmsmtools18 Totally four 60 crossproject bugs due Python 3 incompatibility C RQ3 Workaround Patterns investigating characteristics crossproject bugs workarounds summarized common patterns studied workarounds Generally found four workaround patterns covering workarounds 37 crossproject bugs 1 Pattern 1 Using different method upstream method downstream used bug simple way replace buggy one similar method Example Obspy developer experienced segmentation faults certain systems constructing NumPy array obspyobspy536 investigation bug caused error nparray numpynumpy3175 downstream developers worked around crossproject bug using npfrombuffer instead nparray Fig 3 shows downstream workaround Ten 60 workarounds designed adopt another method could provide functionality However replacements provided original upstream projects example npfrombuffer nparray comes NumPy phenomenon implies two things First libraries may tend develop multiple methods overlapping capabilities Second downstream projects willing change dependencies reasonable since adding new dependency means effort laid downstream understand release cycle new upstream coordinate main challenge proposing kind workaround lies two aspects first find replacement method preferably designed identical upstream least stable Second parameters carefully modified fit new method since may require different kind parameter compared buggy method challenge also indicates automatic tool recommend similar APIs adapt parameters useful developers work around crossproject bug 2 Pattern 2 Conditionally using original method stated IVB crossproject bugs caused one uncovered cases upstream methods Therefore intuitive way work around bug use method cases result failure Example Scipyscipy3596 recorded bug scipysignalfftconvolve work well multithreaded environments digging issue developers found scipysignalfftconvolve made use numpyfftrfftn irfftn noncomplex inputs NumPy’s FFT routines actually thread safe Though later numpynumpy4655 fixed bug NumPy SciPy developers still thought work around side support older NumPy version fix Fig 4 shows downstream workaround pre19 NumPy noncomplex inputs SciPy calls numpyfftrfftn irfftn one thread time thread safe cases use FFT method instead However though workaround helped users get trouble seemed little complex developer proposed easiest workaround would convert noncomplex inputs complex inputs adding 0j processed SciPy’s FFT routine instead buggy NumPy’s RFFT method idea disapproved developers NumPy’s RFFT method significantly faster better use method whenever possible another SciPy developer commented “Whatever fix done SciPy side would nice didn’t prevent someone new enough fixed NumPy using newer RFFT method multithreaded” 1095 1095 def getSequenceNumberself 109 109 def getMSRecordself 110 110 following obspymseedteststestlibmseed 111 111 msr clibmseedmsrinitCPOINTERMSRecord 112 112 pyobj nparrayselfmsrecord 113 113 errcode Fig 3 downstream workaround injected Obspy Fifteen 60 workarounds designed restrict use buggy upstream method covered cases two key points proposing workaround kind First developers determine conditions original used upstream method would fail ie uncovered cases Usually developers could find answer process diagnosing bug important decide deal failed cases inspecting 11 workarounds find developers either made used another method raised error exception eg sympysympy9593 3 Pattern 3 Adapting inputs use original method avoid failure caused uncovered cases developers may also choose convert inputs processable form correctly handled buggy upstream method Example Pyhrfpyhrf146 reported test failure seemed come scipymiscfromimage trying open 1bit images SciPy method would produce segmentation fault order avoid failure Pyhrf developers decided first convert 1bit image 8bit image could dealt SciPy method Fig 5 shows downstream workaround Nine 60 studied workarounds conform pattern Though seems direct way convert uncovered case covered case order use original upstream routine method always feasible 4 Pattern 4 Converting outputs original method work around buggy upstream methods produce wrong outputs certain inputs downstream developers possibly choose convert wrong results desired ones Example method combinefirst Pandas falsely converted dates integers pandasdevpandas3593 bypass bug downstream Clair explicitly called pdtodatetime convert timerelated data integers dates eikewelkclair43 Fig 6 shows downstream workaround Apart example two downstream projects worked around crossproject bugs way V DISCUSSION section discuss findings downstream workarounds Workaround Generation et al proposed workaround common practice downstream developers used cope crossproject bugs 2 Workarounds play significant role since bypass bad impact bugs waiting upstream fixes well shield end user affected even use buggy upstream version 2 Therefore suffering crossproject bug great use downstream developers could propose workaround timely Section IV summarized 60 crossproject bugs workaround three main categories largest number bugs new cases upstream method could process temporarily handle problem downstream developers may adopt another method similar functionality instead limit use buggy method within cases handle convert emerging case form buggy method deal facing crossproject bugs produce wrong results certain inputs downstream developers may continue use original method explicitly transform outputs correct form Summarizing bug types common workaround patterns help developers efficiently develop suitable workaround time also guide design automatic workaround generation tools discussion Section IVC tool supposed following tasks First search alternative methods functionality buggy method Second extract conditions upstream methods correctly work Third adapt input data suitable forms upstream methods able process opinions preferred workaround follow three principles whether generated hand tool First workaround could suppress bypass upstream bug make downstream run normally Second workaround supposed make code changes possible et al indicated workarounds would removed afterwards 2 Therefore workaround preferred designed way affect modules make easy deprecate Third workaround supposed use efficient methods order reduce performance B Workaround Recommendation ecosystem central projects used multiple projects example scientific Python ecosystem NumPy basic tool nearly projects within ecosystem depend Therefore error popular like NumPy may break one downstream projects may need work around crossproject bug waiting upstream fix circumstance downstream could benefit another responsive sibling proposed workaround bug Daskdask297 shows example Dask affected NumPy bug numpynumpy3484 developer found another Scikitlearn suffering bug digging code Scikitlearn indicated Dask could learn Scikitlearn commented “Possible solution would add function python 3 compatibility scikitlearn httpsgithubcomscikitlearnscikitlearnblobmastersklearnutilsfixespyL8” Dask copied solution Scikitlearn code workaround bug existing workaround sibling reduces workload developers suffering bug However find suitable workaround another seems nontrivial task First developers find projects also affected crossproject bug get know affected projects deal bug Last select appropriate workaround projects adapt Therefore workaround recommendation tool automates process could useful tool designed least three functionalities First predict projects may influenced bug learnt workaround Second check code changes extract downstream workarounds Last compare context affected modules different projects rank workarounds developers facing several technical challenges develop tool deserves study C Workaround Removing stated downstream workaround temporary solution injected downstream projects cope crossproject bug Unlike corresponding upstream fix ultimate permanent solution workaround may modified discarded later 2 indeed find cases shows developers intend remove change workarounds future Materialsinnovationpymks132 reported Pymks broke due bug Scikitlearn scikitlearnscikitlearn3984 downstream developer added key word argument size short term solution current dimension requirement buggy method Scikitlearn wrote commit “Sklearn developers already removed dimension requirement development version code version released keyword argument removed” pandasdevpandas9276 Pandas developer proposed workaround NumPy bug numpynumpy 5562 comment would reconsider decision upstream fixed bug Sympysympy9593 included workaround another NumPy bug numpynumpy5672 developer left comment code “maxargs set numpy compiletime constant NPYMAXARGS future version numpy modifies removes restriction variable changed removed” example see downstream developers could decide exact time modify remove workarounds time depends responsible upstream projects accomplish certain tasks eg releasing new version modify specific variables Consequently downstream developers need track progress concerning upstream projects order maintain workarounds accordingly absolutely adds burden downstream maintainers confirmed respondents survey posed et al 2 order reduce maintenance burden downstream developers automatic workaround modification removing tool desirable tool supposed detect occurrence upstream event may influence workaround give notification developers Another key function tool semiautomatically remove workarounds workarounds could deprecated Additionally time remove workarounds also worth studying workaround landmark case coordination upstream downstream projects fixing process crossproject bugs study lifecycle workaround help understand developers sides collaborate fix crossproject bugs developers different projects cooperate within ecosystem VI THREATS VALIDITY section discuss threats validity study first threat concerns accuracy identification workarounds fixes Kim et al pointed needed high quality bugfix information reduce superficial conclusions many bugfixes polluted 28 order identify workarounds fixes two authors individually reviewed issue reports manually related commits indicated reports crosschecked other’s results maximize accuracy data investigation second threat concerns unknown effect deviation variables statistical tests size workaroundfix normal distribution mitigate threats conclusions supported proper statistical tests chose Wilcoxon signedrank test Cliff’s delta effect size nonparametric tests require assumption underlying data distribution third threat concerns researchers’ preconceptions two authors conducted manual analysis followed procedure criteria collecting studied dataset identifying comparing fixes workarounds well summarizing bug features workaround patterns However general difficult completely eliminate influence researchers’ preconceptions order minimize personal bias discuss results especially unclear cases together last threat concerns generalization empirical results conducted study scientific Python ecosystem However crossproject bugs downstream workarounds occur within specific ecosystem cannot assume results generalize beyond specific environment conducted validation ecosystems desirable VII CONCLUSION FUTURE WORK previous work proposing workaround shown common practice downstream developers bypass impact crossproject bug study studied characteristics downstream workarounds First manually identified 60 crossproject bugs workaround 271 crossproject bugs scientific Python ecosystem data empirically compared workaround corresponding upstream fix summarized bug features workaround patterns main findings study follows general size workaround significantly smaller corresponding fix fix workaround usually different code structures crossproject bugs downstream developers worked around usually caused emerging case upstream method cannot process wrong output certain inputs Python 3 incompatibility Four patterns workarounds identified using another method similar functionality restricting buggy method range process converting inputs processable form correcting outputs using buggy method findings study also indicate needs possibility developing tools supporting workaround generation recommendation maintenance removal future work continue develop supporting tools well investigate lifecycle workarounds kinds ecosystems ACKNOWLEDGMENT work supported National Natural Science Foundation China 61472175 61472178 91418202 National Natural Science Foundation Jiangsu Province BK20130014 REFERENCES 1 E Kalliamvakou G Gousios K Blincoe L Singer German Damian indepth study promises perils mining GitHub Empirical Engineering pp 1–37 2015 2 W L Chen X Zhang Zhou B Xu developers fix crossproject correlated bugs case study GitHub scientific Python ecosystem Proceedings 39th International Conference Engineering 2017 p Accepted 3 Decan Mens Claes P Grosjean GitHub meets CRAN analysis interrepository package dependency problems Proceedings International Conference Analysis Evolution Reengineering 2016 pp 493–504 4 B Adams R Kavanagh E Hassan German empirical study integration activities distributions open source Empirical Engineering vol 21 3 pp 960–1001 Jun 2016 5 G Bavota G Canfora Di Penta R Oliveto Panichella Apache community upgrades dependencies evolutionary study Empirical Engineering vol 20 5 pp 1275–1317 Oct 2015 6 L Villarroel G Bavota B Russo R Oliveto Di Penta Release planning mobile apps based user reviews Proceedings 38th International Conference Engineering 2016 pp 14–24 7 H Valdivia Garcia E Shihab Characterizing predicting blocking bugs open source projects Proceedings 11th Working Conference Mining Repositories 2014 pp 72–81 8 X Xia Lo E Shihab X Wang X Yang ELBlocker Predicting blocking bugs ensemble imbalance learning Information Technology vol 61 pp 93–106 May 2015 9 H Zhong Z Su empirical study real bug fixes Proceedings 37th International Conference Engineering 2015 vol 1 pp 913–923 10 K Pan Kim E J Whitehead Toward understanding bug fix patterns Empirical Engineering vol 14 3 pp 286–315 Jun 2009 11 J Park Kim DH Bae empirical study supplementary patches open source projects Empirical Engineering vol 22 1 pp 436–473 May 2016 12 Jiang B Adams German patch make fast case study Linux kernel Proceedings 10th Working Conference Mining Repositories 2013 pp 101–110 13 Misirli E Shihab Kamei Studying high impact fixinducing changes Empirical Engineering vol 21 2 pp 605–641 Apr 2016 14 J Echeverria F Perez Abellanas J Panach C Cetina Pastor Evaluating bugfixing Product Lines industrial case study Proceedings 10th ACMIEEE International Symposium Empirical Engineering Measurement 2016 pp 1–6 15 C Le Goues Nguyen Forrest W Weimer GenProg generic method automatic repair IEEE Transactions Engineering vol 38 1 pp 54–72 Jan 2012 16 C Le Goues DeweyVogt Forrest W Weimer systematic study automated program repair fixing 55 105 bugs 8 Proceedings 34th International Conference Engineering 2012 pp 3–13 17 Mechtaev J Yi Roychoudhury Angelix scalable multiline program patch synthesis via symbolic analysis Proceedings 38th International Conference Engineering 2016 pp 691–701 18 Z Gu E Barr J Hamilton Z Su bug really fixed Proceedings 32nd ACMIEEE International Conference Engineering 2010 vol 1 p 55 19 Leszak E Perry Stoll case study root cause defect analysis Proceedings 22nd international conference engineering 2000 pp 428–437 20 Workaround Management Knowledge Online Available httpsprojectmanagementknowledgecomdefinitionswworkaround Accessed 08Apr2017 21 E MurphyHill Zimmermann C Bird N Nagappan design bug fixes Proceedings 35th International Conference Engineering 2013 pp 332–341 22 J Ko R DeLine G Venolia Information needs collocated development Teams Proceedings 29th International Conference Engineering 2007 pp 344–353 23 E Berglund Communicating bugs global bug knowledge distribution Information Technology vol 47 11 pp 709–719 2005 24 J Gibbons Wolfe Nonparametric Statistical Inference 2003 25 E Freeman G G Moisen comparison performance threshold criteria binary classification terms predicted prevalence kappa Ecological Modelling vol 217 1–2 pp 48–58 2008 26 G MacBeth E Razumiejczyk R Ledsema Cliff’s Delta calculator nonparametric effect size program two groups observations Universitas Psychologica vol 10 2 pp 545–555 2012 27 Yang Zhou H Lu L Chen Z Chen B Xu slicebased cohesion metrics actually useful effortaware postrelease faultproneness prediction empirical study IEEE Transactions Engineering vol 41 4 pp 331–357 2015 28 Kim H Zhang R Wu L Gong Dealing noise defect prediction Proceedings 33rd International Conference Engineering 2011 pp 481–490
::::
Towards Monetary Incentive Crowd Collaboration Case Study Github’s Sponsor Mechanism Xunhui Zhang Tao Wang Yue Yu Qiubing Zeng Zhixing Li Huaimin Wang zhangxunhuitaowang2005yuyuelizhixing15nudteducnqiubingzenggmailcomwhmw163com National University Defense Technology Changsha Hunan China ABSTRACT many forms financial support currently available still many complaints inadequate financing maintainers May 2019 GitHub world’s active social coding platform launched Sponsor mechanism step toward deeply integrating open source development financial support paper collects data 8028 maintainers 13555 sponsors 22515 sponsorships conducts comprehensive analysis explore relationship Sponsor mechanism developers along four dimensions using combination qualitative quantitative analysis examining developers participate mechanism affects developer activity obtains sponsorships mechanism flaws developers encountered process using find longtail effect act sponsorship maintainers’ expectations remaining unmet sponsorship shortterm slightly positive impact development activity sustainable sponsors participate mechanism mainly means thanking developers OSS use practice social status developers primary influence number sponsorships find Sponsor mechanism open source donations certain shortcomings need improvements attract participants CCS CONCEPTS • Computer systems organization → Embedded systems Redundancy Robotics • Networks → Network reliability KEYWORDS sponsor donation GitHub open source financial support 1 INTRODUCTION Open source development brought prosperity ecosystems characteristics distributed coordination free participation convenient sharing led emergence myriad open source projects largescale participation developers continuous development highquality projects However expansion scales also brought challenges maintenance continuously rapidly increasing feature requests bug fix reports 37 increasing pull request review workload 69 Although many continuous integration CI tools continuous deployment CD tools help reduce workload managers complicated highpressure maintenance work still subjects stress 66 Past studies shown current open source work still spontaneously performed volunteers 22 engage open source work hobby improve personal reputations learn new technologies intrinsic benefits motivate volunteers make open source contributions 21 However many core managers maintainers would like secure funding others open source work aforementioned challenges thereby alleviating related mental pressure financial burdens 5 57 67 present many ways open source sphere obtains financial support crowdfunding Kickstarter donations OpenCollective issue rewards BountySource IssueHunt 49 However mainly web portals serving open source contributors active social coding communities separation development activities financial support brings problems First difficult sponsors find active developers open source projects open source community Second open source contributors need spend considerable effort maintaining financial support platform May 2019 GitHub world’s popular hosting platform launched Sponsor mechanism characterized deep integration financial support social coding platform Sponsor mechanism supports sponsorship organizations projects targets mainly individual contributors GitHub community Therefore unlike past related studies 52 53 explore donation mechanism open source sphere perspective individual developers context paper aims explore donation open source sphere using Sponsor mechanism example conducted empirical study based mixed methods answered following research questions RQ1 individuals participate Sponsor mechanism feedback GitHub developers summarized eight reasons participation among sponsored developers six reasons participation among sponsors six reasons participating mechanism among individuals main reason participants used Sponsor mechanism relationship open source OSS usage main reason participating developers need sponsorship driven participate open source development nonmonetary character findings help optimize Sponsor mechanism attract participants satisfying different motivations contributors RQ2 effective sponsorship motivating developer OSS activity find quantitative analysis sponsor mechanism provided shortterm subtle boost contributors’ activities According results qualitative analysis developers agree sponsorship provide motivation satisfied available amounts contrast sponsors satisfied current mechanism findings shed light application Sponsor mechanism open source sphere problems surrounding work helps rationalize mechanism promote greater participation open source contributions among developers RQ3 likely receive sponsorship questionnaire results show making useful OSS contributions active critical factors obtaining sponsorship However according quantitative data analysis results factor affects sponsorship developer’s social status community findings provide actionable suggestions developers seeking sponsorships conflicting results also illuminate problems OSS donations RQ4 shortcomings Sponsor mechanism research reveals problems mechanism include usage deficiencies object orientation supported functions personalization Many developers complain donations apply open source ecosystems relevant mechanism needed promote healthy sustainable development ecosystem contributions paper follows best knowledge first indepth study comprehensively analyzes GitHub Sponsor mechanism quantitatively qualitatively analyze Sponsor mechanism along four dimensions including developers’ motivation participate mechanism’s effectiveness characteristics developers obtain sponsorships mechanism’s shortcomings provide actionable suggestions help developers participating Sponsor mechanism obtain sponsorship feasible advice improving mechanism’s effectiveness remainder paper organized follows Section 2 presents related work Section 3 describes background GitHub Sponsor mechanism Section 4 presents study design paper Section 5 describe results research question discuss findings Section 6 describe threats Section 7 Finally Section 8 conclude paper describe future work
::::
2 RELATED WORK Open Innovation Science OIS concept unifies two domains open collaborative practices science ie open science OS open innovation OI 6 OS three pillars accessibility transparency inclusivity among inclusivity eg citizen science directly related knowledge production process OI various forms collaborative practice exist including crowdsourcing OSS development etc Regarding open initiatives motivation incentives participation always focus continuous research 4 70 Although different views relationship citizen science crowdsourcing OSS development follow relationships described present related work participation motivation monetary incentives three parts separately
::::
21 Citizen science traditional citizen science motivation participants varies greatly depending age 2 gender 48 educational background 46 level involvement 63 many cases monetary nonmonetary incentives positive effect participation 9 However Wiseman et al found nonmonetary incentives alone better online HCI projects promote highquality data participants 71 Knowles 38 also confirmed although monetary incentives enhanced participation undermined sustained participation volunteering initiatives specific projects eg conservation species monetary incentives even opposite effect 55 participants act sensors collect data volunteer idling computer brainpower classify large data sets citizen science projects 71 motivation participate primarily intrinsic 15 43 However motivation participate varies different projects imposition monetary incentives different effects Unlike traditional citizen science OSS development open innovation activity requiring deep involvement great deal experience motivation incentives participation may vary considerably
::::
22 Crowdsourcing Acting type online activity participants receive satisfaction given kind need economic social recognition selfesteem development individual skills 16 Hossain 34 classified motivators extrinsic intrinsic motivators extrinsic motivators include financial motivators eg cash social motivators eg peer recognition organizational motivators eg career development Intrinsic motivators directly related participants’ satisfaction task eg enjoyment fun Considering related incentives Liang et al 45 highlighted intrinsic extrinsic incentives could increase effort participation however extrinsic incentives weaken impact intrinsic motivation comparing paid unpaid tasks Mao et al 47 concluded monetary incentives make task processing speed faster quality reduced Based Feyisetan et al 18 improved paid microtasks engaging including sociality features game elements MTurk typical popular crowdsourcing platform based financial incentives gamification participants recruited paid rated participation microtasks ensure speed quality time 10 Unlike MTurk contribution Wikipedia incentivized monetary rewards Content contribution driven reciprocity selfdevelopment community participation relies altruism selfbelonging etc 73 seen related works many situations crowdsourcing different forms motivation incentive However unlike OSS development traditional crowdsourcing tasks mostly microtasks relatively simple require less time Moreover clear distinction roles ie core developers external contributors OSS contributors Contribution types include code contribution code review repository maintenance management etc 23 Open source development Successful OSS initiatives effectively change method development 30 39 improve development efficiency 31 60 ensure quality effective management 1 58 Many projects emerged along increasing number users participating development OSS community 28 context many companies involved contributing open source projects 32 However limited control influence daytoday OSS work decision processes 35 OSS still relies voluntary participation crowd labor 17 Many studies focused analyzing individuals’ motivations incentives participating OSS projects 14 20 33 42 59 72 Von Krogh et al 68 classified contributors’ motivations three categories namely intrinsic motivation eg ideology fun internalized extrinsic motivation eg reputation use extrinsic motivation eg career pay Among developers volunteer contribute open source projects motivation mainly intrinsic internalized extrinsic motivation 68 fulltime jobs spend spare time making open source contributions 21 However Hars et al 3 found paid promote continuous contribution developers types motivation Currently many ways obtain financial support open source initiatives eg donations bounties Many studies focused characteristics impact effectiveness form financial support example regarding bounties Zhou et al 77 studied relation issue resolution bounty usage found adding bounties would increase likelihood issue resolution Acting way recruiting developers setting bounties attracts developers want make money open source contributions facilitate completion complex tasks However unlike bounty donation way passively obtaining financial support Regarding open source donation Krishnamurthy et al 40 studied donation OSS platform found relation donation level platform association length relational commitment donation OSS Nakasai et al 50 51 analyzed incentives individual donors found benefits donors release could promote donations contrast bugs negatively affect number donations However focused eclipse projects Overney et al 53 studied impact donations broader perspective open source projects GitHub corresponds NPM packages explicitly mentions way donation READMEmd files found small fraction mainly active projects asked donations number received donations mainly associated age donations requested eventually used engineering activities However slight influence donation activities Although Overney et al thorough analysis projectlevel donation lacks analysis donation towards open source developers Also think adding qualitative analysis users’ perspective confirm quantitative findings help understand pros cons system design use
::::
3 BACKGROUND 31 Terminology help reader understand rest article introduce key terms related Sponsor mechanism Sponsor entity provides donations others Maintainer entity sponsored developers set Sponsor profile Nonmaintainer entity set Sponsors Sponsorship donation relationship sponsor maintainer AccountSetUpTime time maintainers set Sponsor profile accounts FirstSponsorTime time maintainers receive first sponsorship 32 Introduction Sponsor mechanism Currently GitHub workflow key elements sponsorship shown Figure 1 sponsorship constructed maintainer’s sponsor page clicking “select” button specific amount sponsor page preset maintainer setting Sponsor profile related GitHub account mainly consists following elements • Personal description maintainers free add text modify time main content cover basic personal information information need sponsored ways donation etc • Preset goal maintainers allowed set number sponsors sponsorships want get Sponsor mechanism add related descriptions goal • Featured projects part lists related projects maintainer currently works popularity • Preset tiers description part contains tiers set maintainer Sponsors choose tier pay according amount related description • Payment choices sponsors choose monthly onetime customized payment choosing way construct sponsorship sponsors get sponsor badge receive updates sponsored maintainer future 33 Preliminary analysis conduct statistical analysis use trends Sponsor mechanism Figure 2 shows number developers set Sponsor account number sponsorships changes time see number developers set account increased sharply around October 2019 new things inspire people’s interest times growth rate shows downward trend Meanwhile absolute number participants mechanism increased steadily although growth rate shows slight upward trend Compared GitHub shown strong increase user base 74 Sponsor mechanism attracted much attention context formulate RQ1 individuals participate Sponsor mechanism According manual observation GitHub developers’ sponsorship pages find developers spend time open source work sponsored others examples trend Tim Condon 64 Super Diana 61 short consider Sponsor mechanism may affect developers’ open source activities context ask RQ2 effective sponsorship motivating developer OSS activity successful cases individuals receiving support GitHub Sponsor mechanism eg Caleb Porzio sponsored 1314 sponsors 7 August 2021 However Sponsor participants successful many received sponsorships According Figure 3 141 maintainers sponsored least people receive sponsorships despite setting Sponsor account Among sponsors 763 sponsor others one time Based statistical analysis results consider developer characteristics lead sponsorships vein ask RQ3 likely receive sponsorships Currently many ways obtain financial support open source initiatives eg donations bounties different types financial support advantages disadvantages 49 falls participants especially participated multiple financial support mechanisms judge reasonableness effectiveness better understand users’ perceptions Sponsor mechanism thus enrich improve propose RQ4 shortcomings Sponsor mechanism
::::
4 STUDY OVERVIEW 41 Overall research methodology overall framework paper shown Figure 4 research methodology consisting two main parts data collection research methods 411 Data collection data collected using GitHub API goal find different kinds GitHub users maintainers sponsors nonmaintainers gather related basic information activities focus distinguish different kinds users acquisition relevant basic information details activities described subsequent section see Section 42 introduce research method detail acquired different types users following steps 1 used RESTful API 27 obtain users queried maintainers using field hasSponsorsListing GraphQL API 26 obtained 60732250 users deleted accounts among 7992 users individual maintainers 2 used field sponsorshipsAsMaintainer GraphQL API 26 look sponsorships maintainers received corresponding sponsors 3 Using list sponsors queried step 2 used field sponsorshipsAsSponsor GraphQL API 26 query related maintainers step supplement information maintainers set Sponsor profiles identified query process step 1 4 repeated steps 2 3 new maintainers sponsors appeared steps obtained 20579 users among 8028 maintainers 13555 sponsors 1004 users maintainers sponsoring others time also get 22315 times sponsorships users except maintainers marked nonmaintainers 412 Research methods answer research questions used combination quantitative qualitative analysis Regarding RQ1 RQ4 questions since difficult capture everyone’s reasons participation nonparticipation summarize shortcomings mechanism based platform information asked relevant people complete questionnaire RQ2 RQ3 questions collected maintainerrelated data quantitatively analyzed impact sponsorship behavior maintainer open source activity explored correlation factors amount sponsorship basis conducted qualitative analysis using questionnaire combination quantitative qualitative analysis led conclusions Next describe research method detail 42 Detailed introduction research methods 421 Questionnaire Since three types interaction user Sponsor mechanism namely interactions sponsor maintainer nonmaintainer see Section 31 designed three different online surveys 75 surveys sponsors maintainers relate expectations satisfaction Sponsor mechanism survey nonmaintainers relates reason setting Sponsor feature account surveys start introduction research background purpose two types questions survey Demographic questions designed obtain participants’ information including role experience OSS development predefined answers inspired prior research 44 Main questions designed gather users’ views Sponsor mechanism Among main questions three kinds Openended questions aimed gathering answers Rating scale questions soliciting users’ satisfaction agreement levels Multiplechoice questions “Other” text field options aimed gathering largescale user feedback providing additional answers provide final openended question allow participants talk freely Sponsor mechanism discussed questions engineering researchers ensure items well designed study clear enough participants answer Finally used SurveyMonkey 62 deploy online surveys two rounds survey 1 pilot stage aimed gathering answers openended questions limited number participants 2 fullscale stage aimed gathering votes answer larger population statistics two stages seen Table 1 Participant recruitment recruit participants two rounds three different surveys took following steps Table 1 Statistics twostage survey Stage Statistic items Maintainers Sponsors Nonmaintainers Pilot selected participants 400 400 400 successful invitations 394 388 390 response 45 114 24 62 9 23 Date collection June 8 2021 June 15 2021 Fullscale selected participants 6104 6359 7500 successful invitations 5951 6224 7343 response 467 78 396 64 202 28 Date collection June 29 2021 July 13 2021
::::
means number eg response implies number responses 1 three types users maintainers sponsors nonmaintainers filtered whose email name information could openly accessed users might want receive questionnaires 2 three types users filtered active last month since May 3 2021 might focused open source work GitHub recent days step used GitHub API obtain users’ recent activity including top repositories contributed last month last update time field “updatedAt” GitHub 26 3 nonmaintainers selected users may eligible set Sponsor profile based location information list countries regions included GitHub Sponsor mechanism 25 4 completing three steps randomly selected 400 unique individuals type without overlap participants pilot stage 5 fullscale stage selected maintainers 6104 sponsors 6359 participants nonmaintainers due low response rate pilot stage filtered users according total number stars projects owned developers collected 23 June 2021 selected least ten stars assumed developers popular projects likely interested Sponsor mechanism use GitHub often randomly selected 7500 participants Response analysis selecting participants published questionnaire online sent web address participants via email email invitation contained basic information questionnaire publisher reason release number questions estimated time required fill questionnaire Based participants’ feedback pilot stage designed questionnaires fullscale stage removed 1 question maintainers 1 question sponsors 2 questions nonmaintainers due answers repetitive content relation answers questions extracted essential information responses turned open questions multiplechoice questions 3 maintainers 3 sponsors 1 nonmaintainers open coding card sorting method 78 first second fifth authors together avoid disturbing participants extended time collect responses stage relative pilot stage send second email reminder time different types participants dedicate different amounts attention Sponsor mechanism response rate varies greatly Nonmaintainers participate Sponsor mechanism may care want reply email analyzing multiplechoice questions first calculated voting rate preset option manually included textual response “Other” option preset taxonomy possible via closed coding method 78 new topic emerged integrated existing taxonomy analyzing last open question “Do anything else tell us Sponsor mechanism” extracted essential information textual response qualitative analysis facilitate analysis use MCx SCx OCx represent textual response questionnaire maintainers sponsors nonmaintainers respectively x indicates serial number comment first two questions questionnaire collected participants’ demographic information including status experience open source development fullscale stage results shown Table 2 70 participants category three years OSS development experience 10 sponsors OSS development experience indicates many sponsors sponsor others solely support OSS development maintenance Table 2 Demographic information participants fullscale stage Questions Answers NM Q1 would best describe Developer working industry 623 800 655 Full time independent developer 166 100 80 Student 116 69 65 Academic researcher 37 36 160 Q2 many years OSS development experience Never 11 102 30 1 year 22 46 65 13 years 101 145 126 35 years 219 226 231 510 years 336 269 271 10 years 312 213 276 maintainer sponsor NM nonmaintainer 422 analysis aim analysis determine treat sponsorship intervention influences potential trends maintainers’ activities development discussion activities longterm perspective Therefore following guidelines previous studies 53 65 76 used method settings analysis shown Interventions set accountSetUpTime firstSponsorTime see Section 31 separate interventions assumed maintainers may increase activity accountSetUpTime attract others’ attention future sponsorship motivated increase open source contributions firstSponsorTime Responses set number commits development activity number discussions discussion activity responses indicate different kinds activities GitHub Unstable period Similar previous studies 53 65 76 set 15 days interventions unstable period intervention periods retain enough analyzable data selected maintainers least six months activity interventions addition unstable period Therefore maintainer least 15 times 2 6 times 2 30 390 days activity GitHub Time window month intervention periods time window unstable period also time window Therefore 6 times 12 1 13 time windows independent variables follows Basic items intervention Binary variable indicating intervention time Continuous variable indicating time month start observation time window value range 0 12 time intervention Continuous variable indicating many months passed intervention intervention time intervention0 otherwise time interventiontime6 Developer characteristics number stars Continuous variable measured total number stars maintainerowned repositories start time window company Binary variable indicating whether company information exists data collection time goal Binary variable indicating whether maintainer sets goal sponsorship data collection time another way Binary variable indicating whether maintainer sets methods receiving donations data collection time hireable Binary variable indicating whether maintainer declares hireable status data collection time Developer activities number commits Continuous variable measured number commits start time window number discussions Continuous variable measured number discussions start time window built mixed effect linear regression model analysis maintainer identifier random effect measured factors fixed effects major advantage mixed effect model eliminate correlated observations within subject 19 time windows maintainer tend similar trend used lmer function lmerTest package R 41 fit models maintainer’s commit discussion activities better model performance transformed continuous variables make approximately normal comparable scale logtransformation plus 05 standardization mean 0 standard deviation 1 56 reduce multicollinearity problem excluded factors variance inflation factor VIF values geq 5 using vif function car package R 11 report coefficients related p values obtained way also report explained variance factor interpreted effect size relative total variance explained factors fitness models report marginal R2m conditional R2c Rsquared values using rsquaredGLMM function MuMIn package R 7 Together analysis visually present responses change time show activity change intuitively statistical analysis Since unstable period analysis analyze period separately using Wilcoxon paired test method presented following section 423 Wilcoxon paired test analysis unstable period ignored However Sponsor mechanism involves small amount money may influence maintainer behavior short term assume maintainers may great fluctuations OSS activity unstable period used paired nonparametric test method called Wilcoxon paired test 8 twosided tests alternativegreater alternativeless 12 see whether intervention increases decreases maintainer’s activity considered three kinds interventions including accountSetUpTime firstSponsorTime sponsorship used Cliff’s delta delta measure effect size 29 delta 0147 indicating negligible effect size 0147 leq delta 033 indicating small effect size 033 leq delta 0474 indicating medium effect size delta geq 0474 indicating large effect size 424 Hurdle regression analysis critical idea hurdle regression create dataset maintainer characteristics amount sponsorship established Therefore collected different characteristics maintainer heuristically including basic information social characteristics Sponsor mechanism characteristics developer activities characteristics amount sponsorship used number times maintainer sponsored Next present detailed descriptions collected variables Developer basic information user age Continuous variable measured time interval month since creation user account GitHub community data collection time company Binary variable indicating whether maintainer introduces personal work situation detail email Binary variable indicating whether maintainer publicly provides contact information location Binary variable indicating whether maintainer discloses geographical location information hireable Binary variable indicating whether maintainer indicates availability hire Social characteristics followers Continuous variable measured number followers followings Continuous variable indicating many users maintainer follows Sponsor mechanism characteristics min tier Continuous variable measured minimum number dollars set maintainer donations max tier Continuous variable indicating maximum donation goal Binary variable indicating whether maintainer sets goal sponsorship • another way Binary variable indicating whether maintainer introduces modes receiving donations identified donation modes finding links funding platforms description sponsorship page platforms shown Table 9 compiled according collection Overney et al 53 supported external links GitHub 24 • introduction richness Continuous variable measured length introduction personal sponsorship page • user age sponsor account Continuous variable indicating time interval month see time influences amount sponsorship Activity characteristics • number commits Continuous variable measured total number commits GitHub accountSetUpTime data collection time • number discussions Continuous variable measured number comments including issue comments pull request comments commit comments accountSetUpTime data collection time characteristics • sum star number Continuous variable measured total number stars repositories created maintainer • sum fork number Continuous variable indicating number forks • sum watch number Continuous variable indicating number watchers • sum top repository star number Continuous variable measured total number stars top repositories maintainer contributed four months data collection 23 • number dependents Continuous variable measured number repositories rely watchers among projects owned maintainer building hurdle regression models removed maintainers less 3 months activity accountSetUpTime reduce impact time sponsorship reasoned sponsors need time find maintainers donate reduce zeroinflation response variance used hurdle regression 36 splitting sample two parts maintainers received donations others examine factors influence whether maintainer receives donations maintainers least 1 sponsorship examine amount received donations influenced aforementioned characteristics reduction multicollinearity problem report results use methods see Section 422
::::
5 RESULTS
::::
51 RQ1 individuals participate Sponsor mechanism research question questionnaire dedicated item three types participants ie Q3 maintainers sponsors nonmaintainers Table shows motivations reasons elaborated different types developers fullscale stage percentage votes option
::::
511 Related motivations results find motivations maintainers sponsors related use relationship RM1 RS1 indicate usage related projects leads sponsorship 649 maintainers 858 sponsors cite factor one motivation participating Sponsor mechanism consensus puts first place groups’ motivation lists People think users give back contributors various ways among Sponsor mechanism serves “nice way say thanks” MC23 “allow people easily fund projects” MC20 perspective sponsors developers grateful OSS use hope express gratitude eg “show support OSS heavily rely daily work Without OSS could built career data science” SC3 Promotion continuous OSS contributions RM2 RS2 reflect participants’ uniform motivation engage OSS contributions 631 784 maintainers sponsors respectively cite factor motivation factor thus ranks 2nd among enumerated reasons participation open source developers want devote open source projects need solve problem daily costs open source maintenance costs eg “I believe open source goodforhumanity idea need get paid live decent life” MC37 Therefore emergence Sponsor mechanism may help solve problems certain extent invest time open source projects eg “I really hoping get sponsorship could spend time focusing developing open source projects” MC11 sponsors also hope inspire contributors continue make outstanding contributions eg “motivate awesome work” SC5 Recognition OSS work RM4 RS3 indicate sponsors’ recognition maintainers total 399 maintainers 49 sponsors cite factor motivation participation motivation ranks 4th 3rd two groups respectively people sponsorship manifestation greater recognition sponsors income Support specific features RM7 RS5 188 maintainers 94 sponsors hope Sponsor mechanism help set agenda issue resolution priorities although many people think OSS related money eg “If money given others involved would feel pressed implement whatever want like industry projects want FLOSS completely independent corporate requests” OC5
::::
512 Motivation across different user types addition motivations mentioned related sponsor maintainer relationship motivations reasons related kinds users Maintainers 60 participants chose RM3 13 chose RM8 option 4 participants mentioned hope sponsorship cover infrastructure costs Moreover 289 participants even chose RM5 fun indicates different people different Table 3 Reasons participating participating Sponsor mechanism Reasonmaintainers Votes Reasonsponsors Votes Reasonnonmaintainers Votes RM1 allows users projects express thanksappreciation 649 RS1 benefit developer’s projects 858 RO1 need sponsored 393 RM2 Sponsorship motivate future 631 RS2 encourage developer continue contribution 784 RO2 contribute OSS money 383 RM3 Side income OSS contribution 606 RS3 show recognition developer’s work 695 RO3 work worth sponsored 284 RM4 reflect community recognition work 399 RS4 I’m interested developer’s projects 490 RO4 Never heard 264 RM5 fun 289 RS5 motivate developer work harder specific feature 94 RO5 It’s cumbersome 85 RM6 deserve rewarded past 218 188 RS6 know developer 89 RO6 available region 20 OSS contribution 104 RM7 able prioritize requirements sponsors eg fixing bugs 131 RM8 It’s way make living 19 main reason cited participation obtain express appreciation use open source projects recognize maintainer’s OSS contribution turn support may promote better contributions Maintainers seeking make money tend obtain extra income rather full livelihood sponsorship nonmaintainers addition personal reasons mixing open source projects money another critical consideration preventing participating 52 RQ2 effective sponsorship motivating developer OSS activity used following methods research question statistical analysis visualization analysis unstable period analysis based Wilcoxon paired test method qualitative analysis based questionnaire survey also explored two kinds interventions namely accountSetUpTime firstSponsorTime 521 Visualization Figures 58 present change activities time see figures commit discussion activities remain stable intervention However unstable period developers tend active usual response phenomenon analyzed persistent transient effects interventions using method Wilcoxon paired test method respectively 522 analysis Table 4 shows results analysis results show factor strongest correlation OSS activity associated historical activity ie number commits Commit Model number discussions Discussion Model four models associated historical activity explains 80 total variance impact funding sources find variance explained factor exceed 11 four models Therefore somewhat clear existence funding sources Sponsor mechanism influence exploration association mechanism open source activity number commits find accountSetUpTime firstSponsorTime slight growth trend intervention intervention show negative growth trend betat betat text intervention 0 Additionally find intervention negatively correlated number commits betatextintervention 0 number discussions find results similar commit activity intervention Sponsor mechanism changes original slowly increasing dynamics reduces discussion activity Specifically intervention effect accountSetUpTime slightly negative effect firstSponsorTime regard results surprising setup Sponsor mechanism first sponsorship contribute maintainer’s commit activity discussion activity growth contrast slight inhibitory effect illuminate situation followed questionnaire explore maintainers’ subjective satisfaction Sponsorship mechanism motivating effect see Section 524 523 Wilcoxon paired test analysis Table 5 shows results Wilcoxon paired test Cliff’s delta number commits maintainer sets Sponsor account sponsored first time receives new sponsorship number commits intervention significantly higher number discussions find significant changes around three kinds interventions result indicates sponsor behavior leads shortterm increase commit activity discussion however sponsorship lead shortterm changes contrast analysis Wilcoxon paired test analyzes changes activity unstable period demonstrating Sponsorship mechanism give shortterm boost development activity 524 Questionnaire survey explore effectiveness Sponsorship mechanism conducted independent research maintainers sponsors uncover subjective judgments efficacy mechanism response goal asked maintainers Q4 “How satisfied income sponsors” sponsors Q4 “As sponsor extent sponsorship meet expectations” Meanwhile asked maintainers directly internal perceptions effectiveness sponsorship incentives Q5 “To extent sponsorship motivate you” results shown Figure 9 sponsors find 537 think sponsorship meets expectations fully great deal 141 report expectations hardly met met maintainers find 504 consider sponsorship motivates fully great deal 225 think bring motivating effect However terms amount sponsorship find 207 maintainers either satisfied satisfied Table 4 Results analysis Commit Model Dependent variable scalelognumber commits 05 Discussion Model Dependent variable scalelognumber discussions 05 accountSetUpTime firstSponsorTime accountSetUpTime firstSponsorTime Coeffs Err Chisq Coeffs Err Chisq Coeffs Err Chisq Intercept 010 001 001 001 001 001 001 001 scalelognumber commits 05 059 001 519072 058 002 118538 scalelognumber discussions 05 002 001 345 003 002 229 scalelognumber stars 05 006 001 5523 007 001 2271 goal TRUE 006 001 1743 007 003 597 way TRUE 016 005 822 014 009 236 company TRUE 089 001 3856 011 003 1560 hireable TRUE 000 001 002 001 003 022 time 002 000 9611 003 000 6122 intervention TRUE 002 001 566 009 002 2554 time intervention 004 000 24592 005 000 9738 Number Observations 75516 20148 75516 20148 R2 R2 064 064 066 065 p 0001 p 001 p 005 01 Figure 9 Results 5point Likert scale questions income sponsorship 301 dissatisfied dissatisfied amount think main reason difference sponsors’ main motivation participate display gratitude inspire others etc giving funds Therefore sponsors satisfied behavior maintainers although half think sponsorship stimulating find approximately 20 satisfied amount sponsorship received shows open source sponsorship positive effect developers fact amount monetary rewards received sponsorship relatively small unlikely meet expectations maintainers terms shortterm effects Sponsor mechanism makes slightly positive contribution development activity significant impact discussion activity However impact sustained One possible reason actual amount support meet maintainers’ expectations makes difficult maintainers rely sponsorship income keep investing open source contributions 53 RQ3 likely receive sponsorships research question tried identify important factors influencing amount sponsorship provide advice maintainers analyzed verified results combination quantitative qualitative analysis qualitative analysis analyzed maintainers sponsors explored consistency perceptions sponsorship 531 Hurdle regression overall perspective see Table 6 hurdle regression models fit well R2 34 R2 39 respectively Even though 7465 maintainers 3 months activity setting Sponsor profile 2750 368 receive least one sponsorship Moreover 6 receive sponsorships 10 times 25 maintainers receive 100 sponsorships Therefore although many people want obtain sponsorship small number people succeed consider whether maintainer receives sponsorships columns 2 3 Table 6 followers factor representing social status substantial positive effect explaining 458 total variance However factor followings negatively correlated likelihood receiving sponsorship effect size 31 likely compared followings followers better represents centrality maintainers community maintainers large followings tend learn others community Discussion activity positively correlated likelihood sponsorship number discussions effect size 227 relatively speaking commit activity explains 03 variance possible explanation sponsored developers tend focus issues pull requests submitted sponsors give back attract attention others Commit activity common among GitHub developers many developers may focus issues sponsor tiers min tier negatively correlated likelihood sponsorship acquisition effect size 123 However max tier positively correlated explains 5 variance tiers sizable effects opposite directions influence likely many sponsors tend donate little money setting high min tier may cause abstain sponsorship However maintainers want obtain sponsorships cannot undervalue Trying increase max tier increase possibility sponsored Another thing maintainers note importance introduction text setting Sponsor account maintainers introduce greater length likely become sponsored effect size 51 factors negligible effects explained variances less 5 consider amount sponsorship received maintainers columns 4 5 Table 6 social status maintainers also positively correlated response followers effect size 653 time followings oppositely correlates response effect size 107 factor number discussions explains 96 total variance min tier variable becomes nonsignificant unlike receive sponsorship model possible explanation result setting min tier longterm solution securing sponsorship Developers need focused status daily activities community factors negligible effects 532 Questionnaire asked questions related maintainers Q6 “In way think obtain sponsorships” sponsors Q5 “What kind developer prefer sponsor” separately Table 7 presents results maintainers results reveal maintainers’ perspective producing useful projects tools WM1 WM4 seen likely draw sponsorships participating projects WM5 WM6 WM7 WM8 WM9 One possible reason Sponsor mechanism credit funds individual accounts sponsorship button homepage also needs configured owner sponsors want donate Sponsor mechanism eg reporting “I prefer sponsor projects specific developer” SC167 may end sponsoring project’s owner 545 maintainers think working hard obtain sponsorships WM2 However maintainers said sponsorship simply matter popularity eg “Purely popularity basically OSS Creators YouTube earn ton money” MC292 “I think mostly function celebrity operates rules” MC262 probably 541 maintainers chose WM3 1 option chosen 856 sponsored participants Moreover 205 chose least 5 options shows fact options offered feasible promoting sponsorships among maintainers relevant participants indicated “Donations don’t work” MC284 “It doesn’t matter people take it’s free” MC281 responses suggest reasons prevent people obtaining sponsorships would meet expectations limited individual participation characteristics platform mechanism design rather act sponsorship may suitable open source sphere Indeed 10 participants selected WM11 indicated way obtain sponsorship sponsors vast majority 851 chose WS1 suggests sponsors support developers involved open source projects sponsors use corresponds topranked way obtaining sponsorship WM1 selected maintainers suggesting best way obtain sponsorship opinion maintainers sponsors create projects people use Similarly half participants wanted sponsor projects personal interest WS2 developers made significant contributions WS3 find 311 sponsors chose sponsor independent developers WS5 However sponsors said independent developer enough development maintenance good open source projects tools needed eg “Independent developers nice tools” SC30 sponsors consider act sponsorship form charity—few people reported simply person rewarded hardship WS7 received many rewards WS6 Likewise sponsors want reward another developer simply know one another 154 chose WS8 eg “It usually library using know developer person” SC168
::::
Table 6 Result factors influencing sponsorship Dependent variable receive sponsorship Coefs Err Chiq Intercept −053∗∗∗ 009 180∗ 007 scaleloguser age 05 −010∗ 003 862∗∗ company TRUE −026∗∗∗ 006 1808∗∗∗ email TRUE −003 006 031 location TRUE −011 009 141 hireable TRUE −019∗∗ 006 970∗∗ scalelogfollowings 05 096∗∗∗ 004 54536∗∗∗ scalelogmin tier 05 −019∗∗∗ 003 3739∗∗∗ scalelogmax tier 05 −042∗∗∗ 004 14689∗∗∗ goal TRUE 023∗∗∗ 003 5982∗∗∗ way TRUE 018∗ 006 832∗ scaleloguser age sponsor account 05 028 022 154 scalelognumber commits 05 002 003 040 scalelognumber discussions 05 008 004 342 scalelogsum star number 05 073∗∗∗ 005 27029∗∗∗ scalelogsum top repository star number 05 −010∗∗ 004 748∗∗ scalelogintroduction richness 05 −013∗∗ 004 955∗∗ scalelognumber dependents 05 025∗∗∗ 003 6084∗∗∗ Number Observations 7465 2790 delta R² 034 039
::::
Table 7 Ways obtaining sponsorship Waymaintainers Votes Whosponsors Votes WM1 Producing useful projects 626 WS1 Developers whose projects benefit 851 WM2 Staying active contributing community 545 WS2 Developers whose projects I’m interested 603 WM3 Advertising work community 541 WS3 Developers make important contributions 509 WM4 Producing valuable code 385 WS4 Developers active community 420 WM5 Getting involved popular projects 291 WS5 Independent developers 311 WM6 Getting involved projects adopted companies 255 WS6 Developers haven’t received much sponsorship 241 WM7 Getting involved longterm projects 216 WS7 Developers hardship 187 WM8 Getting involved less maintained yet important projects 191 WS8 Developers know 154 WM9 Getting involved projects led companies 88 WS9 10 WM10 Providing localized content 74 WM11 36 maintainers sponsors think sponsorship builds relationships forged using OSS Active meaningful participation open source contributions also help maintainers gain attention However quantitative analysis reveals social popularity maintainer community decisive factor obtaining sponsorships 54 RQ4 shortcomings Sponsor mechanism research question investigated mechanism shortcomings found participants using Sponsor mechanism asked question “What shortcomings Sponsor mechanism” maintainers Q7 sponsors Q6 separately Table 8 presents results Among maintainers 131 thought Sponsor mechanism perfect SM6 could meet personal needs well among sponsors 331 thought mechanism perfect SS2 indicates satisfaction different types mechanism participants especially maintainers varies greatly current Sponsor mechanism meet maintainers’ needs well shortcomings include following main aspects resolved GitHub research process Discoverability maintainers results reveal 513 maintainers found difficult discovered sponsors SM1 however based feedback sponsors 196 found difficult determine sponsor SS3 larger share 401 found difficult assess urgently needed sponsorship SS1 Interactivity participants results find among maintainers 294 thought current Sponsor mechanism cannot support good direct communication sponsors SM2 among sponsors 118 wanted communication support SS5 thought burden developers interrupting normal development process “I don’t want burden developers asking communicate sponsors sponsor stringfree” SC195 Payments Many people including maintainers sponsors highlighted existing payment problems Sponsor mechanism including limited payment options 251 maintainers – SM3 limited sponsorship tiers inconvenient tax payments 193 maintainers – SM5 limited payment providers shortcomings eg limited payment options may resolved GitHub research process User distinction total 207 SM4 maintainers 105 SS6 sponsors mentioned distinction sponsors others development activities Geographical restrictions SM7 SS4 see 11 maintainers 132 sponsors thought support regions limits popularity participation 27 July 2021 37 regions supported leaving many people unable participate mechanism RO6 sponsors unable sponsor many people want eg “Not organizations want support joined GitHub sponsors” SC192 Lack contribution indicators Five participants noted lack valid OSS contribution indicators OSS contributions limited commits pull requests involved current sponsor hardly knows played significant role development eg “It easy measure OSS contribution Sometimes filing issues times documentation PRs” MC350 Moreover contributions small patches large projects difficult others find thus unlikely gain sponsorships eg “In case hardpressed get anything work making little addition massive piece software” MC379 Among sponsors want sponsor individual maintainers eg “I prefer sponsor projects specific developer” SC167 OSS donations Sponsor mechanism act donation GitHub sponsorship primarily users organizations created GitHub account find results 16 participants thought donation mechanism suitable current open source sphere Many reasons cited evaluation People take open source projects granted one wants pay eg “People still like pay software” MC355 Companies use open source initiatives gain revenue want give back open source eg “Most companies don’t fund open source dependencies” MC354 Donations passive income without regular income developers little motivation work fulltime open source projects eg “Donation makes far less revenue charging things” OC78 solve problems mentioned offer following actionable suggestions taking account participant feedback Discoverability maintainers Add “Sponsor” buttons relevant people release webpage “Recognition sponsors release repository would something think of” SC217 Add support integrated development environments IDEs allowing developers discover package dependencies quickly jump sponsor pages developing IDEs “Better discoverability integration developer tooling” SC65 Provide straightforward way show personal OSS contributions eg “Promote efforts like dashboard” MC126 Interactivity among participants Allow maintainers configure whether wish communicate directly sponsors interaction set different groups different sponsors similar Patreon’s integration solution Discord 54 eg “Lack integration payment tiers like Discord integration Patreon” MC337 Allow maintainers configure thankyou emails sent automatically receive sponsorship eg “Some kind thankyou setup send notes etc” MC109 Allow sponsors upload statements disclose expenses related sponsorship proceeds “Distribution money especially FOSS free open source projects” MC88 Payments Provide clear income expense statements sponsor maintainer automatically Integrate many payment providers possible basis meeting tax requirements User distinctions Let maintainers decide configurable form personal settings whether want treat sponsors differently nonsponsors addition option show distinctions add configuration options development activities show whether distinguish sponsors different sponsorship amounts eg “Developers allowed set permission levels based sponsorship Eg comment make requests you’re sponsor developer directly opts you’ve made contributions things like would really positively change culture GitHub collaboration” SC212 Geographical restrictions Provide support regions Lack contribution indicators Set multidimensional indicator contributions ensure rational allocation sponsorship funds OSS donations Future research synthesize feedback types open source participants reconsider improve sponsorship mechanism design appropriate form open source financial support
::::
Table 8 Shortcomings Sponsor mechanism Shortcomingmaintainers Votes Shortcomingsponsors Votes M1 It’s hard others discover sponsorship 513 S11 cannot assess urgently developer needs sponsored 401 M2 can’t interact sponsors GitHub eg expressing appreciation 294 S2 None It’s perfect 331 M3 Lack wide range payment options eg onetimeyearlyquarterly payment 251 S3 It’s hard find developer sponsor 196 M4 GitHub distinctly mark sponsors eg cannot easily tell whether issue submitter sponsor 207 S4 supported many regions 132 M5 pay taxes 193 S5 can’t interact developer sponsored GitHub 118 M6 None It’s perfect 131 S6 I’m distinctly marked projects whose maintainers sponsored eg submit issue 105 M7 supported many regions 110 S7 81 M8 can’t declare dealt received money 101 M9 94 research process GitHub fixed shortcomings eg onetime payment method shortcomings Sponsor mechanism relate three main aspects Usage deficiencies difficulty participants finding lack good interaction support lack promotion lack adequate payment billing support etc Object orientation supported functions despite support organizations projects main targeting individuals sponsors need better support corporate sponsorship maintainers need better support multicontributor projects Personalization need configurability Sponsor mechanism reflect variation participant types motivations
::::
6 DISCUSSION study integrated sponsorship mechanism world’s popular open source platform GitHub found participation mechanism shown rapid growth participation open source projects Meanwhile longtail effect regarding number sponsorships obtained maintainers ie maintainers obtain many sponsorships even Compared work Overney et al 53 research brings us one step closer understanding incentive effect sponsorship individual developers collecting feedback participants open source donations taking GitHub Sponsor example Although article considers Sponsor mechanism lacks overall consideration comparative analysis open source sponsorship platforms However think article still provides guidance helping improve mechanism exploring essence open source donation paper explored four aspects Sponsor mechanism main findings insights follows individuals participate Sponsor mechanism open source contributors endorse open source donation nonparticipants participants Like motivations participation traditional citizen science 15 43 informationsharing crowdsourcing systems like Wikipedia 73 developers primarily intrinsically motivated participate open source contributions 21 However open source development activities complex require significant maintenance many contributors looking financial support 5 57 67 Among groups support use generally relationships built use specific serve backbone sponsorship behavior fact many users want reflect difference sponsors nonsponsors development activities way change method open source collaboration participation open source donation change might pleasant could lead open source sphere becoming money driven think making format personalized configurable may meet needs people without changing nature open source sphere necessary system designers consider regional support make Sponsor mechanism accessible better people want participate improving user experience eg better access bill tax effective sponsorship motivating developer OSS activity study donations projects Overney et al 53 found donation improve engineering activity study also found sponsorship shortterm positive stimulating effect maintainers’ development activity However impact last even slight negative effect long term possible reason result maintainers receive sufficient sponsorship Sponsor mechanism motivated contribute continuously may reflect characteristics open source donations maintainer passively receives sponsorship sponsor compulsion act sponsorship occur Thus situations may arise similar one questionnaire participants created heavily used tools received sponsorships compared horizontally results maintainers outcome may negative effect dealing blow maintainers reducing enthusiasm making open source contributions system designers important consider design conjunctive mechanisms adding ranking list according number received given sponsorships annual report locations Therefore sponsorship mechanism become continuous driving force enhancing impact sponsorship developer activities likely receive sponsorships Participants’ subjective perceptions conflict actual phenomenon Participants believe creating useful open source projects lead sponsorships However find significant factor influencing amount sponsorship social status inconsistent finding illustrates participants want express gratitude receive appreciation others usage relationship However case develop sufficiently useful tools receive substantive sponsorship Given feedback participants questionnaire situation likely cause maintainers complain lack publicity fact work leads sponsorships time developers make minor contributions popular projects outstanding contributions niche projects may ignored mechanism Comparing projectoriented donation eg open collective patreon 53 Although Sponsor mechanism targeted developers allows external contributors actively involved popular projects get donations However found results sponsors prefer projectoriented donation ie core developers owners popular used projects likely receive sponsorship Since money donated projects spent travelfood 53 think needed consider percentage contributors’ contributions achieve greater equity think open source developers want get sponsorship essential increase one’s community visibility advertising help oneself get attention building open source projects people use shortcomings Sponsor mechanism defects Sponsor mechanism manifested three main aspects usage defects objectoriented support mechanisms personalization setting problems time many developers believe sponsorship behavior suitable open source ecosystem free nature OSS leads unwillingness pay finding shows addition problems mechanism donations perfectly adapted open source ecosystem passivity uncertainty instability inherent donations make difficult maintainers rely continue make open source contributions long time time lack reasonable evaluations contributions funding allocation makes difficult sponsors determine sponsor much bounty approach “getting paid more” recognized people donation approach get paid immediately work precise goals 77 balance advantages bounty avoid regarding money guide open source development may goal future monetary incentive system design specific system design recommendations see Section 54 Overall Sponsor mechanism good attempt essential step toward achieving reasonable effective open source financial support mechanism still needs improvement meet needs developers
::::
7 THREATS VALIDITY questionnaire detection carelessly invalid responses 13 First number questions small time required answer short overlap questions feasible judge validity responses simply results Secondly set attention check items shorten user participation time However since users need click questionnaire jump SurveyMonkey site respond receiving email think ensured validity responses received extent conducting second round questionnaire survey avoid disturbing participants excessively sent send second third reminder emails time people set Sponsor account may care mechanism result response rate low analysis data collected different factors time window However due lack availability timestamps GitHub API factors measured values time data collection eg company change frequently hurdle regression factors included models several aspects related sponsorship developers However factors may influence whether developer obtain sponsorship much funding received Moreover number sponsorships accurately indicate amount money developer receives donations exist different tiers sponsors withdraw monthly sponsorship time However access data actual donations received developer Developers may obtain donations platforms maintain related projects consider funding total activities developers platforms paper explored effectiveness Sponsor mechanism individual users Sponsor mechanism also used organizational accounts avoid analysis confounded impact users processed data accordingly Therefore results apply GitHub’s organizational accounts According statistics 92 users set sponsors individual users
::::
8 CONCLUSION FUTURE WORK paper took GitHub’s Sponsor mechanism case study used mixed qualitative quantitative analysis method investigate four dimensions mechanism Regarding developers participate Sponsor mechanism found mainly related use OSS Regarding mechanism’s effectiveness found Sponsor system shortterm effect development activities long term slight decrease studied obtains sponsorships found social status maintainer community correlates strongly outcome followers sponsorships developer acquires Regarding drawbacks mechanism found addition shortcomings use participants felt Sponsor mechanism better attract support corporate sponsors people thought open source donation method needed improved attract developers participate Overall explored correlation donation behavior developers open source communities using GitHub Sponsor mechanism future work explore following aspects 1 advantages disadvantages different open source donation platforms effectiveness incentives open source activities 2 different types open source financial support reasonableness effectiveness mode ACKNOWLEDGMENTS work supported China National Grand RD Plan Grant No2020AAA0103504 Thanks GitHub users response questionnaire REFERENCES 1 Mark Aberdour 2007 Achieving quality opensource IEEE 24 1 2007 58–64 2 Bethany Alender 2016 Understanding volunteer motivations participate citizen science projects deeper look water quality monitoring Journal Science Communication 15 3 2016 A04 3 Shaosong Ou Alexander Hars 2002 Working free Motivations participating opensource projects International journal electronic commerce 6 3 2002 25–59 4 Maria J Antikainen Heli K Vaataja 2010 Rewarding open innovation communities—how motivate members International Journal Entrepreneurship Innovation Management 11 4 2010 440–456 5 Dryden Ash 2013 ethics unpaid labor OSS community httpswwwashedrydencomblogtheethicsofunpaidlaborandtheosscommunity Online accessed June 8 2021 6 Susanne Beck Carsten Bergenholtz Marcel Bogers TiareMaria Brasseur Marie Louise Conradsen Dáilétt Di Marco Andreas P Distel Leonard Dobusch Daniel Dörler Agnes Effert et al 2020 Open Innovation Science research field collaborative conceptualisation approach Industry Innovation 2020 1–50 7 Kenneth P Burnham David R Anderson 2002 Model Selection Multimodel Inference Practical InformationTheoretic Approach 2nd ed Springer 8 G Canfora L Cerulo Cimitile MD Penta 2014 changes affect entropy empirical study Empirical Engineering 19 1 2014 1–38 9 Francesco Cappa Jeffrey Laut Maurizio Porfiri Luca Giustiniano 2018 Bring aboard rewarding participation technologymediated citizen science projects Computers Human Behavior 89 2018 246–257 10 Krista Casler Lydia Buckel Elizabeth Hackett 2013 Separate equal comparison participants data gathered via Amazon’s MTurk social media facetoface behavioral testing Computers human behavior 29 6 2013 2156–2160 11 Jacob Cohen Patricia Cohen Stephen G West Leona Aiken 2013 Applied multiple regressioncorrelation analysis behavioral sciences Routledge 12 SciPy community 2008 API Reference scipystatswilcoxon httpsdocsscipyorgdocscipyreferencegeneratedscipystatswilcoxonhtml Online accessed July 31 2021 13 Paul G Curran 2016 Methods detection carelessly invalid responses survey data Journal Experimental Social Psychology 66 2016 4–19 14 Paul David Joseph Shapiro 2008 Communitybased production opensource know developers participate Information Economics Policy 20 4 2008 364–398 15 Margret C Domroese Elizabeth Johnson 2017 watch bees Motivations citizen science volunteers Great Pollinator Biological Conservation 208 2017 40–47 16 Enrique EstellésArolas Fernando GonzálezLadrónde Guervara 2012 Towards integrated crowdsourcing definition Journal Information science 38 2 2012 189–200 17 Yulin Fang Derrick Neufeld 2009 Understanding sustained participation open source projects Journal Management Information Systems 25 4 2009 9–50 18 Oluwaseyi Feyisetan Elena Simperl Max Van Kleek Nigel Shadbolt 2015 Improving paid microtasks gamification adaptive furtherance incentives Proceedings 24th international conference world wide web 333–343 19 Andrzej Gałecki Tomasz Burzykowski 2013 Linear mixedeffects model Linear MixedEffects Models Using R Springer 245–273 20 Rishab Aiyer Ghosh 2005 Understanding free developers Findings FLOSS study Perspectives free open source 28 2005 23–47 21 GitHub 2016 Getting Paid Open Source Work httpsopensourceguidegettingpaid Online accessed June 8 2021 22 GitHub 2017 Open Source Survey httpsopensourceurveyorg2017 Online accessed June 8 2021 23 GitHub 2021 personal dashboard httpsdocsgithubcomengithubsettingupandmanagingyourgithubuseraccountmanaginguseraccountsettingsaboutyourpersonaldashboardfindingyourtoprepositoriesandteams Online accessed May 24 2021 24 GitHub 2021 Displaying sponsor button repository httpsdocsgithubcomengithubadministeringarepositorymanagingrepositorysettingsdisplayingasponsorbuttoninyourrepository Online accessed May 22 2021 25 GitHub 2021 Invest powers world httpsgithubcomsponsors Online accessed July 30 2021 26 GitHub 2021 Reference GraphQL User API httpsdocsgithubcomengraphqlreferenceobjectsuser Online accessed July 30 2021 27 GitHub 2021 Reference RESTful List users API httpsdocsgithubcomenrestreferenceuserslistusers Online accessed August 1 2021 28 GitHub 2021 2020 State OCTOVERSE httpsoctoversegithubcom Online accessed February 4 2021 29 R J Grissom J J Kim 2007 Effect Sizes Research Broad Practical Approach Effect sizes research broad practical approach 30 Carl Gutwin Reagan Penner Kevin Schneider 2004 Group awareness distributed development Proceedings 2004 ACM conference Computer supported cooperative work ACM Chicago Illinois USA 72–81 31 Stefan Haefliger Georg Von Krogh Sebastian Spaeth 2008 Code reuse open source Management science 54 1 2008 180–193 32 Cynthia Harvey 2017 35 Top Open Source Companies httpswwwdatamationcomopensource35topopensourcecompanies Online accessed February 5 2021 33 Andrea Hemetsberger 2002 Fostering cooperation Internet Social exchange processes innovative virtual consumer communities ACR North American Advances 29 2002 354–356 34 Mokter Hossain 2012 Users’ motivation participate online crowdsourcing platforms 2012 International Conference Innovation Management Technology Research IEEE 310–315 35 Javier Luis Cánovas Izquierdo Jordi Cabot 2018 role foundations open source projects Proceedings 40th International Conference Engineering Engineering Society ACM Gothenburg Sweden 3–12 36 Jackman C Kleiber Zeileis 2008 Regression Models Count Data R Journal Statistical 27 8 2008 1–25 37 Jayanta Kanwal Pratibha Mahgul 2012 Bug Prioritization Facilitate Bug Report Triage Journal Computer Science Technology 27 2012 397–412 38 Bran Knowles 2013 Cybersustainability towards sustainable digital future Lancaster University United Kingdom 39 Bruce Kogut Anca Meitus 2001 Opensource development distributed innovation Oxford review economic policy 17 2 2001 248–264 40 Sandeep Krishnamurthy Arvind K Tripathi 2009 Monetary donations open source platform Research Policy 38 2 2009 404–414 41 Alexandra Kuznetsova Per B Brockhoff Bune H B Christensen 2017 InterTest Package Tests Linear Mixed Effects Models Journal Statistical 82 13 2017 1–26 httpsdoiorg1018637jssv082i13 42 Karim Lakhani Robert W 2005 Hackers Understanding Motivation Effort FreeOpen Source Projects MIT Press Cambridge 43 Lincoln R Larson Caren B Cooper Sara Futch Devyani Singh Nathan J Shipley Kathy Dale Geoffrey LeBaron John Takekawa 2020 diverse motivations citizen scientists conservation emphasis grow volunteer participation progresses Biological Conservation 242 2020 108428 44 Huigang Li Yue Yu Tao Wang Gang Yin Shanhan Li Huaimin Wang 2021 Still Working Empirical Study Pull Request Abandonment IEEE Transactions Engineering 2021 1–1 httpsdoiorg101109TSE20213053403 45 Debra J Mesch Patrick Rooney Kathryn Steinberg Brian Denton 2006 effects race gender marital status giving volunteering Indiana Nonprofit Voluntary Sector Quarterly 35 4 2006 565–587 46 Nadia 2015 handy guide financial support open source httpsgithubcomnayalialemonadestandblobmasterREADMEmd Online accessed June 8 2021 47 Keitaro Nakasai Hideaki Hata Kenichi Matsumoto 2018 donation badges appealing case study developer responses eclipse bug reports IEEE 36 3 2018 22–27 48 Keitaro Nakasai Hideaki Hata Saya Onoue Kenichi Matsumoto 2017 Analysis donations eclipse 8th International Workshop Empirical Engineering Practice IWESEP IEEE Tokyo Japan 18–22 49 Cassandra Overney 2020 Hanging Thread Empirical Study Donations Open Source Proceedings ACMIEEE 42nd International Conference Engineering Companion Proceedings Seoul South Korea ICSE ’20 Association Computing Machinery New York NY USA 131–133 httpsdoiorg10114533778123382170 50 Cassandra Overney Jens Meinicke Christian Kästner Bogdan Vasilescu 2020 Get Rich Empirical Study Donations Open Source Proceedings ACMIEEE 42nd International Conference Engineering Seoul South Korea ICSE ’20 Association Computing Machinery New York NY USA 1209–1221 httpsdoiorg10114533778113380410 51 Patrícia Tiago Maria João Gouveia César Capinha Margarida SantosReis Henrique Pereira 2017 influence motivational factors frequency participation citizen science activities Nature Conservation 18 2017 61 52 Cassandra Overney 2020 Become sponsor Super Diana httpsgithubcomsponsorsalphacentauri2 Online accessed May 26 2021 53 SurveyMonkey 1999 httpswwwsurveymonkeycom Online accessed May 26 2021 54 Andrew Schofield Grahame Cooper 2006 Participation Free Open Source Communities Empirical Study Community Members’ Perceptions Open Source Systems Ernesto Damiani Brian Fitzgerald WaiChi Scacchi Marco Scotto Giancarlo Succi Eds Springer US Boston 221–231 55 Manuel Sojer Joachim Henkel 2010 Code reuse open source development Quantitative evidence drivers impediments Journal Association Information Systems 11 12 2010 2 56 Diana Super 2020 Become sponsor Super Diana httpsgithubcomsponsors0xTim Online accessed May 26 2021 57 Asher Trockman Shurui Zhou Christian Kästner Bogdan Vasilescu 2018 Adding Sparkle Social Coding Empirical Study Repository Badges Npm Ecosystem Proceedings 40th International Conference Engineering Gothenburg Sweden ICSE ’18 Association Computing Machinery New York NY USA 511–522 httpsdoiorg10114531801553180209 58 Lian Tung 2020 Redis database creator Sanfilippo I’m stepping opensource httpswwwzdnetcomarticleredisdatabasecreatorsanfilippowhyimsteppingdownfromtheopensourceproject Online accessed June 8 2021 59 Steven J VaughanNichols 2021 Hard work poor pay stresses opensource maintainers httpswwwzdnetcomarticlehardworkandpoorpaystressesoutopensourcemaintainers Online accessed Jun 8 2021 60 Georg Von Krogh Stefan Haefliger Sebastian Spaeth Martin W Wallin 2012 Carrots Rammbocks Motivation Social Practice Open Source Development MIS Q 36 2 Jun 2012 649–676 61 Jing Wang Patrick C Shih John Carroll 2015 Revisiting Linus’s law Benefits challenges open source peer review International Journal HumanComputer Studies 77 2015 52–65 httpsdoiorg101016jijhcs201501005 62 John Willinsky 2005 unacknowledged convergence open source open access open science First Monday 10 8 Aug 2005 httpsdoiorg105210fmv10i81265 63 Sarah Wiseman Anna L Cox Sandy JJ Gould Duncan P Brumby 2017 Exploring effects nonmonetary reimbursement participants HCI research Human Computation 2017 64 Bo Xu Donald R Jones Bingxia Shao 2009 Volunteers’ involvement online community based development Information Management 46 3 2009 151–158 httpsdoiorg101016jim200812005 65 Bo Xu Dahui Li 2015 empirical study motivations content contribution community participation Wikipedia Information management 52 3 2015 275–286 66 Yue Yu Gang Yin Huaimin Wang Tao Wang 2014 Exploring Patterns Social Behavior GitHub Proceedings 1st International Workshop CrowdBased Development Methods Technologies Hong Kong China CrowdSoft 2014 Association Computing Machinery New York NY USA 31–36 httpsdoiorg10114526665392666571 67 Xunzhao Zhang Tao Wang Yue Yu Quheng Zeng Zhiying Li Huaimin Wang 2012 Questionnaire design GitHub Sponsor mechanism 2022 httpsdoiorg105281ZENODO5715824 68 Yangyang Zhao Alexander Serebrenik Yuming Zhou Vladimir Filkov Bogdan Vasilescu 2017 impact continuous integration development practices largescale empirical study 2017 32nd IEEEACM PLATFORMS BESIDES SPONSOR MECHANISM Table 9 platforms obtaining OSS financial support Name URL Bountysource httpswwwbountysourcecom Flattr httpsflattrcom IssueHunt httpsissuehuntio Kickstarter httpswwwkickstartercom Liberapay httpsliberapaycom Gittip httpsgratipaycom Gratipay httpsgratipaycom OpenCollective httpsopencollectivecom Otechie httpsotechiecom Patreon httpswwwpatreoncom PayPal httpswwwpaypalcom Tidelift httpstideliftcom Tip4Commit httpstip4commitcom LFX Mentorship formerly CommunityBridge httpslfxlinuxfoundationorgtoolsmentorship Kofi httpskoficom
::::
Usage Costs Benefits Continuous Integration OpenSource Projects Michael Hilton Oregon State University USA hiltonmeecsoregonstateedu Timothy Tunnell University Illinois USA tunnell2illinoisedu Kai Huang University Illinois USA khuang29illinoisedu Darko Marinov University Illinois USA marinovillinoisedu Danny Dig Oregon State University USA digdeecsoregonstateedu ABSTRACT Continuous integration CI systems automate compilation building testing Despite CI rising big success story automated engineering received almost attention research community example widely CI used practice costs benefits associated CI Without answering questions developers tool builders researchers make decisions based folklore instead data paper use three complementary methods study usage CI opensource projects understand CI systems developers use analyzed 34544 opensource projects GitHub understand developers use CI analyzed 1529291 builds commonly used CI system understand projects use use CI surveyed 442 developers data answered several key questions related usage costs benefits CI Among results show evidence supports claim CI helps projects release often CI widely adopted popular projects well finding overall percentage projects using CI continues grow making important timely focus research CI CCS Concepts • engineering → Agile development testing debugging Keywords continuous integration mining repositories INTRODUCTION Continuous Integration CI emerging one biggest success stories automated engineering CI systems automate compilation building testing deployment example automation reported 22 help Flickr deploy production 10 times per day Others 40 claim adopting CI agile planning process product group HP reduced development costs 78 success stories led CI growing interest popularity Travis CI 17 popular CI service reports 300000 projects using Travis State Agile industry survey 48 3880 participants found 50 respondents use CI State DevOps report 49 finds CI one indicators high performing organizations Google Trends 11 shows steady increase interest CI searches “Continuous Integration” increased 350 last decade Despite growth CI published research paper related CI usage 53 preliminary study conducted 246 projects compares several quality metrics projects use use CI However study present detailed information projects use CI fact despite folkloric evidence use CI systematic study CI systems lack basic knowledge extent opensource projects adopting CI also answers many important questions related CI costs CI CI deliver promised benefits releasing often helping make changes eg merge pull requests faster developers maximize usage CI Despite widespread popularity CI little quantitative evidence benefits lack knowledge lead poor decision making missed opportunities Developers choose use CI missing benefits CI Developers choose use CI might using fullest potential Without knowledge CI used tool builders misallocating resources instead data automation improvements needed users studying CI researchers blind spot prevents providing solutions hard problems practitioners face paper use three complementary methods study usage CI opensource projects understand extent CI adopted developers CI systems developers use analyzed 34544 opensource projects GitHub understand developers use CI analyzed 1529291 builds Travis CI commonly used CI service GitHub projects Section 41 understand projects use use CI surveyed 442 developers data answer several research questions grouped three themes Theme 1 Usage CI RQ1 percentage opensource projects use CI RQ2 breakdown usage different CI services RQ3 certain types projects use CI others RQ4 opensource projects adopt CI RQ5 developers plan continuing use CI found CI widely used number projects adopting CI growing also found popular projects likely use CI Theme 2 Costs CI RQ6 opensource projects choose use CI RQ7 often projects evolve CI configuration RQ8 common reasons projects evolve CI configuration RQ9 long CI builds take average found common reason developers using CI lack familiarity CI also found average makes 12 changes CI configuration file many changes automated Theme 3 Benefits CI RQ10 opensource projects choose use CI RQ11 projects CI release often RQ12 projects use CI accept pull requests RQ13 pull requests CI builds get accepted faster terms calendar time RQ14 CI builds fail less master nonmaster branches first surveyed developers perceived benefits CI empirically evaluated claims found projects use CI release twice often use CI also found projects CI accept pull requests faster projects without CI paper makes following contributions Research Questions designed 14 novel research questions first provide indepth answers questions usage costs benefits CI Data Analysis collected analyzed CI usage data 34544 opensource projects analyzed indepth CI data subset 620 projects 1529291 builds 1503092 commits 653404 pull requests Moreover surveyed 442 opensource developers chose use use CI Implications provide practical implications findings perspective three audiences researchers developers tool builders Researchers pay attention CI passing fad developers list several situations CI provides value Moreover discovered several opportunities automation helpful tool builders details data sets results available httpcopeeecsoregonstateeduCISurvey
::::
2 OVERVIEW CI
::::
21 History Definition CI idea Continuous Integration CI first introduced 1991 Grady Booch 26 context objectoriented design “At regular intervals process continuous integration yields executable releases grow functionality every release” idea adopted one core practices Extreme Programming XP 23 However idea began gain acceptance blog post Martin Fowler 37 2000 motivating idea CI often integrate better key making possible according Fowler automation Automating build process include retrieving sources compiling linking running automated tests system give “yes” “no” indicator whether build successful automated build process triggered either manually automatically actions developers checking new code version control ideas implemented Fowler CruiseControl 9 first CI system released 2001 Today 40 different CI systems wellknown ones include Jenkins 12 previously called Hudson Travis CI 17 Microsoft Team Foundation Server TFS 15 Early CI systems usually ran locally still widely done Jenkins TFS However CI service become popular eg Travis CI available service even Jenkins offered service via CloudBees platform 6
::::
22 Example Usage CI present example CI comes data pull request using found httpsgithubcomRestKitRestKitpull2370 developer named “AdlaiHoller” created pull request 2370 named “Avoid Flushing InMemory Managed Object Cache Accessing” work around issue titled “Duplicate objects created inserting relationship mapping using RKInMemoryManagedObjectCache” RestKit 13 developer made two commits created pull request triggered Travis CI build build failed failing unit tests RestKit member “segiddins” commented pull request asked AdlaiHoller look test failures AdlaiHoller committed two new changes pull request commits triggered new CI build first build failed second successful CI build passed RestKit team member commented “seems fine” merged pull request
::::
3 METHODOLOGY understand extent CI used CI systems developers use analyzed 34544 opensource projects GitHub breadth corpus understand developers use CI analyzed 1529291 builds popular CI system depth corpus understand projects use use CI surveyed 442 developers 31 Breadth Corpus breadth corpus large number projects information CI services uses use breadth corpus answer broad questions usage CI opensource projects collected data corpus primarily via GitHub API first sorted GitHub projects popularity using star rating whereby users mark “star” projects like hence accumulate stars started inspection top list first manually looking top 50 projects collected publicly available information projects use CI used learned manual inspection write script programmatically classify CI service uses four CI services able readily identify manually later script sorted order usage Travis CI 17 CircleCI 5 AppVeyor 2 Werker 18 services provide public API’s queried determine using service Moreover wanted ensure collected complete data possible examined data hand found several projects using CloudBees 6 CI service powered Jenkins CI However given list GitHub projects reliable way programmatically identify GitHub API projects use CloudBees contrast Travis CI uses organization names GitHub making easy check correspondence Travis CI GitHub projects contacted CloudBees sent us list opensource projects CloudBees build set wrote script parse list inspect build information search corresponding GitHub repository repositories build CloudBees used data identify projects breadth corpus use CloudBees yielded 1018 unique GitHub repositoriesprojects check whether projects refer CloudBees searched case insensitive “CloudBees” README files projects found 256 contain “CloudBees” words contacted CloudBees directly using information available GitHub would missed large number projects use CloudBees Overall breadth corpus consists 34544 projects collected following information name owner CI systems uses popularity measured number stars primary programming language determined GitHub 32 Depth Corpus depth corpus fewer projects collect information publicly available subset projects collected additional data gain deeper understanding usage costs benefits CI Analyzing breadth corpus discussed Section 41 learned Travis CI far commonly used CI service among opensource projects Therefore targeted projects using Travis CI depth corpus First collected top 1000 projects GitHub ordered popularity 1000 projects identified 620 projects use Travis CI 37 use AppVeyor 166 use CircleCI 3 use Werker used Travis CI API1 collect entire build history depth corpus total 1529291 builds Using GHTorrent 39 collected full history pull requests total 653404 pull requests Additionally cloned every corpus access entire commit history source code 33 Survey Even collecting diverse breadth depth corpora still left questions could answer online data alone questions developers chose use use CI designed survey help us answer number “why” questions well provide us another data source better understand CI usage deployed survey sending email addresses publicly listed belonging organizations top 1000 GitHub projects rated popularity total sent 4508 emails survey consisted two flows three questions first question flows asked participant used CI Depending answer gave question second question asked reasons use use CI questions multiplechoice multipleselection questions users asked select reasons agreed populate choices collected common reasons using using CI mentioned websites 17 blogs 3819 Stack Overflow 14 Optionally survey participants could also write reasons already list third question asked participant plans using CI future projects incentivize participation raffled 50 USD gift card among survey respondents 442 98 response rate participants responded survey responses 407 921 indicated use CI 35 79 indicated use CI RESULTS section present results research questions Section 41 presents results usage CI Section 42 discusses costs CI Finally Section 43 presents benefits CI Rather presenting implications research question draw several research questions triangulate implications present Section 5 41 Usage CI determine extent CI used study percentage projects actively use CI also ask developers plan use CI future Furthermore study whether popularity programming language correlate usage CI RQ1 percentage opensource projects use CI found 40 projects breadth corpus use CI Table 1 shows breakdown usage Thus CI indeed used widely warrants investigation 1We grateful Travis CI developers promptly resolving bug report submitted prior resolving bug report one could query full build history projects Additionally know scripts find CI usage eg projects run privately hosted CI systems discussed Section 62 reliably detect use public CI services API makes possible query CI service based knowing GitHub organization name Therefore results present lower bound total number projects use CI Table 2 CI usage Service top row shows percent CI projects using service second row shows total number projects service Percents add 100 due projects using multiple CI services Usage CI Service Travis CircleCI AppVeyor CloudBees Werker 901 191 35 16 04 12528 2657 484 223 59 RQ2 breakdown usage different CI services Next investigate CI services widely used breadth corpus Table 2 shows Travis CI far widely used CI service result feel confident analysis focus projects use Travis CI CI service analyzing projects gives representative results usage CI services opensource projects also found projects use one CI service breadth corpus projects use CI 14 use one CI think interesting result deserves future attention RQ3 certain types projects use CI others better understand projects use CI look characteristics projects likely use CI CI usage popularity want determine whether popular projects likely use CI intuition CI leads better outcomes would expect see higher usage CI among popular projects alternatively projects using CI get better thus popular Figure 1 shows popular projects measured number stars also likely use CI Kendall’s tau p 000001 group projects breadth corpus 64 even groups ordered number stars calculate percent projects group using CI group around 540 projects popular starred group 70 projects use CI projects become less popular percentage projects using CI declines 23 Observation Popular projects likely use CI CI usage language examine CI usage programming language certain languages projects written primarily languages use CI others Table 3 shows projects sorted percentage projects use CI language breadth corpus data shows fact certain languages use CI others Notice usage CI perfectly correlate number projects using language measured number projects using language rank percentage Kendall’s tau p 068 words languages use CI popular languages like Ruby emerging languages like Scala Similarly among projects use CI less notice popular languages ObjectiveC Java well less popular languages VimL However observe many languages highest CI usage also dynamicallytyped languages eg Ruby PHP CoffeeScript Clojure Python JavaScript One possible explanation may absence static type system catch errors early languages use CI provide extra safety Observation observe wide range projects use CI popularity language correlate probability uses CI RQ4 opensource projects adopt CI next study projects began adopt CI Figure 2 shows number projects using CI time answer question depth corpus breadth corpus date first build use determine CI introduced Notice collecting data Travis CI founded 2011 10 Figure 2 shows CI experienced steady growth last 5 years also analyze age developers first introduced CI found median time around 1 year Based data conjecture many developers introduce CI early project’s Table 3 CI usage programming language language columns tabulate number projects corpus predominantly use language many projects use CI percentage projects use CI Language Total Projects Using CI Percent CI Scala 329 221 6717 Ruby 2721 1758 6461 Go 1159 702 6057 PHP 1806 982 5437 CoffeeScript 343 176 5131 Clojure 323 152 4706 Python 3113 1438 4619 Emacs Lisp 150 67 4467 JavaScript 8495 3692 4346 1710 714 4175 C 1233 483 3917 Swift 723 273 3776 Java 3371 1188 3524 C 1321 440 3331 C 652 188 2883 Perl 140 38 2714 Shell 709 185 2609 HTML 948 241 2542 CSS 937 194 2070 ObjectiveC 2745 561 2044 VimL 314 59 1879 development lifetime always seen something provides large amount value initial development Observation median time CI adoption one year RQ5 developers plan continuing use CI CI passing “fad” developers lose interest lasting practice time tell true answer get sense future could hold asked developers survey plan use CI next asked likely use CI next using 5point Likert scale ranging definitely use definitely use Figure 3 shows developers feel strongly using CI next top two options ‘Definitely’ ‘Most Likely’ account 94 survey respondents average answers 454 seems like pretty resounding endorsement continued use CI decided dig little deeper Even among respondents currently using CI 53 said would ‘Definitely’ ‘Most Likely’ use CI next Observation CI widely used practice nowadays predict future CI adoption rates increase even
::::
42 Costs CI better understand costs CI analyze survey asked developers believe CI costly worth using data depth corpus estimate cost developers writing maintaining configuration CI service Specifically measure often developers make changes configuration files study make changes configuration files also analyze cost terms time run CI builds Note time builds take return result could unproductive time developers know proceed without knowing result RQ6 opensource projects choose use CI One way evaluate costs CI ask developers use CI survey asked respondents whether chose use use CI indicated asked tell us use CI Table 4 shows percentage respondents selected particular reasons using CI mentioned built list possible reasons collecting information various popular internet sources Interestingly primary cost respondents identified technical cost instead reason using CI “The developers familiar enough CI” know developers familiar enough CI tools eg Travis CI unfamiliar work take add CI including perhaps fully automating build completely answer question research needed second selected reason automated tests speaks real cost CI Table 4 Reasons developers gave using CI Reason Percent developers familiar enough CI 4700 doesn’t automated tests 4412 doesn’t commit often enough CI worth 3529 doesn’t currently use CI would like future 2647 CI systems high maintenance costs eg time effort etc 2059 CI takes long set 1765 CI doesn’t bring value already enough testing 588 Figure 4 Number changes CI configs median number changes 12 much value comes automated tests projects find developing good automated test suites substantial cost Even cases developers automated tests questioned use CI particular regression testing general one respondent P74 even said “In 4 years tests yet catch single bug” Observation main reason opensource projects choose use CI developers familiar enough CI RQ7 often projects evolve CI configuration ask question identify often developers evolve CI configurations “writeonceandforgetit” situation something evolves constantly Travis CI service configured via YAML 20 file named travisyml project’s root directory YAML humanfriendly data serialization standard determine often changed configuration analyzed history every travisyml file counted many times changed calculate number changes commits depth corpus Figure 4 shows number changescommits travisyml file life observe median number changes project’s CI configuration 12 times one projects changed CI configuration 266 times leads us conclude many projects setup CI minimal involvement 25 projects 5 less changes CI configuration projects find changing CI setup quite often Observation projects change configurations relatively often worthwhile study changes Table 5 Reasons CI config changes Config Area Total Edits Percentage Build Matrix 9718 1470 Install 8549 1293 Build Script 8328 1259 Build Language Config 7222 1092 Build Env 6900 1043 Build Script 6387 966 Install 4357 659 Whitespace 3226 488 Build platform Config 3058 462 Notifications 2069 313 Comments 2004 303 Git Configuration 1275 193 Deploy Targets 1079 163 Build Success 1025 155 Build Script 602 091 Deploy 133 020 Deploy 79 012 Custom Scripting 40 006 Build Failure 39 006 Install 14 002 Install 10 002 Mysql 5 001 Build Success 3 000 Allow Failures 2 000 RQ8 common reasons projects evolve CI configuration better understand changes CI configuration files analyzed changes made travisyml files depth corpus YAML structured language parse file determine part configuration changed Table 5 shows distribution changes common changes build matrix Travis specifies combination runtime environment exclusionsinclusions example build matrix Ruby could specify runtimes rvm 22 rvm 19 jruby build environment rails2 rails3 exclusionsinclusions eg exclude jruby rails2 combinations built except excluded example would 5 different builds common changes included dependent libraries install building travisyml calls install changes build script Also many changes due version changes dependencies RQ9 long CI builds take average Another cost using CI time build application run tests cost represents cost energy2 computing power run builds also developers may wait see build passes merge changes longer build times means wasted developer time average build time 500 seconds compute average build times first remove canceled incomplete manually stopped build results consider time errored failed passed completed builds Errored builds occur build begins eg dependency cannot downloaded failed builds build completed successfully eg unit test fail understand data look outcome independently Interestingly find passing builds run faster either errored failed builds difference errored failed significant Wilcoxon p 00001 difference passed errored Wilcoxon p 00001 difference passed failed Wilcoxon p 00001 find result surprising intuition passing builds take longer error state encountered early process abort return earlier Perhaps case many faster running pass builds generating meaningful result run However investigation needed determine exact reasons 2This cost underestimated personal correspondence Google manager charge CI system TAP reveals TAP costs millions dollars computation counting cost developers maintain use TAP 43 Benefits CI first summarize commonly touted benefits CI reported survey participants analyze empirically whether benefits quantifiable depth corpus Thus confirm refute previously held beliefs benefits CI RQ10 opensource projects choose use CI found CI widely used opensource projects RQ1 CI widely used among popular projects GitHub RQ3 want understand developers choose use CI However uses CI cannot determined code repository Thus answer question using survey data Table 6 shows percentage respondents selected particular reasons using CI mentioned build list reasons collecting information various popular internet sources two popular reasons “CI makes us less worried breaking builds” “CI helps us catch bugs earlier” One respondent P371 added “Acts like watchdog may run tests careful merges CI ” Martin Fowler 7 quoted saying “Continuous Integration doesn’t get rid bugs make dramatically easier find remove” However survey projects felt CI actually helped debugging process RQ11 projects CI release often One common claims CI helps projects release often eg CloudBees motto “Deliver Faster” 6 50 respondents survey claimed reason use CI analyze data see indeed find evidence would support claim found projects use CI indeed release often either 1 projects used CI 2 projects use CI order compare across projects periods calculated release rate number releases per month Projects use CI average 54 releases per month projects use CI average 24 releases per month double release rate difference statistically significant Wilcoxon p 000001 identify effect CI also compared projects use CI release rate first CI build found projects eventually added CI used release rate 34 releases per month well 54 rate release CI difference statistically significant Wilcoxon p 000001 RQ12 projects use CI accept pull requests uses CI service Travis CI CI server builds pull request annotates pull request GitHub visual cue green check mark red ‘X’ shows whether pull request able build successfully CI server intuition extra information help developers better decide whether merge pull request code determine extra information indeed makes difference compared pull request acceptance rates pull Table 6 Reasons using CI reported survey participants Reason Percent CI makes us less worried breaking builds 8771 CI helps us catch bugs earlier 7961 CI allows running tests cloud freeing personal machines 5455 CI helps us deploy often 5332 CI makes integration easier 5307 CI runs tests realworld staging environment 4600 CI lets us spend less time debugging 3366 Table 7 Release rate projects Uses Travis Versions Released per Month Yes 54 24 Table 8 Comparison pull requests merged pull requests CI information CI Usage Pull Requests Merged Using CI 23 Using CI 28 requests CI information pull requests depth corpus Note projects exclude branches repository run CI server uses CI branch guarantee every pull request contains CI build status information Table 8 shows results question found pull requests without CI information 5pp likely merged pull requests CI information intuition result 5pp pull requests problems identified CI merging pull requests developers avoid breaking build difference statistically significant Fisher’s Exact Test p 000001 also fits survey result developers say using CI makes less worried breaking build One respondent P219 added CI “Prevents contributors releasing breaking builds” merging potential problem pull requests developers avoid breaking builds Observation CI build status help developers avoid breaking build merging problematic pull requests projects RQ13 pull requests CI builds get accepted faster terms calendar time pull request submitted code merged pull request accepted sooner pull request accepted sooner code merged previous question saw projects using CI accept fewer ie reject ignore pull requests projects using CI question consider accepted pull requests ask whether difference time takes projects accept pull requests without CI One reason developers gave using CI makes integration easier One respondent P183 added “To confident merging PRs” integration easier translate pull requests integrated faster Figure 6 shows distributions time accept pull requests without CI compute results select depth corpus pull requests accepted without build information CI server mean time CI 81 hours median 52 hours Similarly mean time without CI 140 hours median 68 hours Comparing median time accept pull requests find median pull request merged 16 hours faster pull requests without CI information difference statistically significant Wilcoxon p 00000001 Observation CI build status make integrating pull requests faster using CI median pull request accepted 16 hours sooner Table 9 Percentage builds succeed pull request target Pull Request Target Percent Passed Builds Master 7203 6536 RQ14 CI builds fail less master nonmaster branches popular reason participants gave using CI helps avoid breaking build Thus analyze claim depth corpus data show difference way developers use CI master branch vs branches difference many builds fail master vs branches Perhaps developers take care writing pull request master another branch Table 9 shows percentage builds pass pull requests master branch compared branches found pull requests indeed likely pass master Observation CI builds master branch pass often branches IMPLICATIONS offer practical implications findings researchers developers tool builders Researchers RQ1 RQ3 RQ4 RQ5 CI “fad” stay CI widely used projects adopting yet received much attention research community time researchers study use improve eg automate tasks setting CI believe researchers contribute many improvements CI process understand current stateofthepractice CI RQ2 Similarly GitHub become main gateway researchers study believe Travis CI become main gateway researchers study CI Travis offers wealth CI data accessible via public API Therefore researchers maximize impact studying single system RQ7 RQ8 found evidence frequent evolution CI configuration files similar evolution found Makefiles 21 researchers focus providing support safe automation changes configuration files eg via safe refactoring tools RQ6 Table 4 common reason developers use CI unfamiliarity CI tremendous opportunity providing educational resources call upon university educators enrich engineering curriculum cover basic concepts tooling CI Developers RQ3 Table 3 data shows CI widely embraced projects use dynamically typed languages eg 64 2721 Ruby projects use CI compared 20 2745 ObjectiveC projects use CI mitigate lack static type system developers use dynamically typed languages use CI run tests help catch errors early RQ13 analysis depth corpus shows presence CI makes easier accept contributions opensource projects also indicated several survey respondents eg “CI gives external contributors confidence breaking project” P310 Considering research 43 reports lack diversity opensource projects attracting new contributors desirable Thus projects aim diversify pool contributors consider using CI RQ7 RQ9 average times single CI build fairly short CI configurations maintainable appears benefits CI outweigh costs Thus developers use CI projects RQ3 RQ11 RQ12 RQ14 use CI correlates positive outcomes CI adopted successful projects GitHub developers consider CI best practice use widely possible Tool Builders RQ6 CI helps catching bugs locating CI build logs often bury important error message among hundreds lines raw output Thus tool builders want improve CI focus new ways integrate faultlocalization techniques CI RQ1 RQ7 RQ8 Despite wide adoption many projects yet use CI Tool builders could parse build files 56 generate configuration files necessary CI automating process tool builders lower entry barrier developers unfamiliar CI THREATS VALIDITY 61 Construct asking right questions interested assessing usage CI opensource projects focused questions think questions high potential provide unique insight value different stakeholders developers tool builders researchers 62 Internal something inherent collect analyze CI usage data could skew accuracy results CI server configured continue run turned could result projects configuring CI server taking account results continue development However think unlikely Travis CI GitHub close integration would difficult ignore presence CI visual cues throughout GitHub using CI CI services run way cannot detected information publicly available GitHub repository means could missed projects However would mean underestimating extent CI adopted Despite 98 response rate survey still 90 targeted population respond control responded survey may suffer selfselection bias think likely 92 survey participants reported using CI much higher percentage projects observed using CI data order mitigate made survey short provided raffle incentive participate get responses possible 63 External results generalizable general CI usage analyzed large number opensource repositories cannot guarantee results proprietary closedsource fact consider likely closedsource projects would unwilling send source internet CI service intuition would much likely use local CI solution work done investigate usage CI closedsource projects focused Travis CI could CI services used differently showed RQ2 Travis CI overwhelming favorite CI service use focusing think results representative Additionally selected projects GitHub Perhaps opensource projects custom hosting also would likely custom CI solutions work needed determine results generalize RELATED WORK group related work three different areas CI usage ii CI technology iii related technology CI Usage closest work Vasilescu et al 53 present two main findings find projects use CI effective merging requests core members projects use CI find significantly bugs However paper explicitly states preliminary study 246 GitHub projects treats CI usage simply boolean value contrast paper examines 34544 projects 1529291 builds 442 survey responses provide detailed answers 14 research questions CI usage costs benefits tech report Beller et al 25 performs analysis CI builds GitHub specifically focusing Java Ruby languages answer several research questions tests including “How many tests executed per build” “How often tests fail” “Does integration different environments lead different test results” questions however serve comprehensively support refute productivity claims CI Two papers 4446 analyzed couple case studies CI usage two case studies total unlike paper analyzes broad diverse corpus Leppänen et al 45 interviewed developers 15 companies perceived benefits CI found one perceived benefits frequent releases One participants said CI reduced release time six months two weeks results confirm projects use CI release twice fast projects use CI Beller et al 24 find developers report testing three times often actually test overreporting shows CI needed ensure tests actually run confirms one respondents P287 said “It forces contributors run tests might otherwise do” Kochhar et al 42 found larger Java opensource projects lower test coverage rates also suggesting CI beneficial CI technology researchers proposed approaches improve CI servers servers communicate dependency information 31 generating tests CI 30 selecting tests based code churn 41 Also researchers 27 found integrating build information various sources help developers survey found developers think CI helps locate bugs problem also pointed others 28 One features CI systems report build status clear everyone Downs et al 32 developed hardware based system devices shaped like rabbits light different colors depending build status devices keep developers informed status build Related Technology foundational technology CI build systems ways researchers tried improve performance incremental building 35 well optimizing dependency retrieval 29 Performing actions continuously also bring extra value researchers proposed several activities continuous test generation 54 continuous testing continuously running regression tests background 50 continuous compliance 36 continuous data testing 47 CONCLUSIONS CI rising big success story automated engineering paper study usage growth future prospects CI using data three complementary sources 34544 opensource projects GitHub ii 1529291 builds commonly used CI system iii 442 survey respondents Using rich data investigated 14 research questions results show good reasons rise CI Compared projects use CI projects use CI release twice often ii accept pull requests faster iii developers less worried breaking build Therefore come surprise 70 popular projects GitHub heavily use CI trends discover point expected growth CI future CI even greater influence today hope paper provides call action research community engage important field automated engineering ACKNOWLEDGMENTS thank CloudBees sharing us list opensource projects using CloudBees Travis fixing bug API enable us collect relevant build history Amin Alipour Denis Bogdanas Mihai Codoban Alex Gyori Kory Kraft Nicholas Lu Shane McKee Nicholas Nelson Semih Okur August Shi Sruti Srinivasa Ragavan anonymous reviewers valuable comments suggestions earlier version paper work partially funded NSF CCF1421503 CCF1439957 CCF1553741 grants 10 REFERENCES 1 7 reasons using continuous integration httpsaboutgitlabcom201502037reasonswhyyoushouldbeusingci Accessed 20160424 2 AppVeyor httpswwwappveyorcom Accessed 20160426 3 benefits continuous integration httpsblogcodeshipcombenefitsofcontinuousintegration Accessed 20160424 4 Build cloud httpgoogleengtoolsblogspotcom201108buildincloudhowbuildsystemworkshtml Accessed 20160424 5 CircleCI httpscirclecicom Accessed 20160426 6 CloudBees httpcloudbeescom Accessed 20160426 7 Continuous integration httpswwwthoughtworkscomcontinuousintegration Accessed 20160424 8 Continuous integration dead httpwwwyegor256com20141008continuousintegrationisdeadhtml Accessed 20160424 9 CruiseControl httpcruisecontrolsourceforgenet Accessed 20160421 10 CrunchBase httpswwwcrunchbasecomorganizationtraviscientity Accessed 20160424 11 Google Search Trends httpswwwgooglecomtrends Accessed 20160424 12 Jenkins httpsjenkinsio Accessed 20160421 13 Restkit httpsgithubcomRestKitRestKit Accessed 20160429 14 Stackoverflow httpstackoverflowcomquestions214695whataresomeargumentsagainstusingcontinuousintegration Accessed 20160424 15 Team Foundation Server httpswwwvisualstudiocomenusproductstfsoverviewvsaspx Accessed 20160421 16 Tools engineers httpresearchmicrosoftcomenusprojectstse Accessed 20160424 17 Travis CI httpstravisciorg Accessed 20160421 18 Werker httpwerckercom Accessed 20160426 19 don’t use continuous integration httpsbloginfedacuksapm20140214whydontweusecontinuousintegration Accessed 20160424 20 Yaml Yaml ain’t markup language httpyamlorg Accessed 20160424 21 J AlKofahi H V Nguyen Nguyen Nguyen N Nguyen Detecting semantic changes Makefile build code ICSM 2012 22 J Allspaw P Hammond 10 deploys per day Dev ops cooperation Flickr httpswwwyoutubecomwatchvLdOe18KhtT4 Accessed 20160421 23 K Beck Embracing change Extreme Programming Computer 321070–77 1999 24 Beller G Gousios Zaidman much developers test ICSE 2015 25 Beller G Gousios Zaidman Oops tests broke build analysis travis ci builds github Technical report PeerJ Preprints 2016 26 G Booch Object Oriented Design Applications BenjaminCummings Publishing Co Inc 1991 27 Brandtner E Giger H C Gall Supporting continuous integration mashingup quality information CSMRWCRE 2014 28 Brandtner C Müller P Leitner H C Gall SQAProfiles Rulebased activity profiles continuous integration environments SANER 2015 29 Celik Knaust Milicevic Gligoric Build system lazy retrieval Java projects FSE 2016 30 J C de Campos Arcuri G Fraser R F L de Abreu Continuous test generation Enhancing continuous integration automated test generation ASE 2014 31 Dössinger R Mordinyi Biffl Communicating continuous integration servers increasing effectiveness automated testing ASE 2012 32 J Downs B Plimmer J G Hosking Ambient awareness build status collocated teams ICSE 2012 33 Elbaum G Rothermel J Penix Techniques improving regression testing continuous integration development environments FSE 2014 34 J Engblom Virtual near end Using virtual platforms continuous integration DAC 2015 35 Erdweg Lichter Weiel sound optimal incremental build system dynamic dependencies OOPSLA 2015 36 B Fitzgerald K J Stol R O’Sullivan O’Brien Scaling agile methods regulated environments industry case study ICSE 2013 37 Fowler Continuous Integration httpmartinfowlercomarticlesoriginalContinuousIntegrationhtml Accessed 20160421 38 Gligoric L Eloussi Marinov Practical regression test selection dynamic file dependencies ISSTA 2016 39 G Gousios GHTorrent dataset tool suite MSR 2013 40 J Humble Evidence case studies httpcontinuousdeliverycomevidencecasestudies Accessed 20160429 41 E Knauss Staron W Meding Söder Nilsson Castell Supporting continuous integration codechurn based test selection RCoSE 2015 42 P Kochhar F Thung Lo J L Lawall empirical study adequacy testing open source projects APSEC 2014 43 V Kuechler C Gilbertson C Jensen Gender differences early free open source joining process IFIP 2012 44 E Laukkonen Paasivaara Arvonen Stakeholder perceptions adoption continuous integration case study AGILE 2015 45 Leppänen Mäkinen Pagels V P Eloranta J Itkonen V Mäntylä Männistä highways country roads continuous deployment IEEE 2015 46 Miller hundred days continuous integration AGILE 2008 47 K Muşlu Brun Meliou Data debugging continuous testing FSE 2013 48 V One 10th annual state Agile development survey httpsversiononecompdfVersionOne10thAnnualStateofAgileReportpdf 2016 49 Puppet DevOps Research Assessments DORA 2016 state DevOps Report httpspuppetcomsystemfiles201606201620State20of20DevOps20Reportpdf 2016 50 Saff Ernst Continuous testing Eclipse ICSE 2005 51 Testing speed scale Google Jun 2011 httpgoogleengtoolsblogspotcom201106testingatspeedandscaleofgooglehtml 52 Tools continuous integration Google scale October 2011 httpwwwyoutubecomwatchvb52aXZ2yi08 53 B Vasilescu Yu H Wang P Devanbu V Filkov Quality productivity outcomes relating continuous integration GitHub FSE 2015 54 Z Xu B Cohen W Motycka G Rothermel Continuous test suite augmentation product lines SPLC 2013 55 Yoo Harman Regression testing minimization selection prioritization survey STVR 22267–120 2012 56 Zhou J AlKofahi N Nguyen C Kästner Nadi Extracting configuration knowledge build files symbolic analysis RELENG 2015
::::
Managing Episodic Volunteers FreeLibreOpen Source Communities Ann Barcomb KlaasJan Stol Brian Fitzgerald Dirk Riehle Abstract—We draw concept episodic volunteering EV general volunteering literature identify practices managing EV freelibreopen source FLOSS communities Infrequent ongoing participation widespread practices community managers using manage EV concerns EV previously documented conducted policy Delphi study involving 24 FLOSS community managers 22 different communities panel identified 16 concerns related managing EV FLOSS ranked prevalence also describe 65 practices managing EV FLOSS Almost threequarters practices used least three community managers report practices using systematic presentation includes context relationships practices concerns address findings provide coherent framework help FLOSS community managers better manage episodic contributors Index Terms—Best practices community management episodic volunteering free open source
::::
1 INTRODUCTION FreeLibreOpen Source FLOSS research traditionally divided contributors core periphery core describes minority top developers contribute 80 percent code periphery describes developers 1 2 3 focus volume contributions assumes homogenized periphery without distinction within group definition distinction exclusive focus code contributions ignoring many types contributions made FLOSS projects better understand periphery FLOSS communities several researchers begun differentiate participants within periphery based frequency duration participation 4 5 6 7 earlier work drawn upon concept episodic volunteering EV volunteering literature describe subset peripheral contributors whose contributions shortterm infrequent 8 9 contrast habitual contributors whose contributions “continuous successive” 10 also reconsidered definition contribution expanding code contribution type activity within FLOSS community 6 using alternative lens FLOSS communities found evidence wide range contributions episodic volunteers made 6 Based qualitative survey 13 FLOSS communities developed detailed understanding perspectives episodic volunteers community managers Based established initial set recommendations engage episodic volunteers key concern context episodic volunteering whether volunteers return make contributions Drawing general volunteering literature evaluated theoretical model helps explain retention episodic volunteers article extend line research EV FLOSS communities Episodic contributors represent class participants make wide range valuable contributions FLOSS projects 6 nature participating behavior incidental continuous particular interest understand episodic contributors “retained” context refers returning contribute rather converting habitual contributors Retention appealing returning contributors require less assistance newcomers 11 retention one key factors FLOSS sustainability 12 13 14 15 16 However evidence general volunteering literature suggests many organizations clear strategies place effectively manage episodic contributors 11 17 Organizations may also face internal resistance implementing changes episodic contributors may negatively perceived costing resources deliver contributions 18 Despite challenges EV increasingly important topic volunteer management due increase preference kind work 8 19 20 21 22 Adapting changing volunteering context necessary sustainability nonprofit organizations 22 FLOSS long observed many contributors episodic instance case bug reporting 2 6 23 24 25 Furthermore number benefits attributed peripheral contributors—increased identification legal issues copyright infringement highquality bug fixes example 14 26 Hence given increased recognition importance episodic volunteers contributions imperative study manage episodic volunteers FLOSS communities major change FLOSS communities last decade increase firms’ involvement open source development although volunteers remain important participants 27 28 29 Many companies different sectors use developed external FLOSS projects 30 consequently many firms employ developers contribute specific open source projects identify critical business Paid development negate need understand episodic participation Even companydominated FLOSS communities external developers still contribute significant proportion commits 31 Additionally perspective community paid developers employed external firms cannot directed employees 32 33 Although differences paid contributors participants 28 paid contributors’ participation sometimes episodic perspective community research considers episodic participation community perspective consequently adopt broadest definition volunteering encompass anyone engaging FLOSS contributions directly sponsored FLOSS community 6 broad definition allows us identify practices actually used communities without concern whether contributors paid sponsored firm paid contributors affect community managers’ concerns practices explicitly noted findings FLOSS research challenged reliance studying forms participation readily observed data mining notably code contributions bug reports mailing lists 34 35 Exclusion noncode contributors limits applicability research larger FLOSS communities depend code contributions also wide range activities planning advocacy mentoring event organization 35 36 37 38 unpaid paid contributors participate range activities within FLOSS communities 39 Despite extensive research community practices eg 3 two studies focused specifically episodic participation neither focused identifying extensive list practices 6 40 fact specific practices proposed peripheral subgroups namely newcomers 41 42 suggests FLOSS communities may using different practices adapting existing practices different ends order manage episodic contributors Hence study following objectives 1 Identify concerns community managers episodic volunteers 2 Identify practices community managers using envisage using address concerns episodic volunteers address objectives conducted Delphi study structured communication technique involving panel experts drew experience FLOSS community managers identify concerns community managers EV practices use—or consider using—to manage EV preliminary suggestions practices could combined article makes following contributions toward understanding management EV FLOSS prioritized list 16 EV community manager concerns extensive collection practices might used manage EV 74 percent used least three community managers includes connections concerns previously identified well relationships practices Workflows proposed community managers demonstrate practices combined remainder article organized follows Section 2 reviews previous work investigated open source communities volunteers particular role episodic contributors Section 3 presents Delphi research approach adopted including discussion participant selection data collection data analysis procedures Section 4 presents findings study presenting set practices concerns Section 5 concludes discussing findings limitations study outlook future work
::::
2 RELATED WORK section reviews prior work peripheral contributors episodic volunteering FLOSS communities 21 Peripheral Contributors FLOSS Communities One earliest conceptions structure FLOSS communities socalled Onion model 1 43 Onion model depicts increasing numbers decreasing engagement moving innermost core outermost passive users core contains prolific developers often described people create 80 percent code 2 Beyond core periphery contribute fewer lines code Although much earlier research focused core eg 2 24 significant understanding importance periphery motivations peripheral participants Peripheral contributors provide range benefits Bringing new knowledge 26 44 45 46 Raising awareness 46 47 48 Providing new potential core contributors 26 45 49 50 51 Proposing new features 44 52 Contributing new code 26 44 45 53 Finding reporting bugs 54 Ensuring members’ behavior abides community norms 26 FLOSS developer motivations extensively studied Motives usually characterized intrinsic motives inherent job altruism enjoyment internalized extrinsic motives reputation reciprocity extrinsic motives career salary 55 Peripheral contributors tend set motivations core developers 37 extrinsic motives less likely continue participate 45 56 particular peripheral contributors likely seek opportunities afford greater recognition stakeholders chance gain reputation 45 Extrinsic motives desire build reputation gain recognition widespread among peripheral developers core developers 45 Recent work begun study periphery closely identify distinguish different types contributors One dimension often used distinguish frequency participation Groups distinguished frequency participation newcomers 41 57 58 59 59 60 61 62 people attempt become contributors 63 onetime contributors 5 40 56 earlier work linked general episodic volunteering literature periphery 6 disaggregation periphery frequency contribution could also viewed extension rather departure Onion model outer layers—active users passive users—are already defined actions irrespective contributions others Active users engage instance supplying bug reports passive users use Disentangling homogenized periphery subcategories distinguished frequency participation refines Onion model allows identification distinct attributes different groups within periphery Onion model different layers describe people contribute whereas FLOSS projects include many ways get involved 35 36 37 Carillo Bernard 64 described codecentricity limitation “By stereotyping FOSS projects communities developers loosely collaborating FOSSlicensed via online platform disregard massive amount information captured platforms also neglect myriad noncode related tasks roles without could is” Emphasis code contributions within FLOSS communities may devalue types contributions may specifically disadvantage women 65 studies found women’s participation FLOSS remains low code noncode activities including leadership 66 67 68 Nafus’s 65 participant observation study FLOSS contributors found “men monopolize code authorship simultaneously delegitimize kinds social ties necessary build mechanisms women’s inclusion” Research also demonstrated barriers entry newcomers gendered 60 69 gender may influence retention among episodic contributors 7 code contributors represent entire community terms diversity work may additionally demographically unrepresentative argue importance including noncode contributions study emphasis makes EV concept originates general volunteering literature rather engineering literature appropriate lens study places particular emphasis one type contribution
::::
22 Episodic Volunteering Episodic volunteering term general volunteering literature describing shortterm infrequent participation Although particular engagement may limited duration retention episodic contributors possible context EV retention mean conversion habitual participation repeated engagement organization systematic review EV literature Hyde et al 70 identified retention key topic need research Retention remains compelling subject returning volunteers require less training 11 retention one measure stability FLOSS 13 14 15 general volunteering literature retention episodic contributors largely focused explaining factors lead retention satisfaction previous volunteering experience intention return availability 10 71 72 FLOSS domain Steinmacher et al 73 found higher quality email responses encouraged retention among newcomers Meanwhile Labuschagne Holmes 57 critically examined Mozilla’s onboarding programs found may result longterm contributors despite fact mentored newcomers consider program valuable study evaluating five potential EV retention factors found satisfaction community commitment social norms correlate intention remain 7 Another important problem general volunteering organizations incorporate EV 17 Although EV sometimes viewed disruptive widespread reality requires organizations reconsider strategies 18 19 45 74 Volunteer agencies adjust expectations episodic contributors offering flexibility commitment reducing training requirements increasing social element service recognizing volunteers 75 Volunteer coordinators also identify tasks suitable episodic contributors may include oneoff contributions events ongoing nonspecialized work 11 Evaluation suitable tasks done systematically applying ‘volunteer scenario’ approach categorizes volunteer assets volunteer availability potential assignments 76 single work collected comprehensive set practices managing EV FLOSS previous studies proposed practices managing FLOSS contributors Previously identified 20 potential practices EV management evaluating existing FLOSS practices light factors associated retention episodic contributors prior general volunteering recommendations 6 Meanwhile Steinmacher et al 41 identified nine practices communities onboarding new contributors corresponding recommendations new contributors consider practices newcomers relevant study EV community managers cannot distinguish future episodic volunteer future habitual volunteer 72 make first contribution study updates line work drawing expertise community managers time first study 6 found limited evidence community managers managing EV approach increases scope number practices identified First examine practices already used manage EV well practices experts think might appropriate distinguish speculation observed practice Second look volunteer process onboarding retention excluding recruitment
::::
3 Study Design section outline Delphi research method elaborate participant selection data collection analysis methods 31 Research Method research concerned understanding current practices managing episodic contributors also proposes practices may helpful managing EV Delphi method developed way finding collected opinions group experts works assumption multiple experts better able arrive accurate solutions problems Anonymity participants used prevent participants high status reputation disproportionate influence 77 78 79 Delphi approach suitable complex problems 80 solutions yet exist may best explored subjective judgments informed group experts 77 81 common engineering research Delphi method previously used study complex topics tailoring agile methods 82 adoption tools FLOSS developers 83 Delphi studies typically comprise several rounds data collection—as participants exposed new information every round may develop new insights iteration exposure others’ ideas Delphi method also conducted asynchronously particular importance context given geographic distribution open source experts traditional Delphi method focuses achieving consensus evolved variant known policy Delphi emerged policy Delphi study appropriate purpose study establish consensus identify main arguments positions 77 decided policy Delphi study rather traditional Delphi study would appropriate context recognized communities may different goals managing EV could driven community size cultural context types contribution considered wanted articulate constraints order provide context practices rather assume one approach would effective communities activities within communities However also interested generalizing common practices concerns used collation different rounds data collection achieve consensus opinions codify results research form collection practices appendix 84 found Computer Society Digital Library httpdoiieeecomputersocietyorg101109TSE20202985093 ensures fruits research work used practitioners key goal research EV management includes phases volunteer management process explicitly excluded recruitment practices consideration study many specific episodic volunteering focus necessary limit scope study otherwise could overwhelm participants diffuse focus Although onboarding another area expect overlap habitual episodic management decided retain part process order compare results recent study summarizing onboarding practices newcomers 41 32 Participant Selection Participant selection key aspect successful Delphi study 85 Participants must selected care chosen simply basis availability 86 sought select panel 20 25 participants ensure sufficient diversity even participants would stop participating study within recommended range 15–30 participants 87 Potential participants identified one three ways First approached us directly following presentations practitioner conferences Second identified people among contacts people recommended us contacts two groups approached subset met selection requirements describe Third evaluated gaps coverage sent cold emails people identified online searches selection participants based enthusiasm participation connection us also degree diversity along three selection dimensions discussed well expectation participants would able provide relevant input Additionally although gender knowledge directly linked community management awareness gender affect FLOSS participation experiences 60 inspired us deliberately recruit female participants total onethird participants female Table 1 summarizes participants community participation different rounds study gain full benefit multiple perspectives participants Delphi study diverse rather homogeneous 88 identified three dimensions relevant study along expected differences opinion arise size community contribution type country discuss detail 321 Size Community previous study investigating current state EV FLOSS discovered tasks considered appropriate episodic contributors vary community size 6 example smaller communities translation adhoc task wellsuited EV Larger communities complicated rules translating full cognizance rules requires habitual participation Organization size also factor commonly considered studies identifying best practices example case study best practices volunteer organizations Carvalho Sampaio 89 considered size volunteer organizations terms numbers beneficiaries paid employees volunteers many different ways operationalize community size—number users number developers size core—and size continuous categorical categorize communities size instead sought include number communities different size communities represented panel experts handful contributors justified extremely small communities tend concerned developing volunteer management process workflow communities represented shown Table 1 total 22 communities represented four communities Debian Ubuntu KDE OpenStack represented twice Detailed descriptions community provided appendix 84 available online supplemental material
::::
322 Contributor Activities Much FLOSS research codecentric large communities people work number activities translation maintaining web services 35 earlier study EV FLOSS found episodic contributors engage activities areas considered suitable others depending community 6 expect perspective community managers might influenced activities engage used classification system introduced Rozas 38 describe Drupal community contains comprehensive categorization FLOSS activities
::::
323 Country FLOSS communities international although North American European countries disproportionately overrepresented 90 Geographic boundaries eliminated cultural barriers may remain example 2002 Nakakoji et al 1 explained Japanese programmers reluctant directly communicate GNU GCC core developers saw superior programmers wanted keep “respectful distance” One difficulty identifying cultural diversity increasing globalization led intercultural identities identification country birth also residence 91 92 therefore considered country origin well residence participants represented 23 countries spanning populated continents Argentina Australia Brazil Cyprus Czech Republic France Germany Hungary India Ireland Italy Japan Kenya Peru Romania Singapore Spain South Korea Tunisia Uganda Ukraine United Kingdom United States appendix provides details participants’ countries residence origin 84 available online supplemental material
::::
33 Data Collection Analysis Data collection initiated January 2018 concluded October 2018 study comprised three rounds shown Fig 1 first round participants asked think concerns EV might address participants engaged community management precondition participating study participants experience close six categories average involved multiple types contributions Table 2 shows paraphrased list contribution types along count many participants engaged activity appendix provides detailed list participant’s contribution types 84 available online supplemental material
::::
Table 1 Study Participants Community Study Participation ID Community Rounds participated CM1 Anonymous ✓ CM2 Apache RDO ✓ CM3 ChakraLinux ✓ CM4 CHAOSS ✓ CM5 Debian ✓ CM6 Drupal ✓ CM7 Fedora ✓ CM8 Fedora ✓ CM9 Joomla ✓ CM10 KDE NextCloud ✓ CM11 KDE Kubuntu ✓ CM12 Linux Mint Debian ✓ CM13 Mozilla ✓ CM14 Mozilla ✓ CM15 OpenChain ✓ CM16 OpenStack Debian ✓ CM17 OpenStack ✓ CM18 OSGeoLive ✓ CM19 Perl ✓ CM20 PostgreSQL ✓ CM21 Python ✓ CM22 Ubuntu ✓ CM23 Ubuntu ✓ CM24 Women Code ✓
::::
Table 2 Number Participants Engaged Contribution Type Based 38 Name Description Source code Write code review code report bugs 14 Documentation Write report issues 14 Translation Translate review translation 9 Design User experience design visual design style guide creation 6 Support Participate support fora create cookbooks 11 Evangelizing Blog posts speaking unrelated events marketing 19 Mentoring Creation training materials mentoring contributors 15 Community management Participation working local groups conflict resolution governance 24 Events Organization events speaking events 18 Economic Make donations seek sponsors 12 concerns purpose round generate broad overview concerns problems affecting communities Collating round involved identifying unique concerns name description creating list unique practices name description associated concerns second round sought refine understanding concerns practices concerns entailed collecting information prevalence ranking concerns practices elicited relationships practices specifically precedingsubsequent complementary relationships possible workflows collation round focused elaborate descriptions practices reported ranking concerns Workflows also shown third round involved refining information gathered practices Participants asked verify used proposed practice asked specify relationships context limitations earlier analyses missed collation consisted extended description practices round questions posted participants given several weeks respond end period reminders sent participants yet responded response time extended responses received analyzed lead author using QDAcity tool qualitative data analysis Contextual codes representing dimensions interest community name participant’s contribution types participant’s country applied first Next lead author performed theoretical thematic analysis based theme round 93 Round II collation presented authors participants collection practices also known handbook 94 collation sent participants round form member checking 95 Additionally Round III participants supplied list practices attributed giving opportunity challenge interpretation Participants given one week suggest modifications collation sent revised document first two rounds received minor requests changes final round received acknowledgements receipt Responses round anonymized sent respondents confirm modifications obscure message Analysis conducted original responses anonymized responses used provide quotations collations Quotations attributed individual study participants means assigned twoletter code participant able identify contributions could also build impression study participants individuals without knowing identities
::::
4 Results section presents results study Section 41 discusses concerns associated managing episodic contributors Section 42 focuses practices managing episodic contributors Section 43 extends relationships practices workflows 41 Concerns Episodic Volunteering identified set concerns community managers EV Broadly community managers number concerns knowledge transmission community episodic participants suitability episodic contributors tasks effectively community processes support EV episodic contributors included community identified sixteen concerns community managers identified regarding episodic volunteering communities Table 3 specifies sixteen concerns category frequently observed many participants ranked concerns top three pressing concerns Space limitations preclude us discussing concerns illustrate common concerns detail complete set concerns described appendix 84 available online supplemental material Concern 2C Episodic contributor lacks awareness opportunities contribute deemed important observed 20 community managers ranked pressing concern eight study participants One community manager expressed urgency follows “Keeping volunteers interested openly sharing opportunities contribute technical nontechnical given priority” —CM Concern 2C Episodic contributor lacks awareness opportunities contribute Communicating opportunities get involved way reaches episodic contributors concern communities especially people aware tasks could done episodically enjoy outreach activities TABLE 3 Concerns Category Number Community Managers Observing Concern Number Times Ranked Important Concern Second Important Concern Third Important Concern Concern Obs 1 2 3 Knowledge exchange 1C Episodic contributor lacks knowledge developments absences 10 1 1 1 2C Episodic contributor lacks awareness opportunities contribute 20 8 1 4 3C Community lacks knowledge availability episodic contributors 15 2 1 2 4C Episodic contributor lacks understanding vision 11 1 2 1 5C Episodic contributor community mismatched expectations 13 1 1 1 Suitability episodic contributors work 6C Episodic contributor quality work insufficient 9 2 0 0 7C Episodic contributor’s timeliness completion work poor 14 1 1 1 8C Community’s cost supervision exceeds benefit episodic contribution 8 1 1 1 Community processes support EV 9C Community cannot retain episodic contributors sporadic requirements 8 0 1 2 10C Community difficulty identifying appropriate tasks episodic contributors 15 1 4 2 11C Community lacks episodic strategy 14 2 6 1 12C Community insufficiently supports episodic contributors 4 0 0 0 Marginalization episodic contributors 13C Community restricts episodic contributors leadership roles 12 1 1 1 14C Community excludes episodic contributors discussions decisions 10 2 0 3 15C Community gives episodic contributors reduced access opportunities rewards 5 0 0 0 16C Community lacks appreciation recognition episodic contributors 9 0 1 1 key characteristic episodic volunteers contribute irregularly nature participation tends short duration lack daytoday engagement means episodic volunteers may simply aware opportunities contribute Fifteen community managers observed 3C Community lacks knowledge availability episodic contributors two considered primary concern One community manager described issue inperson events conferences “This lack knowledge big problem working online communities grow exponentially working live event may call volunteers may end shorthanded three things once” —CM23 concern directly links one defining characteristics sets episodic volunteers apart habitual volunteers scenario outlined quote clearly identifies key issue episodic volunteers namely availability tends much restricted fact episodes activity volunteers may quite removed happening community daytoday basis Concern 7C Episodic contributor’s timeliness completion work poor mentioned 14 community managers one ranking biggest concern CM24 summarized concern “The main problem using kind help sometimes don’t know whether person started task able finish finish decent quality” —CM24 Concern 7C Episodic contributor’s timeliness completion work poor Episodic contributors may less investment ensuring work completed timely manner completed especially problematic work important others relying situation event may unavoidable put responsibility episodic participants concern alludes asymmetry information possessed community managers episodic contributors concerning contributors’ intentions contributors generally aware progress extent dedication task information often conveyed community managers community managers becomes difficult rely work completed completed sufficient standard episodic contributor problem pronounced community manager may unable form expectation quality future work based previous experience contributor’s work CM6 explained 10C Community difficulty identifying appropriate tasks episodic contributors concern Fifteen community managers experience issue one thought important concern “You need know context background task effective get lost problem prepare information usually requires time task normally person knowledge one ends people lot work possible contributors without knowledge help” —CM6 Concern 10C Community difficulty identifying appropriate tasks episodic contributors Community managers find difficult identify maintain list suitable tasks timeconsuming describe tasks picked episodic contributors recommended episodic contributors given standalone tasks accomplished without deep Conf Code Name Description Community Governance ✓ G1 Manage delivery triangle Adjust scope quality features schedule releases cannot completed schedule desired level quality expected features ✓ G2 Use longer delivery cycles Make release cycles longer order give episodic contributors opportunity contribute without intense time pressure People multiple responsibilities able participate ✓ G3 Host inperson meetings Host inperson meetings creative organizational work involving multiple volunteers frequency meetings may vary could yearly quarterly monthly even frequent ✓ G4 Make decisions public Ensure decisions made process public open suggestions contributors Even decision ultimately made authoritative body transparency process make participants feel part ✓ G5 Create community definition quality Create community definition quality episodic contributors know quality expected ✓ G6 Craft community vision Craft inclusive community vision code conduct clear vision statement helps people determine want participate community ✓ G7 Define measuring success Define successful engagement episodic contributors looks like Describe measure impact G8 Centralize budgeting sponsorships Centralize processing sponsorships reimbursements claims processed manner processing timely G9 Use external provider sponsorships Hire external service provider serve intermediary providing sponsorships G10 Make leadership diverse Try diverse board coordination group review processes ensure welcoming accessible G11 Seek sponsorship Look stable sponsor ensure continuity events Community Preparation ✓ P1 Identify appropriate tasks Episodic participants easily join tasks available Identify types tasks suited episodic contributors ✓ P2 Define oneoff tasks Create standalone oneoff tasks ✓ P3 Crowdsource identifying appropriate tasks Engage experienced contributors shortterm initiative identify outstanding issues could handled episodic contributors Encourage continue identify new tasks backlog addressed ✓ P4 Document general working practices Document community’s working practices placing particular emphasis areas likely relevant new episodic contributors contributions appreciated ✓ P5 Detail complete task summarize tasks detail steps need taken consider providing time estimate task ✓ P6 List current areas activity Prioritize tasks tag entry level appropriate Group similar tasks together ✓ P7 Hold open progress meetings Hold regular open meetings previous work summarized new tasks assigned ✓ P8 Create working groups narrow focus Create specialized working groups people identify ✓ P9 Create written records activity Maintain summary instance form newsletter describes key discussions resolutions took place given period Alternately rely written communications mailing lists chats provide meeting minutes ✓ P10 Keep communication channels active Ensure communication channels online offline monitored queries directed appropriate people Make sure people receive responses ✓ P11 Send ambassadors small events Send ambassadors attend smaller events enable personal interactions potential participants ✓ P12 Respond submissions Respond every submission timely manner ✓ P13 social media team Recruit people enjoy social media specifically task communicating potential episodic contributors ✓ P14 Set expiration dates Set distinct deadlines initiatives ✓ P15 Create continual points entry Create ongoing ways people join contribute rather providing specific times times process people join Conf Code Name Description P16 Share success stories Share stories outstanding longserving community members challenges faced benefits received P17 Provide templates presentations Create one standard slide decks contributors use without modification P18 Write modular Ensure modular P19 Educate sponsoring organizations Educate sponsoring organizations participation open source projects including topics necessity maintenance open model production P20 Offer consistent development environment Document workflow architecture module use container build order allow people easily build local system Decide upon one recommended way set development environment focus documentation Onboarding Contributors Conf Code Name Description O1 Learn experience preferences time constraints participants Ask new infrequent contributors expectations availability preferences experience O2 Screen potential contributors Screen potential contributors determine good match role may include availability appropriate time able commit certain amount time O3 Guide people junior jobs Guide people junior jobs know start O4 Give choice tasks Give participants choice task small number offered O5 Manage task assignments application Use application wiki bug tracking system handle assignment process O6 Explain need maintenance Educate contributors happens contribution included Explain benefits remain available maintain contribution O7 Offer guided introductory events events offer walkthrough tutorials getting started contributor culminating hackathon working specific beginner problem Working contributors Conf Code Name Description W1 key contributor responsible every important make sure one key contributor responsible managing responding inquiries W2 Issue reminders Send reminder deadline approaches persistent following deliverables W3 Give permission quit task Give people permission skip period task without recrimination W4 Encourage people quit Encourage people longer wish fulfill role complete tasks step W5 Automate checking quality work Utilize advances continuous integrationcontinuous delivery automate routine evaluation W6 Set expectations Set expectations deliverables communication even minimal W7 Reject contributions insufficient quality Decline contributions inappropriate sufficient quality W8 Mentor quality Provide mentoring contributions rejected due insufficient quality might include access tools help people meet quality requirements Ensure contributors always reach mentors get speed W9 Require documentation part submission Require people sufficiently document submissions accepted W10 Encourage learners mentor Engage episodic contributors leading episodic contributors Let review episodic contributions mentor episodic contributors W11 Explain context contribution Understanding larger context requires time episodic contributors able willing give W12 Sever ties Publicly sever group’s connection individual explain reasoning W13 Automate process assistance Consider automation help people work early processes chat bot stepbystep interactive site Contributor Retention Conf Code Name Description R1 Publicize release schedule Publish development release schedule notify contributors upcoming milestones allow plan engagement R2 Encourage social connections Encourage people work together small group accomplish task might also include groups within company use TABLE 4 Continued Conf Code Name Description R3 Follow contributors Keep touch contributors even sending email R4 Instill sense community Help people understand cooperative values underlie free open source best done leading example ✓ R5 Acknowledge contributions someone responsible recognizing returning episodic contributors person could thank episodic contributors returning alternately explicitly welcome new contributors ✓ R6 Reward participation Offer tangible reward participation organizer’s dinner swag Alternatively offer recommendation letters certificates online recommendations ✓ R7 Recognize everyone Make use systems badges recognize variety different contributions people make conclusion cycle thank identify contributors ✓ R8 Praise publicly Praise volunteers publicly ✓ R9 Provide evaluations promotion path Provide assessment opportunities episodic contributors Examples assessment skill exploration personal evaluation Examples opportunities travel employment consideration succession planning skill building R10 Promote episodic contributors Give sustained episodic participants access rotating leadership positions depend experience rather continuous contributions ✓ R11 Announce milestones celebrate meeting goals Announce milestones met celebrate success ✓ R12 Listen suggestions Allow anyone participates propose want implement even decisions ultimately made steering committee concepts don’t fit primary goals allow people create unofficial initiatives provided don’t damage Invite creators unofficial initiatives incorporate main successful high quality Alternatively standalone recognize successes within Rotate different focus areas consistent schedule ✓ R13 Incorporate unofficial successes Invite creators unofficial initiatives incorporate main successful high quality Alternatively standalone recognize successes within Rotate different focus areas consistent schedule ✓ R14 Rotate focus areas schedule Rotate different focus areas consistent schedule recent years many FLOSS communities sought create strategies particular aims retaining newcomers recognizing noncode contributions Managing episodic contributors also benefits recognition problem identification desired outcome evaluation practices might used achieve goal previous study community managers didn’t report making use practices managing EV 6 study shows FLOSS communities adopting adapting practices managing EV fact concern manage EV effectively remains high concern demonstrates need study collects codifies experience multiple community managers create larger body knowledge 42 Practices Managing Episodic Volunteering organized identified practices number categories based “lifecycle” episodic contributors’ engagement practice community address categories sequentially move iterate use practices parallel However organizing practices categories help communicate FLOSS community managers practice aimed ameliorating one concerns described previous section total identified 65 practices study across five categories Table 4 provides complete list practices along brief description practice 65 practices 48 confirmed indicated checkmark use least three community managers specific purpose managing EV remaining 17 practices proposed panel experts EV management used zero one two community managers Table 4 contains brief description practice full description practice detailed following subsections include exemplars full descriptions one confirmed practice category previously described literature see Table 5 full descriptions practices found appendix 84 available online supplemental material full description practice includes context may limit generalizability practice list concerns involved solution optionally include challenges may arise implementing solution list community managers participating study used practice list community managers suggested used practice Additionally practice include list related practices part practices meant used isolation combined related practices Section 43 provides examples practices combined Relationships practices take following forms shown least one exemplar practices chosen demonstrate GeneralSpecific describes relationship specific practice restricted specialized practice compared general practice demonstrated R9 Provide evaluations promotion path general practice O2 Screen potential contributors specific practice Alternative describes two practices address concerns largely incompatible solutions example relationship shown P8 Create working groups narrow focus PrecedingSucceeding relationship practices best applied sequential order example relationship found G5 Create community definition quality shows preceding succeeding practices Complementary describes situation practices work well combined practices W10 Encourage learners mentor demonstrates relationship
::::
421 Community Governance category Community Governance contains practices address broad questions community operates practices affect potential episodic contributor’s first impressions kind community One example practices category G5 Create community definition quality CM24 stated able make extensive use episodic contributors community began “documenting standards quality” Another community manager CM16 explained new contributors episodic contributors typically expected know considers “quality work” “we never really explain way that’s easy learn ends barrier entry” Practice G5 Create community definition quality Context Episodic contributors necessarily know level quality expected community large mature enough lack common perspective causes problems contributors cannot expected tacitly acquire knowledge Concerns 4C Episodic contributor lacks understanding vision 6C Episodic contributor quality work insufficient 7C Episodic contributor’s timeliness completion work poor 11C Community lacks episodic strategy Solution Create community definition quality episodic contributors know quality expected become significantly easier follow many subsequent practices quality defined within community Related practices P4 Document general working practices COMPLEMENTARY practice G6 Craft community vision possible PRECEDING step P10 Keep communication channels active possible PRECEDING step P13 social media team possible PRECEDING step G7 Define measuring success possible SUCCEEDING step P5 Detail complete task possible SUCCEEDING step P6 List current areas activity possible SUCCEEDING step W5 Automate checking quality work possible SUCCEEDING step W6 Set expectations possible SUCCEEDING step W7 Reject contributions insufficient quality possible SUCCEEDING step W8 Mentor quality possible SUCCEEDING step Challenges difficult retroactively apply definition quality existing participants agreement Used CMtextsubscript15 CMtextsubscript13 CMtextsubscript14 CMtextsubscript18 CMtextsubscript24 Proposed CMtextsubscript16 CMtextsubscript19
::::
422 Community Preparation category Community Preparation contains practices associated preparing community engage episodic contributors Identifying appropriate tasks lowering barriers entry part group CMtextsubscript4 explained reasoning behind practice P8 Create working groups narrow focus prepare community accepting episodic contributors “By focusing working group topic people identify hope episodic contributors easier time identifying useful place contribute” —CMtextsubscript4
::::
423 Onboarding Contributors category Onboarding Contributors contains practices applied new episodic contributor joins community O2 Screen potential contributors part collection practices incorporating episodic contributors community manager explained screening beneficial Practice P8 Create working groups narrow focus Context complex participants easily comprehend entirety possible readily identify standalone tasks Concerns 2C Episodic contributor lacks awareness opportunities contribute Solution Create specialized working groups people identify narrow focus defined outcomes episodic contributors able find tasks readily Related practices P6 List current areas activity possible ALTERNATIVE step P18 Write modular possible ALTERNATIVE step P18 Write modular COMPLEMENTARY practice P18 Write modular possible PRECEDING step O1 Learn experience preferences time constraints participants possible PRECEDING step Challenges Contributions within working groups need reported back larger group Used CMtextsubscript2 CMtextsubscript3 CMtextsubscript4 CMtextsubscript5 CMtextsubscript6 CMtextsubscript16 “The first criteria contribution availabilitycommitment participants donate time specifically mentioned time frame help reviewers community leaders estimate impact contributions” —CMtextsubscript14
::::
424 Working Contributors category Working contributors contains practices applied period episodic contributor working assignment practices ensure episodic contributors’ contributions used community study participant expressed interest applying practice W10 Encourage learners mentor working contributors “It possible people reviewing episodic contributions different group active developers reviews episodic contributions don’t eat away time available larger contributions almost think like mentorship pool reviewers might even episodic contributors learned enough spend part limited time reviewing episodic contributions others” —CM16 Practice O2 Screen potential contributors Context order contributor properly perform role certain minimum commitment required repeated problems people insufficiently committing roles Concerns 3C Community lacks knowledge availability episodic contributors 4C Episodic contributor lacks understanding vision 5C Episodic contributor community mismatched expectations 10C Community difficulty identifying appropriate tasks episodic contributors Solution Screen potential contributors determine good match role may include availability appropriate time able commit certain amount time less likely commitment met Related practices O1 Learn experience preferences time constraints participants GENERAL practice Challenges people prevented pursuing role forms contribution prevent participating altogether Assessing potential contributors requires effort Used CM3 CM8 CM10 CM13 CM14 Another community manager explained process also benefit mentor “Encouraging someone answer questions IRC example communicates think grasp concepts” —CM2 425 Contributor Retention category Contributor Retention contains practices encourage contributors return CM13 explained R9 Provide evaluations promotion path useful retention practice “It also important provide episodic volunteers metric achievement community time dedicated tasks completed grow basic volunteers representatives mentors influential leaders even employees motivating results retention” —CM13 Another community manager described additional benefit community “Skills exploration skill building sessions prove helpful main goal would know skills episodic volunteers skills develop contribute projects long term short term” —CM14 Practice W10 Encourage learners mentor Context Highly active contributors limited time mentor episodic contributors Concerns 2C Episodic contributor lacks awareness opportunities contribute 4C Episodic contributor lacks understanding vision 8C Community’s cost supervision exceeds benefit episodic contribution 11C Community lacks episodic strategy Solution Engage episodic contributors leading episodic contributors Let review episodic contributions mentor episodic contributors Episodic contributors likely understand concerns limitations episodic contributors Using returning episodic contributors lead episodic contributors lets core contributors focus areas recognizes competency returning episodic contributors Related practices P16 Share success stories COMPLEMENTARY practice W1 key contributor responsible COMPLEMENTARY practice W8 Mentor quality COMPLEMENTARY practice R2 Encourage social connections COMPLEMENTARY practice Used CM2 CM5 CM12 CM13 Proposed CM11 CM16 Practice R9 Provide evaluations promotion path Context Episodic contributors unable develop contributors sustained episodic participation absences affect completion duties Concerns 15C Community gives episodic contributors reduced access opportunities rewards Solution Provide assessment opportunities episodic contributors Examples assessment skill exploration personal evaluation Examples opportunities travel employment consideration succession planning skill building Sustained episodic participants encouraged continue contributing beneficial community Related practices R10 Promote episodic contributors SPECIFIC practice Used CM13 CM14 CM22 Proposed CM1 43 Workflows Many practices limited effectiveness implemented alone instance would impossible implement O3 Guide people junior jobs without first implementing P1 Identify appropriate tasks would also ineffective initiate P1 without planning advertise However wide range practices tuned specific contexts single correct way community manager combine practices achieve particular goal asked participants might combine practices workflow order address important concern response question seen examples community managers approached task illustrative practitioners wish understand leverage extensive list practices resulted study beyond scope article identify specific workflows practices could applied community—largely due fact communities beginning address EV—the links related practices within practice description provide guidance community managers envisioned combining practices workflow consists number practices implemented sequentially simultaneously together form one possible solution specific concern workflow diagrams provided appendix 84 available online supplemental material Fig 2 depicts example workflow proposed CM6 address concern 11C Community lacks episodic strategy diagram shows practices P1 Identify appropriate tasks W1 key contributor responsible COMPLEMENTARY practices directly connected PRECEDE practice P10 Keep communication channels active P13 social media team also SUCCEEDS P1 W1 Another workflow shown Fig 3 devised CM19 depicts alternative approach addressing concern shows individual way community managers might join practices address concern based experience idiosyncratic understanding communities
::::
5 DISCUSSION CONCLUSION 51 Discussion 511 Diversity Practices study sought identify concerns community managers episodic volunteers identify practices using—or envisage using—to address concerns conducted policy Delphi study community managers looked study participants engaged different communities different countries representing communities different sizes order identify relationship responses based dimensions responses coded community name countries involved activities community manager experience Observed variations practices based upon dimensions identified described Context field full description practices Community size important factor episodic contributors informed developments Smaller communities favored less formal approach P7 Hold open progress meetings larger communities recommended O5 Manage task assignments application Mature communities concerned governance automation practices G5 Create community definition quality W5 Automate checking quality work O5 Manage task assignments application W13 Automate process assistance Country associated one difference Specifically reimbursement solutions G8 Centralize budgeting sponsorships G9 Use external provider sponsorships frequently mentioned less developed countries regardless location However important note context practices participants need sponsorship situation arise country FLOSS communities rather consistent concerns practices around world unable observe cultural differences Future work might revisit earlier studies suggested culture factor FLOSS participation determine still holds true Contribution type produced greatest amount diversity practices particular event organization supplied number practices primarily applicable context development another area stood influencing practices example G3 Host inperson meetings primarily eventplanning practice P18 Write modular clearly specific development Practices specific one type work within FLOSS community course less likely confirmed general practices applicable multiple types contributions may reason practices P20 Offer consistent development environment P17 Provide templates presentations confirmed Future research could focus confirming practices specific aspects FLOSS work determining prevalence use Gender directly included study design although participants could introduce gender context problem solution considered relevant One participant mention gender general statement noting women responsive recruitment “in experience women active volunteering find community responsive clearly see difference managing genderrelated communities regular communities clearly represent state industry” —CM24 FLOSS literature suggests responsive communities welcoming participants 73 96 aligns participant’s subsequent statement “Making community friendly women means making friendly everyone kind person everyone would feel included involved It’s easy see succeeding women literally half population” —CM24 ways increasing female participation include appreciation diverse teams tracking female participation improved mentoring 59 67 Workflows show another aspect variation less easy quantify work community manager “peoplecentric versatile” 97 implicit tacit knowledge communities undoubtedly plays role determining construction workflow Future research could try elicit factors go decisions
::::
512 Comparison Previous Studies identified 65 practices note list practices may exhaustive compared findings earlier study onboarding guidelines based interviews community managers diaries newcomers literature 41 Although study focused newcomers expected find overlap episodic contributors often identified retrospect 72 join also compared results earlier study potential practices managing EV proposed based interviews community managers EV literature 6 Table 5 includes complete list practices proposed two previous studies addition overlapping subset practices study total nine practices appeared studies found study Two practices identified onboarding study 41 eight earlier EV study 6 one practice found studies study difference explained variable levels granularity instance Consider timebased releases could seen specific implementation R1 Publicize release schedule different research approaches also explain difference previous EV study provided suggestions based EV literature recommendations Evaluate assets availability assignments may widelyknown systematically applied FLOSS communities Still practices may considered mainstream participants need mention Good documentation end study identified 52 practices described previous studies addition 13 previously described see Table 5 emphasis identifying practices explains many new practices relevant EV found Many practices familiar FLOSS domain community managers adapting existing practices EV context
::::
52 Limitations Study Delphi method qualitative method traditional criteria used quantitative studies internal validity external validity reliability appropriate due epistemological differences Instead qualitative research best evaluated alternative set criteria naturalistic inquiries proposed Guba 95 Guba’s criteria credibility transferability dependability confirmability Credibility Credibility concerns plausible true findings confidence result strengthened fact practices identified iteratively ten month period meant many opportunities participants reflect information presented amend design Delphi study involves member checking theory development phase Preliminary results also shared community manager involved study additional form member checking Transferability Guba recommends purposive sampling means ensuring transferability results 95 identified three dimensions literature suggested might affect results created diverse Delphi study panel able observe situations dimensions limited applicability practices also able identify broadly applicable practices able differentiate novel suggestions practices already use Dependability Dependability strengthened maintaining audit trail maintained anonymized well original copies responses including feedback collation retained copy collation state appeared round well feedback received collation supplemental documents developed creating collation also retained repository Confirmability multiple opportunities study participants correct researcher bias multiple phases Delphi study allow participants respond developing theory form member checking addition reflected understanding participants personalized report practices understood tried advocated requested corrections
::::
53 Conclusion identification 65 practices 52 previously described context managing EV FLOSS demonstrates many community managers actively thinking incorporate EV study confirms 74 percent practices identified actively used contrast earlier qualitative survey state EV FLOSS communities found community managers aware EV taking specific steps manage 6 Given nascent state literature EV FLOSS communities study fills significant gap also described relationships practices gave examples practices combined form workflow findings study readily adopted FLOSS community managers identified 16 concerns community managers EV communities identified frequently observed participants concerns ranked expert panel members study ranked list provides roadmap future research provides clues researchers practitioners might direct energy Concerns linked practices addressing opening possibility future studies investigating effectiveness different approaches collection practices 84 created extensive guide managing EV FLOSS readily understood researchers practitioners draws upon experiences seasoned community managers number different communities geographic regions areas expertise best knowledge study first gathered practices managing episodic contributors FLOSS communities Given increasing attention episodic contributors phenomenon within open source literature believe study provides timely foundation future work area
::::
ACKNOWLEDGMENTS authors would like thank community mentors contributed significant time participate study R Bowen N Bowers AI Chiuta Coughlan El Achêche B “bex” Exelbierd L Kisuuki N Kolokotronis G Lelarge G Link Park Pkpacheco Pinheiro Randal J Rey C Shorter H Tabunshchyk L Vancsa H Woo Zacchiroli V Zimmerman participants preferred remain anonymous Additionally would like thank reviewers constructive feedback Finally B Segletes provided helpful formatting advice work supported part Science Foundation Ireland grants 13RC2094 15SIRG3293
::::
REFERENCES 1 K Nakakoji Yamamoto Nishinaka K Kishida Ye “Evolution patterns opensource systems communities” Proc Int Workshop Princ Softw Evol 2002 pp 76–85 2 Mockus R Fielding J Herbsleb “Two case studies open source development Apache Mozilla” ACM Trans Softw Eng Methodology vol 11 3 pp 309–346 2002 3 K Crowston H Annabi J Howison C Masango “Effective work practices engineering Freelibre open source development” Proc Workshop Interdisciplinary Softw Eng Res 2004 pp 18–26 4 G Pinto Steinmacher Gerosa “More common think indepth study casual contributors” Proc 23rd Int Conf Softw Anal Evol Reengineering 2016 vol 1 pp 112–123 5 Lee J C Carver “Are onetime contributors different comparison core periphery developers FLOSS repositories” Proc Int Symp Empir Softw Eng Mes 2017 pp 1–10 6 Barcomb Kaufmann Riehle KJ Stol B Fitzgerald “Uncovering periphery qualitative survey episodic volunteering freelibre open source communities” IEEE Trans Softw Eng 2018 Online Available httpdxdoiorg101109TSE20182872713 7 Barcomb KJ Stol Riehle B Fitzgerald “Why episodic volunteers stay FLOSS communities” Proc Int Conf Softw Eng 2019 pp 948–959 Online Available httpscorauciehandle104687248 8 N Macduff “Societal changes rise episodic volunteer” Emerg Areas Volunteering vol 1 2 pp 49–61 2005 9 F Tang N MorrowHowell E Choi “Why older adult volunteers stop volunteering” Ageing Soc vol 30 5 pp 859–878 2010 10 Harrison “Volunteer motivation attendance decisions Competitive theory testing multiple samples homeless shelter” J Appl Psychol vol 80 3 pp 371–385 1995 11 R Cnaan F Handy “Towards understanding episodic volunteering” Vrijwillige Inzet Onderzocht vol 2 1 pp 29–35 2005 12 L Bao X Xia Lo G C Murphy “A large scale study longtime contributor prediction GitHub projects” IEEE Trans Softw Eng published doi 101109TSE20192918536 13 J Gamalielsson B Lundell “Sustainability open source communities beyond fork Libreoffice evolved” J Syst Softw vol 89 pp 128–145 2014 14 Foucault Palyart X Blanc G C Murphy JR Falleri “Impact developer turnover quality opensource software” Proc 10th Joint Meeting Found Softw Eng 2015 pp 829–841 15 IzquierdoCortazar G Robles F Ortega J GonzálezBarahona “Using archaeology measure knowledge loss projects due developer turnover” Proc 42nd Hawaii Int Conf Syst Sci 2009 pp 1–10 16 Zhou Mockus “Who stay FLOSS community Modeling participant’s initial behavior” IEEE Trans Softw Eng vol 41 1 pp 82–99 Jan 2015 17 Hager “Toward emergent strategy volunteer administration” Int J Volunt Adm vol 29 3 pp 13–22 2013 18 N Macduff “Episodic volunteers Reality future” Voluntary Action Leadership vol Spring pp 15–17 1990 19 K Culp III Nolan “Trends impacting volunteer administrators next ten years” J Volunt Adm vol 19 1 pp 10–19 2000 20 L Hustinx F Lammertyn “Collective reflexive styles volunteering sociological modernization perspective” Voluntas Int J Voluntary Nonprofit Organizations vol 14 2 pp 167–187 2003 21 K Smith K Holmes HaskiLeventhal R Cnaan F Handy J L Brudney “Motivations benefits student volunteering Comparing regular occasional nonvolunteers five countries” J Nonprofit Soc Econ Res vol 1 1 2010 Art 65 22 R Cnaan H Daniel Heist H Storti “Episodic volunteering religious megaevent” Nonprofit Manage Leadership vol 1 1 pp 1–14 2017 23 Koch G Schneider “Effort cooperation coordination open source GNOME” Inf Syst J vol 12 1 pp 27–42 2002 24 DinhTrong J Bieman “The FreeBSD replication case study open source development” IEEE Trans Softw Eng vol 31 6 pp 481–494 Jun 2005 25 J J Davies H V K Nussbaum German “Perspectives bugs Debian bug tracking system” Proc 7th Work Conf Mining Softw Repositories 2010 pp 86–89 26 F Rullani Haefliger “The periphery stage intraorganizational dynamics online communities creation” Res Policy vol 42 4 pp 941–953 2013 27 Riehle P Riemer C Kolassa Schmidt “Paid vs volunteer work open source” Proc 47th Hawaii Int Conf Syst Sci 2014 pp 3286–3295 28 G Pinto L F Dias Steinmacher “Who gets patch accepted first Comparing contributions employees volunteers” Proc 11th IEEEACM Int Workshop Cooperative Hum Aspects Softw Eng 2018 pp 110–113 29 Capiluppi KJ Stol C Boldyreff “Exploring role community stakeholders open source evolution” Proc IFIP Int Conf Open Source Syst 2012 pp 178–200 30 B Lundell et al “Addressing lockin interoperability longterm maintenance challenges open source companies strategically use open source” Proc IFIP Int Conf Open Source Syst 2017 pp 80–88 31 L F Dias Steinmacher G Pinto “Who drives companyowned OSS projects Employees volunteers” Proc V Work Softw Vis Evol Maintenance 2017 Art 10 32 L Dahlander G Magnusson “Relationships open source companies communities Observations Nordic firms” Res Policy vol 34 4 pp 481–493 2005 33 P J Ägerfalk B Fitzgerald “Outsourcing unknown workforce Exploring opensourcing global sourcing strategy” MIS Quart vol 32 2 pp 385–409 2008 34 G Von Krogh Spaeth “The open source phenomenon Characteristics promote research” J Strategic Inf Syst vol 16 3 pp 236–253 2007 35 K Carillo Huff B Chawner “What makes good contributor Understanding contributor behavior within large freeopen source projects—A socialization perspective” J Strategic Inf Syst vol 26 4 pp 322–359 2017 36 C Jensen C Boldyreff “Role migration advancement processes OSSD projects comparative case study” Proc 29th Int Conf Softw Eng 2007 pp 364–374 37 Fang Neufeld “Understanding sustained participation open source projects” J Manage Inf Syst vol 25 4 pp 9–50 2009 38 Rozas “Selforganisation commonsbased peer production Drupal ‘The drop always moving’” PhD dissertation University Surrey Guildford UK 2017 Online Available httpsdavidrozasccphd 39 Osterloh Rota “Open source development—just another case collective invention” Res Policy vol 36 2 pp 157–171 2007 40 R Pham L Singer K Schneider “Building test suites social coding sites leveraging driveby commits” Proc Int Conf Softw Eng 2013 pp 1209–1212 41 Steinmacher C Treude Gerosa “Let Guidelines successful onboarding newcomers open source projects” IEEE Softw vol 36 4 pp 41–49 JulAug 2019 42 Sholler Steinmacher Ford Averick Hoye G Wilson “Ten simple rules helping newcomers become contributors open source projects” PLoS Comput Biol vol 15 9 2019 Art e1007296 43 K Crowston J Howison “The social structure free open source development” First Monday vol 10 2 2005 44 K R Lakhani “The core periphery distributed selforganizing innovation systems” PhD dissertation Massachusetts Institute Technology Cambridge 2006 45 R Krishnamurthy V Jacob Radhakrishnan K Dogan “Peripheral developer participation open source projects empirical analysis” ACM Trans Manage Inf Syst vol 6 4 pp 14–45 2016 46 P Setia B Rajagopalan V Sambamurthy R Calantone “How peripheral developers contribute open source development” Inf Syst Res vol 23 1 pp 144–163 2012 47 J Wang “Survival factors free open source projects multistage perspective” Eur Manage J vol 30 4 pp 352–371 2012 48 B Vasilescu Serebrenik Goeminne Mens “On variation specialisation workload case study Gnome ecosystem community” Empir Softw Eng vol 19 4 pp 585–1008 2014 49 G Von Krogh Spaeth K R Lakhani “Community joining specialization open source innovation case study” Res Policy vol 32 7 pp 1217–1241 2003 50 L Dahlander O’Mahony “Progressing center Coordinating work” Organization Sci vol 22 4 pp 961–979 2011 51 C Amrit J van Hillegersberg “Exploring impact sociotechnical coreperiphery structures open source development” J Inf Technol vol 25 2 pp 216–229 2010 52 K Neuling Hannemann R Klamma Jarke “A longitudinal study communityoriented open source development” Proc Int Conf Adv Inf Syst Eng 2016 pp 509–523 53 Capiluppi Michlmayr “From cathedral bazaar empirical study lifecycle volunteer community projects” Proc IFIP Int Conf Open Source Syst 2007 pp 31–44 54 H Masmoudi den Besten C de Loupy JM Dalle “Peeling onion” Proc IFIP Int Conf Open Source Syst 2009 pp 284–297 55 G Von Krogh Haefliger Spaeth W Wallin “Carrots rainbows Motivation social practice open source development” MIS Quart vol 36 2 pp 649–676 2012 56 Lee J C Carver Bosu “Understanding impressions motivations barriers one time code contributors FLOSS projects survey” Proc 39th Int Conf Softw Eng 2017 pp 187–197 57 Labuschagne R Holmes “Do onboarding programs work” Proc 12th Work Conf Mining Softw Repositories 2015 pp 381–385 58 Steinmacher G Silva Gerosa F Redmiles “A systematic literature review barriers faced newcomers open source projects” Inf Softw Technol vol 59 pp 67–85 2015 59 Balalí Steinmacher U Annamalai Sarma Gerosa “Newcomers’ barriers analysis mentors’ newcomers’ barriers OSS projects” Comput Supported Cooperative Work vol 27 pp 679–714 2018 60 C Mendez et al “Open source barriers entry revisited sociotechnical perspective” Proc Int Conf Softw Eng 2018 pp 1004–1015 61 Bayati “Understanding newcomers success open source community” Proc 40th Int Conf Softw Eng Companion Proc 2018 pp 224–225 62 Steinmacher Gerosa U Conte F Redmiles “Overcoming social barriers contributing open source projects” Comput Supported Cooperative Work vol 28 12 pp 247–290 2019 63 Steinmacher G Pinto Wiese Gerosa “Almost study quasicontributors opensource projects” Proc 40th Int Conf Softw Eng Companion Proc 2018 pp 985–1000 64 Nafus “‘Patches don’t gender’ open open source projects” New Media Soc vol 14 4 pp 256–266 2012 65 K Carillo JG Bernard “How many hawks hide umbrella examination lay conceptions conceal contexts freeopen source software” Proc Int Conf Inf Syst 2015 Online Available httpsdblporgrecconficisCarilloB15 66 Nafus “‘Patches don’t gender’ open open source projects” New Media Soc vol 14 4 pp 669–683 2012 67 Bosu K Z Sultana “Diversity inclusion open source OSS projects stand” Proc ACMIEEE Int Symp Empir Softw Eng Mes 2019 pp 1–11 68 Izquierdo N Huesman Serebrenik G Robles “OpenStack gender diversity report” IEEE Softw vol 36 1 pp 28–33 JanFeb 2019 68 Storey Zagalsky F F Filho L Singer German “How social communication channels shape challenge participatory culture development” IEEE Trans Softw Eng vol 43 2 pp 185–204 Feb 2017 69 Burnett Peters C Hill N Elarief “Finding genderinclusiveness issues GenderMag field investigation” Proc CHI Conf Hum Factors Comput Syst 2016 pp 2586–2598 70 K Hyde J Dunn P Scuffham K Chambers “A systematic review episodic volunteering public health contexts” BMC Public Health vol 14 1 pp 992–1008 2014 71 K Hyde J Dunn C Bax K Chambers “Episodic volunteering retention integrated theoretical approach” Nurse Educ Voluntary Sector Quart vol 45 1 pp 45–63 2016 72 L Bryen K Madden “Bounceback episodic volunteers makes episodic volunteers return” Queensland University Technology Brisbane Australia Rep CPNS32 2006 73 Steinmacher Wiese P Chaves Gerosa “Why newcomers abandon open source projects” Proc 6th Int Workshop Cooperative Hum Aspects Softw Eng 2013 pp 25–32 74 R Safrit V Merrill “Management implications contemporary trends volunteerism United States Canada” J Volunt Adm vol 20 2 pp 12–23 2002 75 Nunn “Building bridge episodic volunteerism social capital” Fletcher World Aff vol 24 pp 115–127 2000 76 L C P Meijs J L Brudney “Winning volunteer scenarios soul new machine” Int J Volunt Adm vol 24 6 pp 789–799 2007 77 Turoff “The design policy Delphi” Technological Forecasting Soc Change vol 2 2 pp 149–171 1970 78 N Dalkey Helmer “An experimental application Delphi method use experts” Manage Sci vol 9 3 pp 458–467 1963 79 W Weaver “The Delphi forecasting method” Phi Delta Kappan vol 52 5 pp 267–271 1971 80 H Linstone Turoff Eds Delphi Method Techniques Applications vol 18 Boston USA AddisonWesley Publishing Company 2002 81 L E Miller “Determining couldshould Delphi technique application” 2006 82 K Conboy B Fitzgerald “Method developer characteristics effective agile method tailoring study XP expert opinion” ACM Trans Softw Eng Methodol vol 20 1 2010 Art 2 83 F Krafft KJ Stol B Fitzgerald “How freeopen source developers pick tools Delphi study Debian project” Proc 38th Int Conf Softw Eng Companion 2016 pp 232–241 84 Barcomb KJ Stol B Fitzgerald Riehle “Appendix Managing episodic contributors free libre open source communities” IEEE Trans Softw Eng published doi 101109TSE20202985093 85 C Okoli Pawlowski “The Delphi method research tool example design considerations applications” Inf Manage vol 42 1 pp 15–29 2004 86 K Q Hill J Fowles “The methodological worth Delphi forecasting technique” Technological Forecasting Soc Change vol 7 2 pp 179–192 1975 87 R Loo “The Delphi method powerful tool strategic management” Policing Int J Police Strategies Manage vol 25 4 pp 762–769 2002 88 L Delbecq H van de Ven H Gustafson Group Techniques Program Planning Guide Nominal Group Delphi Processes Glenview IL USA Scott Foresman Company 1975 89 Carvalho Sampaio “Volunteer management beyond prescribed best practice case study Portuguese nonprofits” Personnel Rev vol 46 2 pp 410–428 2017 90 Takhteyev Hilts “Investigating geography open source GitHub” University Toronto Toronto Canada 2010 Online Available httpwwwtakhteyevorgpapersTakhteyevHilts2010pdf 91 J C Crotts W Litvin “Crosscultural research researchers better served knowing respondents’ country birth residence citizenship” J Travel Res vol 42 2 pp 186–190 2003 92 Kim “Intercultural personhood Globalization way being” Int J Intercultural Relations vol 32 4 pp 359–368 2008 93 V Braun V Clarke “Using thematic analysis psychology” Qualitative Res Psychol vol 3 2 pp 77–101 2006 94 Riehle N Harutyunyan Barcomb “Pattern discovery validation using scientific research methods” FriedrichAlexander Universität ErlangenNürnberg Erlangen Germany Tech Rep CS202001 Mar 2020 Online Available httpsdirkriehlecomwpcontentuploads202003csfautr202001pdf 95 E G Guba “Criteria assessing trustworthiness naturalistic inquiries” Educ Technol Res Develop vol 29 2 pp 75–91 1981 96 V Singh W Brandon “Open source community inclusion initiatives support women participation” Proc IFIP Int Conf Open Source Syst 2019 pp 68–79 97 H Mäenpää Munezero F Fagerholm Mikkonen “The many hats broken binoculars State practice developer community management” Proc 13th Int Symp Open Collaboration 2017 Art 1 Ann Barcomb received PhD degree University Limerick Limerick Ireland member Open Source Research Group FriedrichAlexander University ErlangenNürnberg Erlangen Germany Lero–the Irish Research Centre Throughout career active freelibreopen source particular Perl community information please visit annbarcomborg KlaasJan Stol lecturer School Computer Science Information Technology University College Cork Cork Ireland SFI principal investigator funded investigator Lero—the Irish Research Centre research interests include research methodology contemporary development approaches information please visit kstoluccie Brian Fitzgerald director Lero—the Irish Research Centre holds endowed chair Frederick Krehbiel II chair Innovation Business Technology University Limerick Limerick Ireland research interests include open source inner source crowdsourcing agile methods information please visit bfleroie Dirk Riehle received PhD degree computer science ETH Zürich Zürich Switzerland professor computer science FriedrichAlexander University Erlangen Germany led Open Source Research Group SAP Labs Silicon Valley founded Open Symposium OpenSym lead architect first UML virtual machine blogs information please visit httpdirkriehlecom reached dirkriehleorg information computing topic please visit Digital Library wwwcomputerorgcsdl
::::
Developers Adopt Change Licenses Christopher Vendometextsuperscript1 Mario LinaresVásqueztextsuperscript1 Gabriele Bavotatextsuperscript2 Massimiliano Di Pentatextsuperscript3 Daniel Germantextsuperscript4 Denys Poshyvanyktextsuperscript1 textsuperscript1The College William Mary VA USA — textsuperscript2Free University Bolzano Italy — textsuperscript3University Sannio Italy — textsuperscript4University Victoria BC Canada Abstract—Software licenses legally govern way developers use modify redistribute particular system previous studies either investigated licensing mining repositories studied licensing FOSS reuse aim understanding rationale behind developers’ decisions choosing changing licensing surveying open source developers paper analyze developers consider licensing reasons developers pick license factors influence licensing changes Additionally explore licensingrelated problems developers experienced expectations licensing support forges eg GitHub investigation involves one hand analysis commit history 16221 Java open source projects identify commits licenses added changed hand consisted survey—in 138 developers informed involvement licensingrelated decisions 52 provided deeper insights rationale behind actions undertaken results indicate developers adopt licenses early project’s development change licensing period development also found developers inherent biases respect licensing Additionally reuse—whether noncontributor commercial purposes—is dominant reason developers change licenses systems Finally discuss potential areas research could ameliorate difficulties developers facing regard licensing issues systems Index Terms—Software Licenses Mining Repositories Empirical Studies INTRODUCTION licenses legal mechanism used determine system copied modified redistributed licenses allow third party utilize code long adhere conditions license particular open source licenses comply Open Source Definition 4 Specifically goal licenses facilitate copying modifying distributing long set ten conditions met free redistribution availability source code open source creators must choose open source license However large number open source licenses use today range highly restrictive General Public License—GPL—family licenses ones restrictions MIT license choice license determine given open source reused especially true libraries expected integrated distributed uses Furthermore choice license might also affected dependencies used eg uses library GPL requires GPL also uses library MIT license license including commercial point creators open source must choose license 1 expresses developers’ philosophy 2 meets deployment goals 3 consistent licenses components reused However choosing license easy process Developers necessarily clear idea exact consequences licensing licensing code specific license instance developers ask questions Question Answer QA websites looking advice redistribute code licensed dual license among issues eg question 2758409 Stack Overflow 19 question 139663 StackExchange site programmers 28 Also problem license incompatibility components trivial see 15 detailed description problem evolution system license might change previous work 30 empirically showed—for hosted GitHub—that license changes common phenomena Stemming results previously captured analyzing licensing changes repositories 30 goal work understand changes licensing happen Specifically paper reports results survey 138 developers aim understanding developers consider adding license ii choose specific license projects iii factors influencing license changes 138 participants respondents set 2398 invitees ie 575 invitees identified developers sampling 16221 Java projects GitHub subsetting 1833 projects license changed time 138 developers 52 developers offered insights aforementioned questions remaining developers reinforced licensing decisions necessarily made contributors subset copyright holders main findings study following 1 Developers frequently license code early main rationale delaying licensing usually wait first release 2 Developers strong intrinsic beliefs affect choice licenses Also open source foundations Apache Foundation Free Foundation Eclipse Foundation exert powerful influence choice license 3 observed change licenses system predominantly influenced need facilitate reuse mostly commercial systems 4 Developers experience difficulties understanding licensing terms dealing incompatible licenses II RELATED WORK work mainly related automatic identification classification licensing artifacts ii empirical studies investigating license adoption license evolution iii qualitative studies licensing Table presents prior work licensing reporting main purpose study corresponding dataset used Identifying Classifying Licensing Automatic identification licensing widely explored best knowledge FOSSology 17 first one aimed solving problem license identification extracting licensing information projects using machine learning classification Another representative ASLA tool Tuunanen et al 29 showed 89 accuracy respect classifying licenses files FOSS systems current stateoftheart automated tool license identification Ninka proposed German et al 16 Ninka relies patternmatching order identify licensing statements return license name version eg Apache20 evaluation Ninka indicated precision 95 Since always distributed source code traditional approaches license identification based parsing licensing statements always applicable bytecode binaries inherently contain licensing information ameliorate problem Di Penta et al 9 proposed approach uses code search textual analysis automatically identify licensing jars approach automatically queried Google Code Search extracting information decompiled code Additionally German et al investigated ability identify FOSS licensing conjunction proprietary licensing analyzing 523930 archives 12 paper rely Ninka 16 license identification since current stateoftheart technique However work aim improve upon license identification classification rather understand rationale behind licensing decisions B Empirical Studies Licenses Adoption Evolution Di Penta et al 10 investigated license migration evolution maintenance six FOSS projects authors unable find generalizable pattern among projects results suggested version type license modified systems’ life cycles German et al 15 investigated way developers handle license incompatibilities analyzing 124 FOSS packages investigation constructed model outlines advantages disadvantages certain licenses well applicability Additionally German et al 13 conducted empirical study understand extent package licensing source code files consistent ii evaluate presence licensing issues due dependencies among packages authors investigated 3874 packages Fedora12 Linux distribution confirmed subset licensing issues developers Fedora Manabe et al 21 analyzed FreeBSD OpenBSD Eclipse ArgoUML order identify changes licensing authors found four projects exhibited different patterns changes licensing German et al analyzed fragments cloned code Linux Kernel OpenBSD FreeBSD 14 investigated extent terms licenses adhered cloning code fragments Similarly Wu et al 31 found cloned files potential inconsistent terms licenses eg one license paper describes types inconsistencies illustrates problem difficulty resolve empirical study Debian 75 related empirical study work previous work 30 analyzed license usage license changes 16221 projects sought extract rationale commit messages issue tracker discussions results indicated lack documentation licensing sources sharing motivation work novel investigates developers choose license change licensing opposed extent changes occur presents rationale survey conducted actual developers projects Study Purpose Dataset German et al Investigate presence license incompatibilities 3874 packages Di Penta et al Investigate license evolution system’s maintenance evolution 6 systems German et al Investigate way developers address incompatible licensing 124 systems German et al Investigate licensing copied code fragments Linux two BSD distributions 3 systems Manabe et al Investigate license change patterns within FOSS systems 4 systems Singh et al Investigate reasons adoption particular FOSS license 5307 projects Sojer et al Investigate reuse legal implication Internet code 686 developers Sojer et al Investigate FOSS code reuse 869 developers Vendome et al Investigate license usage changes FOSS systems rationale revision history issue tracker 16221 systems dataset instead relying rationale issue tracker discussions commit messages C Qualitative Studies Licensing Singh Phelps 25 studied reasons behind adoption specific license FOSS results suggest choice mainly driven social factors—the adoption license new based licenses adopted socially close existing projects eg projects ecosystem work considered license adoption social networking perspective see “licensor” may influenced toward particular licenses based social proximity work investigate latent social connections developers projects contributed Instead directly surveyed developers understand reasoning adopting particular license Sojer et al conducted survey 869 developers regarding reuse open source code legal implications resulting code 26 One key finding industry academic institutions prioritize knowledge regarding licensing reuse authors compared selfassessment questionnaire licensing found discrepancy perceived knowledge actual understanding licensing Additionally Sojer et al conducted survey 686 practitioners regarding reuse FOSS code found licensing FOSS code second largest impedance reuse 27 authors point possible reasons observation study specifically aims understand reasons choosing changing licenses well types problems practitioners face due licensing III DESIGN STUDY goal study investigate developers consider licensing issues reasons developers pick change licensing FOSS projects context consists projects ie change history 16221 Java FOSS projects mined GitHub subjects ie 138 practitioners contributing subset mined projects Research Questions aim answering following research questions RQ1 developers first assert licensing research question first examines developers commit license least one file FOSS projects hosted GitHub ie goes licensing least one license complement analysis questions developers understand actual rationale behind empirical observations RQ2 developers change licensing research question relies similar analysis previous question specifically investigates licensing changes ie change license license B RQ3 problems developers face licensing support expect forge question aims understanding problems developers experience licensing better support Additionally interested understanding expectation developers may support incorporated forges order answer research questions consider two perspectives evidence collected analyzing projects’ change history ii evidence collected surveying developers perspectives explained following B Analysis Projects’ Change History investigate developers pick change licensing mined entire commit history 16221 public Java projects GitHub first queried GitHub using public API 2 generate information publicly available projects extracted comprehensive list 381161 Java projects mining information twelve million projects locally cloned Java repositories consumes total 63 Tb storage space randomly sampled 16221 projects due computation time underlying infrastructure analyze licensing file revisions commitlevel granularity 1731828 commits spanned 4665611 files Table II reports statistics size attributes analyzed dataset overall number different licenses considered study relied upon MARKOS code analyzer 7 extract licensing throughout project’s revision history code analyzer incorporates Ninka license classifier 16 order identify licensing statements classify license family version applicable file code analyzer mined change log 16221 projects extracted commit hash date author file commit message change file Addition Modification Deletion license change Boolean value license name version reported list multiple licenses detected data extraction step 16221 projects took almost 40 days total 1731828 commits spanning 4665611 files analyzed case BSD CMU licenses reported variant either case since Ninka unable identify particular version case GPL LGPL possible license exception allows developers pick future versions license annotate license “” eg GPL20 signifies terms GPL30 also used identify licensing changes followed procedure exploited previous work 30 particular identify commit ci responsible introducing license code file F ci Ninka identify license F ci license F retrieved ie License rightarrow License transition F Instead consider ci licensing change license type andor version detected Ninka F ci different one detected ci ie License rightarrow License transitions C Analysis Developers’ Survey investigate reasons developers addchange licenses systems surveyed developers made licensing changes systems contributed find potential developers survey utilized results quantitative analysis 16221 projects analyzed found 1833 projects experienced either delayed initial license addition ie License rightarrow License transition happened first commit licensing change ie License rightarrow License change history included scenarios understand rationale behind RQ1 RQ2 required change licensing projects used version control history extract set contributors 1833 projects licensing changes identified total 2398 valid developers email address targeted potential participants study valid refer filtering contributor email addresses matching following two patterns — “userlocahost ” “usernone ”— since pointed clearly invalid domains also removed developers Android framework since always licensed Apache license 2398 developers invited via email fillin online survey hosted Qualtrics 5 survey answers anonymous email invitation included link survey ii description specific licensing additionchanges observed project’s history contacted developers offered insights regarding changes directly responding email total emailed 2398 individuals received 138 responses survey 15 followup emails developers volunteered additional information Overall response rate 575 developers contacted survey consisted seven questions Q1Q7 Q7 optional 12 participants answered Tables III IV list survey questions responses developers Q1 Q2 dichotomous questions questions used ensure respondents involved determining project’s licensing respondent answer “yes” Q2 survey ended participant 138 participants 62 responded “no” Q2 ineligible remaining questions Q3Q7 Questions Q3 Q6 multiple choice questions included “Other” option respondents chose “Other” could elaborate using openended field Question Q7 optional openended chose make optional developers may agree forge responsible features supporting licensing 138 respondents 76 developers eligible entire survey Q1Q7 per response Q2 52 individuals completed survey Since questions Q3Q7 also included openended responses relied formal groundedtheory 8 coding openended responses Three authors read responses categorized response represented developer’s rationale categories three authors analyzed merged second round obtain final taxonomy categories Tables Section IV present final results groundedtheory process IV RESULTS section discusses achieved results answering three research questions formulated Section IIIA licenses added FOSS projects Fig 1 shows distribution number commits licenses introduced projects within dataset eg license introduced tenth commit represented number 10 present raw commits log scale due outliers large commit histories least 25 first quartile projects licensed first commit Fig 1 median also two commits third quartile five commits observation indicates FOSS projects licensed early change history 75 projects license fifth commit Assuming might always case observed history corresponds entire history result suggests licensing important developers interesting note mean commit number adding license 21 maximum value 8623 commits two values indicators long tail small number projects consider licensing late change history Summary RQ1 History Results observed developers consider licensing early change histories FOSS projects projects assert license larger number commits 75 dataset license asserted within first five commits Thus data suggests projects adopt licenses among first commit activities B licenses added FOSS projects Table III reports responses Question 3 Q3 survey tried ascertain rationale behind initial licensing 308 developers indicated community influences initial licensing One explanation high prevalence response certain FOSS communities stipulate enforce particular license must used example Apache Foundation requires projects code contributed projects licensed Apache20 license Instead Free Foundation promotes use GPL LGPL family licenses 192 developers chose license goal making reusable commercial applications responses also indicate bias toward permissive licenses facilitate usage restrictive licenses discourage usage since require system licensed terms finding provides partial explanation trend toward permissive licenses observed previous work 30 results survey also show licensingrelated decisions impacted inherent developer bias 154 developers supplied answers categorized moralethicalbeliefs example category response one developer indicating “I always use GPL30 philosophical reasons” Similarly different developer echoed comment stating “I always licence GPL moral reasons” Satisfying dependency constraint ie need use license based license dependencies relevant reason 96 77 picking explicit option 19 “Other” response categorized dependency constraint result important since little work done analyze licensing across dependencies problem also poses challenges identifying necessary dependencies well licenses dependencies automated build frameworks like Maven 6 Gradle 3 attempt ameliorate difficulty listing dependencies file drives building process eg Object Model file Maven However licensing required field files remaining answers question described situations license inherited initial founders persisted time Also companies policies specifically dictate licensing convention latter case respondent indicated “company policy Apache20” company name omitted privacy also interesting see nobody choose license based requests outsiders Lastly identified category licensing changes related license adoption changes 77 developers respond question licensing changes indicated license missing added later commit case added License Addition category Q4 Table III developers noted “Setting license forgotten first place” “Accidentally didn’t include explicit licence initial commit” cases also important since create inconsistencies within system mislead noncontributors unlicensed licensed incompatible terms result reinforces developers view early license adoption important lack license may mistake Summary RQ1 Survey Results initial licensing predominantly influenced community developer contributing Subsequently commercial reuse common factor may reinforce prevalence permissive license usage reuse consideration noncontributors seem impact initial licensing choice also found inclusion particular dependency impact initial licensing C licenses changed FOSS projects Fig 2 shows distribution licenses changed projects within dataset ie license→Some License previous section present raw commit number changes occurred log scale due outliers large commit histories Interestingly minimum value second commit ie license changed right addition first commit generally 25 license changes occur first 100 commits median value 559 commits mean 3993 commits third quartile 2086 commits quite smaller mean suggests long tail license changes occurring late projects’ change histories maximum commit number license change commit 56746 Numbers extreme would cause larger mean value compared median Overall data suggests certain projects change licenses early change history however license changes much prevalent later commits Summary RQ2 History Results observed developers change licensing later change history FOSS projects projects change licensing early first quartile 100 commits third quartile 2086 commits demonstrating substantial development occurred changing licensing licenses changed FOSS projects Table III shows responses Question 4 Q4 survey investigated rationale behind license changes Allowing reuse commercial common reason behind licensing changes 327 option also second prevalent choosing initial license 192 developers Combining two results clear current license heavily affected need reused commercially previously stated result qualitatively supports observation previous work 30 observed projects tend migrate toward lessrestrictive licenses 77 developers changed licensing due community influence response significant factor initial choice licensing emphasizes impact community assert One developer commented “community influence contributing Apache’s projects” Similarly two developers commented influence Eclipse Foundation exercised license changes projects Interestingly one developer reported “I wanted use common one OSS Java projects” response suggests particular license may pick momentum spread particular language Interestingly observed 77 developers willing change licensing due requests noncontributors fact response prevalent changing licensing choosing initial license may influenced outsiders waiting stable mature inquiring particular licensing also observed change licenses dependency using new dependency prompted developers change licenses 58 developers cases observation demonstrates difficulty impact dependency respect licensing also suggests could inconsistencies licensing system dependencies Moralethicalbeliefs also reason 58 developers Interestingly observed beliefs developers beliefs philanthropist funding project’s development one developer acknowledged “I simply wanted pick ‘free’ license chose Apache without much consideration” another developer indicated “Philanthropic funders encouraged us move GPL3 well internal reflection came understand GPL3 better” former example notable developer’s concern impact Apache license particular primary motivator free license ie FOSS license latter indicates individuals funding projects influence licensing developers coerced change GPL30 still influenced beliefs individuals funding system’s change history Summary RQ2 Survey Results developers seem change licensing support reuse commercial systems community influence still impacts changing licensing appears less significant factor respect license adoption Based survey results reasons behind changing licensing diverse evenly distributed among topics observed selection initial license E problems developers face licensing support expect forge Table IV shows results Questions 57 Q5Q7 investigate problems developers experience licensing expected licensing support forge Q5 investigated problems related licensing developers experienced 23 52 developers 442 explicitly mentioned “No problem” “Other” field recognized problems main reason inability others use due license 173 Since developers consider problem suggests developers interested allowing broad access work However may constrained due desired protections eg patent protect Apache20 GPL30 external factors like licensing dependencies external since developers cannot change licenses Additionally developers indicated choosing correct license difficult 135 litigious nature licenses lead misinterpretations developers example Apache Foundation states webpage “The Apache Foundation still trying determine version Apache License compatible GPL” 1 Additionally 58 developers indicated experienced misunderstandings respect license compatibility make matters worse 96 developers experienced compatibility problems dependencies Therefore developers faced difficulty determining appropriate license also misunderstood compatibility among licenses experienced incompatibility project’s licensing desired dependency’s licensing Developers also experienced difficulties users misinterpreting understanding terms license One developer stated “Users readunderstand license even though simple one” result poses two possible problems — either users ie developers looking reuse code ignore actual licensing text struggle interpret even easier licenses former would demonstrate bigger problem users take licensing seriously latter demonstrates difficulty understanding licensing extensive litigious licenses Reinforcing second scenario another developer noted problem “Just usual challenges talking potential commercial partners understand GPL all” phrasing comment usual challenges suggests developer repeated experience partners unable understand licensing necessarily isolated case rather potentially widespread experience shared developers Regarding support provided forge case GitHub investigated impact feature added help document license project—see Q6 Table IV feature added response criticism practitioners 24 365 developers access feature time created interesting result half 519 developers influenced availability tool Additionally “Other” responses indicated feature would impact choice 38 single developer specifically chose license leading combined 58 developers unaffected feature Thus data suggests GitHub feature affectinfluence developers licensing hosted GitHub Finally received 11 responses optional question Q7 concerning whether forges provide features assist licensing Since GitHub criticized practitioners 24 lack licensing consideration question seeks understand features practitioners expect forge end 10 11 participants answered “None” 10 developers one explained third party tool handle license compatibility analysis respondent indicated ideal tool would utilize various forges build frameworks dependency graph license compatibility stating following “This job 3rd party tool IMO since neither github forge open source deps 3rd party tool ideally would know github bitbucket etc poms pom license fields etc form comprehensive depgraph license compat view given node” Another developer noted “None perspective really isn’t hard put copyright licence notices source files” comment interesting since conflicts results Q4 developers indicated licenses sometimes missing incorrect license used developer wishing support forge indicated desire license compatibility checker license selection wizard developer commented desire two particular features stating following “1 License compatibility checker verify license license included support gems libraries includes alert user potential conflicts could also used use case want adopt piece add existing QuestionAnswer Q1 involved changes occurring parts system underwent license changes 138 543 Yes 75 543 63 457 Q2 involved determining license change license files 138 537 Yes 76 537 62 463 Q3 determinepick initial license files 52 77 Dependency constraint 4 77 Community influence eg contributing Apache projects 16 308 Requests noncontributors reuse code 0 0 Interest reuse commercial purposes 10 192 please specify 22 423 — Closedsource 1 19 — Companypolicy 2 38 — Dependencyconstraint 1 19 — Inheritlicense 3 58 — Moralethicalbelief 8 154 — ProjectSpecific 2 38 — Socialtrend 2 38 — None 3 58 Q4 motivated caused change license 52 58 License dependencies changed 3 58 Using new library imposing specific licensing constraints 3 58 Allow reuse commercial 17 327 Requests noncontributors reuse code 4 77 please specify 25 481 — Changetolicensetext 2 38 — Communityinfluence 4 77 — Fixincorrectlicenses 1 19 — Improveclarity 1 19 — Missinglicense License Adoption 4 77 — MoralEthicalbelief 3 58 — Morepermissivelicense 1 19 — Newlicenseversion 2 38 — PersonalPreferenceProjectspecific 1 19 — Privatetopublicproject 1 19 — PromoteReuse 1 19 — Unclear 1 19 — None 3 58 QuestionAnswer Q5 problems experienced due license selection terms code reuse 52 96 license compatible desired dependencies 5 96 Others unable use unless relicensed 9 173 dependency changed licenses ad longer compatible 1 19 misunderstanding compatibility licensing terms two licenses 3 58 Choosing correct license difficultconfusing 7 135 please specify 27 529 — Codeunavailability 1 19 — LackofundersandingbyUsers 2 38 — UniqueNewLicense 1 19 — problems 23 442 Q6 GitHub’s mechanism licensing impact decision licensing 52 58 Yes caused license 3 58 already planned licensing 27 519 want license creation 1 19 mechanism yet available created 19 365 please specify 2 38 — impact 2 38 Q7 kind support would expect forgeGitHub help managing licenses licensing compatibility issues 11 909 None 10 909 License Checker License Selection Wizard 1 91 compatible 2 License selection wizard begin wizard ask series questions want allow commercial use require mods licensed original etc suggest license project” one developer wanted support forge single developer’s comments seem address many problems difficulty respect licensing found evidence Q6 survey Summary RQ3 Survey Results although 442 developers surveyed indicated experienced problems licensing remaining respondents provided diverse set answers primarily related license incompatibility difficulty understanding licensing Lastly survey indicated GitHub’s mechanism encourage aid licensing necessary unavailable surveyed developers also found developers expect support forge one indicate desire thirdparty tool However one developer express interest forge’s support comments aligned results regarding problems developers actually faced V LESSONS IMPLICATIONS Intrinsic beliefs developers first important observation participants bias toward FOSS licensing ethical perspective 52 respondents indicated Q6 planned licensing prior creation 6 respondents Q6 influenced license due GitHub’s licensing feature ie combo list license names Similarly “Other” responses regarding reason project’s initial licensing Q3 indicated sense obligation example one developer said “It moral ethical choice” Delayed licensing Developers necessarily decide open source beginning delay empirically observed early license adoption general one developer wrote email waited choose license “this didn’t license day 1 added first release” Similarly one developer responded survey licensing changed due “change private public project” observation suggests licensing still important developers may considered relevant reaches certain level maturity Thus need tools add verify licensing information system given point time Community organizational influence results indicate communities particular FOSS foundations Apache Eclipse Free foundations exert powerful influence choice license developers 31 participants responded initial licensing done following community’s specific licensing guidelines Improving developing top existing foundation mostly requires using license aligning foundation’s philosophy License misunderstanding survey stresses need aid explaining licenses implications use 20 respondents highlighted licensing confusing andor hard understand Q5 135 respondents indicated developers—both authors users—find licensing confusing difficult Q5 6 developers also noted misunderstandings license compatibility Additionally one “Other” respondent stated “Users readunderstand license even though simple one” suggests developers experienced misunderstanding whether users Reuse commercial distribution results regarding licensing changes indicated commercial usage code concern open source community found practitioners used permissive licenses facilitate commercial distributions cases change permissive license purpose Dependency influence system must choose dependencies avoid conflicts due incompatibilities system’s licenses depending components’ licenses Similarly others choose use particular system based license Thus change license system potential creating chain reaction use might need change license drop dependency system changing license potential pool reusable components change accordingly—it might need drop different dependency might able add dependency previously incompatible license Forge’s support respondents expect licensing support forge likely individuals benefit licensing support forge looking reuse supported results indicate licenses dependencies important consideration since might impact ability reuse dependency require change licenses uses Thus complianceoriented features may aid developers ensure legally reuse Finally results demonstrate external factors like community license prevalence licenses dependencies important impact licensing feature provided forge support domain suggested licensing could benefit practitioners Since developers indicated licensing difficult informative feature could help practitioners determine appropriate licensing instance current licensing support feature provided GitHub feature particularly informative developers Basically provides link choosealicensecom provide guidance developer Also cover issues related compatibilities Moreover applications within domain may utilizing dependencies require similar grants redistribution reuse better support developers forge could include domain analysis feature detect similar applications 22 suggest developermaintainer license used similar systems criteria considered community dependencies VI Threats Validity Threats construct validity relate relationship theory observation mainly due imprecision extracting licensing results developer survey order identify licenses relied Ninka 16 empirically evaluated indicating precision 95 able identify license 85 time study showing precision order classify free responses conducted formal Grounded Theory analysis twoauthor agreement particular responses read categorized three authors agreement two considered necessary Another threat concerns fact possibly GitHub could mirrored fraction projects’ change history hence possible first commits GitHub may correspond first commits projects’ history Finally response rate study 575 response rate often achieved survey studies 18 ie 10 However explicitly targeting original developers usually challenging many may active email addresses invalid even impossible contact longer using email addresses collected Threats internal validity relate internal confounding factors would bias results study analyzing license introduction licensing changes considered commit observed phenomena instance ensure introduce duplicates excluded developers projects Android framework since always Apache licensed Therefore bias selecting developers address lack coverage original options survey added free form option “Other” question addition presented full survey developers indicated involved licensing decisions Another possible threat internal validity concerns fact possibly 138 respondents decided participate survey greater interest licensing problems others However results shown Section IV suggest case eg respondents comprise people directly involved licensing necessarily experience licensing problems Threats external validity relate ability generalize results study assert observations representative FOSS community randomly sampled projects GitHub Java projects Thus languages forges may demonstrate different behavior well developers projects may different beliefs However GitHub popular forge large number public repositories larger evaluation multiple forges projects languages necessary understand licenses adopted changed general case Additionally surveyed actual developers projects claim rationale complete conclusions represent explicit feedback opposed inferred understanding Therefore rationale definitive subset claim results apply context closed source systems since required source code identify licensing Finally limit threat external validity examined diversity data set using metrics proposed Nagappan et al 23 understand diversity matched projects dataset projects mined Boa 11 finding 1556 names matched two datasets used 1556 projects calculate diversity score across six dimensions results 045 programming language 099 developers 100 age 099 number committers 096 number revisions 099 number program languages suggesting dataset diverse excluding programming language score impacted selecting Java projects Overall score 035 suggests cover third FOSS projects 95 dataset VII Conclusions investigated reasons developers adopt change licenses evolution FOSS Java projects GitHub aim conducted survey developers contributed changes projects included licensing changes observed developers typically adopt license within first commits suggesting developers consider licensing important task Similarly observe licensing changes appear nonnegligible period development visible observed history explored reasons initial licensing license changes problems experienced developers respect licensing observed developers view licensing important yet nontrivial feature projects License implications compatibility always clear lead changes Additionally external factors influencing projects’ licensing community purpose usage ie commercial systems use thirdparty libraries developers strongly indicate expectation licensing support forge evident thirdparty tools features within forge would aid developers helping deal licensing decisions changes Acknowledgements would like thank open source developers took time participate survey Specifically would like acknowledge developers provided indepths answers responded followup questions work supported part NSF CAREER CCF1253837 grant Massimiliano Di Penta partially supported Markos funded European Commission Contract Number FP7317743 opinions findings conclusions expressed herein authors’ necessarily reflect sponsors REFERENCES 1 Apache License Version 20 current httpswwwapacheorglicenses Last accessed 20150323 2 GitHub API httpsdevelopergithubcomv3 Last accessed 20150115 3 Gradle httpsgradleorg 4 Open Source Definition httpopensourceorgosd 5 Qualtrics httpwwwqualtricscom 6 Apache Apache maven httpsmavenapacheorg 7 G Bavota Ciemniewska Chulani De Nigro Di Penta Galletti R Galoppini F Gordon P Kedziora Lener F Torelli R Pratola J Pukacki Rebahi G Villalonga market open source intelligent virtual open source marketplace 2014 Evolution Week IEEE Conference Maintenance Reengineering Reverse Engineering CSMRWCRE 2014 Antwerp Belgium February 36 2014 pages 399–402 2014 8 J Corbin Strauss Grounded theory research Procedures canons evaluative criteria Qualitative Sociology 1313–21 1990 9 Di Penta Germán G Antoniol Identifying licensing jar archives using codesearch approach Proceedings 7th International Working Conference Mining Repositories MSR 2010 Colocated ICSE Cape Town South Africa May 23 2010 Proceedings pages 151–160 2010 10 Di Penta Germán Guéhéneuc G Antoniol exploratory study evolution licensing Proceedings 32nd ACMIEEE International Conference Engineering Volume 1 ICSE 2010 Cape Town South Africa 18 May 2010 pages 145–154 2010 11 R Dyer H Nguyen H Rajan N Nguyen Boa language infrastructure analyzing ultralargescale repositories 35th International Conference Engineering ICSE ’13 San Francisco CA USA May 1826 2013 pages 422–431 2013 12 Germán Di Penta method open source license compliance java applications IEEE 29358–63 2012 13 Germán Di Penta J Davies Understanding auditing licensing open source distributions 18th IEEE International Conference Program Comprehension ICPC 2010 Braga Minho Portugal June 30July 2 2010 pages 84–93 2010 14 Germán Di Penta Guéhéneuc G Antoniol Code siblings Technical legal implications copying code applications Proceedings 6th International Working Conference Mining Repositories MSR 2009 Colocated ICSE Vancouver BC Canada May 1617 2009 Proceedings pages 81–90 2009 15 Germán E Hassan License integration patterns Addressing license mismatches componentbased development 31st International Conference Engineering ICSE 2009 May 1624 2009 Vancouver Canada Proceedings pages 188–198 2009 16 Germán Manabe K Inoue sentencematching method automatic license identification source code files ASE 2010 25th IEEEACM International Conference Automated Engineering Antwerp Belgium September 2024 2010 pages 437–446 2010 17 R Gobeille FOSSology Proceedings 2008 International Working Conference Mining Repositories MSR 2008 Colocated ICSE Leipzig Germany May 1011 2008 Proceedings pages 47–50 2008 18 R Groves Survey Methodology 2nd edition Wiley 2009 19 J Hartsock jquery jquery ui dual licensed plugins dual licensing closed httpstackoverflowcomquestions2758409jqueryjqueryuiandduallicensedpluginsduallicensing Last accessed 20150215 20 Manabe Hayase K Inoue Evolutional analysis licenses FOSS Proceedings Joint ERCIM Workshop Evolution EVOL International Workshop Principles Evolution IWPSE Antwerp Belgium September 2021 2010 pages 83–87 ACM 2010 21 Manabe Hayase K Inoue Evolutional analysis licenses FOSS Proceedings Joint ERCIM Workshop Evolution EVOL International Workshop Principles Evolution IWPSE Antwerp Belgium September 2021 2010 pages 83–87 2010 22 C McMillan Grechanik Poshyvanyk Detecting similar applications Proceedings 34th International Conference Engineering ICSE ’12 pages 364–374 Piscataway NJ USA 2012 IEEE Press 23 Nagappan Zimmermann C Bird Diversity engineering research Joint Meeting European Engineering Conference ACM SIGSOFT Symposium Foundations Engineering ESECFSE’13 Saint Petersburg Russian Federation August 1826 2013 pages 466–476 2013 24 Phipps Github needs take open source seriously httpwwwinfoworldcomdopensourcesoftwaregithubneedstakeopensourceseriously208046 25 P Singh C Phelps Networks social influence choice among competing innovations Insights open source licenses Information Systems Research 243539–560 2009 26 Sojer Alexy Kleinknecht J Henkel Understanding drivers unethical programming behavior inappropriate reuse internetaccessible code J Management Information Systems 313287–325 2014 27 Sojer J Henkel Code reuse open source development Quantitative evidence drivers impediments Journal Association Information Systems 1112868–901 2010 28 J Confusion dual license mitgpl javascript use website httpprogrammersstackexchangecomquestions139663confusionaboutduallicensemitgpljavascriptforuseonmywebsite Last accessed 20150215 29 Tuunanen J Koskinen Kärkkäinen Automated license analysis Autom Softw Eng 1634455–490 2009 30 C Vendome LinaresVásquez G Bavota Di Penta Germán Poshyvanyk License usage changes largescale study Java projects GitHub 23rd IEEE International Conference Program Comprehension ICPC 2015 Florence Italy May 1819 2015 IEEE 2015 31 Wu Manabe Kanda Germán K Inoue method detect license inconsistencies largescale open source projects 12th Working Conference Mining Repositories MSR 2015 Florence Italy May 1617 2015 IEEE 2015
::::
Sustainability Open Source communities beyond fork LibreOffice evolved Jonas Gamalielsson Björn Lundell University Skövde PO Box 408 SE541 28 Skövde Sweden ARTICLE INFO Article history Received 19 October 2012 Received revised form 7 November 2013 Accepted 8 November 2013 Available online 21 November 2013 Keywords Open Source Fork Community evolution ABSTRACT Many organisations dependent upon longterm sustainable systems associated communities paper consider longterm sustainability Open Source communities Open Source projects involving fork currently lack studies literature address specific Open Source communities affected fork report study aiming investigate developer community around LibreOffice fork OpenOfficeorg analysis also covers OpenOfficeorg related Apache OpenOffice results strongly suggest longterm sustainable LibreOffice community signs stagnation LibreOffice 33 months fork analysis provides details developer communities LibreOffice Apache OpenOffice projects specifically concerning evolved OpenOfficeorg community respect activity developer commitment retention committers time present results analysis first hand experiences contributors LibreOffice community Findings analysis show Open Source communities outlive Open Source projects LibreOffice perceived community supportive diversified independent study contributes new insights concerning challenges related longterm sustainability Open Source communities © 2013 Authors Published Elsevier Inc Open access CC license Introduction Many organisations requirements longterm sustainable systems associated digital assets Open Source OSS identified strategy implementing longterm sustainable systems Blondelle et al 2012a Lundell et al 2011 Müller 2008 OSS sustainability communities fundamental longterm success study consider longterm sustainability communities OSS projects involving fork overarching goal establish rich insights concerning LibreOffice associated communities evolved LibreOffice associated communities evolved specifically report commitment LibreOffice retention committers insights experiences participants LibreOffice community Overall study revealed several key findings First LibreOffice forked OpenOfficeorg shows sign longterm decline Second LibreOffice attracted longterm active committers OpenOfficeorg Third analysis shows Open Source communities outlive Open Source projects Fourth LibreOffice perceived community supportive diversified independent issue forking OSS projects ongoing issue debate amongst practitioners researchers claimed “Indeed cardinal sin OSS forking whereby divided two streams evolving product different direction strong community norm acts developer turnover projects” Agerfalk Fitzgerald 2008 claimed forks successful Ven Mannaert 2008 Therefore perhaps surprising see claims “there must strong reason developers consider switching competing project” Wheeler 2007 However also argued “forking capability serving invisible hand sustainability helps open source projects survive extreme events commercial acquisitions well ensures users developers necessary tools enable change rather decay” Nyman et al 2012 Similarly Brian Behlendorf cofounder Apache Foundation states “right fork means don’t tolerance dictators don’t deal people make bad technical decisions – put future hands find group people agree create new around it” Severance 2012 Another argument code forking positively impact governance sustainability OSS projects levels community business ecosystem Nyman Lindman 2013 clearly need increased knowledge OSS communities affected fork two specific objectives first objective characterise community evolution time LibreOffice related OpenOfficeorg Apache OpenOffice projects second objective report insights experiences participants community branched LibreOffice order explain evolved fork base OpenOfficeorg paper makes four novel contributions First establish characterisation LibreOffice related OpenOfficeorg Apache OpenOffice projects respect history governance activity Second present findings regarding developer commitment projects different governance regimes Third present findings regarding retention committers projects different governance regimes Fourth report rich insights experiences participants LibreOffice view characterise community way working addition demonstrate approaches involving metrics analysing longterm sustainability communities without forks OSS projects illustrate use different OSS projects five reasons motivate study LibreOffice Firstly LibreOffice one OSS projects active community 10 years including development OpenOfficeorg significant commercial interest Secondly tensions within OpenOfficeorg finally led creation Document Foundation LibreOffice Byfield 2010 Documentfoundation 2013a Thirdly reached certain quality adopted professional use variety private public sector organisations Lundell 2011 Lundell Gamalielsson 2011 Therefore community likely attract certain level attention organisations individuals Fourthly previous studies base OpenOfficeorg Ven et al 2007 recent studies LibreOffice Gamalielsson Lundell 2011 show widespread deployment many organisations number countries turn imposes significant challenges geographically distributed user community Fifthly previous results Gamalielsson Lundell 2011 2012 anecdotal evidence official spokesperson LibreOffice Nouws 2011 suggest significant activity LibreOffice community motivates indepth investigation LibreOffice evolved Hence need extend previous studies LibreOffice include investigation LibreOffice forked OpenOfficeorg also alternative branches Apache OpenOffice investigation OpenOfficeorg interesting since widely deployed natural source recruitment LibreOffice Similarly Apache OpenOffice also interesting investigate since succeeded OpenOfficeorg Oracle abandoned investigation Apache OpenOffice enables comprehensive study community dynamics since OpenOfficeorg potential source recruitment Apache OpenOffice well rest paper position exploration sustainability OSS communities broader context previous research OSS communities Section 2 clarify research approach Section 3 report results Sections 4 5 Thereafter analyse results Section 6 followed discussion conclusions Section 7 sustainable Open Source communities Many companies need preserve systems associated digital assets 30 years Lundell et al 2011 industrial sectors avionics even 70 years Blondelle et al 2012b Robert 2006 usage scenarios “there problems commercial vendor adopted proprietary leaves market” increased risks longterm availability digital assets Lundell et al 2011 Similarly organisations public sector many systems digital assets need maintained several decades causes organisations vary concerning different types lockin inability provide longterm maintenance critical systems digital assets Lundell 2011 reason sustainability communities identified essential longterm sustainability OSS many different aspects OSS affect community sustainability Good management practice includes consider different incentives contributing OSS communities turn may affect future sustainability communities Bonaccorsi Rossi 2006 Previous research shown number different kinds motivations individuals firms impact decision concerning participation OSS projects motivations sometimes categorised economic social technological types incentives Bonaccorsi Rossi 2006 Earlier research also suggests effective structure governance basis healthy sustainable OSS communities de Laat 2007 particular aspects clear leadership congruence terms goals good team spirit fundamental importance Moreover community manager OSS plays key role achieving effective structure governance Michlmayr 2009 licensing OSS may affect community claimed “fair licensing contributions adds strong sense confidence security community” Bacon 2009 also claimed choice OSS license type “can positively negatively influence growth community” Engelfriet 2010 successfully master art establishing longterm sustainable OSS community huge challenge organisations “times every community repetition housekeeping conflict play role otherwise enjoyable merrygoround community begins see bureaucracy repetition useful enjoyable contributions something wrong” Bacon 2009 fork often consequence inadequate OSS governance claimed forks “are generally started number developers agree general direction heading” Ven Mannaert 2008 particular conflicts within communities arise due inadequate working processes lack congruence concerning goals unclear ways inadequate leadership different views considered OSS fork claimed order considered fork Robles GonzalezBarahona 2012 1 new name 2 branch original OSS 3 infrastructure separated infrastructure original eg web site mailing listsforums SCM Configuration Management system 4 new developer community disjoint community original 5 different structure governance also related concepts similar OSS forking Robles GonzalezBarahona 2012 cloning involves design system mimics another system branching source code duplicated within SCM creating parallel threads development derivation involves creation new system based existing system compatible existing system modding existing enhanced typically enthusiasts providing patches extensions existing different possible outcomes fork attempt Four different categories identified Wheeler 2007 1 forked dies eg libcglibc 2 forked remerges original eg gccegcs 3 original dies eg XFree86Xorg 4 successful branching original forked succeeds typically separate communities possible fifth outcome original forked dies Robles GonzalezBarahona 2012 Governance fundamental importance sustainability evolution OSS associated communities Three different phases governance identified de Laat 2007 1 “spontaneous” governance 2 internal governance 3 governance towards outside parties first phase governance concerns situation community including volunteer potentially commercial actors selfdirecting without formal explicit control coordination Given licensing framework control coordination emerge stem degree contribution individual members High performing members community may become informal leaders second phase often adopted larger projects existed longer time involves formal explicit control coordination order support effective governance Different tools used including modularisation assignment roles contributors delegation decision making training indoctrination formalised infrastructure support contributors leadership style autocracydemocracy third phase governance became necessary due increased external interest OSS projects national international organisations private public sector increased institutionalisation OSS led increased risk litigation due patent infringements solution initiatives taken create legal shells around OSS projects protect lawsuits One way implementing establishing nonprofit foundations Linux Foundation Mozilla Foundation governance OSS projects context OSS projects shown “little research conducted social processes related conflict management team maintenance” Crowston et al 2012 several open questions related “How team maintenance created sustained time” Crowston et al 2012 study also motivated fact lack research presenting rich insights large widely deployed OSS projects particular need increased knowledge related community involvement projects involving fork also note different seemingly conflicting views amongst practitioners concerning effect fork involved projects associated communities motivates study remainder section position study respect earlier research studies focusing forks OSS context However none studies focus community involvement time investigate specific OSS projects indepth One studies focused motivations forking SourceForgenet hosted OSS projects Nyman Mikkonen 2011 Another study surveyed large number OSS forks specific focus temporal evolution forks reasons forking outcomes forks Robles GonzalezBarahona 2012 similar limited study focused motivations impact fork mechanism OSS projects Visser 2012 Another study focus code maintenance issues forked projects BSD family operating systems Ray Kim 2012 studies evolution OSS projects time studies always community focus always targeted specific projects Examples include study total growth rate OSS projects Deshpande Riehle 2008 work evolution social interactions large number projects SourceForgenet time Madey et al 2004 Another example study survival analysis OSS projects involving application different metrics based duration thousands projects FLOSSMETRICS database Samoladas et al 2010 also studies focus evolution time specific OSS projects consider community aspect example study Linux kernel based Lehman’s laws evolution involved application code oriented metrics time Israeli Feitelson 2010 similar approach used case study evolution Eclipse Mens et al 2008 growth FreeBSD Linux studied compared earlier results code evolution Izurieta Bieman 2006 Another study topic evolution proposes model Linux kernel life cycle Feitelson 2012 somewhat different strand research involves development application different kinds statistical measures estimation prediction survivability Raja Tretter 2012 Wang 2012 success Crowston et al 2003 2006 Lee et al 2009 Midha Palvia 2012 Sen et al 2012 Subramaniam et al 2009 Wiggins et al 2009 Wiggins Crowston 2010 attractiveness Santos et al 2013 OSS projects measures may consider factors related Wang 2012 developer characteristics eg user developer effort service quality leadership adherence OSS ideology characteristics eg license terms targeted users modularity quality community attributes eg organisational sponsorship financial support trust social network ties However forks usually explicitly addressed research focus overall survivability success OSS projects rather focusing behaviour communities associated projects research typically use large selection projects different OSS forges statistical validation measures whereas study provides indepth analysis interrelated OSS projects employing quantitative qualitative approach studies focus evolution communities specific OSS projects address effects fork example case studies conducted Debian involving quantitative investigations evolution maintainership volunteer contributions time Robles et al 2005 Michlmayr et al 2007 Another study involved investigation developer community interaction time Apache web server Gnome KDE using social network analysis LopezFernandez et al 2006 similar study involved projects Evolution Mono MartinezRomo et al 2008 Case studies Nagios Gamalielsson et al 2010 TopCased Papyrus projects Gamalielsson et al 2011 addressed community sustainability evolution time special focus organisational influence research partially focusing community evolution early case studies large wellknown OSS projects including Linux kernel Moon Sproul 2000 Gnome German 2003 Apache web server Mockus et al 2002 Mozilla Mockus et al 2002 FreeBSD DinhTrong Bieman 2005 earlier reported indepth studies three projects LibreOffice OpenOfficeorg Apache OpenOffice focus evolution OSS communities time except earlier studies LibreOffice Gamalielsson Lundell 2011 2012 study process participation OSS communities Shibuya Tamai 2009 compare communities Writer tool OpenOfficeorg MySQL server MySQL GTK GNOME done using different kinds documentation quantitative data bug tracking systems source code repositories However limited study partially covers OpenOfficeorg another study also community focus open user experience design perspective rather community evolution perspective Bach Carroll 2010 studies OpenOfficeorg without community focus One study focused code evolution Rossi et al 2009 Specifically study explored relation code activities bug fixing activities release dates five projects including OpenOfficeorg another study maintenance process OpenOfficeorg analysed using defect management version management systems Koponen et al 2006 also studies focusing issues related migration adoption deployment OpenOfficeorg Huysmans et al 2008 Rossi et al 2006 Ven et al 2010 Seydel 2009 Research approach address first objective characterise community evolution time LibreOffice related OpenOfficeorg Apache OpenOffice projects undertook analysis LibreOffice related OpenOfficeorg Apache OpenOffice projects done review documented information quantitative analysis repository data order investigate sustainability OSS communities included analysis different phases different governance regimes OpenOfficeorg encompassed time period governance Sun Microsystems Oracle rest paper refer three projects OO OpenOfficeorg LO LibreOffice AOO Apache OpenOffice OO governance Sun Microsystems hereafter referred SOO OO governance Oracle hereafter referred AOO contextualise insights LibreOffice undertook analysis data number different sources First established characterisation three projects LO OO AOO undertaking analysis history governance projects release history commits SCM contributing committers time Second investigate developer commitment projects used different metrics consider extent committers involved contributed different projects different governance regimes Third investigate retention committers projects different governance regimes used different metrics consider recruitment committers time retirement committers time distribution commits committers contributing different combinations projects temporal commitment patterns projects committers quantitative analysis adopt extend approaches earlier studies Gamalielsson et al 2011 Gamalielsson Lundell 2011 2012 done order analyse contributions terms committed SCM artefacts OSS projects time SCM data collected official repositories LO AOO OO website recommended AOO website keeps legacy source code data LO collected LO website1 Git subrepositories “core” “binfilter” “dictionaries” “translations” “help” used analysis choice subrepositories done personal dialogue key LO contributors OO data collected archive website2 Mercurial repository used analysis Data AOO collected AOO website3 SVN repository used analysis Data 31 May 2013 used LO AOO data end OO April 2011 used Logs projects extracted repositories thereafter analysed using custom made scripts semiautomated approach involving manual inspection used associate commit id aliases actual committer address second objective report insights experiences participants community branched LibreOffice order explain evolved fork base OpenOfficeorg undertook case study LO order investigate experiences participants view gain insights effects fork led establishment LO order analyse insights experiences concerning participation LO two researchers conducted interviews active participants LO community goal specifically identify incentives motivations creation LO strategy identifying potential interviewees include key informants key roles interviewees long experience addition also sought include interviewees less experience joined fork strategy include additional perspectives Interviewees selected basis actively involved LO Data collection based results facetoface interviews conducted English Interviews recorded transcribed vetted interviewee Questions prepared advance shown interviewee conduction interview interview conducted informal setting allowed interviewee extensively elaborate issues covered interview total 12 interviews conducted ranging time 8 43 min resulting 67 pages transcribed vetted interview data4 process interviewee allowed elaborate clarify responses Analysis transcribed interview data took place extended timeperiod allow time reflection Individual analysis supplemented group sessions researchers discussed reflected interpretations researcher coding interview data conducted manner follows Glaser’s ideas open coding Lings Lundell 2005 unit coding sentences paragraphs within interview notes focus constant comparison indicator indicator indicator emerging concepts categories Lings Lundell 2005 goal analysis develop refine abstract concepts grounded data field interpreted via collected data transcriptions coding process resulted set categories presented subsection Section 5 paper 1 httpwwwlibreofficeorgdevelopers2 accessed 18 June 2013 2 httphgservicesopenofficeorgDEV300 accessed 18 June 2013 3 httpincubatorapacheorgopenofficeorgsourcehtml accessed 18 June 2013 4 interviews conducted February 2012 4 Community evolution time section report results related first objective Table 1 presents main results observations concerning community evolution time reported following sections 41 Characterisation projects section present overarching characterisation three projects provide historical overview describe governance report activity 411 Organisations overview projects OO established OSS 13 October 2000 Openoffice 2004 initial release 1 October 2001 first stable version v10 released 30 April 2002 Openoffice 2002 Initial development begun within StarDivision German based company acquired Sun Microsystems mid1999 Crn 1999 establishing OO development provision code base closed source OO governed community council comprised OO community members also created charter establishment council Openoffice 2013 Sun contributor agreement needed signed developers wishing contribute whereby contributions jointly owned developer Sun corporation Oracle corporation acquired Sun thereby also OO 27 January 2010 Oracle 2010 Oracle also used contributor agreement almost identical Sun contributor agreement needed signed developers wishing contribute Oracle stopped support commercial OpenOfficeorg 15 April 2011 Marketwire 2011a LO LGPLlicensed Open Source office productivity tool creation editing digital artefacts Open Document Format ODF native file format Document Foundation TDF established 28 September 2010 Linuxuser 2010 German jurisdiction first beta release LO provided date Pclosmag 2011 TDF mission facilitate evolution LO fork OO since date establishing TDF Documentfoundation 2013a TDF independent meritocratic selfgoverning notforprofit foundation evolved OO community formally established members OO community September 2010 supported large number small larger organisations steering committee currently consisting eight members excluding six deputy members also four founding members four official spokespersons TDF open individuals willing contribute activities also agree core values foundation Organisational participation also encouraged example supporting individuals financially work contribute community TDF commits give “everyone access office productivity tools free charge enable participate full citizens 21st century” Documentfoundation 2013b TDF supports preservation mother tongues encouraging translation documentation promotion TDF facilitated office productivity tools languages individual contributors Moreover TDF commits allow users create maintain digital artefacts open document formats based open standards addition TDF openly seeks voluntary financial contributions donations via web site individuals organisations want support evolution LO TDF Besides strong support volunteer contributors LO also receiving support commercial companies including RedHat Novell Canonical Documentfoundation 2013c Oracle donated OO Apache Foundation ASF 1 June 2011 Marketwire 2011b thereafter established incubating ASF 13 June 2011 undergoing proposal voting process Apache 2013a new connection given name Apache OpenOffice AOO licensed APL v2 comprises six office productivity applications first stable release AOO v34 provided 8 May 2012 Openoffice 2012 Apache OpenOffice became toplevel Apache 17 October 2012 Apache 2013a ASF established 1 June 1999 US Jurisdiction Apache 1999 mission ASF establish projects delivering freely available enterprisegrade products interest large user communities Apache 2013b Apart AOO ASF maintains wellknown projects HTTP Server Struts Subversion Tomcat Like TDF ASF independent meritocratic selfgoverning notforprofit organisation governed community members collaborate within ASF projects since 1999 ASF board directors annually elected members ASF manages internal organisational affairs foundation according ASF bylaws board consists nine individuals turn appoints set officers whose task take care daily operation foundation decision making individual ASF projects regarding content direction delegated board directors called management committees committees govern one several communities Individuals unaffiliated working companies willing capable contributing ASF projects welcome participate ASF accepts donations sponsorship program individuals organisations willing contribute financially also note IBM active supporter contributor AOO IBM 2011 Finally note long establishment AOO researchers indicated leadership control OO Sun governance “is remarkably similar Apache” Conlon 2007 Fig 1 summarises evolution projects OO LO AOO time includes selected major events related Moreover illustrates OO black upper bar LO dark grey middle bar AOO light grey lower bar interrelated overlap time 412 activity version history OO LO AOO shown Table 2 observed continuous flow new OO releases 10 years 25 January 2011 Document Foundation TDF announced first stable version LO constitutes fork OO Documentfoundation 2013a TDF thereafter regularly provided new releases LO first stable version AOO announced 8 May 2012 replaced discontinued OO developer activity OO LO AOO presented Fig 2 shows number commits month September 2000 May 2013 note activity OO varies distinct peaks connection OO 20 September 2005 OO 24 March 2008 releases also observed activity level decreased dramatically around August 2008 release OO version 30 contributing reason significant drop activity may major changes terms features implemented version 3 subsequent activity focused bug fixing also observe activity LO AOO varies time peaks less distinct observed OO Fig 3 illustrates number active committers month projects observed large number committers active early OO activity decreases considerably shortly release first stable version OO version 10 May 2002 number committers increases higher level release OO 31 May 2009 note discord number monthly commits committers OO interval January 2003 January 2009 relatively monthly committers contributing large number monthly commits may explained fact number first second level releases interval often cooccur elevated level commits committers often provide majority commits OSS projects see Section 42 details concerning commitment projects LO noted committer participation peaks significantly October 2010 subsequent months connection fork OO LO participation also peaks connection release version 40 February 2013 also observed rise committer participation AOO September 2012
::::
42 Commitment projects section report commitment projects terms SCM contributions Fig 4 provides overview commitment projects figure illustrates number committers contributed seven possible mutually exclusive combinations three projects area combination reflects number committers colour combination represents average number commits per committer projects combination Totally 795 unique code contributors active least one three projects sum committers areas main observation Fig 4 67 contributors committed OO LO provided overwhelming majority commits 4339 commits per committer committers constitute backbone developer communities OO LO 8 contributors three projects provided substantial amount commits 1329 commits per committer Contributors combinations limited impact respect number commits 127 commits per committer less Table 3 provides detailed picture commitment separate projects combinations illustrated Fig 4 table shows proportion committers contributed seven possible combinations three projects table also shows brackets number commits committers different combinations contribute different projects observed 67 contributors committed OO LO provided majority commits OO 92 LO 564 133 Table 2 Version history OpenOfficeorg OO LibreOffice LO Apache OpenOffice AOO OO LO AOO Date YYYYMMDD OO initial 20011001 OO 10 20020430 OO 11 20030902 OO 20 20051020 OO 21 20061212 OO 22 20070328 OO 23 20070917 OO 24 20080327 OO 30 20081013 OO 31 20090507 OO 32 20100211 LO 33 B1 20100928 LO 33 20110125 LO 34 20110412 LO 35 20110603 LO 36 20120214 LO 37 20120508 LO 38 20120812 LO 40 20130214 AOO 34 20130723 AOO 40 20130725 Fig 2 Number monthly commits OpenOfficeorg black LibreOffice dark grey Apache OpenOffice light grey projects Fig 3 Number monthly committers OpenOfficeorg black LibreOffice dark grey Apache OpenOffice light grey projects Table 3 Proportion commits committers contributing different combinations projects number commits brackets combination LO prop AOO prop OO prop LO 372 23846 – – AOO – 264 939 – OO – – 61 16867 LO AOO 03 170 321 1140 – LO OO 564 36152 – 920 254745 AOO OO – 02 8 01 1 LO AOO OO 61 3914 413 1466 19 5121 committers participating OO provided 61 commits contrast situation committers contributed either LO AOO two cases contributions constitute 372 264 commits respectively note 17 committers contributed LO AOO OO contributed significantly AOO 321 little LO 03 may also considered surprising one AOO committers participated OO AOO LO perhaps also unexpected committers contributing three projects behind 413 commits AOO earlier mentioned commits contributed projects different governance regimes different lengths SOO 112 months OOO 16 months LO 33 months AOO 24 months 209 committers OO 197 committers active Sun governance OO contributed 267011 commits 81 committers contributed 9723 commits Oracle governance OO Fig 5 illustrates proportion commits function proportion committers SOO solid black trace OOO dashed black trace LO dark grey trace AOO light grey trace example noted SOO LO 10 committers 19 64 respectively contribute 905 241645 888 56905 commits proportion committers OOO AOO 8 4 committers respectively contribute 416 4045 541 1922 commits respectively Hence SOO LO relatively small proportion committers contribute majority commits whereas larger proportion committers OOO AOO contribute majority commits also mentioned large proportion committers contribute commits 5 commits less made 213 SOO committers 123 OOO committers 543 LO committers 349 AOO committers Table 4 based data illustrated Fig 5 shows proportion commits different proportions committers SOO OOO LO AOO Similarly Table 5 shows proportion commits top N committers projects different values N example 5 active LO committers contribute 78 LO commits observed proportion commits LO Table 5 significantly smaller compared proportion commits LO Table 4 due fact many committers 645 LO top 5 committers therefore much fewer top 5 committers AOO way around top 5 committers much greater proportion committers top 5 therefore proportion commits AOO greater Table 5
::::
43 Retention committers section report retention committers different projects Fig 6 shows recruitment committers retirement committers current number active committers projects month Recruitment represented accumulated number committers made first commit solid black trace Retirement represented accumulated number committers made last commit dashed black trace current number active committers represented difference number recruited retired committers grey trace observed LO far highest recruitment rate approximately 20 new committers month average time LO suffers high retirement rate perhaps surprising since earlier mentioned half LO committers provided 5 commits less However cannot observe long term trend towards decreased number active committers roughly 100 150 currently active committers since start LO SOO high recruitment rate first two years considerably lower recruitment rate rest except last months approximately 75 currently active committers average first two years SOO stabilised around 50 currently active committers second half Noticeable OOO recruitment slow except first months retirement rate OOO comparably high especially later part led dramatic drop currently active committers 10th month onwards AOO Table 4 Proportion commits different proportions committers SOO OOO LO AOO Prop committers SOO OOO LO AOO Top 5 86 27 78 33 Top 15 93 52 93 69 Top 20 95 62 95 80 Table 5 Proportion commits different numbers committers SOO OOO LO AOO Number committers SOO OOO LO AOO Top 5 79 31 33 60 Top 15 89 59 58 93 Top 20 91 69 66 96 positive trend terms number active committers first 16 months due high recruitment rate low retirement rate However AOO lately experienced stagnation recruitment increasing rate retirement resulted halving number active committers AOO second year acknowledge total number months differ projects SOO 112 months OOO 16 months LO 33 months AOO 24 months distribution commits among committers explored following order better explain commitment different projects committer level Fig 7 provides details regarding distribution commits LO dark grey bar colour OO black bar colour 67 committers contributing LO OO Committers sorted sum commits two projects descending order stated earlier connection Table 2 black area represents 92 commits OO dark grey area represents 564 commits LO However LO commits comprise 124 commits Fig 7 OO commits comprise 876 level individual committers observed one projects often hugely dominates example top committer Fig 7 contributes 89931 commits OO two commits LO fact top six committers contribute 04 commits LO Similarly Fig 8 provides details regarding distribution commits 17 committers contributing LO dark grey bar colour AOO light grey bar colour light grey area represents 321 commits AOO dark grey area represents 01 commits LO Given proportions surprising contribution different projects unbalanced LO commits comprise 13 commits Fig 8 AOO commits comprise 87 unbalance clearly visible level individual committers Fig 8 example committers 3 4 8 contribute small proportion commits LO committer 10 contributes larger proportion commits LO Fig 9 provides details regarding distribution commits LO dark grey bar colour AOO light grey bar colour OO black bar colour 8 committers contributing three projects black area represents 19 commits OO light grey area represents 413 commits AOO dark grey area represents 61 commits LO Like Figs 7 8 contribution different projects somewhat unbalanced AOO commits comprise 14 commits Fig 9 LO OO commits comprise 373 488 respectively example unbalance individual committers top committer contributes 2261 commits LO 69 AOO One aspect contribute unbalance Figs 7–9 fact projects different life spans accumulated different total amounts commits example 77 times commits OO compared AOO Table 6 Major commitment patterns committers contributed LO Pattern ID Commitment pattern Commits Committers LP1 33642 525 58 90 LP2 23846 372 553 857 LP3 3052 48 2 03 LP4 2385 37 5 08 Tables 6 7 illustrate major temporal commitment patterns projects OO black colour LO dark grey colour AOO light grey colour committers contributed LO Table 6 AOO Table 7 total 13 commitment patterns identified 645 LO committers four significant patterns LP1 LP4 shown Table 6 four patterns account 982 LO commits 958 LO committers Similarly Table 7 shows four significant patterns AP1 AP4 total 10 identified patterns 43 AOO committers four patterns account 934 AOO commits 744 AOO committers committer assigned one distinct pattern comparing dates first latest commit projects committer active example committer assigned LP1 commitment OO LO sequential committer contributed AOO means LP1 latest commit OO precedes first commit LO Another example LP4 involvement OO overlaps involvement LO Hence LP4 latest commit OO first commit LO committer active AOO connection commitment pattern Tables 6 7 show number proportion commits committers tables sorted number commits assigned specific pattern descending order Table 6 evident pattern accounting largest amount LO commits 525 LP1 committers contributed OO LO sequence AOO also commitment patterns committers involved OO LO LP4 two patterns among four significant together account 39 commits second significant pattern terms commits 372 LP2 committers contributed LO pattern applies clear majority 857 committers patterns LP1 LP2 clearly dominating together involve 897 commits 947 committers also pointed committers involved OO involvement LO LP1 contribute greater proportion commits compared contributed LO LP2 Table 7 observed pattern accounting largest amount AOO commits 292 AP1 committers contributed LO within period contributed AOO second significant pattern terms commits 264 AP2 committers contributed AOO comparing LO patterns find diversified set commitment patterns account significant amounts commits AOO note significant proportion AOO commits 413 stem committers previous cases current experience OO LO AP3 AP4 another pattern shown Table 7 sum concerning recruitment LO 553 645 committers LO constituting 857 active OO AOO therefore directly recruited LO 75 645 committers LO also contributed OO 75 committers 66 committers contributed OO started contribute LO thereafter contributed OO therefore claimed recruited OO LO 66 committers influential together provided majority LO commits 587 remaining 9 75 committers active LO OO parallel 25 645 committers LO also contributed AOO committers contributed 22 LO commits AOO 17 43 committers constituting 395 active OO LOO therefore directly recruited AOO 8 43 committers AOO also contributed OO started contribute AOO contributed LO AOO involvement therefore claimed recruited OO LO 8 committers together contribute significant amount 413 AOO commits also note 17 committers contributed AOO LO mostly contributed two projects parallel contributed considerable amount AOO commits 321 Insights experiences LibreOffice community section reports results related second objective Table 8 shows main themes investigation associated main results observations concerning insights experiences LibreOffice community interviewees active participants LO several expressed active start interviewees include participants active formation TDF several central roles related LO even though interviewees also include contributors less experience participation would appropriate characterise sample interviewees dominated experts thereby consider conduction research interviews dominated elite interviews Six broad categories emerged coding analysis interview transcriptions presented separate section subheading aimed characterise category 51 Considerations creation LibreOffice time members OO community started perceive frustration discontent due number circumstances OO Concerns amongst community members include perceptions vendor dominance copyright assignment lack influence lack fun bureaucracy example expressed community member “I started OpenOffice fun beginning year able see behind didn’t like saw” Similarly another community member expressed view “it stopped fun stopped Open Source Oracle” different respondent particularly emphasised bureaucracy OO inhibitor contributing “In past tried get involved OpenOffice submitting patches hell job bureaucracy that’s didn’t follow quit it” Overall essence circumstances seems originate lack trust idea starting new branch OO evolved amongst community members course events brought many thoughts among members community illustrated comment raised one person involved creation LO “When whole story Oracle started look bit fishy meet people start talking start thinking start planning” clear number issues considered taking action illustrated another person involved “Before started lot discussions Shall start start start people get involved soon possible bit later whatever” different issues considered time take action expressed different respondent “we founded LibreOffice got people together agree know got initial structure set up” choice copyleft license7 mentioned important prerequisite several contributors LO Hence seems consensus amongst contributors LO 7 OSS licenses often broadly categorised either copyleft licenses eg GPL strong copyleft license LGPL weak copyleft license permissive licenses eg BSD MIT main difference two license categories copyleft licenses ensure derivative work remains open source whereas permissive licenses Brock 2013 permissive licenses used expressed one respondent “to licensing key copyleft license weak copyleft license pretty much mandatory interested otherwise know it’s gonna go pretty soon writing proprietary software” importance avoiding permissive licensing emphasised another respondent “the permissive license would lose half volunteer developers real volunteers fun don’t want give away work corporation” respondent also acknowledged contributing companies understand act accordance fundamental values Open Source movement contributors accept “They easily give away work companies like Suse Redhat Canonical contribute transparent way behave project” one respondent pointed apart upsetting community switching copyleft permissive license would require time consuming IPclearance process process would require rewriting code license potentially stall actual development new features essence interviewees involved process establishing LO seem considered establishment LO independent foundation TDF use weak copyleft license inevitable action take given perceived dissatisfaction amongst community members OO 52 Perception LibreOffice Immediate reaction requested seeking respondents associated LO rather probing description definition occasions caused respondents hesitate replying Perhaps surprisingly contributors extensive experience hesitant responding question one even commented “It’s hard question factual question cannot use mind” Overall contributors gave variety ideological emotional responses “freedom” “something believe in” “It’s project” “a group friends” put one contributor “LibreOffice contributed shape also lot emotional participation” Similarly respondents expressed “It deep meaning guess done lot work there” “It’s working yeah” concept also triggered number expressions excitement illustrated following comments “Exciting fun hack on” “It’s positive hear people talk LibreOffice” “It’s cool it’s home it’s something exciting” respondents also associated concept personal commitment example expressed one interviewee “It’s group friends people work would say” addition concept also gave rise number rational associations expressions relate quality system “The best office suite world” “LibreOffice interesting exciting huge amount work good people work work manage do” Yet others relate development model used “Community developed office suite” whereas others related developed system “Open Source office package” Finally respondents seemed flattered probed association concerning concept LO responding jokingly “I recognise name” 53 Participation LibreOffice extent contributions participants LO related professional activities vary amongst respondents note contributions stem volunteer paidfor activities responses revealed contributors employed several different organisations including selfemployed specialists Several respondents expressed working LO part professional activities illustrated following responses “I working LibreOffice professional activities” “I paid working LibreOffice” “It’s full time job” respondents also expressed incentives participation motivated technical need professional activities illustrated one respondent “I wanted use replace Microsoft Access day job” several contributors significant congruence professional activities contributions volunteers example one respondents expressed “there huge overlap professional activities” another “it professional activity … it’s job parts job … stuff free time well” also expressing working LO symbiosis professional job even though directly part “it related harmony basically” Yet others expressed incentives participations motivated business opportunities “I small company country X kind services support LibreOffice old OpenOffice makes logic contribute it’s logic combination” also contributors participating primarily volunteer activities words one respondent “we use LibreOffice company work mostly activities LibreOffice mostly hobby” Amongst respondents also identified professional volunteer activities seem merge “For it’s like hobby turned occupation it’s hard draw line privately employee mostly matches interest company would personally do” 54 Motivations contributing LibreOffice Several interviewees found difficult single specific issues motivate contribute example put one contributor “That’s hard question isn’t … think everyone mixed bundle sorts motivations” Another respondent expressed “There many answers It’s kind hard” Respondents expressed number different types motivations contributing LO Several comments emotional nature “because fun rewarding” “it’s fun contribute contribute gets ahead it’s even fun” “I want something seems useful people significant think it’s joy relationship working people seeing good things happen” emotional comments emphasised motivations contributing future “in future stays fun community stays nice place yeah it’s … continue” Closely related emotions respondents also amplified social rewards social recognition enablers motivation contribute example respondents expressed “Cleaning ugly things socially rewarding” also “positive feedback drives me” Similarly also ideological motivations expressed amongst respondents “I believe free think proper alternative proprietary software” “I care freedom” also intellectual motivations seem drive contributors example one respondent motivates participation LO argument good office package “one biggest tasks doesn’t already good solution Open Source” Similarly another respondent considered establishment high quality LO “a professional challenge money smarter competitors” respondents longterm commitment LO participation led desire see succeed stated one respondent “I’ve invested plenty time branch really really personal desire see succeed” Others expressed motivation improving wayofworking LO follows “It may readily visible still need add structure processes think want continue that” Visionary goal driven motivations future LO also expressed follows “it’s fun convinced it’s right thing think it’s right right time right people right mind sets” Similarly words another respondent “I think change lot running differently pushing borders thinking outside box there” motivations also seem stem frustration concerning perceived lack influence old OO commented one respondent “I active OpenOfficeorg past lots things loved product lot things made feel frustrated influence things picked development side really motivated work LibreOffice make better see improve compared old OpenOffice strong motivation” Finally amongst respondents observe strong commitment expressed one respondent “it’s fun it’s something like it’s first free contribute It’s something good chunks life now” Similarly strong commitment motivation participation also related stark emotions “It’s purely love” 55 Future outlook LibreOffice overwhelming impression responses contributors perceive positive future LO Several respondents gave number emotional expressions observe expectation diverse developer community amongst respondents future example stated one respondent “I believe stay diversified able embrace individuals companies well” Respondents raised budget issue stressed “need strengthen project” comments concern way working organise work illustrated one respondent “we still need consolidate organisation still need increase number members” Several respondents envisaged bright future LO illustrated following comments “Hereto attributes successful believe execute plan releases recently time based schedule always deliver time” “Whatever happens continue way another shape another think code base it’s many users disappear It’s stay” “I think bright future grows takes time” one respondent also expressed view relation existing proprietary alternative follows “We going grow going take market follower called Microsoft behind us” However also predicting somewhat modest future LO example view one respondent “we keep running think” number comments also revealed evolution LO seemed exceeded expectations respondents illustrated following comments “while young surprised diverse healthy is” “I think we’re well major breakthrough milestone finally got German authorities prove idea foundation we’re past quite important milestone Yeah positive future” “I think yet aware possible beginning realise much bigger thing get” importance community role LO amplified number respondents enabler future success Several comments signalled strong identity members LO community illustrated one respondent stressed importance community values follows “This community company don’t titles” commented consequently need business cards working within community However respondent suggested actually need business card certain situations community members need communicate external organisations importance vibrant community also stressed one respondent follows “the rich diverse compelling make ecosystem stronger is” Similarly another comment stressed importance successful governance community follows “governance key governance end discipline necessary discipline go level making others scared come inside moment still little bit scared trying make less scared” 56 Lessons learnt participation LibreOffice responses evident contributors perceive participation LO positive rewarding number different ways observed variety different lessons learnt participants number comments touched upon excitement opportunities open collaboration positive inclusive atmosphere seems promote learning Several respondents elaborated experiences participation LO community attached number positive characteristics community example commented one respondent long experience participation community “it’s true fun diverse vibrant community started OpenOffice fun beginning year able see behind didn’t like saw” Similarly another respondent stressed possibility impact providing value individuals organisations society broadly “the thrilling thing LibreOffice really makes difference see people using appreciating it” another respondent stressed opportunity open collaboration important lesson participation illustrated following comment “I think really shows cooperation open way profitable makes sense think valuable lesson” Similarly another respondent perceived benefits open collaboration follows “It’s things like name practitioner conference meeting people collaborating different people different mentalities tolerating others other’s ideas yeah even completely different approaches expectations get something big yeah” Another respondent stressed inherent nature sharing experiences collaborating community involving providing gaining valuable lessons follows “I think I’ve end really got much given term human experiences incredible” Several respondents stressed importance welcoming environment LO particular emphasis skills development example expressed one respondent “I think it’s good writing skills coding skills” Similarly several respondents stressed welcoming nature established practice mentoring new contributors something highly appreciated illustrated following comment “I pleased much welcoming environment new developers participate us pleased lot people quickly become senior advisers right feel free mentor people bootstrap new developers situation repeat process others done make valuable respected developers commit access” indicated another respondent mentoring process seems founded individual’s ability careful consideration LO acknowledging appreciating contributions contributors “I think it’s exceptionally welcoming nature LibreOffice community speed recognised contributions skills abilities It’s like every know LibreOffice happens fast” Finally another lesson learnt expressed one respondent clearly stressed perception feeling rewarded contributing LO “The important experience weeks actually switched upstream preparation going public seeing matter minutes IRC channels created filled people started download use actually build LibreOffice tireless moments spent IRC trying fix possible breakages magic moment see things actually moving ahead It’s emotional” Analysis 61 Analysis community evolution time results make number observations related results activity Firstly regular frequent releases stable versions LO including former development OO time period ten years examples well known OSS projects release histories extending many years Apache web server8 Linux kernel9 frequent releases since 1995 1991 respectively note LO AOO projects governed foundation10 ie third phase governance according categorisation proposed de Laat 2007 Secondly substantial activity LO including former development OO ten years Despite variation stable releases findings suggest longterm trend towards sustainable community observed signs lasting decline community activity comparison stable community activity many years aforementioned Apache web server Linux kernel projects Based results concerning commitment projects find large proportion influential committers LO involved long periods time fork OO indicates developer community strong commitment LO branch strong commitment contributors long time periods observed earlier study Debian observed maintainers “tend commit long periods time” “the mean life volunteers probably larger many companies would clear impact maintenance software” Michlmayr et al 2007 results show relatively small proportion 5 active LO committers contribute majority commits 78 five active committers contribute 33 commits LO comparison relatively small proportion 5 active AOO committers contribute smaller proportion commits 33 five active committers contribute 60 commits AOO acknowledging analysis AOO based significantly shorter time window LO note projects communities committers larger “the vast majority mature OSS programs” Krishnamurthy 2002 Results concerning commitment support findings previous research show OSS projects “the bulk activity especially new features quite highly centralised” Crowston et al 2012 Results retention committers show SOO LO successful recruiting retaining committers time compared OOO AOO Results also show sign long term decline LO terms number currently active committers results concerning contributions LO AOO projects show new developers ie contributed OO provide limited contributions LO representing 03 LO commits significant amount 321 AOO commits considering longterm contributors ie contributed three projects still limited contributions except AOO representing 19 commits OO 61 commits LO 413 commits AOO two dominating commitment patterns committers contributed LO committers commit LO committers done contributions OO starting contribute LO together involving 947 LO committers 897 LO commits comparison two dominating commitment patterns committers contributed AOO committers contributed LO within period contributed AOO together comprising 599 AOO committers 556 AOO commits Moreover clear majority 857 LO committers directly recruited LO whereas less half 395 AOO committers directly recruited AOO uncommon 8 httphttpdapacheorg 9 httpwwwkernelorg 10 Apache Foundation httpwwwapacheorg Linux Foundation httpwwwlinuxfoundationorg developers simultaneously involved one Lundell et al 2010 However results show limited number contributors simultaneously active LO AOO projects 62 Analysis insights experiences LibreOffice community Results study indicate systematic approach LO mentoring new contributors adopted systematic approach supportive work practices providing guidance new contributors example done via mentoring provision “LibreOffice Easy Hacks” specifically aimed inexperienced contributors Efforts made seem go beyond established practice many OSS projects important promote organisational learning ease introduction new contributors work practices recognised previous research Lundell et al 2010 results also show LO participants seem keen encourage acknowledge contributions new participants community results clearly show use weak copyleft license seen appropriate LO number reasons One reason perceived risk source code continue provided according core principles freedom choice Open Source license referred adhering “keepopen” license Engelfriet 2010 acknowledging number factors affecting attractiveness seems evident choice “keepopen” license considered appropriate amongst new contributors managed attract significant number new contributors additional indication preference “keepopen” license amongst contributors LO also contributing OO stem results interviews turn reinforces observation see majority contributors OO decided continue contributing one projects AOO LO chosen LO branch effect fork part OO community evolved new form founding members LO community stem OO community time new LO managed attract significant number new contributors managed governed TDF contrast approach taken AOO adopted already established structure governance work practices ASF complex interrelationship community company values impacts opportunities longterm maintenance support OSS projects number respondents express besides involvement LO community also affiliated various commercial organisations respondents also symbiosis different involvements results respondents strongly support several motivational factors individual participation OSS projects identified earlier research Bonaccorsi Rossi 2006 particular social motivations fun contribute sense belonging community important LO contributors Another social motivation observed LO community opportunity provide alternative proprietary solutions note technological motivations learning opportunity getting contributions feedback community also present amongst LO contributors respondents also active small companies see business opportunities participating LO community Hence study confirms earlier studies concerning individual motivations participation OSS projects 63 Implications study revealed number insights concerning governance community evolution longterm contributors active several governance regimes 10 years several changes concerning way working different communities Contributors starting OO governance Sun followed Oracle later active AOO experienced different corporate governance regimes followed adoption Apache way working transition governance existing ASF involved significant change participants terms changed governance changing conditions contributors due adoption institutionalised practices change weak copyleft license permissive license hand contributors starting OO later active LO also experienced different corporate governance regimes Sun Oracle followed adoption new way working implied establishment tailor made foundation TDF legal framework maintenance LO contributors continued use weak copyleft license way results show contributors shaped TDF view support preferred way working LO noted choice weak copyleft license base establishing LO possible without prior IPR clearance possible despite fact copyright code base base controlled different organisation Oracle corporation circumstances allowed LO able immediately continue development code base However establishing AOO need IPR clearance connection transferring copyright code base ASF change new Open Source license transfer ASF involved significant efforts resulted significant time window AOO start first release AOO analysis three specific projects investigated LO OO AOO shown significant development experiences – terms contributors contributions – maintained transferred OO two independent projects LO AOO importance establishing strong sense OSS community context large global OSS projects closely related importance establishing sense teamness global development projects Lings et al 2007 Open Source proprietary licensed projects need managing collaboration involving developers different sociocultural backgrounds However key difference Open Source based collaboration large community based projects large interorganisational collaborations using proprietary global contexts lies possibility successfully fork OSS establish new separate governance importance face face meetings recognised contexts interorganisational collaboration field global engineering Lings et al 2007 large globally distributed OSS projects analysed study analysis study note importance establishing common vision OSS community relates experiences context global engineering concerning importance gaining “executive support sites” globally distributed development Paasivaara 2011 7 Discussion conclusions 71 Discussion transition formation LibreOffice community seems successful However acknowledge short time period fork 33 months early indications successful LibreOffice community transition OpenOfficeorg need confirmed analysis longer time period later stage comparison wellknown fork significant uptake longterm sustainable community OpenBSDtext11 forked NetBSD 1995 still active developer community Gmane 2013 considering Open Source products longterm maintenance scenarios potential adoption critical understand engage communities related Open Source base analysed OpenOfficeorg governance structure established OpenOfficeorg community governed community council Openoffice 2013 Similarly investigated branch fork LibreOffice also established governance structure referred Document Foundation Documentfoundation 2013a Despite explicitly documented governance structures participants may decide fork happened Document Foundation established LibreOffice fork OpenOfficeorg 28 September 2010 results suggest fork may actually successful note observation indicates LibreOffice may exception norm since previous research claims “few successful forks past” Ven Mannaert 2008 results remains seen extent LibreOffice Apache OpenOffice projects may successfully evolve projects associated communities way sustainable longterm far seems LibreOffice successful terms growing associated communities results suggest choice Open Source license significantly impacts conditions attracting contributions Open Source projects Amongst contributors LibreOffice clear preference contributing Open Source use weak copyleft license base use keepopen license LibreOffice may significantly impact willingness contribute Open Source possess copyright may amongst volunteer company affiliated developers results show strong indications congruence professional roles contributions LibreOffice community community members acknowledge LibreOffice established openly available external contributions longer time period Apache OpenOffice partly explained later start Apache OpenOffice since state void 15 April 2011 Oracle abandoned OpenOfficeorg 13 June 2011 Apache OpenOffice established Apache Foundation note first commits Apache OpenOffice repository contributed August 2011 Therefore perhaps surprising number contributors OpenOfficeorg became involved LibreOffice since active OpenOfficeorg contribute several months However noted August 2011 first commits contributed Apache OpenOffice became openly available committers continued contribute LibreOffice situation analysed paper inherent complexity involves three projects complex interactions influences relationships respect code community dynamics Therefore study challenges previously established categorisations fork outcomes also concept fork defined since foundation categorisations definitions often consider relationship two projects often referred base forked Robles GonzalezBarahona 2012 Wheeler 2007 study shown individual contributors related OSS developer communities contribute several projects period time including base forked analysis sustainability Open Source communities evolution two independent Open Source projects fork shows potential successful branching specific emphasis investigate insights experiences community members established outcome fork find longterm community members seem manage establishing new tailormade foundation governance way appealing old new contributors situations one analysed study onetoone correspondence Open Source Open Source community Consequently assessing sustainability communities important recognise individual contributors involved multiple projects Therefore assessment must take account community involvement goes beyond single Irrespective relationships projects perceived transition base two new projects results analysis three interrelated projects associated transitions OpenOfficeorg go beyond previously established categorisations fork outcomes results thereby provide valuable insights extending existing body knowledge concerning forks 72 Conclusions study presents findings first comprehensive analysis Open Source projects involving fork study reveals number important findings related longterm sustainability Open Source communities Related characterisation community evolution time three interrelated Open Source projects study presents several important findings First LibreOffice shows sign longterm decline details circumstances fork successful Second majority contributors OpenOfficeorg continued one succeeding projects chose continue contributing LibreOffice LibreOffice attracted longterm active committers OpenOfficeorg thereby demonstrated successful transfer evolution knowhow work practices achieved beyond individual Open Source projects Third OpenOfficeorg governance Sun LibreOffice successful recruiting retaining committers time compared OpenOfficeorg governance Oracle Apache OpenOffice suggests effective governance work practices appreciated community members fundamental longterm sustainability Fourth minority LibreOffice committers recruited OpenOfficeorg contributed clear majority LibreOffice commits hand vast majority LibreOffice committers directly recruited commits minority conclude apart community efforts making easier contribute Open Source 11 httpwwwopenbsdorg also important address challenges related longterm retention contributors study makes novel contribution revealing important insights experiences members LibreOffice community provides explanations LibreOffice evolved clear preference use copyleft license amongst contributors LibreOffice amongst volunteers affiliated companies use license LibreOffice perceived prerequisite entry amongst many volunteer contributors affiliated companies suggests Open Source license preferred amongst contributors Open Source projects strong community identity study shows important values amongst contributors stakeholders congruent effects particular Open Source license used Results study elaborate tension community details circumstances community members need vary order avoid ineffective collaboration climate Open Source study reveals important motivations joining contributing LibreOffice time including perceived welcoming atmosphere community sense supportive effective work practices appreciation independence control developed solutions members community strong identity appraisal community diversity Thereby study detailed importance nurturing Open Source communities order establish longterm sustainable Open Source projects contributor perspective study shows Open Source communities outlive Open Source projects particular projects associated devoted communities strong conviction future directions projects communities find strong indications forking used one effective strategy overcoming perceived obstacles current way working order improve situation findings analysis LibreOffice related OpenOfficeorg Apache OpenOffice projects contribute new insights concerning challenges related longterm sustainability Open Source communities systems long lifecycles success Open Source manages recruit retain new contributors community critical long term sustainability Hence good practice respect governance Open Source projects perceived community members fundamental challenge establishing sustainable communities References Ågerfalk P Fitzgerald B 2008 Outsourcing unknown workforce exploring open sourcing global sourcing strategy MIS Quarterly 32 2 385–410 Apache 1999 Apache Foundation Board Directors Meeting Minutes httpwwwapacheorgfoundationrecordsminutes1999boardminutes19990601txt accessed June 2013 Apache 2013a Apache OpenOffice httpopenofficeapacheorg accessed June 2013 Apache 2013b Apache Foundation – Foundation httpwwwapacheorgfoundation accessed June 2013 Bach P Carroll J 2010 Characterizing dynamics open user experience design cases firebox OpenOfficeorg JAIS 11 special issue 902–925 Bacon J 2009 Art Community O’Reilly Media Sebastopol Blondelle G Arberet P Rossignol Lundell B Labeze P Berrendonner R Gauffret P Faudot R Langlois B Maioncello L Moro P Rodriguez J Puerta Peña JM Bonafous E Mueller R 2012a Polarsys towards longterm availability engineering tools embedded systems Proceedings Sixth European Conference Embedded Real Time Systems ERTS 2012 Toulouse France 1–2 February Blondelle G Langlois B Gauffret P 2012b Polarsys addresses Long Term Support develops ecosystem Eclipse tools Critical Embedded Systems EclipseCon US 2012 Reston Virginia 26–28 March httpwwweclipseconorg2012sessionshowpolarsysaddresseslongtermsupportanddevelopsecosystemeclipsetoolscriticalembe Bonaccorsi Rossi C 2006 Comparing motivations individual programmers teams take part open source movement community business Knowledge Technology Policy 18 4 60–64 Brock 2013 Understanding commercial agreements open source projects Coughlan Ed Thoughts Open Innovation – Essays Open Innovation Leading Thinkers Field OpenForum Europe Ltd OpenForum Academy Brussels Byfield B 2010 Cold War OpenOfficeorg LibreOffice Linux Magazine httpwwwlinuxmagazinecomOnlineBlogsOfftheBeatBruceByfieldsBlogTheColdWarBetweenOpenOfficeorgandLibreOffice accessed June 2013 Conlon MP 2007 examination initiation organization participation leadership control success Open Source development projects Information Systems Education Journal 5 38 1–13 Crn 1999 Sun Microsystems Buys Star Division httpwwwcrncomnewschannelprograms18804525sunmicrosystemsbuysstardivisionhtm accessed June 2013 Crowston K Annabi H Howison J 2003 Defining Open Source success Proceedings International Conference Information Systems ICIS 2003 Seattle WA USA 14–17 December pp 327–340 Crowston K Howison J Annabi H 2006 Information systems success free Open Source development theory measures Process Improvement Practice 11 2 123–148 Crowston K Kangning W Howison J Wiggins 2012 FreeLibre opensource development know know ACM Computing Surveys 44 2 article 7 de Laat P 2007 Governance open source state art Journal Management Information Governance 11 2 165–177 Deshpande Riehle 2008 total growth Open Source Russo B et al Eds Open Source Development Communities Quality IFIP Advances Information Communication Technology vol 275 Springer New York pp 197–209 DinhTrong TT Bieman JM 2005 FreeBSD replication case study open source development IEEE Transaction Engineering 31 6 481–494 Documentfoundation 2013a Document Foundation httpwwwdocumentfoundationorg accessed June 2013 Documentfoundation 2013b Document Foundation Manifesto httpwwwdocumentfoundationorgpdfmanifestopdf accessed June 2013 Documentfoundation 2013c Document Foundation – Supporters httpwwwdocumentfoundationorgsupporters accessed June 2013 Engellfriet 2010 Choosing Open Source license IEEE 27 1 48–49 Fetelson DG 2012 Perpetual development model Linux kernel life cycle Journal Systems 85 4 859–875 Gamalielsson J Lundell B 2011 Open Source communities longterm maintenance digital assets offered ODF OOXML Hammouda L Lundell B Eds Proceedings SOS 2011 Towards Sustainable Open Source Tampere University Technology Tampere pp 19–24 ISBN 9789521524110 ISSN 1737836X Gamalielsson J Lundell B 2012 Longterm sustainability Open Source communities beyond fork case study LibreOffice Hammouda L et al Eds Open Source Systems LongTerm Sustainability IFIP Advances Information Communication Technology vol 378 Springer Heidelberg pp 29–47 Gamalielsson J Lundell B Lings B 2010 Nagios community extended quantitative analysis Agerfalk P et al Eds Open Source New Horizons IFIP Advances Information Communication Technology vol 319 Springer Berlin pp 85–96 Gamalielsson J Lundell B Mattsson 2011 Open Source model driven development case study Hissam Ed Open Source Systems Grounding Research IFIP Advances Information Communication Technology vol 365 Springer Heidelberg pp 348–367 German 2003 GNOME case study open source global development Journal Process Improvement Practice 8 4 201–215 Gmane 2013 Information gmaneosopenbsdcvs httpdirgmaneorggmaneosopenbsdcvs accessed June 2013 Huysmans F Ven K Verelst J 2008 Reasons nonadoption OpenOfficeorg dataintensive administration First Monday 13 10 IBM 2011 IBM Contribute New Proposed OpenOfficeorg httpwww03ibmcompressusenpressrelease34638wss accessed June 2013 Isaëla Fettelson DG 2010 Linux kernel case study evolution Journal Systems 83 3 485–501 Izurieta C Bieman J 2006 evolution FreeBSD Linux Proceedings 5th ACMIEEE International Symposium Empirical Engineering ISESE’06 September 21–22 Rio de Janeiro Brazil Koponen Lintula H Hotti V 2006 Defects reports Open Source maintenance process – OpenOfficeorg case study Proceedings Engineering Applications SEApp’06 Dallas TX USA 13–15 November Krishnamurthy 2002 Cave community empirical examination 100 mature Open Source projects First Monday 7 6 Lee SYT Kim HW Gupta 2009 Measuring open source success Omega 37 2 426–438 Lings B Lundell B 2005 adaptation Grounded Theory procedures insights evolution 2G method Information Technology People 18 3 196–211 Lings B Lundell B Ågerfalk PJ Fitzgerald B 2007 reference model suc cessful distributed Development Systems Proceedings Second International Conference Global Engineering ICGSE 2007 IEEE Computer Society pp 130–139 Linususer 2010 OpenOfficeorg Community Announces Document Foun dation httpwwwopenofficeorgpressreleaseannouncesthedocumentfoundation accessed June 2013 LopezFernandez L Robles G GonzalezBarahona JM Herraz 2006 Apply ing social network analysis techniques communitydriven Libre projects International Journal Information Technology Web Engineering 1 3 27–48 Lundell B 2011 eGovernance public sector ICTprocurement shaping practice Sweden European Journal ePractice 12 6 httpwwwepracticeeufilesEuropean20Journal20of20ePractice20Volume2012266pdf Lundell B Gamalielsson J 2011 Towards Sustainable Swedish eGovernment Practice Observations unlocking digital assets Proceedings IFIP 11th Government Conference 2011 EGOV 2011 Delft Netherlands 28 August–2 September 2011 Lundell B Lings B Lindqvist E 2010 Open Source Swedish companies Information Systems Journal 20 6 519–535 Lundell B Lings B Syberfeldt 2011 Practitioner perceptions Open Source embedded systems area Journal Systems 84 9 1540–1549 Madye G Freeh V Tynan R 2004 Modeling FOSS community quantitative investigation Koch Ed FreeOpen Source Development Idea Group Publishing Hershey pp 203–221 Marketwire 2011a Oracle Announces Intention Move OpenOfficeorg CommunityBased httpwwwmarketwirecompressreleaseoracleannouncesitsintentiontomoveopenofficeorgtoacommunitybasedprojectnasdaqorcl1503027htm accessed June 2013 Marketwire 2011b Oracle Contribute Apache httpwwwmarketwirecompressreleasestatementsonopenofficeorgcontributiontoapachenasdaqorcl1521400htm accessed June 2013 MeyersRomero J Robles G OrtuñoPérez GonzalezBarahona JM 2008 Using social network analysis techniques study collaboration FLOSS community company Russo B et al Eds Open Source Development Communities Quality IFIP Advances Information Communication Technology vol 275 Springer New York pp 171–186 Mens FernándezRamírez J Degrandts 2008 evolution Eclipse Proceedings 24th IEEE International Conference Maintenance ICSM 2008 pp 386–395 Michlmayr 2009 Community management Open Source projects Euro pean Journal Informatics Professional X 3 22–26 Michlmayr Robles G GonzalezBarahona JM 2007 Volunteers large Libre projects quantitative analysis Sowe SK et al Eds Emerging Free Open Source Practices IGI Publishing Hershey pp 1–24 Midha V Palvia P 2012 Factors affecting success Open Source Journal Systems 85 4 895–905 Mockus Fielding RT Herbsleb JD 2002 Two case studies Open Source development Apache Mozilla ACM Transactions Engineering Methodology 11 3 309–346 Moon YJ Sproull L 2000 Essence distributed work case Linux kernel First Monday 5 12 1–7 Müller R 2008 Open Source – Value Creation Consumption Open Expo Zürich 24–25 September Nouws L 2011 LibreOffice – first year looking forward Presented ODF Plugfest Gouda Netherlands 20111118 httpplugfest Nyman L Mikkonen Lindman J Fougère 2012 Perspectives code forking sustainability Open Source Hammouda L et al Eds Open Source Systems LongTerm Sustainability IFIP Advances Information Communication Technology vol 378 Springer Heidelberg pp 274–279 Openoffice 2002 OpenOfficeorg Community Announces OpenOfficeorg 10 Free Office Productivity httpwwwopenofficeorgaboutusoooereleasehtml accessed June 2013 Openoffice 2004 OpenOfficeorg Four httpwwwopenofficeorgaboutusbirthday4html accessed June 2013 Openoffice 2012 Apache OpenOffice Announces Apache OpenOffice™ 34 httpwwwopenofficeorgnewsaoo34html accessed June 2013 Openoffice 2013 Community Council httpwikiservicesopenofficeorgwikiCommunityCouncil accessed June 2013 Oracle 2010 Oracle Completes Acquisition Sun httpwwworaclecomuscorporatepress044428 accessed June 2013 Paasivaara 2011 Coaching global development projects Proceed ings 30th International Conference Global Engineering ICGSE 2011 IEEE Computer Society pp 84–93 Pclomsag 2011 Free Last LibreOffice 33 Released httppclomsagcomhtmlIssues201103page14html accessed June 2013 Raja U Tretter MJ 2012 Defining evaluating measure Open Source survivability IEEE Transactions Engineering 38 1 163–174 Ray B Kim 2012 case study crosssystem porting forked projects Pro ceedings 20th ACM SIGSOFT International Symposium Foundations Engineering 11–16 November 2012 Cary NC Robert 2006 Onboard development – opensource way ISTARTEMIS Workshop Helsinki 22 November Robles G GonzalezBarahona JM 2012 comprehensive study forks dates reasons outcomes Hammouda L et al Eds Open Source Systems LongTerm Sustainability IFIP Advances Information Commu nication Technology vol 378 Springer Heidelberg pp 1–14 Robles G GonzalezBarahona JM Michlmayr 2005 Evolution volunteer participation Libre projects evidence Debian Proceedings First International Conference Open Source Systems OSS 2005 pp 100–107 Rossi B Scotto Sillitti Succi G 2006 empirical study migration Open Source public administration International Journal Information Technology Web Engineering IJITWE 1 3 64–80 Rossi B Russo B Succi G 2009 Analysis Open Source development evolution iterations means burst detection techniques Boldyreff C et al Eds Open Source Ecosystems Diverse Communities Interacting IFIP Advances Information Communication Technology vol 299 Springer Berlin pp 83–93 Samoladas Stamos Angelos L 2010 Survival analysis duration open source projects Information Technology 52 9 902–922 Santos C Kuk G Kon F Pearson J 2013 attraction contributors free Open Source projects Journal Strategic Information Systems 22 1 45–69 Sen R Singh SS Borle 2012 Open Source success measures analysis Decision Support Systems 52 2 364–372 Severance C 2012 Apache Foundation Brian Behlendorf Computer 45 1 1–6 Seydel J 2009 OpenOfficeorg ready prime time Proceed ings Southwest Decision Sciences Institute Conference SWDSI 25–28 Ed Shibuya B Tamai 2009 Understanding process participating open source communities Proceedings 2009 ICSE Workshop Emerging Trends FreeLibreOpen Source Research Development IEEE Computer Society Washington DC USA pp 1–4 Subramaniam C Sen R Nelson ML 2009 Determinants Open Source success longitudinal study Decision Support Systems 46 2 576–585 Ven K Mannenat H 2008 Challenges strategies use Open Source Soft ware Independent Vendors Information Technology 50 9–10 991–1002 Ven K Huysmans P Verelst J 2007 adoption open source desktop large public administration Proceedings 13th Americas Conference Information Systems AMCIS 2007 9–12 August Keystone CO Ven K Van Kerckhoven G Verelst J 2010 adoption open source desktop qualitative study Belgian organizations International Journal ITBusiness Alignment Governance IJITBAG 1 4 1–17 Viseur R 2012 Forks impacts motivations free open source projects International Journal Advanced Computer Science Applications IJACSA 3 2 117–122 Wang J 2012 Survival factors Free Open Source projects multistage perspective European Management Journal 30 4 352–371 Wheeler DA 2007 Open Source SoftwareFree OSSFS FLOSS OSS important Wheeler DA Ed Open Source New Horizons IFIP Advances Information Communication Technology vol 319 Springer Berlin pp 294–307 Wiggins Howison J Crowston K 2009 Heartbeat measuring active user base potential user interest FLOSS projects Boldyreff C et al Eds Open Source Ecosystems Diverse Communities Interacting IFIP Advances Information Communication Technology vol 299 Springer Berlin pp 94–104 Jonas Gamalielsson researcher University Skövde’s Informatics Research Centre conducted research open source open standards several projects involved Open Source Action OSA 2008–2010 Nordic NordForsk OSS Researchers Network 2009–2012 ITEA2project OPEES Open Platform Engineering Embedded Systems participating ORIOS Open Source based Reference implementations Open Standards also involved Fifth Eighth International Conference Open Source Systems OSS 2009 OSS 2012 Björn Lundell senior researcher University Skövde’s Informatics Research Centre researching Open Source phenomenon several years participated number research projects different leading roles including colead work package EU FP6 CALIBRE 2004–2006 manager Swedish National Research OSS 2005–2008 currently leader ORIOS 2012–2015 founding member IFIP WG 213 Open Source program cochair Eighth International Conference Open Source Systems OSS 2012
::::
Code Reuse Stack Overflow Popular Open Source Java Projects Adriaan Lotter Department Information Science University Otago Dunedin New Zealand adriaanlotterotagoacnz Sherlock Licorish Department Information Science University Otago Dunedin New Zealand sherlocklicorishotagoacnz Sarah Meldrum Department Information Science University Otago Dunedin New Zealand sarahmeldrumoutlookcom Bastin Tony Roy Savarimuthu Department Information Science University Otago Dunedin New Zealand tonysavarimuthuotagoacnz Abstract—Solutions provided Question Answer QA websites Stack Overflow regularly used Open Source OSS However many developers unaware Stack Overflow OSS governed licenses Hence developers reusing code Stack Overflow OSS projects may violate licensing agreements attributions correct Additionally code migrates one OSS Stack Overflow another OSS complex licensing issues likely exist forms reuse also implications future maintenance particularly developers poor understanding copied code paper investigates code reuse two platforms ie Stack Overflow OSS aim providing insights issue study mined 151946 Java code snippets Stack Overflow 16617 Java files 12 top weekly listed projects SourceForge GitHub 39616 Java files top 20 popular Java projects SourceForge analyses aimed finding number clones indicating reuse within Stack Overflow posts b Stack Overflow popular Java OSS projects c projects Outcomes reveal 33 code reuse within Stack Overflow 18 Stack Overflow code reused recent popular Java projects 23 projects established Reuse across projects much higher accounting much 772 outcomes implication strategies aimed introducing strict quality assurance measures ensure appropriateness code reuse licensing requirements awareness Keywords—Code reuse Stack Overflow Java projects OSS QA Quality INTRODUCTION Quality plays fundamental role success 30 Thus quality standards developed provide guidance developers covering requirements producing high quality defectfree 30 31 ISO9126 quality model example stated quality requirements cover efficiency functionality reliability usability reusability maintainability 9 standards also subject previous academic studies eg Singh et al 22 quality underlying motivator instilling good development practices creating developers particularly conscious employing code reuse external sources eg open source OS portals 29 may impact efficiency functionality reliability usability maintainability code reuse allows previously tested qualityassured code implemented system reusing code untrusted sources may lead system harm 16 implications code reuse could particularly significant maintainability poor knowledge reused code time development likely create challenges future corrective perfective actions discussed Roy et al 40 understanding levels reuse cloning could valuable developers terms assisting issues related plagiarism evolution debugging code compaction security Furthermore Kashima et al 36 noted several OSS licenses require outcomes derived original solutions published license demands developers aware legal implications licenses OSS code posted portals Stack Overflow published Additionally businesses also need aware reuse occurring within outsourced development 20 conditions may face future legal challenges Code reuse formally defined “the use existing knowledge construct new software” 15 prevalent many including produced toptier development companies Google 34 Beyond industry leaders code reuse found exceptionally common Mobile Apps products consisting entirely reused elements 13 high level reuse seen practice developers stems benefits provides terms easily adding enhancing system features 25 accessibility readily available solutions coding problems highly attractive novice experienced programmers 25 fact study Sojer et al 21 responses 869 developers confirmed consider ad hoc reuse code internet important work Similarly Heinemann et al 18 also found 90 OS projects analyzed contained reused code reiterating point code reuse found extensively many systems ease attractiveness code reuse particularly aided readily accessible code fragments QA websites Stack Overflow Stack Overflow popular QA website allows members public post development related questions andor answers answers often containing code fragments 1 httpwwwstackoverflowcom Recent evidence shows majority questions asked Stack Overflow usually receive one answers 6 forum often substitute official programming languages’ tutorials guides 24 implications maintainability licensing reusing Stack Overflow code fragments interest us potential effects reusing code portal could effort future changes correct use license avoid future legal issues aim paper thus investigate levels code reuse within Stack Overflow Stack Overflow OSS projects focus Java programming language given popularity 2 need understand reuse beyond Python Yang et al 8 strong body knowledge around scale developers’ reuse practices team leaders may begin introduce stricter quality assurance measures ensure appropriateness reused code fragments thus answer five research questions portfolio work Firstly explore extent Java code reuse within Stack Overflow understand community operates ecosystem provision selfsupport RQ1 Related question next explore extent code reuse answers published question Stack Overflow understand degree innovation lack thereof prevalent platform RQ2 Answers two questions particularly useful engineering community withinsource code migration likely increase risk incorrect author attribution due copies existence b increasing number ‘steps’ piece code could taken origin found could turn lead unsuspecting license violations implementing code snippets OSS third research question extent code reuse Stack Overflow current popular Open Source Java Projects helps us understand recent code reuse trends RQ3 Related research question examine extent code reuse Stack Overflow alltime popular Open Source Java projects understand practitioners’ behavior code reuse time RQ4 Additionally answer differences nature reuse found different contexts terms scale size provide deeper evidence nature ranges code reuse Stack Overflow OSS projects RQ5 Beyond understanding extent code reuse clones existing OSS Stack Overflow important understand practitioners’ attitude towards practice changed time investigation led latter three questions provide initial evidence extent code reuse projects developed recently existed longer remaining sections paper organized follows provide study background Section 2 next provide research setting Section 3 providing results Section 4 discuss findings implications Section 5 prior considering threats study Section 6 Finally provide concluding remarks point future research Section 7 II BACKGROUND practitioners would benefit developing maintainable systems free code license violations thus code reuse given serious consideration development topics ie maintenance license investigated various extents importance widely noted literature Firstly maintainability system highly significant stakeholders especially considering leadtimes costs 9 Maintainability refers likelihood performing improvements given period said become difficult prevalence code reuse 32 Kamiya et al 32 established code reuse could introduce multiple points failure code fragments ‘buggy’ fact noted approximately half changes made code clone groups inconsistent 15 issue code reuse maintainability becomes complex reused code sourced external sources eg Stack Overflow due potential code incompatibility issues suboptimal solutions often tied lack developer understanding Also code fragments provided Stack Overflow largely written accompanying textual explanation immediate use fact many developers online sources Stack Overflow utility faced issues require knowledge possess brings question likely understanding code turn brings question software’s quality Furthermore security complications may arise evidence shown Stack Overflow portal includes insecure code 10 example catastrophic code reuse could illustrated Bi 11 author shows piece Stack Overflow code used NissanConnect EV mobile app accidentally displayed piece text reading “App explanation spirit stack overflow coders helping coders” example illustrates code reused Stack Overflow similar portals always examined thoroughly Although example illustrates nonthreatening issue many similar cases could introduce security functionalityrelated problems inspected properly Thus important investigate understand extent code reuse occurring systems online code resources Stack Overflow Recently several research studies conducted topic code reuse Stack Overflow instance Abdalkareem et al 25 investigated code reused Stack Overflow Mobile Apps found 13 Apps sampled constructed Stack Overflow posts also discovered midaged older Apps contained Stack Overflow code introduced later lifetime et al 19 also investigated Android Apps found 62 399 155 Apps contained exact code clones 62 Apps 60 potential license violations terms Stack Overflow discovered 1226 posts contained code found 68 Apps Furthermore 126 snippets involved code migration 12 cases migration involved Apps published different licenses Yang et al 8 noted terms Python projects 1 code blocks token form exist GitHub Stack Overflow 80 similarity threshold 11 code blocks GitHub similar Stack Overflow 2 Stack Overflow code blocks similar GitHub terms attribution ensuring conformance license requirements Baltes et al 27 found 73 popular repositories GitHub contained reference Stack Overflow context Java projects minimum two thirds containing copied code contain reference Stack Overflow Additionally 32 surveyed developers aware attribution requirements Stack Overflow could result complicated legal issues developers fact study licensing violations also subject previous research 4 23 21 noted license violations occur frequently OS projects 4 well QA websites Stack Overflow community inquired issue 3 stated German et al 7 illegal code fragments one system implemented another licenses incompatible developers required cautious work aware legal consequences involved code reuse internet sources Although license violations direct implications quality pose potential legal problems could result removal court costs Additionally development perspective licensing issues could result costs resolve complications implement system changes fix reputation damage Stack Overflow covered CC BYSA 30 Creative Commons AttributionShareAlike 30 license 2 developers right transform build upon content Stack Overflow However new using Stack Overflow code must distributed license original Furthermore credit must given specific answer Stack Overflow link must provided license developer specify introduced changes Noticeably code reuse Stack Overflow shown exist various OSS projects varying amounts reuse levels reused code however often acknowledged lack attribution results license violations many projects 3 25 additional research required validate extend current literature pursue line work study answering five research questions RQ1RQ5 stated earlier III RESEARCH SETTING Data Collection Processing address research questions posed three sets data extracted including Stack Overflow code snippets two sets OSS projects’ source code purpose study dataset required contain Java files collect necessary data utilized Stack Overflow data dump SourceForge GitHub key motivator selecting sources popularity programming community open access data projects selected SourceForge GitHub based popularity weekly time resulting projects selected widely used contributed towards believe effects code reuse would significant projects less popular ones Stack Overflow Java Snippets Java ‘snippets’ Stack Overflow extracted using data explorer function create first dataset Answer posts selected based least one “ ” tag filtered language Java answers selected accepted answers kept premise snippets trusted thus reused final filter answers 2014 2017 selected ensure relevancy resulted 117526 answers answers separated individual code snippets based within “ … ” tags resulted 404799 individual code snippets snippets one line code selected Ultimately 151954 code snippets extracted saved Java files 151946 analyzed since eight returned errors processed Top Weekly OSS Projects second dataset files extracted projects greatest weekly popularity specific week sourcing starting December 18 2017 extracted top 10 weekly Java projects SourceForge GitHub resulted preliminary sample 20 projects line previous research done Heinemann et al 18 Open Source Java projects projects investigated containing least one Java file selected Ultimately 12 suitable projects finally selected analysis contained total 16617 Java files Five files returned errors processing reported Table III Time Popular OSS Projects final dataset covered projects highest alltime popularity SourceForge top 20 projects selected 16 appropriate analysis ie contained least one Java source file extract projects GitHub round given richness projects extracted SourceForge projects filtered popularity well containing Java code However four projects included subset contain Java files leaving 39616 files final list projects summaries found Table IV 39558 Java files used analyses processing B Tools Techniques answer research questions appropriate clone reuse detection tool required conducted review several tools including NiCad 14 SourcererCC 12 CCFinderX 32 selected CCFinderX given performance popularity among researchers 5 25 32 tokenbased techniques clone detection computationally efficient alternative methods high recall rate able detect hidden clones 5 discussed Kamiya et al 32 works employing lexical analyzer create token sequences applies rulebased transformations sequences based specific programming language lexical analyzer used transform sequences characters sequences tokens wordlike entities 33 entities identifiers keywords numbers literals operators separators comments 33 1 matching clones computed using suffixtree algorithm “in clone information represented tree sharing nodes leading identical subsequences clone detection performed searching leading nodes tree” 33 utilizing CCFinderX analyses several parameters configured followed previous recommendations used CCFinderX default settings 25 minimum clone length representing absolute count tokens set default value 50 code blocks considered contain least 50 tokens Additionally minimum unique token set value configured default 12 Hence code blocks considered contains least 12 unique tokens addition absolute minimum count 50 tokens shaper level also set default 2 shaper restricts code blocks considered candidate clone outer block ‘’ splits token sequence final two parameters ‘Pmatch application’ ‘Prescreening application’ Pmatch application parameter default ticked denotes variables function names replaced special characters Prescreening application default ticked wanted retain clone instances Prescreen ticked filter outcomes visually many code clones output CCFinderX includes file metrics clone metrics file metrics provide filelevel insights data whereas clone metrics provide information regarding clone sets One set exists unique group clones clone set contain minimum 2 code blocks Additionally able identify number files containing clones clonesets present different files data refer Figure 1 example order determine extent code reuse occurring within files files projectsdatasets Radius metric RAD CCFinderX utilized performing analysis clonesets selected based specific RAD values turn used select individual files involved RAD metric defined Kamiya et al 32 gives indication maximum distance common directory files involved cloneset clones found within file Radius 0 clones found two files directory Radius 1 C Measures Answering RQs answer first four research questions posed RQ1RQ4 five analyses performed analyses involve calculating following metrics Firstly number files containing least one clone computed Secondly using previous measure got measure percentage files containing clones allows us compare results similar studies Yang et al 8 Thirdly summing population variable pop clonesets identified total number clones present files Fourthly total number clonesets reveals unique clones Fifthly among clone sets identified clones involved one file answer RQ1 extent Java code reuse within Stack Overflow Stack Overflow files stored directory CCFinderX executed Radius 1 used identify betweenfile clonesets Answering second research question RQ2 extent code reuse answers published question Stack Overflow required Stack Overflow files stored separate directories based questions posted Radius 1 would indicate clones exist answers question Radius 2 would indicate clones exist answers separate questions Radius 2 however imply intraquestion clones ie clones question exist simply implies clone also found questions hide intraquestion clones manual inspection performed clonesets Radius 2 identify intraquestion clones hidden maximum Radius value Figures 2 3 demonstrate situation cases Radius 2 however one Figure 2 intraquestion clone code piece denoted ‘A’ found question Question 1 answer third RQ3 extent code reuse Stack Overflow current popular Open Source Java Projects fourth RQ4 extent code reuse Stack Overflow alltime popular Open Source Java projects questions project’s files extracted saved directory Furthermore Stack Overflow files saved two directories away allowed us identify clonesets clones found Stack Overflow projects using Radius value 2 primary measurements required answer research questions includes total number files containing least one clone total number clones present files number unique clones RQ5 differences nature reuse found different contexts terms scale size answered follow statistical analyses involving outcomes
::::
Reliability Checks ensure results obtained analyses reliable conducted manual investigation 60 clone pairs detected CCFinderX Initially author AL first author performed checks discussed author SAL second author triangulated outcomes provided confirmation Within sample 60 clone pairs 20 randomly obtained Stack Overflow analysis Section IV 20 Section IV C – 20 Section IV – selected clonepair determined extent two pieces code similar nature code also recorded ie class method piece code within method detected clone extent clones similar rated either ‘Exact’ ‘High’ ‘Medium’ rated ‘Exact’ code question would identical copies including identifiers structure functionality rated ‘High’ primary difference two pieces code would identifiers Finally ranked ‘Medium’ considered still similar structure although identifiers minor pieces data structures minor pieces functionality may different results analyses given Tables II Table reflects number clone pairs considered similar given extent Table II displays nature code elements detected sample
::::
TABLE MANUAL CHECK DETECTED CLONE SIMILARITY Similarity Time Popular Current Popular Total Exact 10 6 3 19 High 10 12 13 35 Medium 0 2 4 6
::::
TABLE II CODE CLONES ELEMENTS Nature Code Element Time Popular Current Popular Total Class 5 0 1 6 Method 5 6 8 19 Part Method 10 14 11 35 results show highly plausible pieces code could copied directly least adapted fit question refer Table details Furthermore Table II shows majority clones code found within methods Thus appears developer copy piece code Stack Overflow likely code would provide additional functionality method
::::
IV RESULTS
::::
Java Code Reuse within Stack Overflow RQ1 analysis Stack Overflow files revealed overall 5041 files 151946 contained least one clone reused Thus 33 Stack Overflow Java code snippets duplicate found elsewhere Stack Overflow Furthermore observed within 5041 files total 8786 clones present indicating contained multiple clones terms clone sets 3530 unique code snippets observed clones However focusing clones found least two files number reduced 2338 result able determine potentially 2338 unique license violations existing within Stack Overflow files extracted refer Section II Stack Overflow licensing requirements cumulatively appear 5863 places additional 1192 ie 3530 minus 2338 unique clones found within files present potential license violations contained within answers author
::::
B Java Code Reuse Answers Stack Overflow RQ2 investigate code reuse within Stack Overflow also looked amount reuse occurring within answers given questions analyses reveal 151946 Stack Overflow files 2666 contained clones found question equates 18 total files implies amount snippets least one clone code duplication published question Within 2666 files total 3559 clones found indicating answers contained multiple clones 3559 clones discovered number unique clones found 1763 Additionally 2666 Stack Overflow files containing clones able identify present answers responses 1207 unique questions 46082 total Hence 26 Java related questions Stack Overflow expected contain two answers code
::::
C Code Reuse Stack Overflow Current Popular Projects RQ3 Stack Overflow Reuse Analysis analysis Stack Overflow top weekly OSS projects revealed 12763 files 168558 five files removed CCFinderX’s due errors contained least one clone Based result observed 76 files consideration contain least one clone 12763 files total 5447 Stack Overflow files 151946 files 7316 top weekly OSS files 16612 files indicates introducing files 406 additional Stack Overflow files contain clones refer Section IV implies 406 Stack Overflow files contain code found anywhere else Stack Overflow clones solely Stack Overflow least one Additionally files clones account 44 total files proportion much greater Stack Overflow files 33 primarily believed result size files average token size 617 compared much smaller 48 Stack Overflow files performed probing data observing 12763 files containing least one clone 21893 clone sets existed words 21893 unique code snippets least one clone smaller number clone sets contained clones found Stack Overflow top weekly OSS files figure 223 indicating 223 unique code snippets found Stack Overflow files clones cumulatively appear 1627 files 10 168558 appearing average 73 files total 223 unique code snippets appear 1995 times b InterProject Reuse Analysis 12763 files containing clones total 75959 clones discovered within files analyzed independently found 7287 ie 57 files contained clones among giving average 29791 clones per depicted Table III Additionally investigating clonesets observe 212 clones least two projects appearing 1995 times probing also revealed 29 files 7316 contained clones found Stack Overflow files words 29 clones found one one fashion one Stack Overflow likely migrated directly Stack Overflow since evidence originating internal direction migration however known although independent situations reliability checks show attributions thus licensing issues could arise Code Reuse Stack Overflow AllTime Popular Projects RQ4 Stack Overflow Reuse Analysis analysis Stack Overflow time popular Java projects revealed overall 24537 files 191504 58 files removed CCFinderX’s due errors contained least one clone Based result observe approximately 128 files question contain least one clone However 5554 Stack Overflow files contained clone 513 Stack Overflow files considered hand 18983 files 39558 files contained least one clone approximately 48 total files noted average length file 652 tokens Furthermore 24537 files containing least one clone 51282 clone sets existed words 51282 unique code snippets least one clone smaller number clone sets contain clones found Stack Overflow projects figure 450 indicating 450 unique code snippets found Stack Overflow files clones cumulatively appear 4334 times 23 191504 64 files average b InterProject Reuse Analysis Within 24537 files total 245750 clones discovered Additionally analyzed independently found 18935 files contained clones among ie 772 giving average 91869 clones per depicted Table IV Additionally investigating clonesets found 726 clones found least two projects appearing 6377 times noticed 48 files 18983 contained clones found Stack Overflow files 48 files found directly one Stack Overflow highly likely migrated directly Stack Overflow Number Java files Average number tokensfile Number clones Number files cloness Awesome Java 57 220 18 15 Leetcode 1327 4984 1996 486 Dubbo 5576 9367 13823 2869 ElasticSearch 1018 1205 320 189 Java Design 3966 673 13674 2327 Patterns 239 7134 2906 140 Apache OpenOffice 17 2523 11 9 Proxeye 3799 2777 2356 1037 Qmui Android 164 6732 230 71 Sap NetWeaver 252 2866 187 89 Server Adapter 1612 38368 35749 7216 Eclipse Sefin Total 13813 4064 29791 6097 E Contextual Differences Scale Size Reuse RQ5 addition findings results displayed Table V Figure 4 show sizes clones found within various contexts different primary interest larger mean sizes clones within Stack Overflow refer boxplots Figure 4A B larger sizes suggest likelihood clones detected true positives ie indeed evidence reuse entire snippets copied Additionally median upper quartile top weekly Java projects clone sizes greater four contexts files included displayed Figure 4 graphs C E F seen greater median upper quartile value indicates newer projects constructed greater extent reused elements Table V average maximum sizes clones found within various contexts presented Interestingly clones terms maximum sizes smaller two analyses looking Stack Overflow OSS projects together 277 324 respectively see code clones found Stack Overflow OSS projects 324 tokens length However looking interproject clones notice maximum values much higher biggest clone consisting 1369 tokens suggests code reuse projects involves copying larger pieces code including entire components contrast Stack Overflow code usually provides smaller code snippets answers specific coding questions evidence may linked reality test statistically significant differences six groups measures refer Table V terms clone sizes KruskalWallis test performed test selected nonparametric nature ie assume data follows Normal distribution require sample sizes equivalent 28
::::
TABLE IV SUMMARY ALLTIME POPULAR JAVA PROJECTS INTERPROJECT Number Java files Average number tokensfile Number clones Number files clones Angry IP Scanner 219 397 102 48 Catacombae 91 7586 223 33 Cyclops Group 2609 1519 2545 1291 Eclipse Checkstyle Plugin 1708 319 3115 782 Freemind 529 772 495 192 Hibernate 2392 2856 2148 627 Hitachi Vantara Pentaho 24494 6732 112415 12008 Libjpegturbo 12 20613 44 7 OpenCV 148 10039 508 94 Sap NetWeaver Server Adapter Eclipse 239 7134 2921 144 Sweet Home 3D 233 24083 1476 142 TurboVNC 245 8865 495 114 Vuze – Azureus 3639 750 5784 1461 Weka 42 15051 66 21 Xtreme Download Manager 155 8064 468 71 Total 39558 145922 146990 18983 Averagemean 24724 912 91869 11864
::::
TABLE V CLONE SIZE STATISTICS Data Group Median Mean Max Mean Rank Stack Overflow 66 857 938 148697 B Stack Overflow IntraAnswers 69 872 938 154804 C Stack Overflow Top Weekly 57 679 277 110143 Top Weekly 60 843 774 134787 E Stack Overflow Top Time 58 712 324 116464 F Top Time 58 692 1369 113921 result reveals statistically significant outcome significance level 005 providing evidence outcomes different H5 1409 p 001 Given finding examined distributions B Table V others C E F post hoc KruskalWallis tests Outcomes confirm significantly bigger clones p 005 Stack Overflow Stack Overflow IntraAnswers Top Weekly projects compared distributions alongside results Table V boxplots Figure 4 provide preliminary evidence nature clones terms sizes different different data sets thus plan analyses investigate differences exist
::::
V DISCUSSION IMPLICATIONS Discussion Quality important element development projects particular quality freely available key consideration users However migration code OSS projects online QA platforms complicates assessments Stack Overflow platform instance often acts medium code migrates many projects quality code many projects influenced factors beyond control programmers Furthermore OSS projects often published specific licenses adds additional level complexity terms understanding availability reuse fact users code published QA platforms often lack required understanding code direct implications quality management code reused projects order investigate extent code reuse situations focused Java code Stack Overflow popular OSS projects revisit outcomes answer five research questions RQ1RQ5 RQ1 extent Java code reuse within Stack Overflow results indicate within Stack Overflow approximately 33 Java code sampled least one clone elsewhere website Additionally found 2338 unique license violations could present within answers evidence duplicates Python code also revealed 33 duplication 8 noted however Python code examined Yang et al’s 8 study processed remove effects white space comments increase performance clone detection tools lead better comparisons end outcomes best conservative Java code reuse could actually higher 33 Stack Overflow results study along Yang et al 8 indicate code reuse prevalent Stack Overflow Java Python contexts near identical results obtained two studies suggest users developers Stack Overflow platform expect 3 code Stack Overflow duplicated considering parameter settings code blocks considered candidate clones emphasized clones significant size least 50 tokens Unlike many small snippets found Stack Overflow clones meet specified requirements set analysis likely code blocks clones coincidence rather reused Hence developers need cautious reusing larger code blocks Stack Overflow prepared rigorously evaluate code usage addition instances reuse demand proper attribution community aware Stack Overflow knowledge recycled believe tool could utility terms aiding developers wanting evaluate appropriateness code reuse also detecting exactly code originated help correct attribution RQ2 extent code reuse answers published question Stack Overflow observed 18 Java snippets ie code answers least one clone within answers provided question evidence also revealed 26 questions sampled contain least one clone pair answers Furthermore 1763 potential unique license violations sample data insights provided response RQ1 outcome implication developers using Stack Overflow code terms need aware rate code duplication within Stack Overflow overall duplication rate 33 notice significant proportion duplication refers clones answers different questions result developers may give attribution original authors Furthermore cases code blocks migrated external sources duplicates within Stack Overflow may make difficult find original sources Without complete knowledge origin reused code developers may publish OSS different licenses result license violations fact given conservative settings used analyses anticipate reuse rate smaller code snippets may much higher duplicated code identified Stack Overflow process identifying appropriate solution code may expedited since users able avoid duplicated answers repeated duplicate answers may also result convoluted pages could lead slower problem solving developers RQ3 extent code reuse Stack Overflow current popular Open Source Java Projects evidence showed Stack Overflow top weekly Java projects approximately 223 unique code snippets appeared sets files Stack Overflow files snippets appeared total 1627 files evidence shows overall 10 files contain one Stack Overflow clones However noted percentage files containing clones higher compared percentage Stack Overflow files contained code outcome suggests current popular open source Java projects tend use code copied Stack Overflow fact within projects discovered approximately 57 files contained clone clones found either within single projects study Koschke et al 26 discovered approximately 72 lines code Open Source Java projects exact clones findings indicate high levels code reuse duplication within Open Source Java projects findings suggest opportunity exists developers reduce intraproject reuse could result less maintainability issues Furthermore developers also consider code reuse occurring projects become acquainted licensing requirements refer Section II RQ4 extent code reuse Stack Overflow alltime popular Open Source Java projects compared Stack Overflow Java code alltime popular OSS projects SourceForge observed 450 unique code fragments evident datasets appear 4334 files total evidence shows approximately 23 files sampled contained least one clone one unique clone every 545 files fact proportion files containing clones quite high approximately 772 containing clones excluding Stack Overflow files Considering outcomes previous work 26 72 code reuse found believe code reuse high popular Open Source Java projects Interestingly percentage files containing clones higher alltime popular projects compared newer top weekly projects thus likely code copied projects could originally came different source hence creating nested code reuse situation Furthermore developers systems may potentially benefit reducing amount reused code thus improving maintainability projects RQ5 differences nature reuse found different contexts terms scale size results show differences sizes clones found across datasets evidence shows reuse done Stack Overflow snippets copied also observed current popular Java projects greater extent reused elements projects believe newer projects may constructed commonly whole elements projects ie mean clone length greater ‘Top AllTime’ group Table V possibly due availability elements perhaps developers willing reuse recent times Similar outcomes reported Android Mobile Apps 25 tends dominate recent application development environments Evidence indicates developers’ behaviors potentially changing seeing incorporate larger pieces copied code work effects negative positive resulting copying code amplified projects situations copied code well explained respective sections websites Stack Overflow could lead better quality since functionality well understood tested documented developers However larger pieces code copied pasted without sufficient accompanying documentation eg comments likely question contain code understood developers thus bringing questions functionality reliability debuggability overall quality results also show great degree code duplication alltime popular OSS projects fact scale size reuse generally higher OSS projects evidence understandable given Stack Overflow generally known shorter code snippets aimed answering specific questions Code duplication projects possibly driven use common third party libraries could also intentional duplication similar functionalities fact Stack Overflow snippets also copied suggests reuse may part practitioners’ culture Thus implications making sure correct license used developers aware strengths weaknesses code copied Furthermore backdrop need community develop high quality maintainable secured code developers carefully evaluate code reused Implications investigation shown code clones exist across Javabased projects Stack Overflow clones duplicates within system unavoidable since many elements often rely functionalities However cases many code clones exist possible developers may experience negative sideeffects Firstly important understand high levels code cloning negative effects quality terms inconsistencies code Studies found around half projects investigated clones contained inconsistencies ie clones changed inconsistently many unintentional 15 37 Furthermore works also found 323 code clones represented fault Thus important developers aware levels code clones exist within end believe tracking clones could improve overall quality notion tracking clones thus aware shown improve debugging 38 39 Another implication findings relates probable licensing violations Copying code projects websites Stack Overflow without adhering licensing requirements may result complicated legal issues thus developers take caution VI THREATS VALIDITY analyses conducted CCFinderX uses tokenbased approach identify clones technique limitations including lower precision rate compared alternative techniques primarily Abstract Syntax Tree AST techniques 5 Additionally CCFinderX preset parameter settings analyses parameters given specific values used filter texts order identify candidate clones detection clones based code meeting set requirements given CCFinderX possibly leading clones missed particularly important considering worked Stack Overflow data average file token size 48 Thus assume smaller snippets Stack Overflow reused Open Source projects detected thus results could conservative fact reliability checks show many clones smaller sizes refer Table II However code chunks get smaller ability trace back original source becomes challenging Smaller code fragments may also labelled clones accidentally said contextual analyses performed reliability evaluation ascertained code duplicated attributions evidence thus confirms potential future maintenance quality issues possible licensing complications Additionally introduce time element determine direction reuse cannot make conclusive statements regarding temporal copying code Stack Overflow OSS projects terms direction copying ie code copied Stack Overflow OSS projects OSS Stack Overflow Lastly sample projects may representative projects largescale study may produce generalizable insights total number projects SourceForge containing Java code alone 40000 GitHub 35 million available Javabased repositories Thus larger study may help validate results obtained study However initial study completed reflects findings highlyused projects making code reuse important element consider VII CONCLUSION FUTURE RESEARCH imperative engineering community develop deliver high quality Improper code reuse practice may create barriers delivery highquality however particularly terms maintainability confirming legal requirements code reuse popular practice engineering community QA forums Stack Overflow fueling practice pertinent understand practice could affect future maintenance correct use license avoid legal issues Towards goal investigated levels code reuse within Stack Overflow Stack Overflow popular OSS projects findings indicated clones reuse exist examined contexts within Stack Overflow Stack Overflow OSS OSS numerous cases code duplication detected setting Outcomes work show projects highly likely contain code copied sources external code Additionally findings similar research conducted mobile apps Python projects levels code reuse studies indicate Java developers need made aware licensing issues problems could arise adhoc copying particular quality assurance activities projects comprehensive could place greater emphasis code reused platforms Stack Overflow stands agreement 40 discussed benefits code clone analysis provide analysis also believe due increased amount external code integrated projects even greater need exists utilizing clone analysis licensing knowledge correct attribution improved code fragments implemented external sources less likely cause licensing violations interproject analyses showed top weekly Java projects greater average token size compared alltime popular Java projects analyze phenomenon timebased comparison code reuse OSS projects could beneficial identifying changes reuse behavior time preliminary results appears newer projects larger pieces reused code could indicate interproject reuse whole components occurring work completed replicated larger sample projects order validate results assess scale reuse generally Additionally research may look beyond scope OSS projects contrast findings closed source projects research may also expanded provide insights direction migration clones et al 19 published results code migration Android mobile apps Inoue et al 17 developed tool tracking code open source repositories however dedicated work required investigate direction code migration Stack Overflow portals OSS projects REFERENCES 1 V Aho Lam R Sethi J Ullman Compilers principles techniques tools Harlow Essex Pearson 2014 2 Anon Creative Commons License Deed Available httpscreativecommonsorglicensesbysa30 Feb 2018 3 Anon worry copyright issues code posted Stack Overflow Available httpmetastackexchangecomquestions12527doihavetoworryaboutcopyrightissuesforcodepostedonstackoverflow Feb 2018 4 Mathur H Choudhary P Vashist W Thies Thilagam “An Empirical Study License Violations Open Source Projects” presented 35th Annual IEEE Engineering Workshop DOIhttpdxdoiorg101109sew201224 2012 5 C K Roy J R Cordy R Koschke “Comparison evaluation code clone detection techniques tools qualitative approach” Science Computer Programming vol 74 pp 470–495 2009 6 J Cordeiro B Antunes P Gomes “Contextbased recommendation support problem solving sof Development” Proceedings 3rd IntWorkshop RSSE 2012 7 German Di Penta YG Guenecheu G Antoniol “Code siblings Technical legal implications copying code applications” Proc 6th Working Conference Mining Repositories MSR 2017 DOIhttpdxdoiorg101109msr201713 8 Yang P Martins V Saini C Lopes “Stack Overflow Github Snippets There” Proc 14th International Conference Mining Repositories MSR 2017 DOIhttpdxdoiorg101109msr201713 9 E Johansson Wesslen L Bratthall Host “The importance quality requirements platform developmenta survey” Proc 34th Annual Hawaii International Conference System Sciences 2001 10 Felix Fischer et al “Stack Overflow Considered Harmful Impact CopyPaste Android Application Security” IEEE Symposium Security Privacy SP 2017 11 F Bi “Nissan app developer busted copying code Stack Overflow” May 2016 Available httpswwwthevergecom2016511195308dontgetbustedcopyingcodefromstackoverflow 12 H Sajjani V Saini J Svaženjko C K Roy C V Lopes “SourceRecCC” Proc 58th International Conference Engineering – ICDE 16 2016 13 J Mojica B Adams Nagappan Dienst Berger Hassan “A LargeScale Empirical Study Reuse Mobile Apps” IEEE vol 31 2 pp 78–86 2014 DOIhttpdxdoiorg101109ms2013142 14 J R Cordy C K Roy “The NiCad Clone Detector” Presented IEEE 19th International Conference Program Comprehension 2011 15 J Krinke “A Study Consistent Inconsistent Changes Code Clones” Presented 14th Working Conference Reverse Engineering WCRE 2007 16 J C Knight F Dunn “Software quality domaindriven certification” Ann Softw Eng vol 5 pp 293–315 1998 17 K Inoue Sasaki P Xia Manabe “Where code come go — Integrated code history tracker open source systems” Proc 34th International Conference Engineering 2012 18 L Heinemann F Deissenboeck Gleirscher B Hummel Irlbeck “On Extent Nature Reuse Open Source Java Projects” Lecture Notes Computer Science Top Productivity Reuse pp 207–222 2011 19 L Mlouki F Khomh G Antoniol “Stack Overflow code laundering platform” Proc IEEE 24th SANER 2017 20 Sojer J Henkel “Code Reuse Open Source Development Quantitative Evidence Drivers Impediments” Journal Association Information Systems vol 11 pp 868–901 2010 21 Sojer J Henkel “License risks ad hoc reuse code internet” Communications ACM vol 54 pp 74 2011 22 Singh Mittal Kumar “Survey Impact Metrics Quality” International Journal Advanced Computer Science Applications vol 3 2012 23 Mlouki F Khomh G Antoniol “On Detection Licenses Violations Android Ecosystem” Proc IEEE 23rd International Conference Analysis Evolution Reengineering SANER 2016 24 P L Bacchelli Lanza “Leveraging crowd knowledge comprehension development” CSMR IEEE Computer Society 2013 p 57–66 25 R Abdalkareem E Shihab J Rilling “On code reuse StackOverflow exploratory study Android apps” Information Technology vol 88 pp 148–158 2017 26 R Koschke Bazrafshan “SoftwareClone Rates OpenSource Programs Written C C” Proc IEEE 23rd SANER 2016 27 Baltes R Kiefer Diehl “Attribution Required Stack Overflow Code Snippets GitHub Projects” Proc IEEEACM 39th International Conference Engineering Companion ICSEC 2017 28 Sawilowsky G Fahoome KruskalWallis Test Basic Wiley StatsRef Statistics Reference Online 2014 29 Haefliger G Von Krogh Speth “Code Reuse Open Source Software” Management Science vol 54 pp 180193 2008 30 H Kan Metrics Models Quality Engineering 2nd ed AddisonWesley Longman Publishing Co Inc Boston USA 2002 31 V Suma TR Gopalakrishnan nair “Effective Defect Prevention Approach process Achieving Better Quality levels” World Academy Science Engineering Technology vol 42 pp 258262 2008 32 Kamiya Kusumoto K Inoue “CCFinder multilingual tokenbased code clone detection system large scale source code” IEEE TSE vol 28 pp 654–670 2002 33 Ægidius Mogensen “Lexical Analysis” Introduction Compiler Design Undergraduate Topics Computer Science pp 1–37 2011 34 V Bauer J Eckhardt B Hauptmann Klimek “An exploratory study reuse google” Proc 1st International Workshop Engineering Research Industrial Practices SERIPs 2014 35 Wb Frakes K Kang “Software reuse research status future” IEEE Transactions Engineering vol 31 pp 529–536 2005 DOIhttpdxdoiorg101109tse200585 36 Kashima Hayase N Yoshida Manabe K Inoue “An Investigation Impact Licenses Copyandpaste Reuse among OSS Projects” Proc 18th Working Conference Reverse Engineering 2011 37 Elmar Juergens Florian Deissenboeck Benjamin Hummel Stefan Wagner “Do code clones matter” Proc IEEE 31st International Conference Engineering 2009 38 Z Li Lu Myagmar Zhou “CPMiner Finding copypaste related bugs largescale code” IEEE Trans Softw Eng vol 32 pp 176–192 2006 39 L Jiang Z Su E Chiu “Contextbased detection clonerelated bugs” Proc ESECFSE ACM 2007 40 C K Roy J Cordy R Koschke “Comparison evaluation code clone detection techniques tools qualitative approach” Science Computer Programming vol 74 7 pp 470495 2009
::::
Reuse maintenance practices among divergent forks three ecosystems John Busingetextsuperscript12 · Moses Openjatextsuperscript3 · Sarah Naditextsuperscript4 · Thorsten Bergertextsuperscript56 Accepted 25 October 2021 Published online 4 March 2022 © Authors 2022 Abstract rise social coding platforms rely distributed version control systems reuse also rise Many developers leverage reuse creating variants forking account different customer needs markets environments Forked variants form socalled family share common code base maintained parallel different developers families easily arise within ecosystems large collections interdependent components maintained communities collaborating contributors However little known existence characteristics families within ecosystems especially maintenance practices Improving empirical understanding families help build better tools maintaining evolving families empirically explore maintenance practices forkbased families within ecosystems opensource focus three largest ecosystems existence today Android NET JavaScript identify analyze families maintained together exist official distribution platform Google play nuget npm well GitHub allowing us analyze reuse practices depth mine identify 38 families 526 families 8837 families ecosystems Android NET JavaScript study characteristics codepropagation practices provide scripts analyzing code integration within families Interestingly results show little code integration across studied families three ecosystems studied families also show techniques direct integration using git outside GitHub commonly used GitHub pull requests Overall hope raise awareness existence families within larger ecosystems calling research better tools support effectively maintain evolve Keywords Cloneandown · Change propagation · Variant synchronisation · Empirical study · Variant developers · Version control systems · Pull requests · Cherrypicking changes · Rebasing changes · Squashing changes · product lines · Variants Communicated Federica Sarro textsuperscript1 John Businge johnxu21gmailcom Extended author information available last page article 1 Introduction increased popularity socialcoding platforms GitHub made forking powerful mechanism easily clone repositories creating new developer may fork mainline repository new forked repository often transforming governance latter new developer preserving full revision history establishing traceability information forking allows isolated development independent evolution repositories traceability allows comparing revision histories instance determine whether one repository ahead ie contains changes yet integrated also allows easier commit propagation across repositories Many studies forking exist often focusing reasons outcomes Nyman et al 2012 Robles GonzálezBarahona 2012 Viseur 2012 Nyman Lindman 2013 Nyman Mikkonen 2011 Zhou et al 2018 Zhou et al 2019 2020 community dynamics influenced forking Gamalielsson Lundell 2014 community typically distinguishes two kinds forks Zhou et al 2020 social forks created isolated development goal contributing back mainline divergent forks created splitting new development branch often steer development another direction without intending contribute back leveraging mainline defines adheres standards Sung et al 2020 Divergent forks relevant supporting largescale reuse—the focus paper Studies divergent forks usually rely general heuristics identify many forks possible without systematically verifying indeed divergent forks Additionally studying code propagation techniques existing studies consider intricacies git identify possible types code propagation eg offline git rebasing without using GitHub focus pull requests address first challenge identifying divergent forks use insight particular ecosystems systematic way publishing “members” ecosystem example Android apps published Google Play store Similarly Eclipse plugins distributed Eclipse marketplace advantage ecosystems member unique ID identifies Thus given opensource GitHub repository fork verify whether fork actually independent version original mainline core criteria divergent fork checking mainline fork listed separate entries corresponding distribution platform address second challenge considering git intricacies design technique identifies majority code propagation techniques Git GitHub leveraging commit meta data Inspired notion families aka program families Parnas 1976 Czarnecki 2005 Dubinsky et al 2013 Apel et al 2013 Krueger Berger 2020b Stanculescu et al 2015 Berger et al 2020—portfolios managed similar systems application domain—we use term family family short refer mainline repository corresponding divergent forks refer family member variant present largescale empirical study reuse maintenance practices via code propagation among families ecosystems take considerations account study three largescale ecosystems different technological spaces Android JavaScript NET Android one largest successful ecosystem substantial reuse Mojica et al 2014 Li et al 2016 Sattler et al 2018 Berger et al 2014 JavaScript ecosystem distributes packages npm far largest package manager 182M package distributionsfootnoteAs seen Librariesio June 2021 NET ecosystem package management system nuget moderately large 261K packagesfootnoteAs seen Librariesio June 2021 three selected ecosystems vary nature apps versus packages programming languages Java JavaScript C sizes terms distribution platforms study addresses two main research questions RQ1 characteristics families ecosystems investigate general characteristics families variants including number variants per family divergence application domains developer ownership variant popularities within families also determine frequencies variant maintenance looking releases numbers allows putting studied maintenance coevolution practices context RQ2 families maintained coevolved ecosystems determine management practices investigate code propagated mainline divergent forks family example pull requests used main propagation technique code propagated mainline forks propagation direction study code propagation mechanisms used well kinds changes propagated best knowledge work first provide largescale indepth study codepropagation practices divergent forks Understanding codepropagation strategies exercised developers help building better tool support customization code reuse analyze pairs mainline fork open source projects whose package releases available package distribution platforms three ecosystems Android comprising 38 families NET comprising 526 families JavaScript comprising 8837 families results show majority 82 forks study owned developers different within family distinction ownership gives us confidence studying real divergent forks Interestingly though find little code propagation across mainline–fork pairs three ecosystems studied used code propagation technique git mergerebase used 33 Android mainlinefork pairs 11 JavaScript pairs 18 NET pairs find cherry picking less frequently used 9 09 25 Android JavaScript NET pairs using respectively Among three pull request integration mechanisms studied merge rebase squash used pull request integration mechanism merge option direction fork rightarrow mainline 24 7 11 pairs Android JavaScript NET use strategy find integrating commits using squashed rebased pull requests rare three ecosystems Overall find code propagation occurs seems fork developers perform propagation directly git outside GitHub’s builtin pull request mechanism observation implies simply relying pull requests understand code propagation practices divergent forks enough summary work makes following contributions propose leveraging main distribution platforms three ecosystems precisely identify divergent forks devise technique identifying families ecosystems using data GitHub respective distribution platform contrast previous studies code propagation strategies either focused pull requests directly comparing commit IDs first study code propagation considering pull requests options squash rebase well git rebased cherrypicked commits analyze prevalence code propagation within families well types propagation strategies used synthesize implications results code reuse tools provide online appendix 2020 containing datasets intermediate results scripts trace code propagation mainlinefork pair earlier version work appeared conference paper Businge et al 2018 focused analyzing code propagation commit level within Android ecosystem also provided preliminary insights reasons different app variants exist article extends conference paper follows First extend analysis two ecosystems moderate large scale Second substantially improve identification code integration methods focusing solely pull requests direct comparison commit IDs Instead first consider types code propagation techniques including rebasing squashing cherrypicking commits Third contribute toolchain analyzing code propagation mainline–fork pair iv provide discussion implications results Parts RQ1 JavaScript ecosystem previously presented workshop paper Businge et al 2020 article additional contributions RQ1 JavaScript ecosystem following First refine JavaScript dataset ensuring mainlinefork pairs exist GitHub npm package manager end eliminate total 2456 mainlinefork pairs either mainline fork deleted GitHub package releases still existed npm package manager Second provide detailed description dataset collected provide full refined dataset replication package Third create additional dataset new families NET ecosystem Fourth addition new characteristic variant ownership well illustrative graph comparisons discuss characteristics mainline–fork pairs across three ecosystems
::::
2 Background Code Propagation Strategies discuss mechanisms offered GitHub similar socialcoding platforms propagate code among different repositories describe characteristics mechanisms kind metadata generate automated identification technique potentially rely mainline forked repository obligation synchronize changes developers commonly propagate code changes eg new features bug fixes among repositories via commit integration Jiang et al 2017 Openja et al 2020 tracing propagation however metadata provided GitHub always reliable instance Kalliamvakou et al 2014 Kononenko et al 2018 found large number pull requests appearing merged actually merged authors find uncommon destination repositories resolve pull requests outside GitHub Table 1 Changes commit metadata code propagation different kinds code propagation GitHub Git facilities Metadata changed Pull Requests Git Commands Merge Squash Rebase Cherrypick Merge Rebase Commit ID Yes Yes Yes Author Name Yes Author Date Yes Committer Name YesNo YesNo YesNo Committer Date Yes Yes Yes Commit Message Yes File details Yes metadata change change metadata work considers commit integration GitHub commit integration directly using git outside GitHub following describe code propagation using GitHub git facilities Table 1 provides details relationship commits across forked repositories based respective code propagation technique used collect information table read official references Vandehey 201923 online resources4 well created toy repositories mimic various integration scenarios order verify information use insights creating code propagation traceability technique described Section 33 21 Propagation GitHub Facilities pull request head ref reference source repository branch developer wants pull commits refer source branch pull request also base ref reference destination repository pulled commits integrated refer destination branch clarity source destination branches may belong repository different repositories studying code propagation family mainly interested pull requests one source repository family another destination repository family pull request submitted GitHub developer use user interface integrate commits pull request destination branch using one three options merge pull request commits ii rebase pull request commits iii squash pull request commits Merge pull request commits default developer chooses option commit history destination branch retained exactly seen Table 1 metadata integrated commits source branch remain 2httpswwwatlassiancomgittutorialsmergingvsrebasing 3httpshelpgithubcomengithubcollaboratingwithissuesandpullrequestsaboutpullrequestmerges 4httpscloudfourcomthinkssquashingyourpullrequests unchanged destination branch However new merge commit created destination branch “tie together” histories branches GitHub 2020 Rebase merge pull request commits integrator selects Rebase merge option pull request GitHub commits source branch replayed onto destination branch integrated without merge commit Table 1 see using integration technique commit metadata source destination preserves author name author date commit message alters commit ID committer name committer date committer name becomes name developer destination repository rebased merged pull request Note developer submitted pull request coincidentally developer integrates eg developer works repositories committer name remain GitHub 2020 Squash merge pull request commits integrator selects Squash merge option pull request GitHub pull request’s commits squashed single commit Instead seeing contributor’s commits source branch commits squashed one commit included commit history destination branch Apart file details commit meta data changes committer name changes unless similar original committer developer merging pull request GitHub 2020
::::
22 Propagation Git Facilities Cherry Pick Merge Rebase Commits developer may also rely GitHub user interface instead choose integrate commits source branch destination branch outside GitHub using one git integration commands integrator first locally fetch commits source branch example mainline contains commits wish integrate branch perform integration locally using one four options outlined git merge ii git rebase iii git cherrypick iv Git commands rewrite commit history afterwards push changes corresponding GitHub repository5 Git cherrypick commits Cherry picking act picking commit one branch integrating another branch Commit cherry picking example useful mainline developer creates commit patch preexisting bug fork developer cares bug patch changes mainline cherry pick single commit integrate fork shown Table 1 author name author date commit message file details cherry picked commit remain destination branch commit ID committer name committer date however change Note committer name may remain integrator developer performed original commit source branch Git merge commits Like pull request merge git merge also preserves commit metadata creates extraneous new merge commit destination branch ties together histories branches Git rebase commits Rebasing act moving commits current location following older commit new head newest commit branch Chacon 5 httpswwwatlassiancomgittutorialsmergingvsrebasing Straub 2014b Git rebase deviates slightly rebasing pull requests GitHub change committer information better understand git rebase let us explain illustration based experiments carried lefthand side Fig 1 mainline repository fork repository repository made updates code commits C3 C4 mainline commits F1 F2 fork fork developer observes new updates mainline interesting decides integrate using rebasing rebasing commit history look right side Fig 1 Notice IDs order integrated commits C3 C4 fork branch unchanged However IDs commits F1 F2 change F1’ F2’ case Git rebase like fork developer saying “Hey know started branch last week people made changes meantime don’t want deal changes coming mine maybe conflicting pretend made changes today” Vandehey 2019 Git commands rewrite commit history Git number tools rewrite commit history including changing commit messages commit order splitting commits Chacon Straub 2014a commands include git commit amend git rebase HEADN git squash etc commands significantly change history meta data commits integrator uses commands destination repository straightforward way match integrated commits across two repositories Chacon Straub 2014a
::::
3 Methodology goal improve empirical understanding maintenance practices specifically code propagation families identify analyze families using data GitHub distribution platforms three ecosystems
::::
31 Identifying Families Given different nature studied ecosystems terms information distribution platform stores information accessed employ different techniques identify Android families versus JavaScript NET families Figure 2 shows overview process extract families Android ecosystem GitHub Google Play families NET JavaScript extracted Librariesio6 311 Identifying Android Families interested identifying families real Android apps evidently used end users Taking GitHub repositories Android apps account would also include toy apps course assignments end identify source repositories apps also exist Google Play mainly match GitHub repositories Google Play apps via unique identifier—the package name contained app manifest file AndroidManifestxml manifest files also declare app’s components necessary permissions required hardware Android version Android app family must unique package name excludes forked repositories package name modified specifically identify Android families using relatively conservative filtering approach follows Using GitHub’s REST API v3 identify 79338 mainline repositories matching following criteria 1 fork 2 repository contains word “Android” namedescriptionreadme 3 forked least twice 4 created 01072019 mined 14122019 used date 01072019 obtain repositories history 5 AndroidManifestxml file 6 description readmemd file 7 number forks geq 2 reduce chance finding student assignments Munaiah et al 2017 ensure collecting realworld apps check identified mainline repositories exist Google Play repository’s AndroidManifestxml file extract app’s package name check existence Google Play total 6httpslibrariesio find 7423 mainline repositories representing actual Google Play app Businge et al 2017 filter duplicate mainline repositories containing AndroidManifestxml files package name duplicates easily arise app’s source code copied without forking Since package names unique Google Play one duplicate repositories actually correspond Google Play app manually select one repository duplicates considering repository popularity number forks stars GitHub repository app descriptions GitHub Google Play well developer name GitHub Google Play cases Google Play app description conveniently linked GitHub repository result step discard 1232 repositories left 6191 mainline repositories ensure study repositories enough development history filter mainlines fewer six commits lifetime according median number commits GitHub projects found prior work Kalliamvakou et al 2014 leaves us 4337 mainline repositories filter mainline repositories without active forks commit forking date probably abandoned leaves us 1166 mainline repositories total 12025 active forks altogether remove forks package name mainline forks remain given mainline also remove mainline forks different package names corresponding mainline check existence fork’s package name Google Play order ensure fork also real different Android app leaves us 69 app families comprising 95 forks Finally manual inspection filter forked repositories whose app package name points Google Play app correct app analysis based observation sometimes fork developers copy code including AndroidManifestxml another app without changing package name practice results forked app’s package name pointing app exists Google Play one hosted GitHub repository inspect Readmemd unique commit messages GitHub repository respective Google Play description page Eliminating mismatched apps leaves total 38 app families comprising 54 forked apps—our final dataset answer research questions 312 Identifying JavaScript NET Families family JavaScript NET ecosystems comprises packages libraries applications written respective language Similar Android ecosystem consider packages exist sourcecode repositories GitHub ecosystem’s main distribution channels npm nuget metadata package release package managers npm nuget similar package managers package’s metadata include source repository package GitHub GitLab BitBucket number dependent projectspackages number dependencies number package releases package contributors Fortunately data 37 package managers different ecosystems found one central location Librariesio platform periodically collects data different package managers addition metadata specific package given package manager Librariesio also extends package metadata information GitHub example stores Forkboolean field indicates whether corresponding repository package fork field Forkboolean help us identify forked repositories published packages Note different Android ecosystem explicit traceability exist first mine repositories GitHub filter published Google Play contrast NET JavaScript mine families directly Librariesio extract families latest Librariesio data dump release 160 released January 12 2020 metamodel data Librariesio data dump found online7 extract NET JavaScript families Librariesio following steps Using package’s field Platform filter packages distributed nuget npm package managers Next use field Forkboolean identify repositories forks use field Fork Source Name Owner identify fork repository name well parent repository mainline extract fork repositories map published packages nuget npm Next merge sets packages Step 1 Step 2 identify packages make mainlinefork pairs ie fork repository corresponding mainline set Step 2 packages present set Step 1 Using GitHub API verify indeed mainline parent divergent fork still existing GitHub eliminate wrong pairs eg deleted GitHub NET ecosystem identify total 526 families total 590 mainline–fork pairs JavaScript ecosystem identify total 8837 families total 10357 mainline–fork pairs Similar Android families family NET JavaScript contains least one mainline one variant forks 32 Identifying Family Characteristics RQ1 describe identify characteristics identified families variants ie mainlines forks three ecosystems define calculate various metrics follows Note given different nature ecosystems type information available metrics specific ecosystems example FamilySize metric calculate variants three ecosystems hand given difference nature Android variants JavaScriptNET packages need calculate variant popularity differently across ecosystems downloads reviews versus dependents dependencies following discuss goal metric calculate Overall look metrics fall general characteristics variants variant maintenance activity variant ownership variant popularity repositories Android ecosystem extract metrics GitHub Google Play store repositories NET JavaScript ecosystems extract metrics GitHub Librariesio Table 3 Section 4 summarizes metrics provides values 7httpslibrariesiodata 321 General Characteristics Family Size record number variants metric FamilySize Table 3 families three ecosystems Note family FamilySize 2 one mainline one fork family FamilySize 3 one mainline two forks Variant Package Dependencies ecosystems provide huge bazaar reused explicit package dependencies Decan et al 2019 Since divergent fork inherits functionality mainline may also continuously synchronize mainline acquire new changes one would expect number package dependencies mainline fork would However would interesting see cases context example fork dependencies could mean fork implementing new features mainline extract number dependencies Librariesio Android extracted dependencies apps Gradle files GitHub Android variant categories Using variant’s metadata available Google Play also determine variant category eg Business Finance Productivity extract description also record whether variants listed category Google Play helps us understand nature variants family 322 Identifying Maintenance Activities JavaScript NET repository many releases shows actively maintained since release indicates either bug fixes new features introduced end interested seeing relationship mainline fork terms number package releases package distribution platforms collect number package releases variants NET JavaScript ecosystems Librariesio metrics related variant maintenance activity PackageReleasesMLV mainline variants PackageReleasesFV fork variants Unfortunately package manager variants Android ecosystem Google Play store keep history applications therefore cannot extract variant releases alternative collect variant releases Android ecosystem collect repositories using GitHub API Unfortunately found using GitHub API collect list releases repository returns zeros repositories even repository releases example see Android divergent fork imaeses k98 releases However access fork using GitHub API list releases9 see returns empty list end decided collect package releases variants Android ecosystem 323 Identifying Variant Ownership Characteristics would like identify whether mainline fork variant common owners interesting study since determine whether variant fork started owners mainlines started different developers mainline define 8httpsgithubcomimaesesk9releases 9httpsapigithubcomreposimaesesk9releases owner repository contributor access rights integrating changes repository ie repository committer explained Section 2 based different kinds commit integration techniques might difficult identify original repository given commit especially cases mainline many forks end identify repository committer owner one merged least one pull request since certain contributors access rights repository integrate changes consider mainline fork variant common owners exists least one common owner criteria mainline fork variant least one developer bot merged pull request repositories means ownership criteria relies variant merging least one pull request Since variant pairs Android ecosystem would reduce small dataset variant pairs end apply described method variants NET JavaScript ecosystems moderately large large dataset variant pairs use different criteria identify owners Android variants explain later Since variants published Google Play variant owner identify 89 590 mainline–fork pairs NET ecosystem mainline fork variant merged PR real developer JavaScript ecosystem identify 89 10357 mainline–fork pairs mainline fork variant merged PR real developer variant pairs Android ecosystem employ another method identify ownership covers dataset mine ownership Google Play store Google play store variant attribute developer id dev id name developercompany owner uploads variant updates marketplace 324 Identifying Variant Popularity want understand popularity variants studying terms whether widely used respective ecosystems extract popularity metrics distribution platform studied ecosystems use different popularity measure variants Android ecosystem NET JavaScript Android variants variants Android ecosystem define two popularity metrics number downloads Google play DownloadsMLV DownloadsFV mainline divergent fork respectively also define two popularity metrics number reviews Google play ReviewsMLV ReviewsFV mainline divergent fork respectively JavaScript NET variants variants two ecosystems record number packages JavaScript NET depend mainline fork variants DependentPackagesMLV DependedntPackagesFV respectively also record number projects GitHub depend mainline variant DependentProjectsMLV DependentProjectsFV respectively variant’s dependent packages projects extracted Librariesio package dependents good way measuring popularity since give indication packages projects interested functionality provided variant 33 Identifying Code Propagation RQ2 Answering RQ2 requires determining whether code propagated among variants family identify code propagation rely categorizing commits history mainline forks based possible types code propagation discussed Section 2 Figure 3 illustrates relationship variants family Specifically demonstrate relationship commits mainline variant family divergent forks identify two broad categories commits 1 common commits exist mainline variant forked variant represent either starting commits existed forking date propagated commits 2 unique commits exist one variant mainline variant fork variant pair family first identify common commits identify unique commits follows 331 Identifying Common Commits ensure correctly categorize commits perform following steps exact order commit categorized one step need analyze following steps consider default repository branch mastermain branch mainline forks Inherited commits fork date point time fork variant created point commits fork mainline refer InheritedCommits Fig 3 InheritedCommits purple commits 1 2 3 extract commits either variants collect commits since first commit history fork date PullRequest commits first collect merged pull requests repository identify pull requests whose source destination branches belong analyzed repository pair GitHub API ownerrepopullspullnumber provides information given pull request One identify source destination branches using pull request objects headrepofullname baserepofullname returned json response respectively Based source destination information always identify direction pull request fork → mainline mainline → fork shown Fig 3 pull request collect pull request commits prcommits using GitHub API ownerrepopullspullnumbercommits Regardless pull request gets integrated commit information source repository always identical prcommits Thus always identify pull request commits source repository comparing IDs commits prcommits history source repository tricky part identifying integrated commits destination repository Based information discussed Section 2 summarized Table 1 identify pull request commits destination repository follows Merged pull request commits Based Table 1 commit IDs pull request commits integrated using default merge option change Thus identify commits simply compare IDs prcommits commit history destination repository Rebased pull request commits Recall Table 1 integrated commits rebased pull request different commit IDs destination branch Thus identify rebased commits destination branch comparing remaining unchanged commit metadata author name author date commit message file details Squashed pull request commits part squashed pull request’s metadata GitHub records ID squashed commit destination branch mergecommitsha attributetext10 Using ID identify exact squashed commit destination repository extra verification also compare changed files commits pull request changed files identified squashed commit Git merged commits identifying commits related pull requests analyze remaining unmatched commits identify might propagated directly Git commands Recall Section 2 includes merged rebased cherrypicked commits Git cherrypicked commits locate cherrypicked commits source destination commit histories comparing following commit metadata commit ID author name author date commit message filenames file changes also identify source destination branches cherry picked commits looking com 10httpsdevelopergithubcomv3pulls mitter dates matched commits mark commit earlier committer date source branch later date destination branch Git merged Git rebased commits point already identified integrated pull request commits well cherry picked commits Thus remaining commits ID histories variants must propagated git merge git rebase shown Table 1 Fig 1 commits integrated git rebase exactly ID meta data source destination branch Similarly commits integrated git merge also exact information differentiate gitmerged gitrebased commits finding merge commits two parents marking commits merge commit common ancestor commits integrated git merge differentiation important purposes interested marking types commits propagated commits Thus purposes identify commits integrated via Git rebase Git merge differentiate Similar pull requests types commits may pulled branches However unlike pull requests possible identify variant propagated commit originated nature distributed versioncontrol systems commits multiple repositories central record identifying commits’ origin Since common commits pulled mainline pushed fork repository result fork trying keep sync new changes mainline make assumption commits identify integrated git merge git rebase pulled mainline variant pushed fork variant
::::
332 Identifying Unique Commits identify unique commits mainline fork use compare GitHub APItext11 compare GitHub API compares mainline branch fork branch one items return diverged commits comprise number commits given branch say mainline branch ahead branch fork branch well number commits branch behind commits mainline branch ahead fork branch unique commits mainline commits mainline behind fork unique commits fork
::::
333 Verifying Commit Categorization Methods verify methods identifying common commits different commit propagation techniques discussed Section 331 two phases first test scripts six toy projects created intentionally include least one example commit propagation technique verify commits correctly categorized Second manually analyze results scripts sample six real mainline–fork pairs part data collection ecosystem 11texthttpsdocsgithubcomenrestreferencereposcomparetwocommits provide details online appendix earlier version work conference paper Businge et al 2018 noticed integrated pull requests mainline variant forks rare end testing scripts addition variant forks limited number integrated commits also use social forks lots integrated commits mainline counterparts section discuss following 3 pairs show Table 2 dashevo dashwallet sambarboza dashwallet repository sambarboza dashwallet social fork mainline dashevo dashwallet total 445 PRs scripts identifies 74 445 pull requests integrated fork repository sambarboza dashwallet mainline repository dashevo dashwallet show details 74 PRs Table 2 technique identified 3 74 PRs integrated using PR merge option together total 13 commits 43 74 PRs integrated using PR squash option total 194 commits 2 74 Technique PRs Commits Android dashevo dashwallet sambarboza dashwallet PR Merged 3 13 Squashed 43 194 Rebased 2 6 Unclassified 26 167 Git Mergerebase 405 Cherrypick 0 Total 74 785 NET flagbug YoutubeExtractor Kimmax SYMMExtractor PR Merged 2 2 Squashed 0 0 Rebased 0 0 Unclassified 0 0 Git Mergerebase 3 Cherrypick 1 Total 2 6 JavaScript TerriaJS terriajs bioretics rer3dterriajs PR Merged 9 101 Squashed 0 0 Rebased 0 0 Unclassified 0 0 Git Mergerebase 1825 Cherrypick 10 Total 9 1936 first two mainline–fork pairs table source fork destination mainline last mainline–fork pair source mainline destination fork PRs used PR rebase option total 6 commits integration option 26 PRs unclassified total 167 identified total 405 commits integrated using git mergerebase integration option commit integrated using git cherrypick option flagbug YoutubeExtractor Kimmax SYMMExtractor repository Kimmax SYMMExtractor variant fork mainline flagbug YoutubeExtractor total 32 pull requests scripts identifies 2 32 PRs integrated fork repository Kimmax SYMMExtractor mainline repository lagbug YoutubeExtractor see details Table 2 two PRs integrated using merge PR option total two commits integrated also identified total three commits integrated using git mergerebase integration option 1 commit integrated using git cherrypick option TerriaJS terriajs bioretics rer3dterriajs repository bioretics rer3dterriajs variant fork fork bioretics rer3dterriajs total 10 pull requests scripts identifies 9 10 pull requests integrated mainline TerriaJS terriajs fork bioretics rer3dterriajs 9 PRs total 101 commits commits integrated using PR squash PR rebase options total 1825 integrated using option git mergerebase integration option 10 commits integrated using git cherrypick option Given results scripts select identified code propagation techniques manually verify analyzed mainline–fork pair randomly sample pull request identified pull request integration technique returned scripts manually analyze sampled pull requests commits including commit metadata verify correctness identified propagation technique sampled pull requests also randomly select two commits manually analyze make sure correctly classified example pair getodk collect lognaturel collect lognaturel collect social fork script reveals commits pull requests numbered 3531 3462 3434 integrated using merging squashing rebasing respectively manually verify pull requests fact integrated using techniques looking commit metadata Similarly pair dashevo dashwallet sambarboza dashwallet sambarboza dashwallet social fork verify commits pull requests number 421 333 114 integrated using merging squashing rebasing respectively also look results returned integration outside GitHub git mergerebase git cherrypick example results indicate pair FredJulFlym EtuldanspaRSS EtuldanspaRSS variant fork commits integrated using pull requests 34 five commits integrated using git mergerebase git cherrypicking respectively manually verify five latter commits confirm correctness pair dashevo dashwallet sambarboza dashwallet Table 2 shows pull requests scripts able classify part manual verification find GitHub API indicates integrated destination repository since merge date null deeper investigation discover unclassified pull request commits integrated different branch master branch example pull requests 514 512 fork sambarbozadashwallet integrated branch evonetdevelop mainline repository also observed pull requests integration build test failure Travis CI explains commits missing history master branch scripts could classify integrated commits One would wonder threat construct validity since consider commit integration branches default mainmaster example scenario presented unclassified pull requests integrated development branch “staging” missing main branch since failed integration build test 167 integrated staging branch master branch using integration techniques completely rewrite commit history ie PR mergesquashrebase git mergerebasecherrypick script would always identify commits integrated mainline fork using git mergerebase option script minimizes threat validity unclassified pull requests manually verified data toy projects real projects gives us confidence scripts correctly identify commits integrated different integration mechanisms mainline–fork pair repository 334 Fork Variability Percentage quantify much fork differs mainline define metric variability percentage follows textVariabilityPercentage fractextuniqueFVtextuniqueFV textCommonCommits times 100 textCommonCommits textPull Request commits textGit commits textInheritedCommits shown Fig 3 textVariabilityPercentage measures percentage unique commits fork compared commits fork lower percentage means changes fork either starting commits ie fork make many changes fork date merged commits propagated fromto mainline cases indicate functionality fork much differentvariable mainline hand higher textVariabilityPercentage indicates specific customizations fork
::::
4 Variant Family Characteristics RQ1 present characteristics identified families within ecosystems Table 3 shows metrics defined values 41 General Variant Characteristics Variant Family FamilySize Figure 4 shows number variants ie family size variant families three ecosystems studied see distributions family sizes three ecosystems rightskewed families two members Specifically 28 73 38 families 7731 87 8837 families 475 90 526 families two variants three distributions also show larger families Metric Mean Min Median Max Description FamilySize Android apps 24 2 2 7 Number variants Android family NET apps 21 2 2 7 Number variants NET family JavaScript apps 22 2 2 16 Number variants JavaScript family App Dependencies NET JavaScript PackageDependenciesMLV 404 0 26 140 Number mainline variant packages dependencies Android 23 0 1 49 Number mainline variant packages dependencies NET 118 0 7 267 Number mainline variant packages dependencies JavaScript PackageDependenciesFV 22 0 22 81 Number fork variant packages dependencies Android 20 0 1 25 Number fork variant packages dependencies NET 98 0 6 605 Number fork variant packages dependencies JavaScript App Popularity Android DownloadsMLV 2211K 1 50K 100M Number downloads mainline variant Google Play DownloadsFV 5479K 5 1K 100K Number downloads fork variant Google Play ReviewsMLV 27K 0 547 631K Number reviews mainline variant Google Play ReviewsFV 28K 0 45 161K Number reviews fork variant Google Play App Popularity NET JavaScript DependentPackagesMLV 106 0 0 27K Number packages depend mainline app NET 80 0 2 26K Number packages depend mainline app JavaScript DependedntPackagesFV 04 0 0 19 Number NET packages depend fork app NET 17 0 0 2K Number JavaScript packages depend fork app JavaScript DependentProjectsMLV 133 0 0 33K Number NET projects depend mainline app GitHub 140 0 0 83K Number JavaScript projects depend mainline app GitHub DependentProjectsFV 05 0 0 82 Number NET projects depend fork app GitHub 2 0 0 5K Number JavaScript projects depend fork app GitHub Table 3 continued Metric Mean Min Median Max Description App Maintenance NET JavaScript PackageReleasesMLV 146 1 2 188 Number mainline variant packages dependencies NET 15 1 8 1117 Number mainline variant packages dependencies JavaScript PackageReleasesFV 36 1 2 54 Number fork variant packages dependencies NET 4 1 2 341 Number fork variant packages dependencies JavaScript MLV mainline variant FV forked variant Fig 4 Distribution family sizes number variants family three ecosystems variant family contains one mainline variant least one fork variants presented data corresponds 38 families 8837 families 526 families Note yaxes Figs 4b c presented logarithmic scales axes figures also presented different scales visibility purposes rather seldom three ecosystems largest family sizes observe part JavaScript ecosystem identifying variant families different ecosystems observe although Android considered one largest known ecosystems Mojica et al 2014 Li et al 2016 Sattler et al 2018 identifying variant families rather difficult compared packaging ecosystems JavaScript NET studied Android ecosystem compulsory record source repository Android variant Google Play end went lengthy process described Section 311 applying number heuristics GitHub repositories identify families Variant Package Dependencies Fig 5 present two scatter plots showing graph mainline dependencies versus fork dependencies Figures 5a c show scatter plots number fork variant package dependencies yaxis versus number mainline variant package dependencies xaxis Android NET JavaScript variants respectively point scatter plots represents number package dependencies given fork variant yaxis number package dependencies counterpart mainline variant xaxis scatter plots surprising number package dependencies fork corresponding Fig 5 Scatter plots mainline fork variant dependencies packages ecosystems datasets mainline–fork variants 54 mainline–fork pairs Android 590 mainline–fork pairs NET 10357 mainline–fork pairs JavaScript Note graphs presented different scales visibility purposes mainline correlated confirms fork variants inherit original dependencies mainline However also observe points scatter plots one variant dependencies means variant packages dependencies functionality included counterpart variant Although observation prominent mainline variant since see many points diagonal lines two graphs forks keep sync mainline interesting also fork variants dependencies Followup studies could investigate new functionalities related used dependencies introduced variants Android variant categories Figure 6 shows distribution variants different categories Google Play see 12 54 forks 22 listed different category mainline suggests variants serve different purposes However majority pairs include variants category
::::
42 Variant Maintenance Activity JavaScript NET Figure 7 shows release distributions mainline fork variants JavaScript NET ecosystems point xaxis represents pair sort pairs number mainline package releases Figure 7a shows majority mainline variants multiple releases Specifically 5888 8835 67 mainline variants geq 5 package releases JavaScript package manager fork variants fewer still multiple releases Specifically 2389 10357 mainline variants 23 geq 5 package releases JavaScript package manager Interestingly plot also observe number forks releases mainlines Looking Fig 7b NET variants observe similar distribution like JavaScript Fig 6 Relationship variant categories listed Google Play variant Android Mainline–Fork Pairs mainline–fork pairs share category Different mainline–fork pairs share different category variants Fig 7a results interesting since indicate developers forked variants usually make oneoff package distribution continuously distributing new releases packages emphasizing indeed variant forks Observation 1–RQ1 Families fact exist three ecosystems collected 38526 8837 different families mainlines forks multiple releases number releases significantly higher forks Still indicates latter usually oneshot releases even mainlines
::::
43 Variant Ownership Characteristics Figure 8 shows percentage common owners mainline–fork variant pairs three studied ecosystems Android variants analysis based data collected 54 mainline–fork variant pairs However NET JavaScript variants analysed subset NET JavaScript mainline–variant pairs respectively due criteria set identify variant ownership Section 32 Fig 8 see relatively percentages common Yes common developers across three ecosystems Overall results imply majority forked variants started maintained developers different maintaining mainline counterparts Observation 2–RQ1 majority mainline–fork variant pairs three ecosystems investigated owned different developers 91 Android variants 95 JavaScript variants 92 NET variants implies majority forked variants datasets started maintained developers different maintaining mainline counterparts Fig 8 Variant owners mainline–fork variant pair three ecosystem Yes mainline–fork variant pair common developers mainline–fork variant pair common developers datasets mainline–fork variant pairs 54 Android 985 JavaScript 89 NET ecosystems Note graphs presented different scales visibility purposes 44 Variant Popularity Characteristics Figure 9 shows variant popularity variants three packaging ecosystems Android JavaScript NET Android variants Figure 9a shows variant downloads distribution mainline fork variants point xaxis represents pair sort pairs number mainline downloads observe majority mainline variants quite popular 27 38 mainline variants 71 geq 10K downloads fork variant popularity terms downloads observe 10 54 fork variants 19 geq 10K downloads believe natural mainline variants popular fork counterparts since assume Fig 9 Distributions mainline fork variant variants’ popularity metrics variants three ecosystems Android JavaScript NET datasets 54 mainline–fork pairs Android 10357 mainline–fork pairs JavaScript 590 mainline–fork pairs NET ecosystems released first Google Play Figure 9b shows variant reviews distribution mainline fork variants point xaxis represents 12Note Google Play keep release history variants possible obtain first listing date variant pair sort pairs number mainline reviews observe similar distribution number reviews like observed number downloads surprising since previous studies found downloads reviews correlated Businge et al 2019 Overall variant popularity observe gives us confidence data set consists real variants JavaScript NET variants Figs 9c–f present popularity graphs variants two ecosystems NET JavaScript Figure 9c shows dependent packages distributions mainline fork variants point xaxis represents pair sort pairs number mainline dependent packages observe majority mainline variants quite popular 6157 10357 mainline variants 59 least two dependent packages fork variants observe 1624 10357 mainline variants 16 least two dependent packages Figure 9d shows dependent projects distributions mainline fork variants variants JavaScript ecosystem point xaxis represents pair sort pairs number mainline dependent also observe similar distribution number dependent projects observed number dependent packages remaining two graphs Figs 9e f show data NET ecosystem show similar trends observed JavaScript Comparing popularity ecosystems observe mainline variants popular fork variant counterparts surprising since forks clones mainline However Fig 9 three ecosystems interesting observe fork variants popular mainline counterparts followup study would interesting investigate possible explanations variants popular mainline counterparts Comparing popularity variants JavaScript NET ecosystems observe average variants JavaScript ecosystem popular variants NET ecosystem also observe fork variants NET ecosystem less popular fewer dependent packagesprojects variants JavaScript ecosystem followup study would also interesting investigate variants JavaScript families popular variants NET families also fork variant variants JavaScript families popular fork variant variants NET families Tables 4 5 present examples showing variant popularity three ecosystems variant maintenance activities NET JavaScript Table 5 columns mainline fork use package names variants since repository names GitHub long tables present two interesting examples variant pairs randomly picked 1 abandoned mainlines first variant pair ecosystems fork variant popular mainline mainline fork mainline downloads fork downloads mainline reviews fork review TobyRich TailorToys 10K 100K 106 1034 appsmartplaneandroid apppowerupandroid opendatakit kobotoolbox 1000K 100K 3049 1527 collect collect Table 5 Example mainline–fork pairs NET JavaScript ecosystems showing statistics popularity maintenance activities mainline fork dependent packages mainline dependent packages fork package releases mainline package releases fork package releases NET FlurlSigned FlurlHttpSigned 3 10 6 10 Ninject PortableNinject 638 19 75 14 JS selenium seleniumserver 97 2046 2 51 gulpistanbul gulpbabelistanbul 5867 11 24 14 JS JavaScript compared last release dates variants ecosystems observed mainlines seem abandoned fork variant continued evolve reason fork variants popular Table 5 also see fork variants releases mainlines 2 Coevolution second pair ecosystems present another interesting case coevolution mainline fork variant continuously maintained popular cases would interesting coevolution variants technical social aspects Technical example investigating variants complementary competing Social learn variant communities Observation 3–RQ1 Although mainline variants popular surprising quite number fork variants also popular also observe fork variants popular mainline counterparts tells us forks studying indeed variant forks used community developers cases NET JavaScript variants Android variants downloaded installed user phones pointed interesting research directions investigated followup studies
::::
5 Code Propagation Families RQ2 far analyzed characteristics families across three ecosystems results RQ1 give us confidence fork variants data set indeed variant forks RQ2 present results variants family coevolve Specifically interested code propagation practices understand variants evolve separately propagate code forking date present results code propagation family variants terms propagated commits differentiating propagation mechanisms explained Sections 2 33 Recall commit types determine various code propagation strategies eg pull requests versus direct integration git Tables 6 7 8 9 show metrics use RQ measure types propagated commits ecosystems Android JavaScript NET applicable specify direction propagated code ie mainline→fork fork→mainline Recall Section 331 differentiate git merge git rebase commits assume integrated git merge git rebase commits direction mainline→fork Tables 7 8 show one metric gitPullMLVFV represent two commit integration types Tables 6–9 show summary descriptive statistics metrics use investigate code Metric Mean Min Median Max Description Android variants mergedPRsMLVFV 031 0 0 15 Number merged PR mainline fork variant mergedPRsFVMLV 009 0 0 4 Number merged PR given fork mainline variant prMergedCommitsMLVFV 833 0 0 427 Number merged PR commits mainline fork variant prMergedCommitsFVMLV 057 0 0 28 Number merged PR commits fork mainline variant prSquashedMLVFV 0 0 0 0 Number squashed PR mainline fork variant prSquashedFVMLV 0 0 0 0 Number squashed PR given fork mainline variant prRebasedMLVFV 0 0 0 0 Number rebased PR mainline fork variant prRebasedFVMLV 0 0 0 0 Number rebased PR given fork mainline variant NET variants mergedPRsMLVFV 0 0 0 3 Number merged PR mainline fork variant mergedPRsFVMLV 02 0 0 13 Number merged PR given fork mainline variant prMergedCommitsMLVFV 02 0 0 30 Number merged PR commits mainline fork variant prMergedCommitsFVMLV 12 0 0 207 Number merged PR commits fork mainline variant prSquashedMLVFV 0 0 0 0 Number squashed PR mainline fork variant prSquashedFVMLV 0 0 0 5 Number squashed PR given fork mainline variant prSquashedCommitsFVMLV 01 0 0 14 Number squashed PR commits fork mainline variant prRebasedMLVFV 0 0 0 0 Number rebased PR mainline fork variant prRebasedFVMLV 0 0 0 0 Number rebased PR given fork mainline variant Table 6 continued Metric Mean Min Median Max Description JavaScript variants mergedPRs MLVFV 0 0 0 26 Number merged PR mainline fork variant mergedPRs FVMLV 04 0 0 4 Number merged PR given fork mainline variant prMergedCommits MLVFV 01 0 0 399 Number merged PR commits mainline fork variant prMergedCommits FVMLV 057 0 0 28 Number merged PR commits fork mainline variant prSquashed MLVFV 0 0 0 2 Number squashed PR mainline fork variant prSquashed FVMLV 0 0 0 21 Number squashed PR given fork mainline variant prSquashedCommits MLVFV 04 0 0 52 Number squashed PR commits mainline fork variant prSquashedCommits FVMLV 0 0 0 109 Number squashed PR commits fork mainline variant prRebased MLVFV 0 0 0 2 Number rebased PR mainline fork variant prRebased FVMLV 0 0 0 3 Number rebased PR given fork mainline variant prRebasedCommits MLVFV 04 0 0 4 Number rebased PR commits mainline fork variant prRebasedCommits FVMLV 0 0 0 25 Number rebased PR commits fork mainline variant propagation commit level three ecosystems Android JavaScript NET 51 Pull Request Propagation Commit Integration Inside GitHub present results pull request integration techniques merge rebase squash well unclassified PRs mainline–fork pairs three ecosystems Android JavaScript NET Table 6 results summary statistics Table 7 present details summary statistics also present distributions integration directions Fig 10 Figure 10 shows box plots showing distributions different PR integration techniques example variants Android ecosystem distribution PR integration directions mainlines → fork fork → mainline shown Fig 10a one pull request direction integration pull requests integrated using PR merge option PR integrated using PR integration options see boxplots majority mainline–fork variant pairs zero PRs integrated either direction implies pairs integrate PRs Table 7 Number mainline–fork pairs pull requests involved code propagation dataset 54 mainline–fork pairs 10357 mainline–fork pairs 590 mainline–fork pairs ecosystems Android JavaScript NET respectively Mainline→ Fork Fork→ mainline Pairs PRs Commits Pairs PRs Commits Android variants PR Merged 1 1 5 1 2 427 Rebased 0 0 0 0 0 0 Squashed 0 0 0 0 0 0 Unclassified 0 0 0 0 0 0 Git Cherrypick 5 na 250 4 na 136 gitPullMLVFV 18 na 13198 na na na NET variants PR Merged 9 13 96 67 139 721 Rebased 0 0 0 0 0 0 Squashed 0 0 0 13 21 72 Unclassified 0 0 0 3 3 9 Git Cherrypick 15 na 99 16 na 138 gitPullMLVFV 106 na 5601 na na na JavaScript variants PR Merged 99 162 1862 724 1394 4523 Rebased 1 1 4 11 13 67 Squashed 5 6 72 132 250 1048 Unclassified 7 10 33 23 32 134 Git Cherrypick 95 na 275 91 na 251 gitPullMLVFV 1180 na 40001 na na na example Android apps first row direction mainline→ fork 1 fork variant merged 1 PR mainline containing 5 commits direction fork→ mainline 1 mainline merged 2 PRs containing 427 commits Table 7 shows details summary statistics distributions example top section Table 7 Android variants first row observe 1 54 mainline–fork variant pairs integrated 1 PR total 5 commits using merge pull request option direction mainline→ fork row direction fork→ mainline observe 1 mainline–fork pair integrated 2 PRs total 427 commits using merge pull request option direction fork→ mainline see Android variants 1 54 19 mainline–fork pairs integrated commits using merge pull request option observe less similar trends mainline–fork variants pairs two ecosystems JavaScript mainline–fork variant pairs observe 99 10357 mainline—fork variant pairs 1 Table 8 Git based outside GitHub code propagation practices commit level 54 mainline–fork pairs 10357 mainline–fork pairs 590 mainline–fork pairs Android JavaScript NET ecosystems respectively Metric Mean Min Median Max Description Android variants gitCherrypickedMLVFV 46 0 0 168 Number git cherrypicked commits mainline fork variant gitCherrypickedFVMLV 25 0 0 75 Number git cherrypicked commits fork mainline variant gitPullMLVFV 244 0 0 6567 Number git mergedrebased commits mainline fork variant NET variants gitCherrypickedMLVFV 15 0 0 42 Number git cherrypicked commits mainline fork variant gitCherrypickedFVMLV 04 0 0 148 Number git cherrypicked commits fork mainline variant gitPullMLVFV 95 0 0 2317 Number git mergedrebased commits mainline fork variant JavaScript variants gitCherrypickedMLVFV 46 0 0 168 Number git cherrypicked commits mainline fork variant gitCherrypickedFVMLV 0 0 0 70 Number git cherrypicked commits fork mainline variant gitPullMLVFV 37 0 0 6035 Number git mergedrebased commits mainline fork variant integrating commits using merge pull request option direction mainline→fork 724 10357 mainline–fork pairs 7 direction fork→mainline observe mainline–fork variant pairs JavaScript packaging ecosystem integrating commits using pull request squashrebase options either integration directions mainline–fork variant pairs NET ecosystem observe 9 590 mainline–fork pairs 15 67 590 mainline–fork pairs 113 integrating commits using merge pull request option direction mainline→fork fork→mainline respectively observe commits integrated using rebased pull request option either integration direction commits integrated using squash pull request option observed integration direction fork→mainline accounting 13 590 mainline–fork pairs 2 observe mainline–fork variant pairs integrating commits direction fork→mainline opposed mainline→fork irrespective PR integration option used Android variants observed 1 pair either direction 19 JavaScript variants 867 10357 mainline–fork pairs 84 direction fork→mainline 105 10357 mainline–fork pairs 14 Regarding pull request integration options see merge pull request option clearly frequently used integration directions three ecosystems three packaging ecosystems squash rebase options rarely used However comparing two PR options squash rebase observe squash PR option used often Table 9 Unique commits variability percentage 54 mainline–fork pairs 10357 mainline–fork pairs 590 mainline–fork pairs Android JavaScript NET ecosystems respectively Metric Mean Min Median Max Description Android variants unique MLV 1122 0 228 18961 Number unique commits mainline variant given mainline–fork pair unique FV 983 1 16 1646 Number unique commits fork variant given mainline–fork pair InheritedCommits 1884 10 755 29110 Number common commits given fork mainline variant VariabilityPercentage 15 0 27 938 Percentage unique commits according 1 NET variants unique MLV 1022 0 3 10789 Number unique commits mainline variant given mainline–fork pair unique FV 162 0 5 605 Number unique commits fork variant given mainline–fork pair InheritedCommits 2245 0 421 20538 Number common commits given fork mainline variant VariabilityPercentage 20 0 11 99 Percentage unique commits according 1 JavaScript variants unique MLV 335 0 3 10223 Number unique commits mainline variant given mainline–fork pair unique FV 128 0 5 1229 Number unique commits fork variant given mainline–fork pair InheritedCommits 1115 14 32 66861 Number common commits given fork mainline variant VariabilityPercentage 223 0 14 99 Percentage unique commits according 1 Observation 1–RQ2 Code propagation using PRs rarely used mainline–fork variant pairs three ecosystems studied Unsurprisingly observed PRs direction fork → mainline direction mainline → fork However although low numbers observed PRs direction mainline → fork also observed three ecosystems used integration option far merge PR option squash rebase PR option less frequently used mainline–fork variant pairs three ecosystems although squash PR option used rebase PR option low numbers could attributed fact fork variants created submit PRs diverge away mainline solve different problem followup study involving user study could investigate motivation behind fork variant creation limited collaboration mainline fork variants 52 Git Propagation Commit Integration Outside GitHub section present results commit integration outside GitHub relating git cherrypick git mergerebase gitPullMLVFV summary statistics two commit integration techniques presented Table 8 Table 7 detailed results corresponding summary statistics Table 8 presented first present results git cherrypick follow results git mergerebase git cherrypick commit integration Like stated Section 33 commits cherrypicked mainline two directions mainline→fork fork→mainline two metrics gitCherrypickedMLVFV gitCherrypickedFVMLV Table 8 corresponding two commit integration directions mainline→fork fork→mainline respectively three ecosystems Fig 11 present boxplot distributions corresponding results Table 8 see distributions show outliers meaning pairs cherrypicked commits detailed statistics Table 7 reveal results example upper part Table 7 presenting Android variants see 5 54 mainline–fork pairs 9 integrated total 250 commits direction mainline→fork direction fork→mainline 4 54 mainline–fork pairs 74 integrating total 136 commits Like results pull request integration presented earlier also clearly see commit integration using git cherrypick rarely used mainline–fork variant pairs three ecosystems studied Unlike pull request integration developer sync upstream downstream new changes git cherrypick developer search specific commits integrate requires first look pool new changes identify ones interest cherrypick mainline fork variant diverged solving different problems finding interesting commits new changes might laborious hypothesize could one reasons numbers commits observed mainline–fork variant pairs three ecosystems follow study confirm refute hypothesis would add value study git mergerebase commit integration Table 8 see metric gitPullMLVFV representing git mergerebase commit integration direction mainline→fork three ecosystems see medians metric three ecosystems zeros Figure 11 shows three boxplots showing distributions gitPullMLVFV metric mainline–fork variant pairs three ecosystems boxplots also observe medians zeros Table 7 present detailed statistics metric gitPullMLVFV Android mainline–fork variant pairs observe 18 54 mainline–fork pairs 33 total 13198 commits integrated direction mainline→fork NET mainline–fork variant pairs observe 106 590 mainline–fork pairs 18 total 5601 commits integrated direction mainline→fork finally JavaScript mainline–fork variant pairs observe 1180 10357 mainline–fork pairs 11 total 40001 commits integrated direction mainline→fork see although git mergerebase still rarely used mainline–fork variants three ecosystems used two options pull requests git cherrypick conclude git mergerebase used code integration mechanism variants variant families speculate lack integration mainline–fork variant pairs could result variants diverging solve different problems solved mainline counterparts Observation 2–RQ2 Like integration technique using PRs also observe git mergerebase git cherrypick integration techniques also less frequently used variants three ecosystems However observe integration using git mergerebase commonly used integration mechanism mainline–fork variants three ecosystems occurs integration direction mainline→fork general followup study investigate variants share code would reveal reasons low numbers integration
::::
521 Fork Variability Percentage section presents results variability percentage metric VariabilityPercentage fork variants three ecosystems Table 6 present summary statistics metrics used calculate VariabilityPercentage 1 Figure 12 presents distributions metric VariabilityPercentage fork variants three ecosystems see medians 27 11 14 variants three ecosystems Android NET JavaScript respectively high value metric VariabilityPercentage implies fork differs mainline counterpart fork variants Android ecosystem observe quite number forks 35 54 35 high VariabilityPercentage ≥ 10 fork variants NET ecosystem also observe majority forks 281590 53 high VariabilityPercentage 10 Lastly fork variants JavaScript ecosystem also observe quite majority forks 607610357 58 relatively high VariabilityPercentage 10 Distribution fork variability percentage— VariabilityPercentage variants three ecosystems datasets 54 fork variants 10357 fork variants 590 fork variants ecosystems Android JavaScript NET respectively Observation 3–RQ2 majority fork variants three ecosystems Android JavaScript NET highly differ mainline counterparts ie higher numbers unique commits findings forks variants differing mainlines could used support earlier finding relating limited commit integration mainline–fork variant pairs three ecosystems 53 Summary presented results code propagation practices among mainline–fork variant pairs three ecosystems Android NET JavaScript Overall studied mainline–fork variant pairs three ecosystems observe infrequent code propagation regardless type propagation mechanism direction used code propagation technique git mergerebase used 33 Android mainlinefork pairs 11 JavaScript pairs 18 NET pairs integration using pull requests developers often integrate code direction fork → mainline compared direction mainline → fork mainline–fork variants code integration direction mainline → fork often done using merge pull request option git mergerebase outside GitHub Moreover squash rebase pull request options less frequently used mainline–fork variant pairs although squash PR option used rebase pull request option Finally comparing fork variability percentage observed high percentage difference fork variants mainline counterparts indicated higher number unique commits results consistent across variants three ecosystems ie Android JavaScript NET studied findings potentially indicate fork variants created intention diverging away mainline solve different problem ie intention sync way original mainline Future studies could investigate motivation behind fork variants’ creation limited collaboration mainline fork variants
::::
6 Discussion Implications observations two research questions several implications future research coevolution families respective tool support Implications Identifying Variant Forks opposed previous studies relied heuristics applied GitHub repositories identify Variant forks study ensure members variant family represent different variants marketplace Google Play JavaScript NET Relying heuristics applied GitHub repositories find variant forks may false positives ie fork classified variant fork yet social fork method identifying divergent forks reused researchers interested studying variant families ecosystems including operatingsystem packages eg Debian packages Berger et al 2014 ecosystems established programming languages fact popular programming language today JavaScript Java PHP NET Python many package managers available host hundreds thousands packages details package managers found Librariesio platform used identify extract details variant families JavaScript NET ecosystem Librariesio references packages 37 package managers one obtain families different ecosystems Implications Forking Studies Observation 1–RQ2 Observation 2–RQ2 suggest studied divergent forks direct integration using git outside GitHub commonly used GitHub pull requests implies simply relying pull requests understand code propagation practices divergent forks enough Furthermore seems integration using git rebase common per Observation 2–RQ2 Rebas ing complicates git history empirical studies consider rebasing may report skewed biased inaccurate observations Paixão Maia 2019 Thus addition looking beyond pull requests studying code propagation studies must also consider rebased commits paper contribute reusable tooling identifying rebased commits Implications Integration Support Tools Regardless integration technique used findings based variants three ecosystems studied suggest code propagation rarely happens fork mainline datasets observe 35 54 mainline–fork pairs 21 590 mainline–fork pairs 115 10357 mainline–fork pairs integrated commits using least one commit integration techniques three ecosystems Android NET JavaScript respectively lack integration may problematic since fork variants may rely correct functionality existing code mainline means bugs exist mainline also exist forks unless bug fixes propagated one variant However current integration techniques Lillack et al 2019 Krueger Berger 2020a Krueger et al 2020 necessarily facilitate finding bug fixes example code integration using pull requests git merge rebase may best integrating changes variant forks since involve syncing upstream downstream changes missing current branch Alternatively cherry picking probably suitable bug fixes since developer choose exact commits want integrate However GitHub’s current setup make easy identify commits cherrypick without digging branch’s history identify relevant changes since last code integration result difficulty finding commits cherrypick developers may end fixing bugs would result duplicated effort wasted time check possible duplication effort occurs data set looked unique commits variants indeed found developers independently update files shared variants example mainline–fork variant pair k9mail k9 imaeses k9 shared file ImapStorejava text13 touched 15 different developers 142 commits mainline variant fork variant touched one developer 9 different commits possible developers could fixing similar bugs existing shared artifacts Moreover study Jang et al 2012 reports parallel maintenance cloned code bug found one clone exist clones thus needs fixed multiple times Furthermore result different developers changing shared files possible developers integrate 13textsrccomfsckk9mailstoreImapStorejava path mainline fork code “fear merge conflict” relation conjecture several studies reported merging diverged code repositories laborious result merge conflicts Stanciulescu et al 2015 Brun et al 2011 de Souza et al 2003 Perry et al 2001 Sousa et al 2018 Mahmood et al 2020 Silva et al 2020 end would interesting future research interview developers forks forks determine whether lack support cherry picking bug fixes specific functionality indeed contribute lack code propagation case developing patch recommendation tool inform developers possible interesting changes soon introduced one variant recommend variants family help save developers’ efforts recent work Ren et al 2018 focused providing mainline facilities explore nonintegrated changes forks find opportunities reuse one step towards direction work opens opportunities applying tools since mentioned respect identifying divergent forks provide technique identifying forks combining information GitHub ecosystem’s main delivery platform well mention various ecosystems similar strategy adopted Finally limited sharing changes give rise quality issues specifically investigate propagation test cases might propagated well Developing techniques propagating test cases within families could significantly enhance quality variants within families potential testcase propagation recently pointed preliminary study Mukelabai et al 2021 Implications Future Research work first perform largescale empirical study practices used manage families within ecosystems results give rise following open research questions could addressed follow studies understand evolution families two variants family results RQ1 showed quite number families FamilySize two variants ie mainline two fork variants However study concentrated practices used manage mainlinefork pairs example look forkfork pairs given family looking holistic evolution families two variants would interesting extend study families study evolution family Variant dependencies RQ1 observed variant pairs three ecosystems one mainline fork variant pair dependencies implies variant dependencies implements new functionality relating extra dependencies missing counterpart would interesting investigate whatwhy new functionality missing counterpart variant Another interesting research relating dependencies would investigate variants family updated code depend new releases common dependencies variants family still dependent old releases dependencies Updating code implement new release dependency may involve fixing incompatibilities especially new release dependency involves breaking change avoid effort duplication tool could developed could help transplanting patches related incompatibility fixes variants family yet migrated code new APIbreaking change release common dependency Limited sharing changes unique commits RQ2 observed limited sharing changes unique commits mainline–fork variant pairs three ecosystems hypothesized one possible reasons could variants diverging solve different problems also stated fork variants could created support new technology serve different community target different content support frozen feature mainline Fork variants created reasons likely little share mainline variants would interesting carry study involving mixed methods quantitative user studies verify hypothesis Impediments coevolving variants families Like study Robles GonzálezBarahona 2012 dataset also observed mainline–fork variant pairs continue coexist others one variants pair abandoned continues evolve Followup study conducted investigate impediments coevolving variants Inspirations leveraged studies coevolution Eclipse platform thirdparty plugins Businge et al 2012a 2013 2010 2012b 2015 Businge et al 2019 Kawuma et al 2016
::::
7 Related Work discuss related work variant forking ii code propagation forked projects well discuss iii general studies forking 71 Variant Forking understand variants variant families RQ2 explored reasons forks created existing studies variant forks done preGitHub days SourceForge advent social coding environments Nyman et al 2012 Robles GonzálezBarahona 2012 Viseur 2012 Nyman Lindman 2013 Laurent 2008 Nyman Mikkonen 2011 studies reported controversial perceptions around variant forks preGitHub days Chua 2017 Dixion 2009 Ernst et al 2010 Nyman Mikkonen 2011 Nyman 2014 Raymond 2001 However Zhou et al 2020 recently report perceptions changed advent GitHub PreGitHub days variant forks frequently considered risky projects since could fragment community lead confusion developers users Jiang et al 2017 state although forking controversial traditional open source OSS community encouraged builtin feature GitHub authors report developers carry social forking submit pull requests fix bugs add new features keep copies Zhou et al 2020 also report variant forks start social forks Robles GonzálezBarahona 2012 comprehensively study carefully filtered list 220 potential forks different projects referenced Wikipedia authors assume fork significant reference appears English Wikipedia found technical reasons discontinuation original common reasons creating variant forks accounting 273 20 respectively recently Zhou et al 2020 interviewed 18 developers variant forks GitHub understand reasons forking modern social coding environments explicitly support forking authors report motivations observed align prior studies works studied forks type limited specific technological space eg web applications mobile apps paper different focuses Android apps triangulating data GitHub Google Play study realworld apps Specifically study variant reuse practices RQ2 different studies Zhou et al 2020 Robles GonzálezBarahona 2012 investigate additional phenomena code propagation RQ3 Another difference current study study Zhou et al 2020 heuristics two studies employ determine variant forks Zhou et al 2020 classify forks GitHub variant forks using following heuristics contain phrase “fork of” description ii received least three external pull requests iii least 100 unique commits iv least one year development v changed name work use external validation fork listed Google Play different package name use description verify app indeed variant mainline 72 Code Propagation Practices studies investigated code integration given repository forks Stanciulescu et al 2015 studied forking GitHub using case study Marlin open source firmware 3D printers authors observed many forked variants share changes mainline However work differentiate social variant forks Thus know whether observed prevalent code propagation simply due fact social forks created main goal contributing back original Zhou et al 2019 current paper interested variant forks Recently Zhou et al 2020 observed 16 15306 studied variant forks ever synchronized merged changes mainline repository However based discussed threats validity seems authors relied common commit IDs identify shared commits explained Section 2 several integration techniques result propagated commits different commit IDs Thus relying commit ID may result missing shared commits mitigate problem work identifies integrated commits preserve commit ID well may integrated using techniques change commit ID Another study code propagation practices work Kononenko et al 2018 authors considered three types commit integration GitHub merge cherrypick merge commit squashing comparison study study commit squashing look techniques authors consider like GitHub rebase squash pull requests well git merge rebase Code propagation practices necessarily context forks example German et al 2016 investigated Linux uses Git authors stated code changes variant track proliferation code repositories developers modify “rebase” filter “cherrypick” history changes streamline integration repositories developers end authors presented method textitcontinuousMining crawls known git repositories multiple times day record analyze changesets authors state textitcontinuousMining yields complete git history also catches phenomena variant study rebasing cherrypicking continuously capture “live” history able capture rebased cherrypicked commits context forked projects relying commit meta data thorough investigation meta data changes depending propagation strategy 73 Studies Forking Gamalielsson Lundell 2014 studied long term sustainability Open Source communities Open Source projects involving fork authors study based LibreOffice fork OpenOffice wanted understand Open Source communities affected forking authors undertook analysis LibreOffice related OpenOffice Apache OpenOffice projects reviewing documented information quantitative analysis repository data well first hand experiences contributors LibreOffice community results strongly suggested longterm sustainable LibreOffice community signs stagnation LibreOffice 33 months fork also reported good practice respect governance Open Source projects perceived community members fundamental challenge establishing sustainable communities Nyman Nyman 2014 interviewed developers understand views forking findings interviews differentiate good forks revive abandoned programs ii experiment customize existing programs iii minimize tyranny resolve disputes allowing involved parties develop versions program vs bad forks create confusion among users ii add extra work among developers including duplication efforts increased work attempting maintain compatibility
::::
8 Threats Validity Internal Validity identify four issues could threaten internal validity results 1 Section 31 heuristics used app family data identification Steps 2 6 resulted mismatch mapping forks GitHub Google Play mitigated threat carrying manual analysis Section 31–Step 7 discarded mismatched apps steps carried Android variant’s data collection manual errors could affect results 2 Although observe cases developer changed message cherrypicked commits acknowledge algorithm able identify cases instead algorithm identify unique commits respective variants 3 also acknowledge tool chain may miss commits integrated using one integration technique example Section 333 presented unclassified merged pull requests listed GitHub API merged yet merged master branch discovered pull requests integrated different branch mainline failed build integration tests end integrating commits fork → mainline “best practice” developers may wish first integrate commits different branch say staging branch perform integration test later integrate master However following “best practice” explained developer first integrates development branch using one commit integration technique Thereafter developer may wish integrate commits master using different technique changes original integrator’s metadata example cherrypicking case toolchain miss commits 4 Section 22 also stated scripts able identify integrated commits integrator uses git commands rewrite commit history However like stated Section 333 believe practice rewriting contributions community likely rare experienced developers since rewriting changes commit authorship 5 Step 6 Section 31 eliminated Android mainlines least one fork different package name Google Play store means eliminate fork variants created different markets Google play However unlike Google play one use app’s package name unique ID Google play markets anzhi apkmirror appsapk implement strategy means cannot easily identify correct app given GitHub repository Therefore intentionally focus Android apps distributed Google play store limits number Android families able identify Construct Validity calculation variability percentage fork variants treats commits way irrespective number files touched example commit touched 100 files treated one touched file may misleading measure provides indication unique development activity External Validity analyzed 54 Android mainline–fork variant pairs exists millions android applications Google Play Android markets means results might representative Android applications However also analyze mainline–fork variant pairs two ecosystems also show similar results behavior
::::
9 Conclusion presented largescale exploratory study reuse maintenance practices via code propagation variant forks mainline counterparts ecosystems subject ecosystems cover different technological spaces Android JavaScript NET part study designed systematic method identify real variant forks well identified analyzed families variants maintained together exist official package distribution platforms Google play nuget npm well GitHub allowing us analyze reuse practices depth variants given ecosystem mined sources information—from GitHub package distribution site—to study characteristics including variations codepropagation practices Android ecosystem identified 38 families total 54 mainline–fork pairs NET ecosystem 526 families 590 mainline–fork pairs JavaScript ecosystem 8837 JavaScript families 10357 mainline–fork pairs provide toolchain analyzing code integration mainlinefork variant pair Regardless integration technique used findings suggest code integration rarely happens fork mainline study Android ecosystem observed 19 54 35 integrated commits using least one commit integration techniques discussed NET ecosystem observed total 126 590 mainline–fork pairs 21 integrated commits using least one commit integration techniques JavaScript ecosystem observe total 1189 10357 mainline–fork pairs 115 integrated commits using least one commit integration techniques Overall analyzed variant forks GitHub two main reasons 1 many previous studies focused social forks 2 studies variant forks conducted preGitHub days SourceForge future would interesting investigate middle ground variant forks social forks example one could investigate practices observed variant forks different social forks Acknowledgements thank Serge Demeyer comments earlier drafts work John Businge’s work supported FWOVlaanderen FRSFNRS via EOS 30446992 SECOASSIST Thorsten Berger’s work supported Swedish research council Wallenberg Academy Sarah Nadi’s research undertaken part thanks funding Canada Research Chairs Program Open Access article licensed Creative Commons Attribution 40 International License permits use sharing adaptation distribution reproduction medium format long give appropriate credit original authors source provide link Creative Commons licence indicate changes made images third party material article included article’s Creative Commons licence unless indicated otherwise credit line material material included article’s Creative Commons licence intended use permitted statutory regulation exceeds permitted use need obtain permission directly copyright holder view copy licence visit httpcreativecommonsorglicensesby40 References Online appendix 2020 httpsgithubcomjohnxu21emse2020 GitHub 2020 pull request merges httpshelpgithubcomengithubcollaboratingwithissuesandpullrequestsaboutpullrequestmerges Apel Batory Kastner C Saake G 2013 Featureoriented product lines Springer Berlin Berger Pfeiffer R Tartler R Dienst Czarnecki K Wasowski 2014 Variability mechanisms ecosystems Inf Softw Technol 56111520–1535 Berger Steghöfer JP Ziadi Robin J Martinez J 2020 state adoption challenges systematic variability management industry Empir Softw Eng 251755–1797 Brun Holmes R Ernst MD Notkin 2011 Proactive detection collaboration conflicts Proceedings 19th ACM SIGSOFT symposium 13th European conference foundations engineering ESECFSE ’11 Association Computing Machinery New York pp 168–178 httpsdoiorg10114520251132025139 Businge J Decan Zerouali Mens Demeyer 2020 empirical investigation forks variants npm package distribution Papadakis Cordy eds Proceedings 19th BelgiumNetherlands evolution workshop BENEVOL 2020 Luxembourg December 34 2020 CEUR Workshop Proceedings vol 2912 CEURWSorg httpceurwsorgVol2912paper1pdf Businge J Kawuma Bainomugisha E Khomh F Nabaasa E 2017 Code authorship faultproneness opensource android applications empirical study Proceedings 13th international conference predictive models data analytics engineering PROMISE ACM New York pp 33–42 httpsdoiorg10114531270053127009 Businge J Kawuma Openja Bainomugisha E Serebrenik 2019 stable eclipse application framework internal interfaces 2019 IEEE 26th international conference analysis evolution reengineering SANER pp 117–127 httpsdoiorg101109SANER20198668018 Businge J Openja Kavaler Bainomugisha E Khomh F Filkov V 2019 Studying android app popularity crosslinking github google play store SANER Businge J Openja Nadi Bainomugisha E Berger 2018 Clonebased variability management android ecosystem 2018 IEEE international conference maintenance evolution ICSME 2018 Madrid Spain September 2329 2018 pp 625–634 Businge J Serebrenik van den Brand 2012 Compatibility prediction eclipse thirdparty plugins new eclipse releases 12th IEEE international working conference source code analysis manipulation SCAM 2012 Riva del Garda Italy September 2324 2012 pp 164–173 Businge J Serebrenik van den Brand 2012 Survival eclipse thirdparty plugins 28th IEEE international conference maintenance ICSM 2012 Trento Italy September 2328 2012 pp 368–377 httpsdoiorg101109ICSM20126405295 Businge J Serebrenik van den Brand 2013 Analyzing eclipse API usage Putting developer loop 17th European conference maintenance reengineering CSMR 2013 Genova Italy March 58 2013 pp 37–46 Businge J Serebrenik van den Brand MGJ 2010 empirical study evolution Eclipse thirdparty plugins EVOLIWPSE’10 ACM pp 63–72 Businge J Serebrenik van den Brand MGJ 2015 Eclipse API usage good bad Softw Qual J 231107–141 httpsdoiorg101007s1121901392213 Chacon Straub B 2014 git tools rewriting history httpsgitscmcombookenv2GitToolsRewritingHistory Chacon Straub B 2014 Pro Git Apress Chua BB 2017 survey paper open source forking motivation reasons challenges Alias RA Ling PS Bahri Finnegan P Sia CL eds 21st Pacific Asia conference information systems PACIS 2017 Langkawi Malaysia July 1620 2017 p 75 Czarnecki KBanâtre JP Fradet P Giavitto JL Michel eds 2005 Overview generative development Springer Berlin Decan Mens Grosjean P 2019 empirical comparison dependency network evolution seven packaging ecosystems Empir Softw Eng 241381–416 httpsdoiorg101007s106640179589y Dixion J 2009 Different kinds open source forks – salad dinner fish httpsjamesdixonwordpresscom20090513differentkindsofopensourceforkssaladdinnerandfish Dubinsky Rubin J Berger Duszynski Becker Czarnecki K 2013 exploratory study cloning industrial product lines CSMR Ernst NA Easterbrook SM Mylopoulos J 2010 Code forking opensource requirements perspective arXiv10042889 Gamalielsson J Lundell B 2014 Sustainability open source communities beyond fork libreoffice evolved J Syst Softw 89128–145 httpsdoiorg101016jjss2013111077 httpwwwsciencedirectcomsciencearticlepiiS0164121213002744 German DM Adams B Hassan AE 2016 Continuously mining distributed version control systems empirical study linux uses git Empir Softw Eng 211260–299 Jang J Agrawal Brumley 2012 Redebug Finding unpatched code clones entire OS distributions IEEE symposium security privacy SP 2012 2123 May 2012 San Francisco California USA IEEE Computer Society pp 48–62 httpsdoiorg101109SP201213 Jiang J Lo J Xia X Kochhar PS Zhang L 2017 developers fork github Empir Softw Eng 221547–578 httpsdoiorg101007s1066401694366 Kalliamvakou E Gousios G Blincoe K Singer L German DM Damian 2014 promises perils mining github MSR Kawuma Businge J Bainomugisha E 2016 find stable alternatives unstable eclipse interfaces 2016 IEEE 24th international conference program comprehension ICPC pp 1–10 httpsdoiorg101109ICPC20167503716 Kononenko Rose Baysal Godfrey Theisen de Water B 2018 Studying pull request merges case study shopify’s active merchant Proceedings 40th international conference engineering engineering practice ICSESEIP ’18 Association Computing Machinery New York pp 124–133 httpsdoiorg10114531835193183542 Krueger J Berger 2020 Activities costs reengineering cloned variants integrated platform 14th international working conference variability modelling softwareintensive systems VaMoS Krueger J Berger 2020 empirical analysis costs clone platformoriented reuse 28th ACM SIGSOFT international symposium foundations engineering FSE Krueger J Mahmood W Berger 2020 Promotepl roundtrip engineering process model adopting evolving product lines 24th ACM international systems product line conference SPLC Laurent 2008 Understanding open source free licensing O’Reilly Media Newton Li L Martinez J Ziadi Bissyandé TF Klein J Traon YL 2016 Mining families android applications extractive spl adoption SPLC Lillack Stanciulescu Hedman W Berger Wasowski 2019 Intentionbased integration variants 41st international conference engineering ICSE Mahmood W Chagama Berger Hebig R 2020 Causes merge conflicts case study elasticsearch 14th international working conference variability modelling softwareintensive systems VaMoS Mojica IJ Adams B Nagappan Dienst Berger Hassan AE 2014 large scale empirical study reuse mobile apps IEEE Softw 31278–86 Mukelabai Berger Borba P 2021 Semiautomated testcase propagation fork ecosystems 43rd international conference engineering new ideas emerging results track ICSENIER Munaiah N Kroh Cabrey C Nagappan 2017 Curating GitHub engineered projects Empir Softw Eng 2263219–3253 Nyman L 2014 Hackers forking Proceedings international symposium open collaboration pp 1–10 Nyman L Lindman J 2013 Code forking governance sustainability open source Technol Innov Manag Rev 37–12 Nyman L Mikkonen 2011 fork fork Fork motivations sourceforge projects Open source systems grounding research pp 259–268 Nyman L Mikkonen Lindman J Fougère 2012 Perspectives code forking sustainability open source Open source systems longterm sustainability pp 274–279 Openja Adams B Khomh F 2020 Analysis modern release engineering topics – largescale study using stackoverflow – 2020 IEEE international conference maintenance evolution ICSME pp 104–114 httpsdoiorg101109ICSME46990202000020 Paixão Maia P 2019 Rebasing code review considered harmful largescale empirical investigation 2019 19th international working conference source code analysis manipulation SCAM pp 45–55 Parnas DL 1976 design development program families IEEE Trans Softw Eng 211–9 httpsdoiorg101109TSE1976233797 Perry DE Siy HP Votta LG 2001 Parallel changes largescale development observational case study ACM Trans Softw Eng Methodol 103308–337 httpsdoiorg101145383876383878 Raymond ES 2001 Cathedral Bazaar Musings linux open source accidental revolutionary Newton O’Reilly Media Inc Ren L Zhou Kästner C 2018 Poster forks insight providing overview github forks 2018 IEEEACM 40th international conference engineering companion ICSECompanion pp 179–180 Robles G GonzálezBarahona JM 2012 comprehensive study forks dates reasons outcomes Open source systems longterm sustainability pp 1–14 Sattler F von Rhein Berger Johansson NS Hardø MM Apel 2018 Lifting interapp dataflow analysis large app sets Autom Softw Eng 25315–346 Silva LD Borba P Mahmood W Berger Moisakis J 2020 Detecting semantic conflicts via automated behavior change detection 36th IEEE international conference maintenance evolution ICSME Sousa Dillig Lahiri SK 2018 Verified threeway program merge Proc ACM Program Lang 2OOPSLA httpsdoiorg1011453276535 de Souza CRB Redmiles Dourish P 2003 Breaking code moving private public work collaborative development Proceedings 2003 international ACM SIGGROUP conference supporting group work GROUP ’03 Association Computing Machinery New York pp 105–114 httpsdoiorg101145958160958177 Stanciulescu Schulze Wasowski 2015 Forked integrated variants opensource firmware IEEE international conference maintenance evolution ICSME ICSME ’15 Sung C Lahiri SK Kaufman Choudhury P Wang C 2020 Towards understanding fixing upstream merge induced conflicts divergent forks industrial case study Proceedings ACMIEEE 42nd international conference engineering engineering practice ICSESEIP ’20 Association Computing Machinery New York pp 172–181 httpsdoiorg10114533778133381362 Vandehey 2019 Rebase merge httpscloudfourcomthinkssquashingyourpullrequests Viseur R 2012 Forks impacts motivations free open source projects Int J Adv Comput Sci Appl IJACSA 32 Zhou Stănciulescu C Leßenich Xiong Wasowski Kästner C 2018 Identifying features forks Proceedings 40th international conference engineering pp 105–116 Zhou Vasilescu B Kästner C 2019 fork study inefficient efficient forking practices social coding Proceedings 2019 27th ACM joint meeting european engineering conference symposium foundations engineering pp 350–361 Zhou Vasilescu B Kästner C 2020 forking changed last 20 years study hard forks github Proceedings 42nd international conference engineering Accepted Publisher’s note Springer Nature remains neutral regard jurisdictional claims published maps institutional affiliations John Businge Postdoctoral fellow LORE lab University Antwerp Belgium received PhD Eindhoven University Technology Netherlands 2013 receiving PhD lecturer Mbarara University Science Technology Uganda six months 2016 Fulbright research scholar University California Davis USA research focuses mining repositories clone detection program analysis variability management empirical engineering Moses Openja PhD student member SWAT Lab Polytechnique Montreal Canada received bachelor’s degree 2017 Mbarara University Science Technology Uganda masters degree 2021 Polytechnique Montreal Canada research area includes quality machine learning applications empirical Engineering maintenance evolution ecosystem release engineering Sarah Nadi Assistant Professor Department Computing Science University Alberta Tier II Canada Research Chair Reuse obtained Master’s 2010 PhD 2014 degrees University Waterloo Canada joining University Alberta 2016 spent approximately two years postdoctoral researcher Technische Universität Darmstadt Germany Sarah’s research focuses providing intelligent support maintenance reuse including creating recommender systems guide developers correctly securely reusing individual functionality external libraries Thorsten Berger Professor Computer Science Ruhr University Bochum Germany receiving PhD degree University Leipzig Germany 2013 Postdoctoral Fellow University Waterloo Canada University Copenhagen Denmark Associate Professor jointly Chalmers University Technology University Gothenburg Sweden received competitive grants Swedish Research Council Wallenberg Autonomous Systems Program Vinnova Sweden EU ITEA European Union fellow Wallenberg Academy—one highest recognitions researchers Sweden received two bestpaper awards one influential paper award service recognized distinguished reviewer awards tierone conferences ASE 2018 ICSE 2020 research focuses modeldriven engineering program analysis empirical engineering Affiliations John Busingetextsuperscript12 · Moses Openjatextsuperscript3 · Sarah Naditextsuperscript4 · Thorsten Bergertextsuperscript56 Moses Openja openjamosesopmgmailcom Sarah Nadi nadiualbertaca Thorsten Berger thorstenbergerrubde textsuperscript1 Mbarara University Science Technology Mbarara Uganda textsuperscript2 University Antwerp Antwerp Belgium textsuperscript3 SWAT Lab École Polytechnique de Montréal Montréal Canada textsuperscript4 University Alberta Edmonton Canada textsuperscript5 Ruhr University Bochum Bochum Germany textsuperscript6 Chalmers University Gothenburg Gothenburg Sweden
::::
Expect Code Review Bots GitHub Survey OSS Maintainers Mairieli Wessel mairieliimeuspbr University São Paulo Alexander Serebrenik aserebreniktuenl Eindhoven University Technology Igor Wiese igorutfpredubr Universidade Tecnológica Federal Paraná Igor Steinmacher igorsteinmachernauedu Northern Arizona University Marco Gerosa marcogerosanauedu Northern Arizona University ABSTRACT bots used Open Source OSS projects streamline code review process Interfacing developers automated services code review bots report continuous integration failures code quality checks code coverage However impact bots maintenance tasks still neglected paper study maintainers experience code review bots surveyed 127 maintainers asked expectations perception changes incurred code review bots findings reveal frequent expectations include enhancing feedback bots provide developers reducing maintenance burden developers enforcing code coverage maintainers report bots satisfied expectations also perceived unexpected effects communication noise newcomers’ dropout Based results provide series implications bot developers well insights future research CCS CONCEPTS • Humancentered computing → Open source • engineering → creation management KEYWORDS bots pullbased model open source code review ACM Reference Format Mairieli Wessel Alexander Serebrenik Igor Wiese Igor Steinmacher Marco Gerosa 2020 Expect Code Review Bots GitHub Survey OSS Maintainers 34th Brazilian Symposium Engineering SBES ’20 October 21–23 2020 Natal Brazil ACM New York NY USA 6 pages httpsdoiorg10114534223923422459 1 INTRODUCTION Code review quality assurance practice 8 common Open Source OSS projects 3 Since open source development involves community geographically dispersed developers 23 projects often hosted social coding platforms GitHub 7 receive external contributions repositories shared fork modified pull requests pullbased development model maintainers spend nonnegligible time inspecting code changes engaging discussion contributors understand improve modifications integrating codebase 15 33 Open source communities use bots assist streamline code review process 9 29 short bots applications integrate human tasks serving interfaces connect developers tools 26 providing additional value human users 12 Accomplishing tasks previously performed solely human developers interacting communication channels human counterparts bots become new voices code review conversation 17 According Wessel et al 29 code review bots differ bots guiding contributors provide necessary information maintainers review pull requests GitHub bots responsible leaving comments pull requests reporting continuous integration failures code quality checks code coverage theory automation provided bots save maintainers effort time 25 lead focus higher priority aspects code review 2 Nevertheless adoption code review bot similar technological adoption bring unexpected consequences Since according Mulder et al 18 many effects directly caused new technology changes human behavior provokes important assess discuss effects new technology case effect bots maintainers often neglected paper aim understand open source maintainers integrate code review bots pull request workflow perceive changes bots induce short answer following research questions RQ1 motivates maintainers adopt code review bots RQ2 maintainers perceive changes code review bots introduce process achieve goal conducted survey 127 maintainers OSS projects hosted GitHub adopted code review bots investigate maintainers’ perceptions whether activity indicators change bot adoption number pull requests received merged nonmerged number comments time close pull requests Analyzing survey results found maintainers predominantly motivated reducing effort tedious tasks allow focus interesting ones enhancing feedback communicated developers Regarding changes introduced bot noted less manual effort required adoption highquality code enforced pull request review sped However four maintainers also reported unexpected aspects bot adoption including communication noise time spent tests newcomers’ dropout bots impersonating maintainers stressed contributors contributions twofold set maintainers’ motivations using bot assist code review process ii discussion maintainers see impact bot introduction support contributions may help maintainers anticipate bots’ effects guide bot developers consider implications new bots design findings preliminary suggest research hypotheses impact code review bots code review process open source projects followup studies support refute
::::
2 BACKGROUND RELATED WORK bots designed assist technical social aspects development activities 13 including communication decisionmaking 25 Basically bots act conduit developers tools 25 Wessel et al shown bot adoption indeed widespread OSS projects hosted GitHub 29 GitHub bots developed integrated pull request workflow perform variety tasks beyond code review support 31 tasks include repairing bugs 17 27 28 refactoring code 32 recommending tools 4 detecting duplicated development 20 updating dependencies 16 fixing static analysis violations 5 Despite increasing popularity understanding effects bots major challenge Storey Zagalsky 25 Paikari van der Hoek 19 highlight potential negative impact task automation bot technology still neglected bots often used avoid interruptions developers’ work may lead less obvious distractions 25 Additionally Liu et al 14 claim bots may negative impacts user experience open source contributors since needs preferences maintainers contributors previous studies provide recommendations evaluate bots’ capabilities performance 1 4 draw attention impact bot adoption development engineers perceive bots’ effects Wessel et al 29 investigated usage impact bots support contributors maintainers pull requests identifying bots popular GitHub repositories authors classified bots 13 categories according tasks perform third frequently used bots code review bots Wessel et al 30 also employed regression discontinuity design OSS projects revealing bot adoption increases number monthly merged pull requests decreases monthly nonmerged pull requests decreases communication among developers Prior work also investigated impact continuous integration CI code review tools GitHub projects 6 11 34 Zhao et al 34 Cassee et al 6 investigated impact Travis CI tool’s introduction development practices Kavaler et al 11 turned impact linters dependency managers coverage reporter tools work extends literature providing understanding code review bots adopted effects adoption focusing perceptions open source maintainers
::::
3 STUDY METHODOLOGY conducted survey obtain insights open source maintainers perceive impact using code review bots pull requests effects bots activities 31 Survey Design first identified OSS projects hosted GitHub point adopted least one code review bot 29 find projects queried GHTorrent dataset 10 searching projects received comments pull requests code review bots identified Wessel et al 29 determined bot introduced based date bot’s first comment Afterwards contacted maintainers merged one pull request bot adoption avoid duplicate invitations kept first record maintainers appeared one initial target population comprised 1960 maintainers projects adopted code review bots made email addresses publicly available via GitHub API increase survey participation followed best practices described Smith et al 21 sending personalized invitations allowing participants remain anonymous survey set online questionnaire sent September 18 2019 received answers 3 months sent reminder October 2019 Participation voluntary estimated time complete survey 10 minutes received answers 127 maintainers delivery 26 messages failed survey response rate approx 655 consistent studies engineering 22 maintainers’ survey three main questions made publicly available1 summary asked maintainers expectations perception changes caused adoption code review bot Regarding changes process level asked maintainers activity indicators studied Wessel et al 29 number opened merged nonmerged pull requests number comments time close pull requests 32 Data analysis used card sorting approach 35 qualitatively analyze answers openended questions Q1 Q3 Two researchers conducted card sorting two steps first step researcher analyzed answers cards independently applied codes answer sorting meaningful groups step followed discussion meeting reaching consensus 1httpszenodoorgrecord3992379Xz1iSlKg3E Table 1 Reasons adoption code review bots Reasons answers Enhance feedback developers 31 244 Reduce maintainers effort 30 236 Enforce high code coverage 22 173 Automate routine tasks 20 157 Ensure highquality standards 20 157 Detect change effects 7 55 Curiosity 5 39 Improve interpersonal communication 5 39 Lack available tools 5 39 Outside contributor’s suggestion 2 16 code names categorization item end process answers sorted highlevel groups second step researchers analyzed categories aiming refine classification grouprelated codes significant higherlevel categories themes used open card sorting meaning predefined codes groups codes emerged evolved analysis process addition quantitatively analyzed closedended question Q2 understand developers’ perceptions impact bots pull requests
::::
4 RESULTS section report main findings 41 Maintainers’ Motivations Adopt Code Review Bot asked maintainers made decide start using bots support code review activities Four participants 315 report reason answers grouped 10 categories seen Table 1 maintainers’ perspective recurrent motivation relates enhancing feedback developers 31 mentions category includes cases respondents’ desired see code review metrics additional information “in pretty automated fashion” “without go another tool” Several respondents recognized value bot feedback reviewers contributors “bots write useful information comments analyze without switching context” addition respondents pointed importance “giving uniform feedback contributors” “letting contributors see affect code” Another two respondents mentioned kind feedback might also increase contributors’ public accountability giving reviewers “confidence author cares testing” quality code contribution Another recurrent reason regards reducing maintainers’ effort 30 mentions Several maintainers motivated necessity save time reduce effort code review process said reducing maintainers’ effort trivial tasks finding syntax errors checking code style coverage requirements allows “spend time important parts” Moreover feedback provided code review bot helps maintainers avoid “repeating comments pull request” 22 mentions enforcing high code coverage code review process third common reason general respondents mentioned code review bots adopted help detect prevent reduction code coverage also mentioned bots “ensure good coverage allow changes code base high confidence continue function expected” since “don’t want drop significantly coverage” Respondents 20 also reported another related reason ensure highquality standards Respondents said using code review bots “automating repetitive tasks ensures get done increasing code quality” “reduces risk bugs missed reviewers” Several maintainers 20 also motivated automating routine tasks previously manually performed Respondents mentioned desire automate routine tasks order structure process code review “make process repeatable” routine tasks include tracking coverage “automatically uploading code coverage results 3rdparty service” Others provided generic answers briefly mentioning “automation” Maintainers also motivated curiosity test new technological tool suggestion outside contributor five cases respondents motivated improving interpersonal communication since “an automatic answer bot isn’t taken personally” “it friendly way ensure quality” Moreover code review bot “improves interpersonal communication pull requests thus may reduce chance pull request abandoned author” Answer RQ1 Maintainers reported 10 reasons using code review bots found several maintainers motivated enhancing feedback developers 244 reducing efforts 236 enforcing high code coverage 173 42 Maintainers’ Perceptions Bots Effects also asked maintainers perspective potential changes projects code review bot introduced answers followed 5point Likert scale neutral ranging “Strongly disagree” “Strongly agree” Figure 1 observe respondents agree expected impact bot adoption pull requests considering five studied activities indicators number pull requests received merged nonmerged number comments time close pull requests respondents claimed relation number pull requests presence bot stated amount opened pull requests “depends bugs features software” However one respondent claimed could lead increase number pull requests “a better experience everyone involved might eventually lead repeat contributors” Regarding merged nonmerged pull requests maintainers claimed trends typically “human factors” unrelated bot adoption One maintainer believed ability filter contributions reduce code quality also reduces merge rates pull requests Respondents 36 perceived increase number comments made pull requests bot adoption One respondent claimed increase occurs contributions drastically reduce coverage stimulate exchange comments maintainers contributors Another maintainer explained number comments increased maintainers “contributors started discussing best test something” Maintainers believe 41 code review bot helped decrease timetoclose pull requests One respondent agree statement left comment telling us code review bot actually increased time merge pull requests due need additional time write tests obtain stable code Another maintainer commented bot increases time merge contributions though “it perceived bad thing” also openly asked maintainers changes introduced adoption code review bots maintenance process Twentythree participants 181 report change responses grouped 13 categories seen Table 2 recurrent reported change adoption code review bots requires less manual labor maintainers 33 mentions general respondents mentioned maintenance process easier fewer manual tasks perform “need spend less time it” maintainers also suggest bots could help reduce number human resources necessary complete task makes “it easier reducing number review comments general feedback manual quality assurance required successful merge” Nevertheless maintainers also aware implications “automation like always prone nonfatal error” Several maintainers 20 noticed changes quality contributions received reporting bot helps enforce highquality code one example respondent mentioned “the introduction bots increased quality code seen maintainers initial review since contributors got timely minutes feedback parts failed basic quality standards missing tests missing documentation incorrect style broken functionality” Another 6 respondents also realized positive effects quality code review process “translate efficient code review robust codebase long term” Since one common reasons adopt code review bot enforce code coverage unsurprisingly 16 respondents mentioned increase code coverage adoption respondents reported bots help “encourage add tests” “the coverage good enough” One respondent stated importance awareness code coverage “the effects visible contributors generally resolve decreased coverage pull request” Additionally one respondent claimed bot feedback also “spurred pull requests increase coverage” Another bot adoption effect reviewing pull requests became faster reported 16 maintainers Three respondents mentioned faster reviews lead faster merging respondent stated highquality pull requests quickly identified since “the human review step always started baseline level quality” thus merged faster addition another maintainer reinforced efficiency process “some bots well merge pull requests immediately opening it” addition 7 maintainers also reported quality code review process improved categories although less recurrent called attention negative effects reportedly caused bot adoption One respondent said bots intimidate newcomers since newcomers close pull requests bot comment Another believes newcomer receiving assessment “you let coverage go down” instead “thanks contribution” “can little daunting” Respondents also mentioned adoption testing started require time development bot’s comments introduced noise Another respondent said bot impersonate human developers due bots’ strict rules stressed contributors Answer RQ2 Among positive changes incurred code review bots maintainers reported less manual labor required bot adoption 259 bots enforced highquality code 157 negative effects include communication noise time spent tests newcomers’ dropout bots impersonating maintainers
::::
5 DISCUSSION IMPLICATIONS Adding code review bot represent desire better communicate developers helping contributors maintainers effective achieving improved interpersonal communication already discussed Storey Zagalsky 25 fact results reveal predominant reason using code review bot improve feedback communicated developers Moreover maintainers also interested automating code review tasks reduce maintenance burden enforce high code coverage maintainers’ perceptions bots impact maintenance line reported motivations Indeed maintainers started spend less effort trivial tasks allowing focus important aspects code review Furthermore code review bots guide contributors toward detecting change effects maintainers triage pull requests 29 ensuring highquality standards faster code review Bots’ feedback provides immediate clear sense contributors need contribution reviewed Maintainers also noted contributors’ confidence increased code review bot provided situational awareness 25 indicating standards language issues coverage contributors one hand adopting bot save maintainers’ costs time effort code review activities hand study also reports four unexpected negative effects adopting bot assist code review process effects include communication noise time spent tests newcomers’ dropout bots impersonating maintainers Although less recurrent effects nonnegligible OSS community Previous work Wessel et al 29 already mentioned support newcomer onboarding terms challenges feature maintainers desire survey maintainers claim easier newcomers submit highquality pull request intervention bots However another maintainer pointed newcomers casual contributors receive feedback bot lead rework discussions ultimately dropping contributing study suggests practical implications practitioners well insights suggestions researchers Awareness bot effects Indeed maintenance activities changed following adoption code review bots change directly affect contributors’ maintainers’ work Hence understanding code review bot adoption affects important practitioners mainly avoid unexpected even undesired effects Awareness unexpected bot effects lead maintainers take countermeasures andor decide whether use code review bot Improving bots’ design Anyone wants develop bot support code review process needs consider impact bot may technical social contexts Based results bot improvements envisioned example order prevent bots introducing communication noise bot developers know extent bot interrupt human 14 24 Improving newcomers support aforementioned previous literature bots already mentioned lack support newcomers 29 reasonable expect newcomers receive friendly feedback higher engagement level thus sustain participation Hence future research help bot designers providing guidelines insights support new contributors
::::
6 THREATS VALIDITY Since leverage qualitative research methods categorize openended questions asked survey may introduced categorization bias mitigate bias conducted process pairs carefully discussed categorization among authors Regarding survey order presented questions respondents may influenced way answered addition cannot guarantee maintainers correctly understood sentences 4 5 tried order questions based natural sequence actions help respondents understand questions’ context
::::
7 FINAL CONSIDERATIONS work conducted preliminary investigation maintainers’ perceptions effects adopting bots support code review process pull requests frequently mentioned motivations using bots including automating repetitive tasks improving tools’ feedback developers reducing maintenance effort RQ1 Moreover maintainers cite several benefits bots decreasing time close pull requests reducing workload laborious repetitive tasks However maintainers also stated negative effects including introduction noise RQ2 Based preliminary findings future research focus better supporting understanding bots’ influences social interactions context OSS projects Moreover future work investigate effects adopting bot expansion analysis types bots activity indicators social coding platforms ACKNOWLEDGMENTS thank participants study volunteered support research work partially supported Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil CAPES – Finance Code 001 CNPq grant 14122220182 National Science Foundation grants 1815503 1900903 REFERENCES 1 Ahmad Abdellatif Emad Shihab 2020 MSBot Using Bots Answer Questions Repositories Empirical Engineering EMSE 25 2020 1834–1863 httpsdoiorg101007s10664019097885 2 Alberto Bacchelli Christian Bird 2013 Expectations outcomes challenges modern code review 2013 35th International Conference Engineering ICSE IEEE 712–721 3 Olga Baysal Oleksii Kononenko Reid Holmes Michael W Godfrey 2016 Investigating technical nontechnical factors influencing modern code review Empirical Engineering 21 3 2016 932–959 4 Chris Brown Chris Parpin 2019 Sorry Bother Designing Bots Effective Recommendations Proceedings 1st International Workshop Bots Engineering Montreal Quebec Canada BotSE ’19 IEEE Press Piscataway NJ USA 54–58 httpsdoiorg101109BotSE201900021 5 Carvalho W Luz Marciolo R Bonfáciu G Pinto E Dias Canedo 2020 C3PR Bot Fixing Static Analysis Violations via Pull Requests 2020 IEEE 27th International Conference Analysis Evolution Reengineering SANER IEEE Computer Society 6 Nathan Cassee Bogdan Vasilescu Alexander Serebrenik 2020 silent helper impact continuous integration code reviews 27th IEEE International Conference Analysis Evolution Reengineering SANER IEEE 49–60 7 Linda Erenholt Francisco Gomez de Oliveira Neto Riccardo Scandariato Philipp Leitner 2019 Current Future Bots Development Proceedings 1st International Workshop Bots Engineering Montreal Quebec Canada BotSE ’19 IEEE Press Piscataway NJ USA 7–11 httpsdoiorg101109BotSE201900009 8 Georgios Gousios Diomidis Spinellis 2012 GHtorrent GitHub’s data firehose 2012 9th IEEE Working Conference Mining Repositories MSR IEEE 12–21 9 David Kavaler Asher Trockman Bogdan Vasilescu Vladimir Filkov 2019 Tool choice matters JavaScript quality assurance tools usage outcomes GitHub projects Proceedings 41st International Conference Engineering IEEE Press 476–487 10 Carlene Lebeuf Alexey Zagalsky Matthieu Foucault MargaretAnne Storey 2019 Defining Classifying Bots Faceted Taxonomy Proceedings 1st International Workshop Bots Engineering Montreal Quebec Canada BotSE ’19 IEEE Press Piscataway NJ USA 1–6 httpsdoiorg101109BotSE201900008 11 Bin Lin Alexey Zagalsky MargaretAnne Storey Alexander Serebrenik 2016 developers slacking Understanding teams use slack Proceedings 19th ACM Conference Computer Supported Cooperative Work Social Computing Companion ACM 333–336 12 Dongyu Lau Mich J Smith Kalyan Veeramachaneni 2020 Understanding UserBot Interactions SmallScale Automation OpenSource Development Extended Abstracts 2020 CHI Conference Human Factors Computing Systems Honolulu HI USA CHI EA ’20 Association Computing Machinery New York NY USA 1–8 httpsdoiorg10114533344803382998 13 Shane McIntosh Yasutaka Kamei Bram Adams Ahmed E Hassan 2014 impact code review coverage code review participation quality case study qt vtk itk projects Proceedings 11th Working Conference Mining Repositories 192–201 14 Samim Mirhosseini Chris Parpin 2017 Automated Pull Requests Encourage Developers Upgrade Outofdate Dependencies Proceedings 32nd IEEEACM International Conference Automated Engineering UrbanaChampaign IL USA ASE ’17 IEEE Press Piscataway NJ USA 94–94 httpdlacmorgcitationcfmid31555623155577 15 Martin Monperrus 2019 Explainable Bot Contributions Case Study Automated Bug Fixes Proceedings 1st International Workshop Bots Engineering Montreal Quebec Canada BotSE ’19 IEEE Press Piscataway NJ USA 12–15 httpsdoiorg101109BotSE201900010 16 KF Mulder 2013 Impact new technologies assess intended unintended effects new technologies Handb Sustain Eng 2013 17 Elahe Paikari André van der Hoek 2018 Framework Understanding Chatbots Future Proceedings 11th International Workshop Cooperative Human Aspects Engineering Gothenburg Sweden CHASE ’18 ACM New York NY USA 13–16 httpsdoiorg10114531958363195859 18 Luyao Ren Shurui Zhou Christian Kastner Andrzei Woznarski 2019 Identifying Redundancies Forkbased Development 2019 IEEE 26th International Conference Analysis Evolution Reengineering SANER IEEE 230–241 19 Edward Smith Robert Loftin Emerson MurphyHill Christian Bird Thomas Zimmermann 2013 Improving developer participation rates surveys 2013 6th International Workshop Cooperative Human Aspects Engineering CHASE IEEE 89–92 20 Igor Steinmacher Gustavo Pinto Igor Scaliante Wiese Marco Gerosa 2018 Almost Study Quasicontributors Open Source Projects Proceedings 40th International Conference Engineering Gothenburg Sweden ICSESEIP ’18 ACM New York NY USA 256–266 httpsdoiorg10114531801553180208 21 Igor Fábio Steinmacher 2015 Supporting newcomers overcome barriers contribute open source projects PhD Dissertation Universidade de São Paulo 22 MargaretAnne Storey Alexander Serebrenik Carolyn Penstein Rosé Thomas Zimmermann James Herbsleb 2020 B0Tse Bots Engineering Dagstuhl Seminar 19471 Dagstuhl Reports 9 11 2020 84–96 23 MargaretAnne Storey Alexey Zagalsky 2016 Disrupting Developer Productivity One Bot Time Proceedings 2016 24th ACM SIGSOFT International Symposium Foundations Engineering Seattle WA USA FSE 2016 ACM New York NY USA 928–931 httpsdoiorg10114529502902938989 24 MargaretAnne Storey Alexey Zagalsky Fernando Figueira Filho Leif Singer Daniel German 2017 Social Communication Channels Shape Challenge Participatory Culture Development IEEE Trans Softw Eng 43 2 Feb 2017 185–204 httpsdoiorg101109TSE20162584053 25 Simon Urli Zhongxing Yu Lionel Seinturier Martin Monperrus 2018 Design Program Repair Bot Insights Repairinator Proceedings 40th International Conference Engineering Engineering Practice Gothenburg Sweden ICSESEIP ’18 ACM New York NY USA 95–104 httpsdoiorg10114531835193183540 26 Rijnard van Tonder Claire Le Goues 2019 Towards EngineerBot Principles Program Repair Bots Proceedings 1st International Workshop Bots Engineering Montreal Quebec Canada BotSE ’19 IEEE Press Piscataway NJ USA 43–47 httpsdoiorg101109BotSE201900019 27 Mairieli Wessel Bruno Mendes de Souza Igor Steinmacher Igor Wiese Ivanilto Polato Ana Paula Chaves Marco Gerosa 2018 Power Bots Characterizing Understanding Bots OSS Projects Proceedings ACM Conference Computer Supported Cooperative Work Social Computing 2 CSCW Article 182 Nov 2018 1821–18218 httpsdoiorg1011453274451 28 Mairieli Wessel Alexander Serebrenik Igor Scaliante Wiese Igor Steinmacher Marco Aurelio Gerosa 2020 Effects Adopting Code Review Bots Pull Requests OSS Projects IEEE International Conference Maintenance Evolution IEEE Computer Society 29 Mairieli Wessel Igor Steinmacher 2020 Inconvenient Side Bots Pull Requests Proceedings 2nd International Workshop Bots Engineering BotSE httpsdoiorg10114533879403391504 30 Marvin Wyrich Justus Bogner 2019 Towards Autonomous Bot Automatic Source Code Refactoring Proceedings 1st International Workshop Bots Engineering Montreal Quebec Canada BotSE ’19 IEEE Press Piscataway NJ USA 24–28 httpsdoiorg101109BotSE201900015 31 Yue Yu Huaimin Wang Vladimir Filkov Premkumar Devanbu Bogdan Vasilescu 2015 Wait Determinants Pull Request Evaluation Latency GitHub 2015 IEEEACM 12th Working Conference Mining Repositories 367–371 httpsdoiorg101109MSR201542 32 Yangyang Zhao Alexander Serebrenik Yuming Zhou Vladimir Filkov Bogdan Vasilescu 2017 impact continuous integration development practices largescale empirical study Proceedings 32nd IEEEACM International Conference Automated Engineering IEEE Press 60–71 33 Thomas Zimmermann 2016 Cardsorting text themes Perspectives Data Science Engineering Elsevier 137–141
::::
Developers Use Trivial Packages Empirical Case Study npm Rabe Abdalkareem Olivier Nourry Sultan Wehaibi Suhaib Mujahid Emad Shihab Datadriven Analysis DAS Lab Department Computer Science Engineering Concordia University Montreal Canada rababduonourrysalwehasmujahieshihabencsconcordiaca ABSTRACT Code reuse traditionally seen good practice Recent trends pushed concept code reuse extreme using packages implement simple trivial tasks call ‘trivial packages’ recent incident trivial package led breakdown popular web applications Facebook Netflix made imperative question growing use trivial packages Therefore paper mine 230000 npm packages 38000 JavaScript applications order study prevalence trivial packages found trivial packages common increasing popularity making 168 studied npm packages performed survey 88 Nodejs developers use trivial packages understand reasons drawbacks use survey revealed trivial packages used perceived well implemented tested pieces code However developers concerned maintaining risks breakages due extra dependencies trivial packages introduce objectively verify survey results empirically validate cited reason drawback find contrary developers’ beliefs 452 trivial packages even tests However trivial packages appear ‘deployment tested’ similar test usage community interest nontrivial packages hand found 115 studied trivial packages 20 dependencies Hence developers careful trivial packages decide use CCS CONCEPTS • engineering → libraries repositories maintenance tools KEYWORDS JavaScript Nodejs Code Reuse Empirical Studies ACM Reference Format Rabe Abdalkareem Olivier Nourry Sultan Wehaibi Suhaib Mujahid Emad Shihab 2017 Developers Use Trivial Packages Empirical Case Study npm Proceedings 2017 11th Joint Meeting European Engineering Conference ACM SIGSOFT Symposium Foundations Engineering Paderborn Germany September 4–8 2017 ESECFSE’17 11 pages httpsdoiorg10114531062373106267
::::
1 INTRODUCTION Code reuse often encouraged due multiple benefits fact prior work showed code reuse reduce timetomarket improve quality boost overall productivity 3 32 37 Therefore surprise emerging platforms Nodejs encourage reuse everything possible facilitate code sharing often delivered packages modules available package management platforms Node Package Manager npm 7 39 However good news many cases code reuse negative effects leading increase maintenance costs even legal action 2 29 35 41 example recent incident code reuse Nodejs package called leftpad used Babel caused interruptions largest Internet sites eg Facebook Netflix Airbnb Many referred incident case ‘almost broke Internet’ 33 45 incident lead many heated discussions code reuse sparked David Haney’s blog post “Have Forgotten Program” 26 real reason leftpad incident npm allowed authors unpublish packages problem resolved 40 raised awareness broader issue taking dependencies trivial tasks easily implemented 26 Since many discussions use trivial packages Loosely defined trivial package package contains code developer easily code himherself hence worth taking extra dependency Many developers agreed Haney’s position stated every serious developer knows ‘small modules nice theory’ 8 suggesting developers implement functions rather taking dependencies trivial tasks work showed npm packages tend large number dependencies 13 14 highlighted developers need use caution since dependencies grow exponentially 4 fact dataset found 11 trivial packages 20 dependencies million dollar question “why developers resort using package trivial tasks checking variable array” time questions regarding prevalent trivial packages potential drawbacks using trivial packages remain unanswered Therefore performed empirical study involving 230000 npm packages 38000 JavaScript applications better understand developers resort using trivial packages empirical study qualitative nature based survey results 88 Nodejs developers also quantitatively validate commonly developercited reason drawback related use trivial packages Since best knowledge first study examine developers use trivial packages first propose definition constitutes trivial package based feedback JavaScript developers also examine prevalent trivial packages npm widely used Nodejs applications findings indicate Trivial packages common popular 231092 npm packages dataset 168 trivial packages Moreover 38807 Nodejs applications GitHub 109 directly depend one trivial packages developers consider use trivial packages bad practice survey 88 JavaScript developers 579 said consider use trivial packages bad practice whereas 239 consider bad practice finding shows clear consensus issue trivial package use Trivial packages provide well implemented tested code increase productivity Developers believe trivial packages provide well implementedtested code increase productivity time increase dependency overhead risk breakage applications two cited drawbacks Developers need careful trivial packages use empirical findings show many trivial packages dependencies fact found 437 trivial packages least one dependency 115 trivial packages 20 dependencies addition aforementioned findings study provides following key contributions provide way quantitatively determine trivial packages best knowledge first study examine prevalence reasons drawbacks using trivial packages Nodejs applications study also one largest studies JavaScript applications involving survey 80 JavaScript developers 231092 npm packages 38807 Nodejs applications perform empirical study validate commonly cited reasons drawbacks using trivial packages developer survey make dataset responses provided npm developers publicly available paper organized follows Section 2 provides background introduces datasets Section 3 presents determine trivial package Section 4 examines prevalence trivial packages use Nodejs applications Section 5 presents results developer survey presenting reasons perceived drawbacks developers use trivial packages Section 6 presents empirical validation commonly cited reason drawback using trivial packages implications findings noted section 7 discuss related works section 8 limitations study section 9 present conclusions section 10
::::
2 BACKGROUND DATASETS JavaScript used write client server side applications popularity steadily grown thanks popular frameworks Nodejs active developer community 7 46 JavaScript projects classified two main categories packages used projects applications used standalone Node Package Manager npm provides tools manage Nodejs packages npm official package manager Nodejs registry contains 250000 packages 25 perform study gather two datasets two sources obtain Nodejs packages npm registry applications use npm packages GitHub Packages Since interested examining impact ‘trivial’ packages mined latest version Nodejs packages npm May 5 2016 package obtained source code GitHub cases package publisher provide GitHub link case obtained source code directly npm total mined 252996 packages Applications also want examine use packages JavaScript applications Therefore mined Nodejs applications GitHub ensure indeed obtaining applications GitHub npm packages compare URL GitHub repositories URLs obtained npm packages URL GitHub also npm flagged npm package removed application list determine application uses npm packages looked ‘packagejson’ file specifies amongst others npm package dependencies used application eliminate dummy applications may exist GitHub choose nonforked applications 100 commits 2 developers Similar filtering criteria use prior work Kalliamvakou et al 31 total obtained 115621 JavaScript applications removing applications use npm platform left 38807 applications
::::
3 TRIVIAL PACKAGES ANYWAY Although trivial package loosely defined past eg blogs 27 28 want precise objective way determine trivial packages determine constitutes trivial package conducted survey asked participants considered trivial package indicators used determine package trivial devised online survey presented source code 16 randomly selected Nodejs packages range size 4 250 JavaScript lines code LOC Participants asked 1 indicate thought package trivial 2 specify indicators use determine trivial package opted Developers Use Trivial Packages Empirical Case Study npm limit size Nodejs packages survey maximum 250 JavaScript LOC since want overwhelm participants review excessive amounts code asked survey participants indicate trivial packages list Nodejs packages provided provided survey participants loose definition trivial package ie package contains code easily code hence worth taking extra dependency Figure 1 shows example trivial package called isPositive simply checks number positive survey questions divided three parts 1 questions participant’s development background 2 questions classification provided Nodejs packages 3 questions indicators participant would use determine trivial package sent survey 22 developers colleagues familiar JavaScript development received total 12 responses javascript moduleexports function n return toStringcalln object Number n 0 Figure 1 Package isPositive npm Participants Background Experience 12 respondents 2 undergraduate students 8 graduate students 2 professional developers Ten 12 respondents least 2 years JavaScript experience half participants developing JavaScript five years Survey Responses asked participants list indicators use determine package trivial indicate packages considered trivial 12 participants 11 92 state complexity code 9 75 state size code indicators use determine trivial package Another 3 20 mentioned used code comments indicators eg functionality indicate package trivial Since clear size complexity common indicators trivial packages use two measures determine trivial packages mentioned participants could provide 1 indicator hence percentages sum 100 Next analyze packages marked trivial total received 69 votes 16 packages ranked packages ascending order based size tallied votes voted packages find 79 votes consider packages less 35 lines code trivial also examine complexity packages using McCabe’s cyclomatic complexity find 84 votes marked packages total complexity value 10 lower trivial important note although provide source code packages participants explicitly provide size complexity packages participants biased metrics ie size complexity classification Based aforementioned findings used two indicators JavaScript LOC leq 35 complexity leq 10 determine trivial packages dataset Hence define trivial packages XtextLOC leq 35 cap XtextComplexity leq 10 XtextLOC represents JavaScript LOC XtextComplexity represents McCabe’s cyclomatic complexity package X Although use aforementioned measures determine trivial packages consider possible way determine trivial packages survey indicates size complexity commonly used measures determine package trivial Based analysis packages leq 35 JavaScript LOC McCabe’s cyclomatic complexity leq 10 considered trivial
::::
4 PREVALENT TRIVIAL PACKAGES section want know prevalent trivial packages examine prevalence two aspects first aspect npm’s perspective interested knowing many packages npm trivial second aspect considers use trivial packages JavaScript applications 41 Many npm’s Packages Trivial use two measures LOC complexity determine trivial packages use quantify number trivial packages dataset dataset contained total 252996 npm packages package calculated number JavaScript code lines removed packages zero LOC removed 21904 packages left us final number 231092 packages package removed test code since mostly interested actual source code packages identify remove test code similar prior work 22 44 48 look term “test” variants file names file paths 231092 npm packages mined 38845 168 packages trivial packages addition examined growth trivial packages npm Figure 2 shows percentage trivial packages published npm per month see increasing trend number trivial packages time approximately 15 packages added every month trivial packages investigated spike around March 2016 found spike corresponds time npm disallowed unpublishing packages 40 npm posts dependedupon packages website 38 measured number trivial packages exist top 1000 dependedupon packages find 113 trivial packages finding shows trivial packages prevalent increasing number also popular among developers making 113 1000 depended npm packages Trivial packages make 168 studied npm packages Moreover proportion trivial packages increasing trivial packages make 113 top 1000 depended npm packages
::::
42 Many Applications Depend Trivial Packages trivial packages exist npm mean actually used Therefore also examine number applications use trivial packages examine packagejson file contains dependencies application installs npm However cases application may install package use avoid counting instances parse JavaScript code examined applications use regular expressions detect require dependency statements indicates application actually uses package code Finally measured number packages trivial set packages used applications Note consider npm packages since popular package manager Nodejs packages package managers manage subset packages eg Bower 9 manages frontendclientside frameworks libraries modules find 38807 applications dataset 4256 109 directly depend least one trivial package 38807 Nodejs applications dataset 109 depend least one trivial package
::::
5 SURVEY RESULTS surveyed Nodejs developers understand reasons drawbacks using trivial packages use survey allows us obtain firsthand information developers use trivial packages order select relevant participants sent survey developers use trivial packages used Git’s pickaxe command lines contain required dependency statements applications procedure provided us email name developer introduced trivial package dependency Survey Participants mitigate possibility introducing misunderstood misleading questions initially sent survey two JavaScript developers incorporated minor suggestions improve survey Next sent survey 1055 developers 1696 applications select developers ranked based number trivial packages use took sample 600 developers use trivial packages another 600 indicated least use trivial packages survey emailed 1200 selected developers however since emails returned various reasons eg email account exist anymore etc could reach 1055 developers Note package required application exist break application survey listed trivial package application detected trivial package received 88 responses survey translates response rate 83 survey response rate line even higher typical 5 response rate reported questionnairebased engineering surveys 42 88 respondents 83 identified developers working either industry 68 full time independent developers 15 remaining 5 identified casual developers 2 3 including one student two developers working executive positions npm development experience survey respondents majority 67 respondents 5 years experience 14 35 years 7 13 years experience fact respondents experienced JavaScript developers gives us confidence survey responses
::::
51 Developers Consider Trivial Packages Harmful first question survey participants “Do consider use trivial packages bad practice” reason ask question bluntly allows us gauge deterministic way Nodejs developers felt issue using trivial packages provided three possible replies Yes case provided text box elaborate 88 participants 51 579 stated consider use trivial packages bad practice Another 21 239 stated indeed think using trivial package bad practice remaining 16 182 stated really depends circumstances time available critical piece code package used thoroughly tested surveyed developers 579 believe using trivial packages bad practice
::::
52 Developers Use Trivial Packages answered question whether developers think using trivial packages bad practice interested developers resort using trivial packages view drawbacks using trivial packages Therefore second part survey asks participants list reasons resort using trivial packages ensure bias responses developers answer fields questions freeform text ie predetermined suggestions provided gathering responses grouped categorized responses twophase iterative process first phase first two authors carefully read participant’s answers came number categories responses fell Next discussed groupings agreed extracted categories Whenever failed agree category third author asked help break tie categories decided two authors went answers classified respective categories majority cases two authors agreed categories classifications responses measure agreement two authors used Cohen’s Kappa coefficient 10 Cohen’s Kappa coefficient used evaluate interrater agreement levels categorical scales provides proportion agreement corrected chance resulting coefficient scaled range 1 1 negative value means less chance agreement zero indicates exactly chance agreement positive value indicates better chance agreement 18 categorization level agreement measured authors 090 considered excellent interrater agreement Table 1 shows five reasons using trivial packages reported survey respondents another category used group ‘no reason’ responses Table 1 presents different reasons description category frequency reasons listed order popularity R1 Well implemented tested 546 cited reason using trivial packages provide well implemented tested code half responses mentioned reason particular although may easy developers code trivial packages difficult make sure details addressed eg one needs carefully consider edge cases example responses mention issues stated participants P68 P4 cite reasons using trivial packages follows P68 “Tests already written lot edge cases captured ” P4 “There may elegantefficientcorrectcrossenvironmentcompatilble solution trivial problem yours” R2 Increased productivity 477 second cited reason improved productivity using trivial packages enables Trivial tasks writing code requires time effort hence many developers view use trivial packages way boost productivity particular early developer want worry small details would rather focus efforts implementing difficult tasks example participants P13 P27 state P13 “ save time think best implement even simple things” P27 “Don’t reinvent wheel task done before” aforementioned clear examples developers would rather code something even trivial course comes cost discuss later R3 Well maintained code 91 less common cited reason using trivial packages fact maintenance code need performed developers essence outsourced community contributors trivial packages example participant P45 states “Also highly used trivial package probable well maintained” Even tasks bug fixes dealt contributors trivial packages attractive users trivial packages reported participant P80 “ leveraging feedback larger community fix bugs etc” R4 Improved readability reduced complexity 91 Participants also reported using trivial packages improves readability reduces complexity code example P34 states “immediate clarity use readability developers commonly used packages” P47 states “Simple abstract brings less complexity” R5 Better performance 34 participants stated using trivial packages improves performance since alleviates need application depend large frameworks example P35 states “ depend huge utility library need part” small percentage 80 respondents stated see reason use trivial packages two cited reasons using trivial packages 1 provide well implemented tested code 2 increase productivity
::::
53 Drawbacks Using Trivial Packages addition knowing reasons developers resort trivial packages wanted understand side coin perceive drawbacks decision use packages drawbacks question part survey followed aforementioned process analyze survey responses case drawbacks Cohen’s Kappa agreement measure 086 considered excellent agreement Table 2 lists drawback mentioned survey respondents along brief description frequency drawback I1 Dependency overhead 557 cited drawback using trivial packages increased dependency overhead eg keeping dependencies date dealing complex dependency chains developers need bear 7 situation often referred ‘dependency hell’ especially trivial packages additional dependencies drawback came clearly many comments example P41 states Table 2 Drawbacks using trivial packages Drawback Description Resp Dependency overhead Using trivial packages results dependency mess hard update maintain 49 557 Breakage applications Depending trivial package could cause application break package becomes unavailable breaking update 16 182 Decreased performance Trivial packages decrease performance applications includes time install build application 14 159 Slows development Finding relevant high quality trivial package challenging time consuming task 11 125 Missed learning opportu practice using trivial packages leads developers learning experiencing writing code trivial tasks 8 91 nities Security Using trivial packages open door security vulnerability 7 80 Licensing issues Using trivial packages could cause licensing conflicts 3 34 drawbacks 7 80 people don’t actively manage dependency versions could exposed serious problems P40 Hard maintain lot tiny packages Hence trivial packages may provide well implementedtested code improve productivity developers clearly aware management additional dependencies something need deal Breakage applications 182 Developers also worry potential breakage application due specific package version becoming unavailable example leftpad issue main reason breakage removal leftpad P4 states Obviously whole leftpad crash exposed issue However since incident npm disabled possibility package removed 40 Although disallowing removal solves part problem packages still updated may break application nontrivial package may worth take risk however trivial packages may worth taking risk Decreased performance 159 issue related dependency overhead drawback Developers mentioned incurring additional dependencies slowed build time increased application installation times example P64 states many metadata download store real code P34 states slow installs make noisy unintuitive attempting cobble together many disparate pieces instead targeted code mentioned earlier cases fact trivial package adds dependency cases trivial package depends additional packages negatively impacts performance even Slows development 125 cases use trivial packages may actually reverse effect slow development example P23 P15 state P23 actually slow team matter trivial package developer hasn’t required read docs order double check rather reading lines source P15 problem locating packages useful trustworthy difficult find relevant trustworthy package Even others try build code much difficult go fetch package learn rather read lines code Missed learning opportunities 91 certain cases use trivial packages seen missed learning opportunity developers example P24 states Sometimes people forget things could lead lack control knowledge languagetechnology using clear example using package rather coding solution lead less knowledge code base Security 80 cases trivial packages may security flaws make application vulnerable issue pointed developers example P15 mentioned earlier difficult find packages trustworthy P57 also mentions depend public trivial packages careful selecting packages security reasons case dependency one takes always chance security vulnerability could exposed one packages Licensing issues 34 cases developers concerned potential licensing conflicts trivial packages may cause example P73 states possibly licenseissues P62 risk trivial package might licensed GPL must replaced anyway prior shipping also 8 responses stated see drawbacks using trivial packages two cited drawbacks using trivial packages 1 increase dependency overhead 2 may break applications due package specific version becoming unavailable incompatible
::::
6 PUTTING DEVELOPER PERCEPTION MICROSCOPE developer survey provided us great insights developers use trivial packages perceive drawbacks However whether empirical evidence support perceptions remains unexplored Thus examine commonly cited reason using trivial packages ie fact trivial packages well tested drawback ie impact additional dependencies based findings Section 5 61 Examining ‘Well Tested’ Perception shown Table 1 546 responses indicate use trivial packages since well implemented tested developers good reasons believe npm requires developers provide test script name submission packages listed packagejson file fact 812 31521 38845 trivial packages dataset test script name listed However since developers provide script name field difficult know package actually tested examine whether package really well tested implemented two aspects first check package tests written Second since many cases developers consider packages ‘deployment tested’ also consider usage package indicator well tested implemented 47 carefully examine whether package really well tested implemented use npm online search tool known npms 11 measure various metrics related well packages tested used valued provide ranking packages npms mines calculates number metrics based development eg tests usage eg downloads data use three metrics measured npms validate ‘well tested implemented’ perception developers 1 Tests considers tests’ size coverage percentage build status looked npms source code find Tests metric calculated texttestsSize times 06 textbuildStatus times 025 textcoveragePercentage times 015 use Tests metric determine package tested trivial packages compare nontrivial packages terms well tested One example motivates us investigate well tested trivial package response P68 says “Tests already written lot edge cases captured ” 2 Community interest evaluates community interest packages using number stars GitHub npm forks subscribers contributors find source code npm Community interest simply sum aforementioned metrics measured textstarsCount textforksCount textsubscribersCount textcontributorsCount use metric compare interested community trivial nontrivial packages measure community interest since developers view importance trivial packages evidence quality stated P56 says “ Using isolated module welltested vetted large community helps mitigate chance small bugs creeping in” 3 Download count measures mean downloads last three months number downloads package often viewed indicator package’s quality P61 mentions “this code tested used many makes trustful reliable” initial step calculate number trivial packages Tests value greater zero means trivial packages tests find 452 trivial packages tests ie Tests value 0 addition compare values Tests Community interest Download count Trivial nonTrivial packages focus values aforementioned metric values trivial packages however also present results nontrivial packages put results context Figure 3 shows beanplots Tests Community interest Download count figures show cases trivial packages median smaller Tests value Community interest value Download count compared nontrivial packages said observe Figure 3 distribution Tests metric similar trivial nontrivial packages packages Tests value zero small pockets packages values aprox 025 06 08 10 case Community interest Download count metrics see similar distributions although clearly median values lower trivial packages examine whether difference metric values trivial nontrivial packages statistically significant performed MannWhitney test compare two distributions determine difference statistically significant p value 005 also use Cliff’s Delta nonparametric effect size measure interpret effect size trivial nontrivial packages suggested 23 interpret effect size value small 033 positive well negative values medium 033 leq 0474 large geq 0474 Table 3 shows p values effect size values observe cases differences statistically significant however effect size small results show although majority trivial packages tests written Metrics p value Tests 22e16 0119 small Community interest 22e16 0269 small Downloads count 22e16 0245 small Contrary developers’ perception 452 trivial packages actually tests Albeit trivial packages lower Tests Community interest Download count values values metrics seem large difference compared nontrivial packages ie trivial packages similar nontrivial packages terms well tested 62 Examining ‘Dependency Overhead’ Perception discussed Section 5 top cited drawback using trivial packages fact developers need take maintain extra dependencies ie dependency overhead Examining impact dependencies complex wellstudied issue eg 1 12 15 examined multitude ways choose examine issue application package perspectives Applications compared coding trivial tasks using trivial package imposes extra dependencies One problematic aspects managing dependencies applications dependencies update causing potential break application Therefore first step examined number releases trivial nontrivial packages intuition developers need put extra effort assure proper integration new releases Figure 4 shows trivial packages less releases nontrivial packages median 2 trivial 3 nontrivial packages hence trivial packages require effort nontrivial packages fact trivial packages updated less frequently may attributed fact trivial packages ‘perform less functionality’ hence need updated less frequently Next examined developers choose deal updates trivial packages One way application developers reduce risk package impacting application ‘version lock’ package Version locking dependencypackage means updated automatically specific version mentioned packagesjson file used stated responses survey eg P8 “Also people don’t lock versions pain” different types version locks ie updating major releases updating patches updating minor releases lock means package automatically updates version locks specified packagesjson file next every package name examined frequency trivial nontrivial packages locked find average trivial packages locked 149 time whereas nontrivial packages locked 117 time However Wilcox test shows difference statistically significant pvalue 005 Hence cannot say developers version lock trivial packages Packages package level investigate direct indirect dependencies trivial packages particular would like determine trivial packages dependencies makes dependency chain even complex trivial nontrivial package install count actual number direct indirect dependencies package requires allows us know true direct indirect dependencies package requires Note simply looking json file require statements provide direct dependencies indirect dependencies Figure 5 shows distribution dependencies trivial nontrivial packages Since trivial packages dependencies median 0 Therefore bin trivial packages based number dependencies calculate percentage packages bin Table 4 shows percentage packages respective number dependencies observe majority trivial packages 563 zero dependencies 279 110 dependencies 43 1120 dependencies 115 20 dependencies table shows trivial packages many dependencies indicates indeed trivial packages introduce significant dependency overhead Packages Dependencies Direct Indirect Trivial zero 563 110 279 1120 43 115 Non Trivial zero 348 110 306 1120 73 273 Trivial packages fewer releases developers less likely version locked nontrivial packages said developers careful using trivial packages since cases trivial packages numerous dependencies fact find 437 trivial packages least one dependency 115 trivial packages 20 dependencies 7 RELEVANCE IMPLICATIONS common question asked empirical studies implications findings would practitioners care findings discuss issue relevance study developer community based responses survey highlight implications study 71 Relevance Practitioners care start study sure practically relevant study trivial packages However surprised interest developers study fact one developers P39 explicitly mentioned lack research topic stating “There enough research I’ve taking note people’s proposed “quick simple” code handle functionality trivial packages it’s surprised see high percentage times proposed code buggy incomplete” Moreover conducted study asked respondents would like know outcome study provide us email address 88 respondents 66 approx 74 provided email us provide outcomes study respondents hold high level leadership roles npm us indicator study outcomes high relevance npm Nodejs development community 72 Implications Study study number implications engineering research practice Implications Future Research study mostly focused determining prevalence reasons drawbacks using trivial packages Based findings find number implicationsmotivations future work First survey respondents indicated choice use trivial packages black white many cases depends team example one survey respondent stated team less experienced developers likely use trivial packages whereas experienced developers would rather write code trivial tasks issue experienced developers likely trust code less experienced likely trust external package Another aspect maturity survey respondents pointed much likely use trivial packages early waste time trivial tasks focus fundamental tasks However matures start look ways reduce dependencies since pose potential points failure Hence study motivates future work examine relationship team experience maturity use trivial packages Second survey respondents also pointed using trivial packages seen favourably compared using code QA sites StackOverflow Reddit compared using code StackOverflow developer know posted code else uses whether code may tests using trivial package npm much better option case using trivial packages seen best choice certainly better choice Although many studies examined developers use QA sites StackOverflow aware studies compare code reuse QA sites trivial packages findings motivate need study Practical Implications direct implication findings trivial packages commonly used others perhaps indicating developers view use bad practice Moreover developers assume trivial packages well implemented tested since findings show otherwise npm developers need expect trivial packages submitted making task finding relevant package even harder Hence issue manage help developers find best packages needs addressed extent npms recently adopted npm specifically address aforementioned issue Developers highlighted lack decent core standard JavaScript library causes resort trivial packages Often want install large frameworks leverage small parts framework hence resort using trivial packages Therefore need Nodejs community create standard JavaScript API library order reduce dependence trivial packages However issue creating standard JavaScript library much debate
::::
8 RELATED WORK Studies Code Reuse Prior research code reuse shown many benefits include improving quality development speed reducing development maintenance costs 3 32 36 37 example Sojer Henkel 43 surveyed 686 open source developers investigate reuse code findings show experienced developers reuse source code 30 functionality open source OSS projects reuse existing components Developers also reveal see code reuse quick way start new projects Similarly Haefliger et al 24 conducted study empirically investigate reuse open source development practices developers OSS triangulated three sources data developer interviews code inspections mailing list data six OSS projects results showed developers used tools relied standards reusing components Mockus 36 conducted empirical study identify largescale reuse open source libraries study shows 50 source files include code OSS libraries hand practice reusing source code challenging drawbacks including effort resource required integrate reused code 16 Furthermore bug reused component could propagate target system 17 study corroborates findings main goal define empirically investigate phenomenon reusing trivial packages particular Nodejs applications Studies Ecosystems recent years analyzing characteristics ecosystems engineering gained momentum 4 5 15 34 example recent study Bogart et al 6 7 empirically studied three ecosystems including npm found developers struggle changing versions might break dependent code Witter et al 46 investigated evolution npm ecosystem extensive study covers dependence npm packages download metrics usage npm packages real applications One main findings npm packages updates packages steadily growing Also 80 packages least one direct dependency package studies examined size characteristics packages ecosystem German et al 21 studied evolution statistical computing GNU R aim analyzing differences code characteristics core usercontributed packages found usercontributed packages growing faster core packages Additionally reported usercontributed packages typically smaller core packages R ecosystem Kabbedijk Jansen 30 analyzed Ruby ecosystem found many small large projects interconnected many ways study complements previous work since instead focusing packages ecosystem specifically focus trivial packages Moreover examine reasons developers use trivial package view drawbacks study reuse trivial packages subset general code reuse Hence expect overlap prior work Like many empirical studies confirm prior findings contribution Moreover paper adds prior findings example validation developers’ assumptions Lastly believe study fills real gap since 74 participants said wanted know study outcomes
::::
9 THREATS VALIDITY Construct validity considers relationship theory observation case measured variables measure actual factors define trivial packages surveyed 12 JavaScript developers mostly graduate student professional experience However find clear vote considered trivial package Also although data suggested packages leq 35 LOC complexity leq 10 trivial packages believe definitions possible trivial packages said 88 survey participants emailed using trivial packages 1 mentioned flagged package trivial package even though fit criteria us confirmation definition applies vast majority cases although clearly perfect use LOC complexity code determine trivial packages cases may measures need considered determine trivial packages example trivial packages dependencies may need taken consideration However experience tells us developers look package dependencies determining trivial said would interesting replicate questionnaire another set participants confirm enhance definition trivial Nodejs package list reasons drawbacks using trivial packages based survey 88 Nodejs developers Although large number developers results may hold Nodejs developers different sample developers may result different list ranking advantages disadvantages mitigate risk due sampling contacted developers different applications responses show experienced developers Also potential survey questions may influenced replies respondents However minimize influence made sure ask freeform responses minimize bias publicly share survey anonymized survey responses used npms measure various quantitative metrics related testing community interest download counts measurements accurate npms however given main search tool npm confident npms metrics distinguish domain npm packages may impact findings However help mitigate bias analyzed 230000 npm packages cover wide range domains removed test code dataset ensure analysis considers JavaScript source code identified test code searching term ‘test’ variants file names file paths Even though technique widely accepted literature 22 44 48 confirm whether technique correct ie files term ‘test’ names paths actually contain test code took statistically significant sample packages achieve 95 confidence level 5 confidence interval examined manually External validity considers generalization findings findings derived open source Nodejs applications npm packages hence findings may generalize platforms ecosystems said historical evidence shows examples individual cases contributed significantly areas physics economics social sciences even engineering 19 believe strong empirical evidence built studies individual cases studies large samples
::::
10 CONCLUSION use trivial packages increasingly popular trend development Like development practice proponents opponents goal study examine prevalence reasons drawbacks using trivial packages findings indicate trivial packages commonly widely used Nodejs applications also find majority developers oppose use trivial packages main reasons developers use trivial packages due fact considered well implemented tested However cite fact additional dependencies’ overhead drawback using trivial packages said empirical study showed considering trivial packages well tested misconception since half trivial package studied even tests written however trivial packages seem ‘deployment tested’ similar Tests Community interest Download count values nontrivial packages addition find trivial packages dependencies studied dataset 115 trivial packages 20 dependencies Hence developers careful trivial packages use ACKNOWLEDGMENTS authors grateful many survey respondents dedicated valuable time respond surveys REFERENCES 1 Pietro Abate Roberto Di Cosmo Jaap Boender Stefano Zacchiroli 2009 Structural Dependence Components Proceedings 2009 3rd International Symposium Empirical Engineering Measurement ESEM ’09 IEEE Computer Society 89–99 2 Rabe Abdalkareem Emad Shihab Juergen Rilling 2017 Code Reuse StackOverflow exploratory study Android apps Information Technology 88 C 2017 148–158 3 Victor R Basili Lionel C Briand Walcélio L Melo 1996 Reuse Influences Productivity Objectoriented Systems Commun ACM 39 10 October 1996 104–116 4 Gabriele Bavota Gerardo Canfora Massimiliano Di Penta Rocco Oliveto Sebastiano Panichella 2013 Evolution Interdependencies Ecosystem Case Apache Proceedings 2013 IEEE International Conference Maintenance ICSM ’13 IEEE Computer Society 280–289 5 Remco Bloemen Chintan Amrit Stefan Kuhlmann Gonzalo Ordóñez Matamoros 2014 Gentoo Package Dependencies Time Proceedings 11th Working Conference Mining Repositories MSR ’14 ACM 404–407 6 Christopher Bogart Christian Kästner James Herbsleb 2015 Breaks Breaks Ecosystem Developers Reason Stability Dependencies Proceedings 2015 30th IEEEACM International Conference Automated Engineering Workshop ASEW ’15 IEEE Computer Society 86–89 7 Christopher Bogart Christian Kästner James Herbsleb Ferdian Thung 2016 Break API Cost Negotiation Community Values Three Ecosystems Proceedings 2016 24th ACM SIGSOFT International Symposium Foundations Engineering FSE ’16 ACM 109–120 8 Stephan Bonnemann 2015 Dependency Hell Froze httpsspeakerdeckcombonnemanndependencyhelljustfrozeover September 2015 accessed 08102016 9 Bower 2012 Bower package manager web httpsbowerio 2012 accessed 08232016 10 J Cohen 1960 coefficient agreement nominal scales Educational Psychological measurement 20 1960 37–46 11 Andre Cruz Andre Duarte 2017 npmjs httpsnpmjsorg 012017 accessed 02202017 12 Cleidson R B de Souza David F Redmiles 2008 Empirical Study Developers’ Management Dependencies Changes Proceedings 30th International Conference Engineering ICSE ’08 ACM 241–250 13 Alexandre Decan Tom Mens Maëlick Claes 2016 Topology Package Dependency Networks Comparison Three Programming Language Ecosystems Proceedings 10th European Conference Architecture Workshops ECSAW ’16 ACM Article 21 4 pages 14 Alexandre Decan Tom Mens Maëlick Claes 2017 Empirical Comparison Dependency Issues OSS Packaging Ecosystems Proceedings 24th International Conference Analysis Evolution Reengineering SANER ’17 IEEE 15 Alexandre Decan Tom Mens Philippe Grosjean others 2016 GitHub Meets CRAN Analysis InterRepository Package Dependency Problems Proceedings 23rd IEEE International Conference Analysis Evolution Reengineering SANER ’16 Vol 1 IEEE 493–504 16 Roberto Di Cosmo Davide Di Ruscio Patrizio Pelliccione Alfonso Pierantonio Stefano Zacchiroli 2011 Supporting evolution componentbased FOSS systems Science Computer Programming 76 12 2011 1144–1160 17 Mehdi Dogguy Stephane Glondu Sylvain Le Gall Stefano Zacchiroli 2011 Enforcing TypeSafe Linking using InterPackage Relationships Studia Informatica Universalis 9 11 2012 129–157 18 J L Fleiss J Cohen 1973 equivalence weighted kappa intraclass correlation coefficient measures reliability Educational Psychological Measurement 33 1973 613–617 19 Bent Flyvbjerg 2006 Five misunderstandings casestudy research Qualitative Inquiry 12 2 2006 219–245 20 Thomas Fuchs 2016 great standard library JavaScript httpsmediumcomthomafuchswhatifwehadagreatstandardlibraryinjavascript52692342ee3f Mar 2016 accessed 02242017 21 German B Adams AE Hassan 2013 Programming language ecosystems evolution r Proceedings 37th European Conference Maintenance Reengineering CSMR ’13 IEEE 243–252 22 Georgios Gousios Andy Zaidman 2014 Dataset Pullbased Development Research Proceedings 11th Working Conference Mining Repositories MSR ’14 ACM 368–371 23 Robert J Grissom John J Kim 2005 Effect sizes research broad practical approach Lawrence Erlbaum Associates Publishers 24 Stefan Haefliger Georg Von Krogh Sebastien Spahett 2008 Code reuse open source Management Science 54 1 2008 180–193 25 Quin Hanam Fernando N N Brito Ali Mesbah 2016 Discovering Bug Patterns JavaScript Proceedings 24th ACM SIGSOFT International Symposium Foundations Engineering FSE ’16 ACM 144–156 26 Dan Haney 2016 NPM leftpad Forgotten Program httpwwwhaneycodesnetnpmleftpadhaveweforgottenhowtoprogram March 2016 accessed 08102016 27 Rich Harris 2015 Small modules itâ€™s quite simple httpsmediumcomRichHarrissmallmodulesitsnotquitethatsimple3ca5352d5d4e Jul 2015 accessed 08242016 28 Hemanth HM 2015 Oneline node modules issue10sindresorhusama httpsgithubcomsindresorhusamaissues10 2015 accessed 08102016 29 Katsuro Inoue Yusuke Sakai Pei Xia Yuki Manabe 2012 Code Come Go Integrated Code History Tracker Open Source Systems Proceedings 34th International Conference Engineering ICSE ’12 IEEE Press 331–341 30 Jaap Kabbedijk Slinger Jansen 2011 Steering insight exploration ruby ecosystem Proceedings Second International Conference Business ICSOB ’11 Springer 44–55 31 Ernui Kalliamvakou Georgios Gousios Kelly Blincoe Leif Singer Daniel German Daniela Damian 2014 Promises Perils Mining GitHub Proceedings 11th Working Conference Mining Repositories MSR ’14 ACM 92–101 32 Wayne C Lim 1994 Effects Reuse Quality Productivity Economics IEEE 11 5 1994 23–30 33 Fiona Macdonald 2016 programmer almost broke Internet last week deleting 11 lines code httpwwwsciencealertcomhowaprogrammeralmostbroketheinternetbydeleting11linesofcode March 2016 accessed 08242016 34 Konstantinos Manikas 2016 Revisiting ecosystems research longitudinal literature study Journal Systems 117 2016 84–103 35 Stephen McCamant Michael Ernst 2003 Predicting Problems Caused Component Upgrades Proceedings 9th European Engineering Conference Held Jointly 11th ACM SIGSOFT International Symposium Foundations Engineering ESECFSE ’03 ACM 287–296 36 Audris Mockus 2007 LargeScale Code Reuse Open Source Proceedings First International Workshop Emerging Trends FLOSS Research Development FLOSS ’07 IEEE Computer Society 7– 37 Parastoo Mohagheghi Reidar Conradi Ole Killi Henrik Schwarz 2004 Empirical Study Reuse vs DefectDensity Stability Proceedings 26th International Conference Engineering ICSE ’04 IEEE Computer Society 282–292 38 npm 2016 dependedupon packages httpwwwnpmjscombrowsedepended August 2016 accessed 08102016 39 npm 2016 npm Node Package Management Documentation httpsdocsnpmjscomgettingstartedwhatisnpm July 2016 accessed 08142016 40 npm Blog 2016 npm Blog changes npm’s unpublish policy httpblognpmjsorgpost141953680000changestounpublishpolicy March 2016 accessed 08112016 41 Heikki Orsila Jaco Geldenhuys Anna Ruokonen Imed Hammouda 2008 Update propagation practices highly reusable open source components Proceedings 4th IFIP WG 213 International Conference Open Source Systems OSS ’08 159–170 42 Janice Singer Susan E Sim Timothy C Lethbridge 2008 engineering data collection field studies Guide Advanced Empirical Engineering Springer London 9–34 43 Manuel Sojer Joachim Henkel 2010 Code Reuse Open Source Development Quantitative Evidence Drivers Impediments Journal Association Information Systems 11 12 2010 868–901 44 Jason Tsay Laura Dabbish James Herbsleb 2014 Influence Social Technical Factors Evaluating Contribution GitHub Proceedings 36th International Conference Engineering ICSE ’14 ACM 356–366 45 Chris Williams 2016 one developer broke Node Babel thousands projects 11 lines JavaScript httpwwwtheregistercouk20160323npmleftpadchaos March 2016 accessed 08242016 46 Erik Wittern Philippe Suter Shriram Rajagopalan 2016 Look Dynamics JavaScript Package Ecosystem Proceedings 13th International Conference Mining Repositories MSR ’16 ACM 351–361 47 Dan Zambonini 2011 Testing deployment Practical Guide Web App Success Owen Gregory Ed Five Simple Steps Chapter 20 accessed 02022017 48 Jiaxin Zhu Minghui Zhou Audris Mockus 2014 Patterns Folder Use Popularity Case Study GitHub Repositories Proceedings 8th ACMIEEE International Symposium Empirical Engineering Measurement ESEM ’14 ACM Article 30 4 pages
::::
Deliberate change without hierarchical influence case collaborative OSS communities Abstract Purpose – Deliberate change strongly associated formal structures topdown influence Hierarchical configurations used structure processes overcome resistance get things done deliberate change also possible without formal structures hierarchical influence DesignMethodologyApproach – longitudinal qualitative study investigates opensource OSS community named TYPO3 case exhibits formal hierarchical attributes study based mailing lists interviews observations Findings – study reveals deliberate change indeed achievable nonhierarchical collaborative OSS community context However presupposes presence active involvement informal change agents paper identifies specifies four key drivers change agents’ influence Originalityvalue – findings contribute organizational analysis providing deeper understanding importance leadership making deliberate change possible nonhierarchical settings points importance ‘changebyconviction’ essentially based voluntary behaviour open door reducing negative side effects deliberate change also hierarchical organizations Keywords Opensource communities deliberate change change agents change conviction hierarchical influence Introduction widespread agreement research well management practice deliberate change key organisation’s success longterm survival 2005 Teece Pisano Shuen 1997 hand also generally acknowledged deliberate change challenges organisations potentially stresses members disturbs existing structures causes disorder Schumpeter 1934 violates truce existing routines Nelson Winter 1982 drives people comfort zones evokes resistance Hon Bloom Crant 2011 Waddell Sohal 1998 Therefore deliberate change also typically associated strong leaders execution power Kotter 2007 Thus general agreement hierarchical influence particularly needed implementation stage order get things done overcome resistance Somech 2006 Strong leaders also needed promote change organisations create sense urgency Higgs Rowland 2011 Yates 2000 happens informal leaders formal positional power organisational members basically left whatever want exactly situation many collaborative communities opensource OSS communities many communities participation voluntary leaders limited formal power known hierarchical organizations communities handle challenges deliberate change without formal power successfully secure efficient consistent planning procedures overcome resistance get things done collaborative communities able change doomed fail long term Differently put mean OSS communities change deliberately Organisational scholars already shown extensive interest OSS communities collaborative communities general MartinezTorres DiazFernandez 2014 Key topics interest include motivation participate contribute collaborative communities Cromie Ewing 2009 Hars Ou 2002 Lerner Tirole 2002 structures division labour Mockus Fielding Herbsleb 2002 governance structures processes communities Demil Lecocq 2006 Markus 2007 coordination communication mechanisms Lee Cole 2003 extant research thus provides detailed picture OSS communities work studies yet examined deliberate change OSS communities studies address change found change OSS communities fluid tacit emergent task execution typically dependent informal structures voluntary contributions members Sharma Sugumaran Rajagopalan 2002 aim study investigate deliberate change accomplished OSS communities specifically empirical foundation research based longitudinal singlecase study Data collected one OSS community called TYPO3 2006–2010 refer deliberate change change intended planned Change therefore residual outcome multitude processes even though might disparities plans outcomes Burnes 1996 2009 Kanter Stein Jick 1992 data collection observed various deliberate change initiatives TYPO3 strategic well organisational level focus paper one strategic change initiative carried order redirect project’s focus towards product usability results show deliberate change possible OSS communities change agents play essential role change processes summarise findings model structuring success factors change agents Two main contributions offered First paper advances knowledge change processes nonhierarchical structures OSS communities increasing relevance economic activity relevant know informal nonhierarchical organisations allow executing deliberate change possible organizations likely become old Second much important investigation changes OSS communities gives new insights deliberate change nonhierarchical organisational settings possible shows organisations master ‘change conviction’ ie organisational members forced change accept adapt change voluntarily discuss insights study may used reduce tensions frictions change traditional business organisations well Structure governance OSS communities OSS community consists individuals voluntarily contribute development opensource MartinezTorres DiazFernandez 2014 Opensource freely available public open license based unrestricted access source code Bonaccorsi Rossi 2003 Wellknown examples OSS Linux Firefox Apache Lakhani von Hippel 2003 OSS communities typically demonstrate classic textbook principles organisations form entity distinguishable environment Lawrence Lorsch 1967 ii specific goals Etzioni 1964 iii purposive actions realise goals Mooney Reiley 1939 iv dependent affected external environment Scott 1981 However time OSS communities distinguish traditional business organisations basically open anyone participate participation voluntary high degree selfassignment don’t physical location like headquarters enabled modularization distributed activities allowing rather loosely managed structured development processes leave developers free choose tasks execute Vujovic Ulhøi 2008 Demil Lecocq 2006 argue open license indeed unique contractual framework generated new type governance structure distinct familiar governance modes hierarchy network market Although OSS communities differ terms structure size formalisation appears ‘ideal type ground architecture’ identified many communities main characteristics architecture also apply TYPO3 OSS communities often managed twolayer task structure containing core peripheral layer Lee Cole 2003 core consists leaders maintainers leadership projects eg Linux centralised one undisputed leader projects eg Apache committee solves particular leadership tasks disagreements conflicts voting consensus Lerner Tirole 2002 one hand communities align definition shared leadership—“distributed phenomenon several formally appointed andor emergent leaders within group”—and generally focuses emergence leaders Mehra Smith Dixon Robertson 2006 p 233 hand investigations shared leadership stem mainly context organizational teams emphasize importance formal leaders set stage informal leadership roles arise create conditions maximize successful outcome shared leadership teams Denis Langley Sergi 2012 stands contrast OSS communities based formal leadership traditional sense leadership fact required informal leaders emerge OSS communities OSS informal leadership positions emerge reputational gains based “technical acumen managerial skill” Fleming Waguespack 2007 p 165 addition trust requirement leaders selected community O’Mahony Ferraro 2007 Usually founders count leaders earned credibility act leaders contributing initial source code demonstrating expertise leaders typically act visionaries providing recommendations work tasks milestones etc community Another important leadership task attract new members posing challenging programming problems potential contributors Lerner Tirole 2002 p 220 nature leadership OSS communities changes communities grow mature O’Mahony Ferraro 2007 time leaders perform less technical tasks programming organisational building tasks ibid periphery OSS community often structured development bugfixing team Lee Cole 2003 Members periphery loosely connected community Task assignment mostly completely voluntary ibid Participation OSS communities driven intrinsic eg fun enjoyment extrinsic eg peer recognition signalling skills career benefits rewards Lerner Tirole 2002 Lakhani von Hippel 2003 p 923 emphasize three motivations participation OSS communities needdriven participation eg need enjoymentdriven participation reputation enhancement Reputation lowranking incentive join contribute OSS community ibid However reputation achieved member’s desire maintain reputation encourages member continue provide quality contributions Sharma et al 2002 structure supported number governance mechanisms help direct control coordinate individual efforts OSS communities Markus 2007 mechanisms include selfassignment tasks Crowston Li Wei Eseryel Howison 2007 peer review Lee Cole 2003 bug reporting voting procedures process determining requirements Scacchi 2002 Collaboration enabled platforms provide infrastructure sharing solutions asking help etc Services tools mailing lists discussion forums archives blogs key infrastructures enable communication collaboration OSS communities Fjeldstad Snow Miles Lettl 2012 OMahony Ferraro 2007 sum OSS communities welldeveloped structures resembling structures traditional business organisations also leaders involved organising structuring processes major difference leaders formal authority thus execution power Participation OSS communities voluntary tasks selfassigned Leaders cannot therefore exert hierarchical influence lead based expertise persuasion power reputation among peers literature called type influence informal leadership De Souza Klein 1995 Hongseok Labianca MyungHo 2006 Lakhani von Hippel 2003 p 923 found informal leaders OSS communities capable organising “mundane necessary” tasks daytoday business also capable mastering challenges change already difficult master formal companies leadership power needed Deliberate change organisations Like organisations OSS communities change concerns “organisation’s direction structure capabilities” Moran Brightman 2001 sense nothing unusual basic nature substance change OSS communities resembles basic structure demands organisational change processes Many researchers emphasised process character organisational change Bullock Batten 1985 Hayes 2010 Lewin 1951 Van de Ven Poole 1995 identified 20 models structure change processes different ways However vast majority models identify three key tasks deliberate change processes deal First need change recognised change process initiated Kirzner 1997 need typically results opportunities threats addressed change change initiative put organisation’s agenda order secure action taken Kotter 2012 Organisational change strategic level genuine management task recognition change needs might come ‘ordinary’ employees exclusive right management acknowledge initiatives put agenda Kesting Ulhøi 2010 least traditional business organisations main rationale behind governance structure secure consistency—between different initiatives organisational activities also shareholder stakeholder interests Second deliberate change tends based planning decisionmaking activities 2005 Goals defined information acquired analysed results process management decisions documents like road maps business plans traditional business organisations leaders drive structure process creating sense urgency involving organisational members keeping track process Kotter 2012 distinction deliberate emergent change acknowledged strategy literature Mintzberg Waters 1985 change management literature Liebhart GarciaLorenzo 2010 aspects like contingency choice also included discussion review 2005 shows complex heterogeneous inconsistent distinction paper intend contribute discussion argumentation paper sufficient specify substance deliberate change two attributes purpose reason understanding deliberate change neither implies everything goes according plan goals realised exactly planned way Dunphy Stace 1993 argue organizational change takes place dynamic environment organizations adapt plans accordingly background posit deliberate change rule emergent element Rather implies change grounded intention change view corresponds Mintzberg’s 1994 view change element strategy process contrast change completely emergent simply accumulated result series unrelated decisions events change strategic perspective Third change executed decisions implemented means organisation members make effort bring change Also routines altered order adapt change literature conflict resistance caused change del Val 2003 Huy Corley Kraatz 2014 emphasises leadership execution power particularly necessary get things done overcome resistance resolve conflicts Leadership power thus required three tasks however implementation Change often burdens organisations stresses people Leadership power needed change behaviour overcome resistance Traditional business organisations therefore often rely topdown implementation planned change Howell Avolio 1993 Leadership vision needed motivate organisational members challenges handled informal leaders resistance overcome without use formal power governance structure OSS communities handle deliberate organisational change Currently research addressing questions systematically However one concept change leadership offers theoretical grounding answer also important analysis article concept change agent Based Caldwell’s findings 2003 define change agents individuals initiate direct manage andor implement specific change initiatives Like many concepts concept change agents also used heterogeneously Wylie Sturdy Wright 2014 closely related concepts like product champions literature Ginsberg Abrahamson 1991 key point study change agents individuals drive change initiatives ie create momentum ensure decisions made actions taken change agents assume complex sensemaking Brown Colville Pye 2015 sensegiving Petkova Rindova Gupta 2013 roles essential attract collective attention gain legitimacy change initiatives Change agents assigned leaders formal given responsibilities even outsiders like consultants Volberda Van Den Bosch Mihalache 2014 However traditional business organisations authorised supported formal leaders Therefore activity change agents also based hierarchical influence even though mostly indirectly change agents thus might power order change supporting formal leaders possess power case sensegiving ie “the processes strategic change framed disseminated organization’s constituents” Fiss Zajac 2006 p 1173 particularly relevant change agents attract management attention promote initiatives outlined deliberate change cannot decided enforced management OSS communities like traditional business organisations Even initiatives come core based initiative promoted community sensegiving may particularly relevant change agents way attract attention community andor even attract media attention order promote change initiatives Sensegiving support positions “symbolic struggles purpose direction organization” Fiss Zajac 2006 p 1173 coming periphery requires even initiative change OSS community deliberately Therefore expected change agents play important role However conditions fundamentally different OSS communities management support hierarchical influence upon draw change agents realise change initiatives Methods Two main criteria guided selection focal case First case representative example OSS community Second community mature case already established formalised work procedures guidelines rules Studying change developed growing community would hold promises providing intensive rich case would “manifest phenomenon interest intensely extremely” extreme cases may distort manifestation phenomenon Patton 2002 p 234 Accordingly selected OSS community named TYPO3 study line research objective first identified deliberate changes various stages followed process underlying changes tracing mechanisms used address changes unit analysis community ie focus intraorganisational level Study setting TYPO3 public since 2000 time study community experiencing continuous growth see Figure 1 TYPO3 system enterpriseclass content management system CMS offering outofthebox operation standard modules httptypo3org system aimed two different groups authors ii administrators content managers TYPO3’s core team members play central role community contribute source code manage design development voluntary basis study started approximately half core team members ie nine individuals comprised project’s RD committee members also belonged project’s teams working groups Moreover members committee could described project’s central coordination body responsibilities included supervising coordinating development ii providing knowledge contacts financial support iii supervising supporting communitydriven teams chose committee point departure study responsibilities 855 discussions focussing governance issues Table 1 relevance RD committee members informants undeniable addition interviewing seven RD committee members two core team members interviewed directly involved specific organisational changes joining core team ie still belonged community’s periphery study unfolded hundreds informants pertaining community’s periphery became involved observations relevant mailing lists TYPO3 website Table 2 Table 1 Figure 1 Starting year 2003 TYPO3 began grow fast number registered developers doubled year 2003 2005 continuous growth trend set stage community changes focus study time lag growth registered 2003 2005 Figure 1 start data collection process 2006 necessary see community would respond growth Data sources Multiple sources data Table 2 employed strengthen design study capture complexities case question data sources allowed us triangulate data validate theoretical constructs data collected several occasions 2006 2010 study began TYPO3 addressing organisational issues surfaced growing size community However soon discovered TYPO3 experienced organisational challenges past Therefore learning project’s history prior development important illuminating current development collected data interviews observations facetoface RD committee meetings three relevant community mailing lists archival data introductory interview founder also acted leader 2000 2007 provided deeper understanding community history development point structure internal work processes products current future strategies rest interviews community manager TYPO3 Association RD committee core team members—some recently made move periphery core community—were focussed managing deliberate changes TYPO3 interviews addressed following main themes change initiatives ii activities roles practices related identified change initiatives iii motivation iv background interview guide used throughout process new relevant information emerged specific community changes additional questions incorporated following interviews interviews lasted 60 minutes average recorded transcribed Furthermore twoday period 2006 18 hours spent observing facetoface meetings among RD committee members method yielded insights range organisational issues related community’s development background deliberate change initiatives review 235 posts RD committee mailing list gave access content type discussions contributions roles various individuals work coordination delegation particular source information allowed us obtain deeper understanding organisational challenges facing community time period challenges resolved interviews observations RD committee’s meetings RD committee mailing list together led uncovering number change processes TYPO3 community Additional relevant mailing list data namely humancomputer interaction HCI team’s mailing list core team’s mailing list included data collection Using archival data allowed us crosscheck facts uncovered observation activities interviews Data analysis Since interested deliberate changes possible specific context case study design deemed appropriate specifically studying contemporary activities andor events researcher limited control case study research obvious choice Yin 1994 Qualitative techniques used analyse data Eisenhardt 1989 Miles Huberman 1984 Strauss Corbin 1998 Overall analysis focussed organisational practices change structuring paying specific attention grounded concepts proceeded three steps First constructed case studies Eisenhardt 1989 identified organisational change initiative focussed major change initiatives affected entire community time study four change initiatives ongoing reorganisation product development ii establishment nonprofit organisation called TYPO3 Association central hub support active developers iii installation usability mindset thus replacing strong technical mindset community iv restructuring entire community create efficiency transparent structure clear responsibilities increased team autonomy Although general character three initiatives structural one cultural usability initiative changes involved changes structures practices Second divided coding process open axial selective coding employed constant comparative method within coding phase identify concepts relationships relevant type change Locke 2001 Strauss Corbin 1998 Third crosscase analysis Eisenhardt 1989 Miles Huberman 1984 used identify similarities differences across three change types process repeated several times time resulting conceptual insights refined developed analysis generated four core categories represent mechanisms employed TYPO3 address deliberate changes Table 4 interviews observations RD committee meetings data three mailing lists enabled us determine precisely timing order deliberate changes intended effects data sources used trace unintended emergent effects identified deliberate changes However three mailing lists documented reactions lack reactions entire community played central role interviews played central role establishing timeline parts change processes eg decision making took place offline preliminary findings presented discussed leader two core team members provided valuable comments confirmed elaborated upon uncovered theoretical constructs Findings observed multiple change initiatives community successful less successful significant summarised Table 3 Change agents played decisive role key tasks observed change management processes recognition decision making implementation observed initiatives one change agent originated community’s core One reason prevalence core member change agents might fact identified initiatives major expected widescale effect community Table 3 sketch four change initiatives Table 3 elaborating aims initiative ii made deliberate iii specifying change agents iv whether implementation successful first change initiative “Reorganization product development” launched product development process inefficient characterized lack release discussions core community community’s failure test enough different versions failure read existing instructions different contributions ie release management procedures testing instructions poor planning subprojects eg many postponements unrealistic deadlines part Core Team capacity respond inquiries proposals general input meeting arranged potential solutions discussed demonstrating explicit intent plan execute needed change Core Team RD Committee member charge release process time proposed solution subsequently adopted Release management consequently improved introducing rotating release manager function July 2007 change process RD Committee’s tasks taken Core Team one hierarchical layer got removed created flexibility readiness Core Team easier access new contributions Additionally core development mailing list opened created direct communication channel core periphery activity level increased drastically mailing list initiative doubled amount incoming patches core list thus freed Core Team members also able pursue larger projects much higher extent initiative thus successfully implemented second change initiative “Founding nonprofit organization called TYPO3 Association” intended create committee structure resembled functional organizational structure consisted establishing nonprofit organization called TYPO3 Association initiated founder complex task demanded deliberate action took many discussions especially Core Team meetings TYPO3 conferences main goals Association support core development steadier basis improve efficiency “providing central hub support active developers well concentrate members pool regular contributors” mailing list TYPO3 Association meant support core development providing funds take care development taken care commercial interests One way donations ie individuals earn income part using open source choose give income back community form donations Another way membership ie firms individuals could become members Association paying annual fee used sponsor development TYPO3 Furthermore Association able create transparency regarding decisionmaking roles activities change initiative thus successfully implemented Association created period growth goaloriented integrative leadership board whose chairman leader third change initiative “New team structure” deliberate direct response rapid community growth founder change agent behind initiative sought make particular responsibilities tasks explicit order create transparency activities upper echelons Association team level therefore determined following apply team leaders’ tasks leaders solely responsible team ii members appointedaccepted leader iii decisions made leader however agreement sought team members far possible iv delegation tasks encouraged v minimum timeframe set leader’s response team members’ requests defining responsibilities community attempted introduce measure accountability team performance considered vital virtual context due voluntary nature participation formalize responsibilities tasks founder thus introduced “team contracts” contracts served purpose creating synergy already existing teams elaboration written mission statement minimum contained following team information team’s position organizational structure ie committee team belong description team’s mission specification team’s responsibilities name team leader rules becoming team member Although contracts introduced tasks still taken selfassignment motive underlying team contracts define two aspects responsibility authority However team contracts never really gained momentum attempts introducing formal authority team level succeed either initiative failed attempted structure left degrees freedom contributors type executed authority resembled hierarchy Demil Lecocq 2006 Powell 1990 unintentionally led authority erosion accentuated need autonomy regard following one’s “personal itch” Finally aim fourth initiative “Installing usability mindset” redirect project’s focus towards product usability time project’s focus almost entirely technical nature limited product’s appeal customer segments low technical skills eg secretary edits content company website “A lot OSS created technicians technicians … users use every third week don’t demand many functions demand don’t need remember works using every third week” interview founder wish introduce greater degree product usability put forward newcomer TYPO3 community 2001 newcomer ie periphery member community became change agent made explicit decision launch process change making initiative case deliberate change designer profession realized need TYPO3 improve design idea remained background 2006 leader established humancomputer interaction HCI team appertaining mailing list intended act “the melting pot ideas usability improvements” HCI team mailing list However progress slow breakthrough first came change agent started making focussed effort implement usability idea end change initiative successfully implemented findings based analysis observed initiatives community selected fourth initiative “Installing usability mindset” representative initiative illustrate general traits organisational change mechanisms drove success change initiatives focusing presentation study’s results one particular change initiative intention promote clarity comprehensibility findings following present findings consist four mechanisms analysis revealed central drivers successful deliberate change management community Table 4 Table 4 Individual initiative data first reveal community cannot expected embrace change initiative—regardless inherent value community—unless persistent change agent bring initiative point inception successful implementation direct consequence absence formal power hierarchical influence OSS communities Since community members cannot ordered something persuaded become active change agent HCI expressed difficulties saying “You find developers interested design topics don’t really get far that’s experienced HCI team…a lot’ interview change agent Even change agent right idea engages right community members enough set change motion consequence change agent persevered four years concept usability penetrated prevailing mindset culture community Persistence involves high dose patience primarily community also needs time adapt organisational changes need pointed one core member TYPO3 “There gap design organisation letting organisation accumulate around design…giving time people flock teams” RD committee meeting core member found clear indications less organisational planning decision making individual effort achievement motivate community members contribute change initiative decisions matter OS communities one thing matters actually done Post factum situation things people make decisions make decision doesn’t mean people motivated implement work thing matters action Consult people hook knowledge resources hope would like expect… think service providers RD committee meeting core member one key statements investigation outlining structure individual initiative clearly possible view also supported founder TYPO3 short statement “First things others follow” interview founder taking action change agent HCI reflected upon motivated developers work leader found key driver leader’s “front guy guru status” fact “he usually keeps promises able huge workloads” interview change agent Based insight change agent tried motivate others participate HCI team “I tried find guys motivated work work me” interview change agent success approach evident already 2007 change agent became HCI team leader success also recognised community members Someone usability mailing list comes nifty goodlooking screenshot proposes usability changes core developers fascinated go implement seems like really great idea Especially change agent successful way getting suggestions implemented he’s HCI team leader interview core member even don’t know many seen PDF change agent produced saw also met Frankfurt PHP conference core team member name joined meeting leader—and hard impressive work done core team mailing list core member end found role change agents communities similar product champions experience progress time persistent enthusiastic effort Tushman Anderson 1986 Persistence leading example traits define change agent’s degree individual initiative Persistent change agents able selfmotivate selfdirect performance ie exercise selfleadership Manz 1986 essential part organisational change initiative OSS communities takes great deal time persuasion garner acceptance support organisational change change agent demonstrating high levels commitment personal motivation skills may develop mutual cognitivebased trust turn may strengthen community members’ readiness engage collaborate Chowdhury 2005 McAllister 1995 Thus put forward following proposition grounded similar behaviours observed three change initiatives Table 3 Proposition 1 individual initiative change agents positively related successful implementation deliberate organisational change initiatives communities Reputation reputation lending Power struggles visible change process initiative instance observed RD committee meeting one member left room frustrated rest group support views arguing excessively predetermined team structure implemented However lost debate arguing stance change agent responsible particular change initiative higher status within community later revealed opposing member actually right team structure fact prescriptive example shows difficult accomplish anything without support community members higher social statuses difficulty exists even difference social status change agent supporting highstatus member rather low eg members core team find lending reputations lowerstatus members highstatus members share influence clearly recognised founder “And clear individuals kind naturally given power example natural individuals appoint close us easily gain influence” interview founder situations change agent rather lower status community case early days HCI team change agent gain influence teaming one community members enjoy highstatus reputation case HCI team change agent “did lot work founder” establish worthy community member Eventually invited TYPO3 Board meeting discuss usability issues “With founder T3 Board talked Drupal easier TYPO3 WordPress easier TYPO3” linking highstatus members way change agent gained respect support highstatus core members addressed change agent complimentary terms praised work “As usability guru please give feedback description two mentioned features page tree below…” core team mailing list core member appointed HCI team leader evident yet gained respect members systematically circumventing HCI team instead discussed usability issues core team’s mailing list effort made redirect attention towards HCI team particular towards role change agent endorsing building authority examples include way user interface change committed get approval change agent core team mailing list core member agree anyone else properly educated questions trust anyone else HCI field TYPO3 one showed good HCI skills far change agent one core team mailing list core member might also watched podcast issue 2 change agent demonstrates great ideas usability improvements TYPO3 seen PDF 3 core team mailing list core member subsequent period activity levels HCI team increased significantly However seemed obvious relationship content change initiatives skills highstatus members supporting initiatives finding implies potential spillover effect reputations rooted technical contributions reputations rooted organisational contributions also instances highstatus members eg team leaders core team members respected members met change agents halfway data show leaders TYPO3 work community’s initiatives process mutual adjustments leaders notice promising initiatives assess try provide necessary resources tried motivate build team around noticed way try enable people work It’s bit intuitive also working already ten years system foundation something like probably already laid couple years back interview community manager type leadership emphasises intuition alertness main task consists providing support change initiatives form knowledge resources without making decisions behalf community members Rather leaders establish infrastructure framework hopefully assist community change agents paving way intended improvements changes Highstatus members lend lateral authority reputation change agent providing type visible support even verbal nature One reason method works highstatus members’ support provides change agent credibility crucial initiative stand chance implemented Markus Benjamin 1996 finding suggests community leadership shared via reputation lending also facilitates organisational changes communities Therefore based similar behaviours observed three initiatives Table 3 make following prediction Proposition 2 Reputation lending high status lower status members positively related successful implementation deliberate organisational change initiatives communities Changeoriented communication found communication change initiatives essential successful implementation meetings presentations small large target audiences various community events change agents TYPO3 communicated rationales arguments behind initiatives Still took change agent behind HCI initiative long time realise communicating idea usability vital success change agent attracted support usability initiative communicating changeoriented fashion basic ideas behind concept several rounds presentations developer community “This founder decided maybe need find change point view guide developers different direction—so typical marketing communication thing” interview change agent 2007 2008 change agent tried motivate community communicating relevance usability TYPO3 presentations community’s main yearly events first presentation usability flaws ten major usability flaws … Developer Days 2007 2008 T3Con held presentation done positive way usability solutions future interfaces like example interfaces “Minority Report” … look back second phase motivate people saying “Look that’s possible work together” “Wouldn’t fun amazing interfaces there” interview change agent observed projects presentations helped change agents gain community’s trust capabilities showed presentations could really get done kind trusted words said usually it’s inner circle developers developers could trust language comes strange design guy says “You everything wrong change everything don’t even knowledge understand wrong” doesn’t really end trust interview change agent addition establishing trustworthiness change agent Gurtman 1992 changeoriented communication process TYPO3 also helped stimulate community members participate process also aimed educate target audience attempted changes community developers target “Then Usability Week started way educate people” interview change agent facilitation community participation resembles particular dimension shared leadership called voice known increase person’s social influence among members community Carson Tesluk Marrone 2007 change initiatives successful outcome change agents excelled initiating facilitating constructive changeoriented dialogue debates around community achieve needed changes Thus voice boosted change agents’ level social influence increasing immersion participation various means opening core team’s mailing list set rules rest community implementing rotating release managers presenting ideas community events establishing Usability Week Voice form changeoriented communication may associated successful change implementations voice based interpersonal events promote communication feedback according Ryan Deci 1985 catalyse feelings competence thereby stimulate intrinsic motivation Based similar behaviours exhibited three initiatives Table 3 make following prediction Proposition 3 Changeoriented communication positively related successful implementation deliberate organisational change initiatives communities Motivation challenging tasks selfassignment principle Crowston et al 2007 one major challenges opensource communities motivating developers work tasks uninteresting necessary complete Lakhani von Hippel 2003 see problem extends organisational change initiatives also recognised change agent HCI “… usability topics really challenging developers usually It’s removing staff making staff simple that’s usually challenge developers It’s challenge designer” interview change agent resulting challenge put generally one member core team “We uncertain get people boring timeconsuming essential tasks” interview core team member Working usability demanded developers overcome three fundamental tasks First developers needed become motivated work usability issues Second TYPO3 community attract skilled designers possessed necessary knowledge regarding usability Third change agent find way stimulate developers follow designers’ recommendations motivate developers work usability issues change agent came idea create “fake challenges … motivate finish goals” interview change agent approach based idea developers would willing work tasks perceived challenging came idea ‘Usability Week’ concept pretty simple rented castle one week locked 30 developers castle certain task needed solve within one week challenge way needed solve problem one week kind tough problems took huge solve one week challenge even task simple time pressure interview change agent Usability Week five mixed teams created team consisted three developers one core developer one manager one designer day event three meetings took place meetings designed streamline tasks motivate teams attract designers TYPO3 community usability change agent used different set tools created entrance barrier designers needed overcome could join community major wish Usability Week wasn’t solve tasks find designers able motivated join TYPO3 community idea make interesting make little bit complicated apply Usability Week 60 70 applications 30 places end five designers 50 could join somehow charmed could attend others couldn’t really worked really stuck today design work interview change agent Finally motivate developers change agent needed make tasks related usability issues challenging achieved incorporating novel task structure content ii freedom execute tasks different way usual simple problems change agent successfully motivated developers solve problems example structure website something called ‘page tree’ looks like tree Explorer Windows machine that’s kind old style done … However framework called XJS written Java Script interesting developers it’s new technology way new framework it’s hard implement need change lot decided use XJS page tree even don’t need would sure end would page tree wished would challenging task actually instead writing lines change page tree interview change agent really freedom totally change core… Actually way … worked… taking beta version 39 back time coded anything liked inside core Usually someone creates extension told “never touch core file” could really go deeply inside delete files replace files totally focus keeping compatible old code compatible old … extensions interview developer case HCI Usability Week turned quite successful challenged whether could reach goals really moved hugely forward one week … end say didn’t reach goals … got pretty far really gave whole usability new motivation interview change agent selfassignment tasks prime mechanism work division task allocation OSS communities obviously issue tasks attract enough interest consequently remain undone Task challenge refers continuum ranging lowto highstimulation tasks eg highly routinized tasks versus nonstandardized original tasks case TYPO3 shows increases task challenge due example entrance barriers competition level withintask stimulation task novelty freedom execute task new way compensate initial lack personal desire would normally drive selfassignment tasks analysis shows case tasks related implementation organisational change initiatives change agent needs increase perceived task challenge accordance skills interests targeted members Thus task challenge seen dynamic factor dependent persontask interaction Campbell 1988 Task challenge associated increased participation appeals intrinsic motivation primary motivational factor opensource communities Lakhani Wolf 2005 turn increased participation improves performance Hackman Oldham 1976 Herzberg 1959 Furthermore creating entrance barriers team membership proved effective activating sense achievement recognition stimuli Herzberg 1959 Hence based three observed change initiatives Table 3 make following prediction Proposition 4 Increased task challenge positively related successful implementation deliberate organisational change initiatives communities Discussion study offers first comprehensive investigation deliberate change OSS communities presents clear indications OSS communities indeed capable changing deliberately therefore doomed fail long run change deliberate desired community member—the change agent—and supported sufficient coalition within community observed HCI change initiative carried clear goal improving usability TYPO3 study also shows OSS communities deliberate change highly dependent change agents play essential role managing key tasks change processes change agents recognise need change translate organisational goals ii create sense urgency convince community members make decisions matter iii push change process ensure things getting—often things clear contrast hierarchical business organisations change mostly driven leaders positional power andor special functions change agents play secondary role background study deliberate change OSS communities focuses investigation change agents success drivers initiatives insights study summarised simple model Figure 2 findings first relevant research nonhierarchical organizational settings OSS communities provide insights area vastly underresearched far addition knowledge change important collaborative communities traditional business organisations allows designing change processes purposefully ii provides insights longterm behaviour collaborative communities relation competitive environment long based similar governance structure good reason assume findings also apply types communities practice related development BridwellMitchell 2015 gives broader relevance findings since importance communities increasing information knowledgebased economy O’Mahony Ferraro 2007 However findings study also include quite interesting relevant findings go beyond communities also concern change processes traditional business organisations way paper also contribute broader change literature elements change model completely new already know change agents informal power leadership investigations contexts new important however complete absence formal power prevent execution deliberate change critical role change agents drive process OSS leaders core team members formal command authority enforce decisions von Hippel von Krogh 2003 also clearly illustrated especially third change initiative “New team structure” Table 3 leader founder change agent Although kept team contracts agenda two years unable implement initiative kind formal fiat community initiative would probably lead different outcome OSS communities “do rely employment contracts unable governed formal authority case hierarchy” Demil Lecocq 2006 p 1454 allows quite interesting perspectives insights first important finding apparent irrelevance decision making hierarchical sense expressed community members point needs clarification mean deliberate planning decision making taking place OSS communities Instead statements relate power structure article Finkelstein 1992 distinguished various forms management power outlined OSS communities characterised inherent absence formal power ‘structural power’ terminology Finkelstein 1992 p 509 ie “legislative right exert influence” others forms informal power like ‘expert power’ ‘prestige power’ exist OSS communities play important role informal leadership provides foundation significance community’s core team Fleming Waguespack 2007 O’Mahony Ferraro 2007 Individual initiative proposition 1 mechanism change resembles change factors observed ‘traditional’ organizations formal leadership ie hierarchies Demil Lecocq 2006 Similarly community change agents agents hierarchies make use exemplary change leading example Kotter 2012 Also individual initiative bears resemblance tasks performed change champions Ulrich 1997 product champions Day 1994 providing impetus strongly promoting change initiative However apparent irrelevance decision making community change points structural power deficit change agents regard change initiatives Change agents able convince relevant community members decisions made tasks distributed often result action situations decisions relevant legitimise activities change agents trigger action Often change agents keep pushing get things done cases complete tasks background individual initiative strategy exert influence without formal power Yet noted strategy works locally informal power still needed change agents points Individual initiative might even result acquisition expert prestige power makes change agents abilities visible date meaning individual initiative structure lowpower contexts well understood might expected individual initiative also plays role highpower contexts strategy exert influence without power However research needed regard Another interesting point observations named ‘reputation lending’ proposition 2 already research reputation advancement communities organisations without vertical lines authority Fleming Waguespack 2007 Research knows lot authority means flat hierarchies ii authority acquired Dahlander O’Mahony 2011 context hierarchies reputation lending parallels coalition formation support building gaining sponsorship individuals organizational clout formal authority access resources Connor 1998 Day 1994 Kanter 1994 Kotter 2012 actions help legitimize change initiative change agent well create acceptance change affected Buchanan Boddy 1992 Conceptually reputation lending also somewhat close leader support hierarchies Amabile Schatzel Moneta Kramer 2004 Leader support means using formal power managers support activities lesspowerful organisational members often relation innovation change activities support include resources time autonomy support organisational decision making Mumford Scott Gaddis Strange 2002 contrast reputation lending implies using informal power community leaders support change agents activities mostly giving recognition letting participate board meetings decisionmaking procedures making initiatives visible community informal form support described far literature Still interesting elements visibility acceptance play minor role leader support finding indirectly confirms research showing importance informal networks policy systems change agent success Battilana Casciaro 2012 also discovered interesting findings regards motivation community members carry changerelated tasks discussed conceptual section motivation already focus previous research Lakhani von Hippel 2003 found participation OSS communities quite rewarding since “98 effort expended information providers fact returns direct learning benefits providers” p 923 However observed changerelated tasks rewarding rather challenging motivate community members work regard observed strategy socalled ‘fake challenges’ proposition 4 underlying approach combine unattractive tasks motivating elements like competitions social gatherings interesting early description principle fence episode novel Adventures Tom Sawyer Mark Twain 1876 readers perhaps remember Tom paint Aunt Polly’s fence punishment dirtied clothes fight hated work however one friends came spot Tom able create impression privilege pleasure paint fence even able sell painting permissions fellows sense change agent successful creating sense exclusivity restricting spaces challenge transformed boring work socially attractive event knowledge strategy described research OSS communities far Ultimately strategy creating challenging tasks expected improve community members understanding sense ownership change initiative eventually enhance motivation participate executing change sense approach objective instance empowerment organizational members important element change leadership literature within context hierarchies Caldwell 2003 Gill 2003 Goffee Scase 1992 strategies thus seek remove obstacles change fact other’s opposites One strategy uses task design deal downsides innate characteristic OSS communities ie member autonomy however seeks increase member autonomy hierarchical setting strong administrative controls provide formal powers supervise regulate behaviour organizational members Demil Lecocq 2006 Although change processes theorized practiced variety ways one finding deliberate change OSS communities mostly common change hierarchies related changeoriented communication proposition 3 frequent communication change agents create opportunities organizational members understand give input change process Kotter 2012 Practicing openness widespread communication Buchanan Boddy 1992 change process increases chance successful implementation organizational communication plays central role eroding existing path dependencies Cohen Levinthal 1990 thus paving way organizational change Yet important finding study perhaps observation OSS communities succeed handling deliberate change processes without formal preassigned power Certainly informal power persuasion group pressure relevant manage deliberate change OSS communities certain extent Situations arise organisational members faced decision accept change leave community Still community member ordered accept change like traditional business organisations Nobody laid sanctioning possibilities generally limited community members comply change believe least accept majority decision change supported critical mass community successful call type deliberate change ‘change conviction’ relevant people comply change voluntarily good chance negative side effects resulting enforcement reduced even though completely eliminated group members might submit change unwillingly leave community Indeed found indications data even though directly looking convinced findings may also applicable hierarchical business organisations latter learn lot OSS communities reduce level enforcement change processes thereby decreasing levels demotivation insecurity resistance Consequently relevance findings much broader concern nonhierarchical settings OSS communities helps shed additional light deliberate organisational change general research however needed substantiate findings clarify impact different elements change negative side effects explore possibilities traditional business organisations Managerial implications obvious managerial implication communities need aware central role change agents deliberate change organise change processes accordingly study emphasizes role importance individuals taking initiatives responsibilities outlining critical success factors realizing deliberate change nonhierarchical settings OSS communities Another implication hierarchical organizations need also reconsider use appreciation change agents including selfappointed ones Change agents already used hierarchical business organisations often unsystematic way However results study suggest would useful base major change projects change agents well decisions made change agents simply assigned endowed necessary power supported top managers Contrary nonhierarchical case analysed study specific individual initiative needed point hierarchical organisations Still might important change agents care usual second driver model build reputation right person organise change process among organisational members involved two last drivers point communication education well motivation convinced lot done smooth change projects hierarchical business organisations might even possible establish regime change conviction Limitations future research first limitation study theoretical nature investigating deliberate change OSS communities touching variety different themes including leadership reputation building informal power motivation innovation others themes developed many might potentially offer new insights sake rigour decided focus change meaning change agents drivers change agent success targeted study primarily toward research conversations communities change decision made keep study focused detailed Second study looking organisational context factors mediate effect success drivers change agent activities like cultural context size age community degree formalisation others also look antecedents change agent activities means study far offering complete model change agent activity communities Still think propositions useful stepping stones towards holistic model Analysing classic concepts andor phenomena deliberate change entirely different newer organizational regimes important helps clarify organizational settings work also sheds new light phenomenon investigation study realization phenomenon manifested form selfappointment change agents necessary phenomenon exist completely different nonhierarchical organizational setting also holds potential applied hierarchical settings Conclusion study provides evidence indeed possible change complex organisations deliberately without formal power hierarchical influence change initiatives observed grounded individual commitment change agents However also found success change agents’ initiatives depended ability get sufficient support within organisation Key drivers individual initiative reputation reputation lending changeoriented communication education motivation challenging tasks reason assume insights also hold broader range organisations including hierarchical business organisations relevant indications change conviction reduces negative side effects deliberate change References Amabile Schatzel E Moneta G B Kramer J 2004 Leader behaviors work environment creativity Perceived leader support Leadership Quarterly 15 1 532 Battilana J Casciaro 2012 Change Agents Networks Institutions Contingency Theory Organizational Change Academy Management Journal 55 2 381398 Bonaccorsi Rossi C 2003 Open Source Succeed Research Policy 32 12431258 BridwellMitchell E N 2015 Collaborative Institutional Agency Peer Learning Communities Practice Enables Inhibits MicroInstitutional Change Organization Studies Brown Colville Pye 2015 Making sense sensemaking organization studies Organization Studies 36 2 265277 Buchanan Boddy 1992 expertise change agent London Prentice Hall Bullock R J Batten 1985 Phase Going Review Synthesis OD Phase Analysis Group Organization Studies 10 4 383412 Burnes B 1996 thing one best way manage organizational change Management Decision 34 10 11 Burnes B 2009 Managing change strategic approach organisational dynamics 5th ed Harlow England New York Prentice HallFinancial Times R 2005 Organisational change management critical review Journal Change Management 5 4 369380 Caldwell R 2003 Models change agency fourfold classification British Journal Management 14 131142 Campbell J 1988 Task complexity review analysis Academy Management Review 13 1 4052 Carson J B Tesluk P E Marrone J 2007 Shared leadership teams investigation antecedent conditions performance Academy Management Journal 50 5 12171234 Chowdhury 2005 role affect cognitionbased trust complex knowledge sharing Journal Managerial Issues 17 3 310327 Cohen W Levinthal 1990 Absorptive capacity new perspective learning innovation Administrative Science Quarterly 35 1 128152 Connor R 1998 Managing speed change Chichester UK John Wiley Sons Cromie J G Ewing 2009 rejection brand hegemony Journal Business Research 62 218230 Crowston K Li Q Wei K Eseryel U Howison J 2007 Selforganization teams freelibre open source development Information Technology 49 564–575 Dahlander L OMahony 2011 Progressing Center Coordinating Work Organization Science 22 4 961979 doi 101287orsc11000571 Day 1994 Raising radicals Different processes championing innovative corporate ventures Organization Science 5 148173 De Souza G Klein H J 1995 Emergent leadership group goalsetting process English Small group research 26 4 475496 del Val P 2003 Resistance change literature review empirical study Management Decision 41 2 148 Demil B Lecocq X 2006 Neither market hierarchy network emergence bazaar governance Organization Studies 27 10 14471466 Dunphy Stace 1993 strategic management corporate change Human Relations 46 8 905920 Eisenhardt K 1989 Building theories case study research Academy Management Review 14 4 532 Etzioni 1964 Modern organization Englewood Cliffs NJ PrenticeHall Inc Finkelstein 1992 Power top management teams dimensions measurement validation Academy Management Journal 35 3 505538 Fiss P C Zajac E J 2006 symbolic management strategic change sensegiving via framing decoupling Academy Management Journal 49 6 11731193 Fjeldstad Ø Snow C C Miles R E Lettl C 2012 architecture collaboration Strategic Management Journal 33 734750 Fleming L Waguespack 2007 Brokerage Boundary Spanning Leadership Open Innovation Communities Organization Science 18 2 165180 Gill R 2003 Change management — change leadership Journal Change Management 3 4 307318 Ginsberg Abrahamson E 1991 Champions change strategic shifts role internal external change advocates Journal Management Studies 28 2 173190 Goffee R Scase R 1992 Organizational change corporate career restructuring managers’ job aspirations Human Relations 45 4 363384 Gurtman B 1992 Trust distrust interpersonal problems circumplex analysis Journal Personality Social Psychology 62 9891002 Hackman J R Oldham G R 1976 Motivation design work test theory Organizational Behavior Human Performance 16 2 250250 Hars Ou 2002 Working free Motivations participating opensource projects International Journal Electronic Commerce 6 3 25–39 Hayes J 2010 theory practice change management 3rd ed New York Palgrave Macmillan Herzberg F 1959 motivation work New York John Wiley Sons Higgs Rowland 2011 Take Implement Change Successfully Study Behaviors Successful Change Leaders Journal Applied Behavioral Science 47 3 309335 Hon H Bloom Crant J 2011 Overcoming Resistance Change Enhancing Creative Performance Journal Management 40 3 919941 Hongseok Labianca G MyungHo C 2006 mulitlevel model group social capital Academy Management Review 31 3 569582 Howell J Avolio B J 1993 Transformational leadership transactional leadership locus control support innovation key predictors consolidatedbusinessunit performance English Journal applied psychology 78 6 891902 Huy Q N Corley K G Kraatz 2014 Support Mutiny Shifting Legitimacy Judgments Emotional Reactions Impacting Implementation Radical Change Academy Management Journal 57 6 16501680 Kanter R 1994 change masters London Allen Unwin Kanter R Stein B Jick 1992 challenge organizational change companies experience leaders guide New York Free Press Kesting P Ulhøi J P 2010 Employeedriven innovation extending license foster innovation Management Decision 48 1 6584 Kirzner 1997 Entrepreneurial Discovery Competitive Market Process Austrian Approach Journal Economic Literature 35 1 6085 Kotter J P 2007 Leading Change Transformation Efforts Fail Harvard Business Review 85 1 96103 Kotter J P 2012 Leading change Boston Mass Harvard Business Review Press Lakhani K R von Hippel E 2003 open source works free usertouser assistance Research Policy 32 2003 923943 Lakhani K R Wolf R G 2005 hackers Understanding motivation efforts freeopen source projects Hissam B Fitzgerald J Feller K R Lakhani Eds Perspectives free open source pp 321 Cambridge MIT Press Lawrence P R Lorsch J W 1967 Organization environment Managing differentiation integration Cambridge Harvard University Press Lee G K Cole R E 2003 firmbased communitybased model knowledge creation case Linux Kernel Development Organization Science 14 6 633649 Lerner J Tirole J 2002 Simple Economics Open Source Journal Industrial Economics 50 2 197234 Lewin K 1951 Field theory social science selected theoretical papers 1st ed New York Harper Liebhart GarciaLorenzo L 2010 planned emergent change decision maker’s perceptions managing change organisations International Journal Knowledge Culture Change Management 10 5 214225 Locke K 2001 Grounded theory management research London Sage Publications Manz C C 1986 Selfleadership toward expanded theory selfinfluence processes organizations Academy Management Review 11 585600 Markus L 2007 governance freeopen source projects Monolithic multidimensional configurational Journal Management Governance 11 2 151163 Markus L Benjamin R 1996 Change agentry next frontier MIS Quarterly 20 4 385407 MartinezTorres R DiazFernandez C 2014 Current issues research trends opensource communities Technology Analysis Strategic Management 26 1 5568 McAllister J 1995 Affect cognition based trust foundations interpersonal cooperation organizations Academy Management Journal 38 1 2459 Mehra Smith B Dixon Robertson B 2006 Distributed leadership teams network leadership perceptions team performance Leadership Quarterly 17 232–245 Miles B Huberman 1984 Qualitative data analysis sourcebook new methods Beverly Hills CA Sage Publications Mintzberg H 1994 rise fall strategic planning New York NY Free Press Mintzberg H Waters J 1985 strategies deliberate emergent Strategic Management Journal 6 3 257273 Mockus Fielding R Herbsleb J 2002 Two case studies open source development Apache Mozilla ACM Transactions Engineering Methodology 11 3 309–346 Mooney J Reiley C 1939 principles organization New York Harper Brothers Moran J W Brightman B K 2001 Leading organizational change Career Development International 6 2 111118 Mumford Scott G Gaddis B Strange J 2002 Leading creative people Orchestrating expertise relationships Leadership Quarterly 13 6 705 Nelson R R Winter G 1982 evolutionary theory economic change Cambridge Mass Belknap Press Harvard University Press O’Mahony Ferraro F 2007 emergence governance open source community Academy Management Journal 50 5 10791106 Patton Q 2002 Qualitative research evaluation methods 3rd ed Thousand Oaks CA Sage Publications Petkova P Rindova V P Gupta K 2013 news bad news sensegiving activities media attention venture capital funding new technology organizations Organization Science 24 3 865888 Powell W W 1990 Neither market hierarchy network forms organization Research Organizational Behavior 12 295336 Ryan R Deci E L 1985 Intrinsic extrinsic motivations Classic definitions new directions Contemporary Educational Psychology 25 5467 Scacchi W 2002 Understanding requirements developing open source systems IEE ProceedingsSoftware 149 1 2439 Schumpeter J 1934 theory economic development inquiry profits capital credit interest business cycle Cambridge Mass Harvard University Press Scott W R 1981 Organizations rational natural open systems Englewood Cliffs NJ Prentice Hall Sharma Sugumaran V Rajagopalan B 2002 framework creating hybridopen source communities Information Systems Journal 12 725 Somech 2006 Effects Leadership Style Team Process Performance Innovation Functionally Heterogeneous Teams Journal Management 32 1 132157 Strauss Corbin J 1998 Basics qualitative research techniques procedures developing grounded theory 2nd edition ed London SAGE Publications Teece J Pisano G Shuen 1997 Dynamic capabilities strategic management Strategic Management Journal 18 7 509533 Tushman L Anderson P 1986 Technological Discontinuities Organizational Environments Administrative Science Quarterly 31 3 439466 Twain 1876 adventures Tom Sawyer Toronto Belford Bros Ulrich 1997 Human resource champions Cambridge Harvard University Press Van De Ven H Poole 1995 Explaining development change organizations Academy Management Review 20 3 510540 Volberda H W Van Den Bosch F J Mihalache R 2014 Advancing Management Innovation Synthesizing Processes Levels Analysis Change Agents Organization Studies 35 9 12451264 von Hippel E von Krogh G 2003 Open Source PrivateCollective Innovation Model Issues Organization Science Organization Science 14 2 209223 Vujovic Ulhøi J P 2008 Online innovation case open source development European Journal Innovation Management 11 1 142156 Waddell Sohal 1998 Resistance constructive tool change management Management Decision 36 78 543 Wylie N Sturdy Wright C 2014 Change agency occupational context lessons HRM Human Resource Management Journal 24 1 95110 Yates 2000 Developing leaders global landscape J Giber L Carter Goldsmith Eds Linkage Incs best practices leadership development handbook Case studies instruments training 1st ed San Francisco CA JosseyBassPfeiffer Yin R K 1994 Case study research design methods 2nd ed Thousand Oaks CA Sage Publications Biographies Sladjana Nørskov External Lecturer Department Management Aarhus University received PhD Aarhus School Business research interests include organizational development usercentered innovation processes community governance new organizational forms Peter Kesting Associate Professor Management Aarhus University Denmark research interests primarily concern innovation management cognitive conceptual foundations routine decisionmaking negotiations life work Joseph Schumpeter John Parm Ulhøi Professor Organization Management Theory Aarhus University research interests include organisational development new forms organising human social capital innovation entrepreneurship years served TIMDivision Board Member Academy Management Editorial Board member various journals served member various International Expert Boards example DirectorateGeneral Research European Commission Israel Science Foundation European Science Foundation Belgian Office Scientific Technical Cultural Affairs Research Council Norway Figure 1 growth TYPO3 depicted number registered developers references extensions 20032005textsuperscript1 Source httptypo3com beginfigure centering includegraphicswidthtextwidthtypo3growthpng captionThe growth TYPO3 depicted number registered developers references extensions 20032005 Source httptypo3com endfigure textsuperscript1 graph shows number registered developers 2003 2005 Unfortunately reliable statistics ensuing years could obtained Figure 2 Model moderators change initiatives OSS communities beginfigure centering includegraphicswidthtextwidthchangeinitiativespng captionModel moderators change initiatives OSS communities endfigure Table 1 Topics discussed RD Committee’s mailing list Number Governancerelated postings Technical postings Sum 201 21 13 235 855 90 55 100 Table 2 Data sources Data source Description Purpose Time Mailing list 235 postings RD Committee mailing list Insight contributions role Committee member indepth understanding organizational tasks issues addressed 2006 Mailing list II 1088 postings HCI Team mailing list Understanding organizational developments within HCI Usability Team Related particular change initiative 20062009 Mailing list III 1191 postings selected relevance total 13587 postings Core Team mailing list Understanding interactions core periphery interactions developed time Actions reactions related identified change processes 20062008 Interviews 11 interviews 1 interview founder 1 interview community manager 9 interviews 9 Core Team members 7 also members RD Committee Understanding community history development change TYPO3 Managing change TYPO3 followup specific developments change initiatives 20062010 Observation 18 hours twoday RD Committee facetoface meeting Insight issues regularly addressed RD Committee observations revealed range organizational issues 2006 Archival documentation description bylaws videos conferences meetings summaries meetings news Learning formal regulations structures community Crosschecking facts uncovered observation activities interviews 20062010 Table 3 four change initiatives Change initiative Components change initiative Rationale behind changes Change agent Outcome Reorganization product development New work processes Feedback Gate keeping Closer interactions Release management Motivate contributors via feedback gate keeping closer interactions expected act rewards retention mechanisms Release management improved setting strict development phases Core member Successfully implemented Founding nonprofit organization called TYPO3 Association Create committee structure similar functional structure Support core development steadier basis Improve efficiency providing central hub support active developers well concentrate members pool regular contributors founder Successfully implemented New team structure Establishing Team Contracts team Implement transparent structure clear responsibilities increased team autonomy elaborate structure Ensure responsibility accountability task role founder Unsuccessful Installing usability mindset Usability mindset Changing mindset developers Bringing developers designers together Create team would work increase usability TYPO3 system Developers usually lack user perspective Designers needed create userfriendly Periphery member Successfully implemented Individual initiative Persistence need extremely enthusiastic afraid setbacks experience many take long time make changes happen Interview core member Leading example creating credibility merit community gain followers change initiative didn’t work couldn’t motivate persons follow guidance changes created would say 200 mockups 10 percent realized TYPO3 today Interview change agent need prove skills able assess solutions Interview change agent Reputation reputation lending Endorsement highstatus members change agents also realized change agent’s name– one active participants – continuously working lot TYPO3 HCI Topics … New Installer 20 Backend interface improvements TYPO3 42 TemplaVoilá 2 together name Starting work Extension Manager 2 name finally change agent’s name also active member TYPO3org redesign group Core Team mailing list core member Redirecting attention work efforts towards initiative Could tell us bit Maybe developer list Answer HCI Please continue discussion … resend mail HCI list please feel like want continue discussion Core Team mailing list Proactive recognition support initiatives highstatus members It’s keeping big overview picking cherries dynamic system never idea sudden … It’s mostly things already way Interview core member work mostly things going try find little suggestions ask someone else “What think idea anything add that” … It’s mostly already ongoing projects community manager see okay guy working guy working try connect Interview community manager Changeoriented communication Inform educate community rationale arguments behind initiatives breakthrough presentation 50 guy called name presentation spirit community changed saw really possible … Interview change agent watched HCI podcast really impressed get proud flexible product userfriendly product well ‘outsider’ HCI team produced two random thoughts would like share … viewing presentation overwhelmed thinking would mean achieve really get consistent look field would require rewriting lot code adapting tons extensions things like installer might easier since better modularized achieve major changes strongly feel would best focus 50 development HCI mailing list developer Motivation challenging tasks Novel task structure content exciting developers use framework powerful new many functions already inside using framework could use lot things box could never pluck old system Interview core member Freedom work new ways removing everything replacing totally new components whole frame page tree really going bring something totally new coding driven huge set features Every one us coding past position coding extensions customer … create new menu items never possible past … really point freedom drop compatibility quite helpful go fast forward say ok let’s delete everything create new Interview core developer
::::
Make Breaking Changes Policies Practices 18 Open Source Ecosystems CHRIS BOGART CHRISTIAN KÄSTNER JAMES HERBSLEB Carnegie Mellon University USA FERDIAN THUNG Singapore Management University Singapore Open source projects often rely package management systems help projects discover incorporate maintain dependencies packages maintained people systems save great deal effort ad hoc ways advertising packaging transmitting useful libraries coordination among teams still needed one package makes breaking change affecting packages Ecosystems differ approaches breaking changes general theory explain relationships features behavioral norms ecosystem outcomes motivating values address two empirical studies interview case study contrast Eclipse NPM CRAN demonstrating different norms coordination breaking changes shift costs using maintaining among stakeholders appropriate ecosystem’s mission second study combine survey repository mining document analysis broaden systematize observations across 18 ecosystems find ecosystems share values stability compatibility differ values Ecosystems’ practices often support espoused values surprisingly diverse ways data provides counterevidence easy generalizations ecosystem communities CCS Concepts • engineering → Collaboration development development process management libraries repositories • Humancentered computing → Empirical studies collaborative social computing Additional Key Words Phrases ecosystems dependency management semantic versioning collaboration qualitative research ACM Reference format Chris Bogart Christian Kästner James Herbsleb Ferdian Thung 2021 Make Breaking Changes Policies Practices 18 Open Source Ecosystems ACM Trans Softw Eng Methodol 30 4 Article 42 July 2021 56 pages httpsdoiorg1011453447245 work supported NSF awards 1901311 1546393 1302522 1322278 0943168 1318808 1633083 1552944 Science Security Lablet H9823014C0140 US Department Defense Systems Engineering Research Center grant Alfred P Sloan Foundation Authors’ addresses C Bogart C Kästner J Herbsleb Carnegie Mellon University Institute Research TCS Hall 430 4665 Forbes Avenue Pittsburgh PA 15213 emails cbogart ckaestner jherbslebcscmuedu F Thung Singapore Management University School Computing Information Systems 80 Stamford Road Singapore 178902 email ferdiant2013smuedusg Permission make digital hard copies part work personal classroom use granted without fee provided copies made distributed profit commercial advantage copies bear notice full citation first page Copyrights components work owned others ACM must honored Abstracting credit permitted copy otherwise republish post servers redistribute lists requires prior specific permission andor fee Request permissions permissionsacmorg © 2021 Association Computing Machinery 1049331X202107ART42 1500 httpsdoiorg1011453447245 1 INTRODUCTION ecosystems communities built around shared programming languages shared platforms shared dependency management tools allow developers create packages import build others’ functionality ecosystems become important paradigm organizing open source development maintaining reusing code packages Development within ecosystems efficient sense common functionalities need developed maintained tested single team instead many authors reimplementing functionality Coordination major challenge ecosystems since packages tend highly interdependent yet independently maintained 2 3 6 21 55 68 least ecosystems JavaScript transitive dependency networks growing rapidly 46 Improvements maintainer makes shared package may affect many users package example incorporating new features making APIs simpler improving maintainability 10 actions may require rework developers whose depends package Package users may invest regular rework keep changes collaborate upstream projects minimize impact changes decline update latest versions risk missing bug fixes security updates replicate functionality avoid dependencies first place 6 17 19 72 Package maintainers turn many ways reduce burden users example refrain performing changes announce clearly label breaking changes help users migrate old new versions 6 36 65 67 Many different practices contribute managing change adopting various practices shift cost form effort among different classes ecosystem participants maintainers package users endusers eg Reference 28 much known individual practices managing change yet understand practices occur wild combine establish full design space practices Managing change takes time effort upstream downstream developers depending community’s practices cost may distributed differently However fully understand distributions costs result various practices practices related ecosystem culture technologies important research perspective acquire understanding ecosystem coordination mechanisms also practitioners sponsors may need tune distribution costs accommodate changing conditions example ecosystem accumulates large rapidly growing base applications use particular packages community may wish adopt practices increase stability packages avoid imposing costs change large growing base users practices could accomplish set practices likely compatible adopting ecosystem’s culture values perform two studies address questions like First conducted multiple case study Study 1 three open source ecosystems different philosophies toward change Eclipse RCRAN Nodejsnpm studied developers plan manage coordinate change within ecosystem changerelated costs allocated developers influenced influence changerelated expectations policies tools ecosystem ecosystem studied public policies policy discussions interviewed developers expectations communication decisionmaking regarding changes found developers employ wide variety practices shift delay costs change within ecosystem Expectations handle change differ substantially among three ecosystems influence costbenefit tradeoffs among develop packages used others call upstream developers developerusers packages call downstream developers endusers argue differences arise different values community reinforced peer pressure policies tooling example longterm stability central value Eclipse community achieved “prime directive” practice never permitting breaking changes practice imposes costs upstream developers may accept substantial opportunity costs technical debt avoid breaking client code contrast Nodejsnpm community values ease simplicity upstream developers technical infrastructure breaking changes accepted signaled clearly version numbering second study builds expands scope first investigating prevalence practices attitudes toward ecosystems values Study 1 larger set 18 ecosystems combine several methods accomplish including data mining repositories identify practices leave visible traces document analysis identify policylevel practices stated explicitly largescale survey ask developers many practices well importance various values within ecosystem Study 2 find practices values indeed often cohesive within ecosystem diverse across different ecosystems also find even ecosystems share similar values often achieve different ways sometimes fail achieve promoting practices never widely adopted work well Together results provide map distribution values practices across ecosystems allow us examine relationships values practices Beyond findings make full anonymized results available research community hopes useful future studies example providing basis selecting cases particular combinations practices values work builds extends previously published conference paper 6 including much material Section 4 data available archived dataset 7 well interactive web page1 contributions include description breaking changerelated values practices three ecosystems taxonomy values practices mapping values practices across 18 ecosystems derived survey data mining policy analysis
::::
2 CONCEPTS DEFINITIONS ecosystems study define ecosystems communities built around shared programming languages shared platforms shared dependency management tools allowing developers create packages import build others’ functionality line definitions Lungu 50 Jansen Cusumano 43 focus “collections projects developed coevolve together environment” 50 p 27 interdependent independently developed packages generally share technology platform set standards 43 ecosystems typically center means package version often host artifacts manage dependencies among 1 47 51 61 74 Note term “software ecosystem” overloaded used different definitions different lines research 52 including ones focus commercial platforms enhanced thirdparty contributions 40 56 81 83 focus especially opensource communities developing interdependent libraries eg Maven npm CPAN rather centralized platforms usually independent extensions provide single application build eg Photoshop plugins Android apps also exclude ecosystems repackage 1httpbreakingapisorg projects dependencies deployment eg Debian packages homebrew often managed independent volunteers rather original developers Breaking changes many relevant development concerns maintaining interdependent artifacts community focus coordination issue deciding whether perform breaking changes downstream developers respond article define breaking change change package would cause fault dependent package blindly adopt change thus include cases change API would cause downstream package fail compile also cases program behavior would change leading incorrect results unacceptable performance examine breakingchange related practices quite broadly including reactions actual breaking changes practices meant signal mitigate prevent breaking changes Maintaining dependencies updating one’s code react breaking changes significant cost driver using otherwise free opensource dependencies Breaking changes common practice 3 5 6 14 22 29 39 44 48 53 54 66–68 89 90 example Decan et al 22 found 5 package updates CRAN backward incompatible causing 41 errors released dependent packages Xavier et al 90 report 28 releases frequently used Java libraries break backward compatibility rate breaking changes increasing time Information hiding 63 centralized change control 29 73 change impact analysis 8 84 guide decision making cannot entirely prevent need breaking changes practice given largescale open distributed nature ecosystems 6 59 62 76 90 Package managers structure problem make dependencies versions explicit 3 47 51 practices like semantic versioning assign semantics version numbers eg breaking vs nonbreaking changes 65 67 help manage change prevent problem support decision making perform breaking changes Values practices “why” “how” managing breaking changes ecosystems values practices Shared values—judgments important preferred—can explain developers make similar decisions Values studied societal scale psychology 4 ethics 16 related fields 12 37 eg education influences personal value systems however values influence practices studied mostly narrow contexts engineering Pham et al studied testing culture 64 MurphyHill et al found creativity communication nonengineers valued game developers application developers resulting less testing architecture practices game development 58 use concept values analyze common shared beliefs important ecosystem focus changerelated issues practices refer broadly activities developers engage primarily focus managing change Practices may include specific release strategies deciding perform changes mitigating impact changes documenting migration paths reaching developers monitoring changes dependencies deciding whether update dependencies many 6 ecosystems practices may encouraged mandated policies example npm Eclipse mandate use semantic versioning documentation may supported even enforced tools example Eclipse community’s API Tools detect even subtle breaking changes CRAN runs automated checks enforce coding standards resolve incompatibility issues 6 simplicity use term practice broadly including policies tools Governance open source ecosystems covers communitywide decisions eg integrate thirdparty contributions 11 model decision making generally appropriate 45 60 open ecosystem 85 people different roles allowed participate 86 governance research discusses need evolvability stability organization 83 research focuses general market mechanisms process documentation conformance 41 45 technical steps engineer might take
::::
3 METHODS 31 Research Design stated introduction goal research create highlevel map values practices relating breaking change across many ecosystems approached question exploratory sequential mixedmethods design 15 beginning qualitative preliminary case study first understand community deals prevents breaking changes deal way first study takes constructivist view focusing problem breaking changes look perspective participants asking approach collaboration problem way use inform second primarily quantitative study second study intended specifically confirm findings generalize although confirmatory check Section 51 rather broad look see generalizes pattern combinations values practices see larger landscape outside three case study ecosystems Study 2 casts broad net cost depth asking highlevel questions many communities however recognize call research particular practices values ecosystems followed depth bringing resources bear focused questions Study 2 shows simple relationship practices values—we found communities often act value different ways 32 Study 1 Interview Case Study first look ecosystem practices performed multiple case study interviewing 28 developers three ecosystems Case studies appropriate investigating “how” “why” questions current phenomena 92 selected three contrasting cases aim theoretical replication 92 means investigate proposition phenomena differ across contrasting cases predictable reasons Eclipse Nodejsnpm served cases contrast sharply approach change Eclipse interfaces changed decade Nodejsnpm relatively new fastmoving platform expected Eclipse’s policies tools might impose costs developers way encouraged act consistently ecosystem’s values stability RCRAN ecosystem serves useful third theoretical replication since policy favors compatibility among latest versions packages Eclipse’s longterm compatibility past versions addition CRAN acts gatekeeper centralized repository contrast npm’s intentionally low hurdles contributions began mining lists packages dependency relationships three ecosystems assembled database packages dependency relationships version change histories npm repository metadata retrieved httpsregistrynpmjsorg json format CRAN repositories scraping metadata web pages starting httpcranrprojectorgwebpackagesavailablepackagesbynamehtml git repositories Eclipse httpsgiteclipseorgc Table 1 Interviewees R2 N4 Pairs Close Collaborators Identified R2a R2b N4a N4b Code Case Field Occupation E1 Eclipse Programming toolsHCI University E2 Eclipse Soft EngCS Education University E3 Eclipse Soft EngResearch University E4 Eclipse CS Education University E5 Eclipse engineering Retired E6 Eclipse engineering Industry E7 Eclipse Eclipse infrastructure Industry E8 Eclipse engineering Industry E9 Eclipse engineering Industry R1 CRAN Soil science Government R2ab CRAN Statistics University R3 CRAN Medical imaging University R4 CRAN Genetics University R5 CRAN Soil science University R6 CRAN Web apps Industry R7 CRAN Data analysis Industry R8 CRAN R infrastructure Industry R9 CRAN R infrastructure Industry R10 CRAN R infrastructure University N1 NPM Telephony Industry N2 NPM Tools API dev Industry N3 NPM Web framework Startup N4ab NPM Web framework Startup N5 NPM Cognitive Science University N6 NPM Database Node infrastr Startup N7 NPM Database Node infrastr Industry owned packages upstream downstream dependencies pursued two complementary recruitment strategies interviews find package maintainers would recent relevant insight managing dependencies sides dependency relationship used mined repository datasets identify packages least two downstream dependencies two upstream dependencies focal package least one upstream dependencies version update year interview 2015² emailed random sample packages’ owners choosing random package list mentioned small batches handwriting emails authors using emails details supplied npm CRAN repositories Eclipse commit logs set interviews people responded also interviewed three developers colleagues knew personally contacted 92 people conducted 26 interviews interviews focused personal practices experiences managing upstream downstream dependencies 20 interviews hearing similar ideas new interviewee recognized need deeper experience ecosystemwide origins impacts ecosystem’s ²The code implementing filtering available httpsgithubcomcbogartdepalyzeblob1d867cc92d7a5f18274358ae02574915026a30d5depalyzeversionhistorypyL354 policies decided additionally interview individuals role current historical development ecosystem’s tools policies individuals fewer demands time attempted find key people ecosystem thus recruited 8 additional developers asking questions also adding questions ecosystem’s history policy values 28 interviewees active developers multiple years experience background ranged university research startup companies Table 1 gives overview conducted semistructured phone interviews lasted 30–60 minutes generally followed interview script shown Appendix tailored questions toward interviewees’ personal experiences interviewees’ consent recorded interviews keeping constructivist approach first study analyzed interviews using Thematic Analysis 9 transcribed recordings tentatively coded transcripts looking interesting themes using Dedoose 23 iteratively discussed redefined recoded Codes emerged first round included labels “expectations towards change” “communication channels” “opportunity costs backward compatibility” “monitoring” combined redundant codes eliminated ones recur address research questions grouped remainder seven highlevel themes “Change planning reasons changes” “change planning costs changer” “Change planning Technical means practices” “Change planning reasoning cost tradeoffs” “Coping change” “Communication” “Ecosystemwide policy technology” Next gathered tagged quotes highlevel category two researchers checked agreed lowlevel tags quote category revising disagreements discussion Thematic analysis claim find reproducible phenomena within interviews example attempt compute interrater reliability since make claim two researchers trained reliably identify exactly utterances interviewees examples “expectations towards change” exhaustively identified instances expectation among interviewees apply statistics qualitative results attach much importance counts purpose interviews thematic analysis discover broad categories attitudes strategies towards change interviewees experienced illustrative examples typical practices motivations constitute strategies complement interviews explored policies public discussions meeting minutes tools ecosystem analysis distinguish decisions made roles upstream downstream developer depicted Figure 1 Validity check validate findings case study adapted Dagenais Robillard’s methodology 18 check fit applicability defined Corbin Strauss 13 p 305 presented interviewees summary full draft Sections 42–43 along questions prompting look correctness areas agreement disagreement ie fit insights gained reading experiences developers platforms ie applicability Six interviewees responded comments results six indicated general agreement eg R5 “It brings structure coherence issues loosely aware rarely centre focus everyday work” corrections included small factual errors eg number CRAN packages increased since initial writeup 14000 suggestions ways sharpen analysis eg R7 noted CRAN’s policy contact downstream developers apply majority users outside CRAN incorporated feedback consistent recheck data added clarifications otherwise 33 Study2 conducted systematic mapping values practices broad sample ecosystems primarily making use survey large number diversity practices Tables 4 5 6 could measure one methodology asked large subset survey eg research dependencies using bottom section Table 6 also analyzed documentation policies identify practices enacted ecosystemwide organizations tools eg Ecosystemwide synchronized release Table 4 finally mined Github repositories librariesio package metadata dataset practices leave visible traces eg “Continue critical updates older versions” Table 5 55 practices identify 19 attempt measure Study 2 eg socially connected developers following Twitter going conferences top section Table 6 First describe survey methods subsequent subsections describe policy analysis Section 335 data mining Section 336 methods 331 Ecosystems solicited survey participants ecosystems dependency network structure packages depend packages standardized infrastructure helps sharing compatibility started list repositories Wikipedia’s “Software Repository” page added additional ecosystems active community could find excluded ecosystems flat structure packages depend single shared platform eg Android ecosystems obviously small hope get least dozen responses also excluded ecosystems different enough possible write clear questions would apply across ecosystems excluded example operatingsystemlevel package managers apt rpm brew scientific workflow engines conducted survey 31 ecosystems analysis somewhat arbitrarily set minimum number participants ecosystem 15 feeling would give us reasonable claim breadth responses led us exclude 13 ecosystems CBoost Bower Perl 6 Smalltalk TexCTAN Julia Clojureclojars Meteor Wordpress SwiftPM PHP’s PEAR Racket Dartpub leaving us 18 ecosystems analysis shown Table 2 2 40 complete responses 332 Survey Goals Recruitment survey consisted 108 questions seven long free text questions marked optional opportunities clarification three short text questions ecosystem package name gender rest multiplechoice scales informed consent screen participants first asked choose ecosystem published used package could choose list type another grouped rare answers “other” analysis 333 Recruitment invested significant outreach activities recruit participants survey First created web page Twitter account describe state current research area form easily accessible practitioners encouraged readers web page take survey contribute additional knowledge values ecosystems Second attended community events including npmcamp 2016 talk developers community leaders multiple ecosystems research result several prominent community members tweeted web page survey resulting surges responses CRAN npm particularly Third promoted web page survey ecosystemspecific forums mailing lists “developers write packages” hoping web page would spark interest topic also posted Twitter hashtags appropriate different ecosystems Finally 21 ecosystems outreach activity yield sufficient answers solicited individuals directly email sent 8137 emails package authors sampled authors packages culled librariesio targeted ecosystems Participants demographics succeeded recruiting 2321 participants partially fully complete survey August November 2016 number 932 completed survey however put value questions near beginning 1466 answers questions Statistical analysis answers early questions reveal systematic differences people completed survey mean difference answers 65 Likertscale questions respondents completed survey 3httpsbreakingapisorg 013 scale points 4 5 depending question maximum difference 83 scale points maximum difference among questions one “incomplete” respondent answered 54 Likertscale points Since partial responses similar full responses include data incomplete responses correct careless responses people appeared answering many questions without careful consideration excluded “careless” sections person’s response rated items exactly performed test eight sections survey number excluded blocks ranged 11 set upstream practices 76 set downstream practices people excluded one block responses questions appear outliers mean difference answers 65 Likertscale questions respondents excluded block respondents 015 scale points 4 5 depending question maximum difference 50 question “How important think following values community stability” answers similar questions exclude entire people apparently careless eight blocks Table 2 shows participation ecosystem Participants averaged 88 years development experience 72 years open source 46 ecosystem answered Slightly half 59 college degrees CS frequently claimed role ecosystem package lead developer 59 Others ranged 85 claimed role founding core team ecosystem 11 drew ecosystem packages projects average age 33 152 18–24yearolds 6 65 gave gender 959 identified male 32 female 08 gave another gender demographic proportions quite similar contemporaneous Github community survey 31 334 Survey Design goal survey investigate prevalence values practices across many ecosystems feasible asked larger number questions typical survey sort Long surveys often reduced completion rates however mitigated keeping questions diverse hopefully interesting participants putting questions interested front result got reasonably high completion rate 40 partial completion rate 62 value questions beginning considering length survey resulting encouragingly rich deep dataset article focus describing values practices responses additional data available accompanying data release 7 Values explore complete list possible values relevant managing change began values derived interviews Study 1 searched web pages candidate ecosystems clues potential values example “fun” mentioned explicit value Ruby community interview Ruby founder Matsumoto said “That primary goal designing Ruby want fun programming myself” 82 Note values initially seem directly related breaking change included thought could indirectly influence breaking change practices example expected perhaps practices efficient less rewarding carry “fun”valuing ecosystem might avoid assembled list 11 values following descriptions Stability Backward compatibility allowing seamless updates “do break existing clients” Innovation Innovation fast potentially disruptive changes • Replicability Longterm archival current historic versions guaranteed integrity exact behavior code replicated • Compatibility Protecting downstream developers endusers struggling find compatible set versions different packages • Rapid Access Getting package changes endusers quickly release “no delays” • Quality Providing packages high quality eg secure correct • Commerce Helping professionals build commercial • Community Collaboration communication among developers • Openness Fairness Ensuring everyone community say decisionmaking community’s direction • Curation Selecting set consistent compatible packages cover users’ needs • Fun personal growth Providing good experience package developers users survey asked participants perceived values community—“How important think following values community” used sevenpoint rating scale adapted Schwartz’s value study 71 “extremely important” “very important” “important” “somewhat important” “not important” “community opposes value” “I don’t know” first five options separated visually last two make clear former designed approximate regular intervals recommended Dillman et al 27 addition asked participants similar value question scale values respect single package worked ecosystem encourage participants think concrete work asked name specific package worked used package question “How important values development personally” Recognizing despite taking values multiple sources may captured values relevant managing change asked survey participants openended question values important ecosystem answers summarized Section 52 Practices practices part survey asked many softwareengineering practices many mention throughout analysis Tables 4 5 6 full list exact phrasing questions found Appendix B Surveyed practices encompassed participant’s personal practices experiences respect documentation support timing version numbering releases selecting packages depend monitoring dependencies changes asked appropriate either agreement Likert scale frequency scale “never” “several times day” subset 15 questions relating communication developers downstream packages skipped participants indicated maintain package used others limit length survey focused primarily questions cannot answered difficult answer mining repositories reading explicit policy documents see “M” “P” labels Tables 4 5 6 Study 2 Methods column Survey analysis 483 participants 21 gave answer least one seven optional freeresponse questions 11 people gave answers seven used grounded approach analyze answers question values one researcher performed open coding identify set candidate codes two researchers iteratively combined revised achieve consensus set codes apply responses Layout Figures Figures 2 3 4 drawn eliminating skipped “don’t know” values merging “Not important” “opposed value” answers drawing violin plot diamond symbol mean position violin bodies smoothed image portrays mean rough distribution Table 10 wanted derive ranking importance values ecosystem provide indication consensus around ranking method adopted calculates highest ranked values ecosystem identifying person ecosystem highest rating 11 values incrementing count values person assigned highest rating effect counting number people ranked value highest accounting ties table lists values three highest counts consensus numbers described caption 335 Policy Analysis Method examined ecosystem’s online presence summarized sanctioned practices Practices ecosystems derived documentation pages within language’s repository’s websites specifically seeking documentation define package submit repository documents typically communicate policies authors clear actionable way columns table defined follows Dependencies outside repository Standard tools two ecosystems Stackage LuaRocks allow developers additionally specify packages part standard repository example reference GitHub repository alternate specialized site checked documentation package manager’s syntax declare dependencies see way specify URL package formally repository marked feature could specified directly URL “alternate repo” could accomplished alternate repository custom server mimics repository’s API Central Repository captures whether ecosystem supplies packages central repository simply provides index authorhosted download sites Access dependency versions denotes whether ecosystem documentation recommends examples documentation page packages refer dependencies version number simply assume latest version dependency desired RCRAN Go two cases Stackage Bioconductor set mutually compatible versions provided used together set Gatekeeping Standards Ecosystem repositories vary amount vetting packages include determined looking submission requirements packages open circle table means cursory metadata name package list dependencies required closed circle means platform tools volunteers perform deeper investigation package vetting submitter automated manual tests package packages depend virus checks Two marked “staged releases” submissions tested collectively along cohort packages released simultaneously Synced Ecosystem simply denotes whether ecosystem packages important subset released regular synchronized schedule
::::
336 Data Mining mined data two sources capture data prevalence seven additional practices First list packages query derived librariesio librariesiodata crossecosystem package index Librariesio lists versions release dates dependencies version constraints source repositories available subset 18 ecosystems Atom RCRAN PerlCPAN RubyRubygems RustCargo PythonPypi NuGet Maven PHPPackagist NodejsNPM Erlang ElixirHex Partial information available CocoaPods 4Recommendations evolved since 2016 Go see httpsbloggopheracademycomadvent2016sagagodependencymanagement Table 3 Ecosystem Statistics Ecosystem Founded Num Pkgs Avg deps 3 deps 0 deps Atom plugins 2014 4424 12 100 382 CocoaPods 2001 14493 04 17 211 Eclipse plugins 2001 14954 64 557 100 ErlangElixirHex 2013 1304 10 53 505 Go 2013 76632 106 571 883 Haskell CabalHackage 2003 8593 64 579 916 Haskell StackStackage 2012 1337 83 650 939 LuaLuarocks 2007 966 08 57 347 Maven 2002 114404 21 206 418 NodejsNPM 2010 229202 56 498 812 NuGet 2010 66486 16 114 583 PerlCPAN 1995 31641 76 565 796 PythonPyPi 2002 65622 02 20 81 PHPPackagist 2012 63860 31 281 827 RBioconductor 2001 1104 49 489 742 RCRAN 1997 7922 29 279 867 RustCargo 2014 3727 21 201 715 Packagedependency founding year data ecosystems Pkgs number packages repository checked January 2016 Avg deps average number dependencies sampled packages 3 deps percentage packages three dependencies 0 deps percentage dependencies Hackage dependencies Dependency counts Bioconductor Hackage Stackage Lua Eclipse CocoaPods scraped respective repository websites find Go dependencies listed centrally repository extracted information World Code 57 massive mirror GitHub GitLab Bitbucket open source repositories indexed searchable ways make convenient data mining GitHub’s APIs allow One data product World Code provides dependencies packages parsed source code files used count Go dependencies Table 3 shows packages ecosystems interdependent widely differing degrees Beyond package counts dependencies information packages queried packages ecosystems World Code 57 Dependency Version Constraints ran patternmatching dependency constraints packages librariesio packages released 2016 flagged package whether used particular type constraint one dependencies time year Note percentages add 100 since package may use one kind dependency constraint Exact Dependency version constrained fully specified version number 132 Min Version constraints 132 use conventions like caret npm effect eg 13 130 Range Constraints minimum maximum version like 13220 use conventions like tilde npm effect eg 132 means 13220 — Unconstrained dependency name specified version constraints either constraint blank symbol like “” used5 finegrained analysis version constraints across many ecosystems see Dietrich et al 26 Lock files Using World Code 57 examined files committed 2016 ecosystem’s packages looking references lock file specifies exact versions dependencies direct transitive ie dependencies dependencies differ ecosystem vary canonical use filenames used search shown Table 11 Appendix Including lock file enduser distribution program makes likely program run correctly since preserves exact versions dependencies program tested However developers including many dependencies projects may prefer specify exact versions transitive dependencies since may conflict means opportunity resolve conflicts perhaps locking consistent set dependencies producing release users 78 Maintaining old versions Making bug fixes outdated versions code even backporting new features helpful users cannot update cuttingedge versions reason define priorversion maintenance operationally simply release whose version number smaller expected hence sequence example sequence releases “201” “202” “153” “203” identify “153” likely bugfix backported feature introduced 201 202 introduced courtesy users currently using 152 choose upgrade 20 series Specifically measure captures percentage packages ecosystem whose version number ever decreased 2016 per data Librariesio Cloning measured percentage packages repository whose projects borrowed file 2016 another package building list SHA hashes files blobs associated commit ecosystem World Code 57 looking overlaps count cloned file commit incorporates blob 1 kb 2016 previously seen package ecosystem considered blobs derived packages ecosystem’s repository ones derived projects broader realm open source chose count withinrepository clones specifically since developer could tried use ecosystem’s dependency management system incorporate desired code reference chose Previous research also mapped cloning behaviors 33 49
::::
34 Threats Validity chose methods carefully answer research questions survey particular differs typical statistically focused survey technique therefore describe threats validity study presenting results readers mind read findings described Study 1 used case selection criteria 92 appropriate contrasting cases may typical ecosystems one needs careful generalizing beyond three cases results may affected selection bias developers want interviewed may different experiences Finally differences found among cases 5Note weighs heavily state packages versions released dependencies may confounded reasons selected popularity availability data Study 2 typical surveys field survey sample truly random may selection bias relating able reach via venues chose tried mitigate recruiting forums Twitter direct email survey also quite long advertised front People less patience long surveys less interest questions breaking changes values practices may selfselected could significant people impatience long surveys also different softwareengineering practices beliefs Another possible concern respondents may apply different standards ratings example expectation stability extremely high particular ecosystem participants may rate perceived importance stability lower applying stringent standard focused everyone stability similar focus stability different ecosystem might lead participants ecosystem rate importance stability higher tried mitigate requiring least 15 participants ecosystem give breadth experience behind responses tried avoid using terminology differed among ecosystems always successful example word “snapshot” means different things different ecosystems’ practices caused confusion Even term “breaking change” may interpreted differently example might define narrowly change simply would cause downstream compilation fail intended also include changes would cause wrong behavior downstream Respondents may also given answers questions influenced social desirability example may felt obliged say “quality” extremely important “right” answer people follow certain practices know expected mitigation approach ensuring confidentiality responses avoiding extent possible questions clear desirable undesirable responses difficulty recruiting sufficient participants smaller ecosystems Perl 6 Clojure small ecosystems may different characteristics large ones two small ecosystems Stackage Lua outliers ways exploration small ecosystems example interviews analysis artifacts priority future work
::::
4 STUDY 1 QUALITATIVE MULTIPLECASE STUDY Study 1 investigated decisionmaking involved making breaking changes practices adopt ease burden RQ11 developers make decisions whether perform breaking changes mitigate delay costs developers also wanted see developers responded breaking changes affected RQ12 developers react manage change dependencies Finally wanted know whether developers perceived tensions platform policies intended effects RQ13 platform policies tools ever unintended consequences 41 Case Overview understand identified different practices policies important understand purpose history ecosystem following provide brief description three ecosystems values informed public documentation interviews Platformlevel features practices relevant breaking change identified Table 4 411 Eclipse Eclipse foundation publishes 250 open source projects flagship Eclipse IDE created 2001 IDE built ground around plugin architecture used general purpose GUI platform plugins depend extend plugins Projects apply join Eclipse foundation incubation process practices come Eclipse management umbrella also common practice develop commercial opensource packages separately foundation publish common format thirdparty server addition “Eclipse marketplace” popular registry listing 1600 external Eclipse packages installed thirdparty servers GUI dialog Eclipse foundation coordinates “simultaneous release” Eclipse IDE year 2016 three “update releases” new features Many external developers align dates well Eclipse foundation backed corporate members IBM SAP Oracle policies biased toward backward compatibility packages eg commercial business solutions developed 10 years ago often still work current Eclipse revision without modification core value Eclipse community backward compatibility value evident many policies “API Prime Directive evolving Component API release release break existing Clients” 25 Although entirely uncontroversial explain value confirmed many interviewees 412 RCRAN Comprehensive R Archive Network CRAN managed distributed packages written R language since 1997 R interpreted language designed statistics R language updated approximately every six months new development snapshots available daily R multiple repositories different policies expectations including Bioconductor RForge focus CRAN largest one CRAN formally exists umbrella R Foundation sets policies CRAN contains 8000 packages 29 either required “recommended” bundled binary installs 2200 cataloged useful 33 different specializations finance medical imaging Distributing R CRAN package gives high visibility since installation CRAN automated commandline version R popular IDE RStudio 69 R CRAN used many developers without formal computerscience programming background CRAN pursues snapshot consistency newest version every package compatible newest version every package repository Older versions “archived” available repository harder install new package version submitted CRAN evaluated CRAN team’s partly automated process package must pass tests must break tests downstream packages CRAN depend without first alerting package’s authors make corresponding fixes Package owners need react changes platform upstream packages within weeks otherwise package may archived core value RCRAN community make easy endusers install uptodate packages Although explicitly represented policy documents value apparent many interviews example R10 said “CRAN primarily academic users mind want timely access current research” Table 4 Platform Communitylevel Practice Choices Platform Upstream Downstream 3 Third party Study 2 Method Policy Analysis Survey Mining Study 2 Method Practice P P Existence centralized repository directory packages P P Mechanism referring dependencies distributed outside official repositories eg via github directly P P Make historical versions package easy difficult rely P P Mechanism remove reassign unmaintained packages eg maintainers respond emails P Releasing changes fixed advertised schedule per package P SP Ecosystemwide synchronized release P P Repository personnel check standards submitted code making available repository P Allow multiple versionsonly one version package loaded time PU “Stability attributes” Rust saying API points change P Use nightly unstable builds get exciting new features cost compatibility downstream users P Disallow wildcard dependencies P Test compiler changes published using prevent breaking things P Constrained rules version numbering eg cargo disallowing wildcards 3 P Thirdparty curation sets useful packages compatible versions P Dynamic language feature help backward compatibility optional parameters R P Centralized testing infrastructure packages P Vulnerability tracking eg Node security platform U Private arrangement among package authors release time ecosystembyecosystem breakdown policies see Section 5 413 Nodejsnpm Nodejs runtime environment serverside JavaScript applications released initially 2009 npm default package manager npm provides tools managing packages JavaScript code online registry packages revisions npm repository contains 250000 packages rapid growth rates Nodejsnpm platform somewhat unusual characteristic multiple revisions package coexist within user use two packages require different revision third package case npm install revisions distinct places package use different implementation core value Nodejsnpm community make easy fast developers publish use packages addition community open rapid change Ease developers one principles motivating designer npm 75 Therefore npm explicitly act gatekeeper review testing requirements fact npm repository contains large number test stub packages focus convenience developers instead endusers apparent interviews 42 Study 1 Results Planning Changes RQ11 first discuss managing change perspective developer planning perform changes may affect downstream users observed similar forces concerns regarding change across three ecosystems observed differences community values affect ways package maintainers mitigate delay costs downstream users 421 Breaking Changes Reasons Opportunity Costs Although breaking changes APIs costly downstream users terms interruptions rework interviewees gave many reasons perform changes corresponding opportunity costs arise deciding perform change cost maintaining obsolete code working around known bugs postponing desirable new features Obvious expected reasons breaking changes included requirements context changes rippling effects upstream changes Beyond found surprisingly frequent mentions stylistic performance reasons well difficult bug fixes Technical debt Surprisingly 12 interviewees E3 E9 R1 R3 R4 R5 R6 R7 R8 N1 N7 mentioned concerns technical debt rather bugs new features rippling upstream changes trigger breaking changes technical debt refer code functionally sufficient outstanding stylistic issues developers want fix poorly chosen object models method names lack extensibility maintainability littleused longdeprecated methods conjecture reason interviewees brought kinds changes often discussion thought depth Technical debt often arises tension tools practices encourage developers preserve backward compatibility eg Eclipse’s “prime directive” versus general pressure evolution improvement Developers often postpone breaking changes technical debt becomes intolerable example E3 mentioned reason planning finally remove deprecated code “What provide old methods deprecated gets quite messy one point almost half methods deprecated” E9 similarly told us upcoming longpostponed major version change “since don’t often probably every five years let’s take advantage opportunity things would good couldn’t before” Old interfaces come seem old fashioned unattractive swiftly changing community Three interviewees said made breaking changes syntactic reasons harmonize syntax R1 improve “weird” “bad” names R3 R4 interfaces N7 talked adopting new JavaScript programming paradigm far attractive N7 “You can’t stay old stuff forever it’s going work drastically rewrote internals transport stream that’s sort essentially right Like it’s little stream takes logs sends places” However four interviewees E1 E5 E6 R6 talked consequences able make changes ie preserve old interfaces long periods caused opportunity costs since hindered attracting new developers lured cuttingedge things E6 example told us “If hip things get people create new APIs top order example create next graphical editing framework build efficient text editors things don’t happen Eclipse platform anymore” Efficiency Four interviewees E6 R1 R4 N1 reported cases efficiency improvements required breaking changes example N1’s package offered API requesting paged data server could provide efficiently deprecated eventually removed function rather spending money hardware Bugs Bug fixes another reason breaking changes E4 E7 R7 R9 Bug fixes break downstream packages packages depend actual broken behavior instead intended behavior lack welldefined contracts implementations makes assigning blame responsibilities difficult practice E5 told us “If someone likes broken semantics they’re going like fixed semantics” Thus even fixing obvious mistake code control single person require significant coordination among many people Throughout interviews heard many examples bug fixes effectively broke downstream packages difficulty knowing advance fixes would cause problems example R7 told us reimplementing standard string processing function finding broke code downstream users depended bugs tests caught R9 commented opportunity cost fixing bug deference downstream users’ workarounds “If downstream package implemented workaround bug fix actually breaks workaround sort fallback … pause gets nasty” 422 Dividing Delaying Change Costs previous discussion already hinted flexibility regarding bears costs breaking change instance package’s developer decide making breaking change pushing costs rework maintainers downstream packages making change accepting opportunity costs technical debt Even deciding make change developer faces strategic choices whether invest effort making change reduce interruption rework costs downstream users well affect timing costs paid Table 5 example documenting upgrade developer invests effort reduce effort downstream maintainers Different developers different communities different attitudes toward pay costs change show Awareness Costs Downstream Users Almost 24 28 interviewees stated possible avoid breaking changes would affect downstream users Reasons included looking users’ best interests knowing costs affected users would come back users ask help adapting change ask change reverted seek alternative packages Two interviewees E1 R4 specifically mentioned concern downstream users’ scientific research R4 “We’re improving method results might change that’s also worrying—it makes hard reproducible research” Interviewees’ concern impacts users tied size visibility user base perceived importance appropriateness usage Nine interviewees across ecosystems E4 E5 E6 R1 R4 R6 R7 R9 N7 aware users concerned specifically number users affected quantity complaints change would imply eg R9 “I wanted rename something specifically describes actually new V8 context know can’t many packages already importing new context function” N1 “we happen know paging feature … often used Node module customers” Another npm developer said N7 “…that strictly breaking change feature really didn’t want break community feature Like didn’t want 700 give ‘the code you’re using upgrade…Good luck bro” RCRAN developer said R7 “I’m cautious making changes make changes often regret Even small change package used lot people improves 90 people’s lives makes 10 people’s lives worse 1 complain package lot people” Three interviewees E1 R4 R8 noted sensitivity toward avoiding breaking changes grew experience growing user base learned feedback received earlier breaking changes course developers also work downstream packages Four interviewees mentioned E5 N4 N7 R6 see discussion Section 431 presumably aware impact changes make packages four developers particularly worried breaking changes Three E6 N1 N5 strong ties users felt could help individually N5 “We try avoid breaking code—but it’s easy update code” Interviewee N6 expressed “out sight mind” attitude “Unfortunately someone suffers silently know reach contact something yeah that’s bad suffering person sort tree woods falls doesn’t make sound” Finally developers described tradeoffs fixing mistakes downstream users come depend E8 talked stuck poor design “If make mistake API … sorry you’re stuck kind work around it” R9 mentioned circumstances users depended buggy behavior upstream code fixed anyway “After upgrading parser people complained script longer working problem syntax invalid begin It’s obviously fault” Techniques Mitigate Delay Costs Despite strong general preference avoiding breaking changes many cases opportunity costs making change high interviewees identified several different strategies package maintainers routinely invest effort reduce delay impact changes downstream users Maintaining old interfaces Across ecosystems preserving old interface alongside new one common approach mitigate immediate impact change downstream users specifics depend language tools common strategies avoid breaking downstream implementations include documenting methods deprecated providing default implementations new extension points parameters strategies package developer invests additional effort preserve backward compatibility accepting technical debt form extra code maintain time exchange preventing immediate downstream impact change developer may later time clean code affecting downstream users updated meantime 68 Similarly many interviewees E2 E3 E5–E8 R1 R6–R9 N1 N7 told us various techniques perform changes without breaking binary compatibility prevent rework costs existing users accepting complicated implementations harder maintenance changed package possibly also creating costs new downstream users deal complicated mechanisms Parallel Releases Seven developers E5 E6 R1 R2 R4 R7 R8 reported strategies maintain multiple parallel releases downstream developers incorporate minor nonbreaking changes eg bug fixes without adopt major revisions Nodejsnpm’s caret operator allows package authors support parallel releases different version numbers author publish update 101 version 100 even 200 released users wish stay 1 series still receive updates may refer version 1 1x receive anything less 200 common practice provide security patches including older releases contrast CRAN supports sequential version numbering causing developers fork packages eg reshape2 introduced backward incompatible revision reshape However R8 told us discouraged CRAN R8 “Because 2 it’s second version point freeze API leave 6httpsdocsnpmjscommiscsemver 7Current npm security alerts listed httpswwwnpmjscomadvisories 8eg httpswwwnpmjscomadvisories1482 9According httpscranrprojectorgwebpackagespolicieshtml “Updates previouslypublished packages must increased version” jump n1 version continue think there’s lingo CRAN’s instructions package authors they’d rather that” case fact adding code multiple versions suggests developers investing significant additional effort reduce immediate impact downstream users example N1 told us conservative making major new versions since package “has changed major version numbers lot last years many things backported earlier versions irritating major revisions every couple months” variant strategy maintain separate interfaces different user groups different stability commitments within package see façade pattern Reference 30 example interviewee E5 provided parallel detailed frequently changing API expert users simpler stable API insulated less sophisticated users changes Similarly interviewee R1 split packages smaller packages intention user could depend parts relevant would exposed less change cases developer accepts higher design maintenance costs multiple APIs reduced impact specific groups users distinct needs Release Planning Individual developers communities may take consideration downstream users planning release changes R1 keeps versions package quickly changing API separate repository batches multiple updates together CRAN less frequently wants release version broader audience RCRAN Nodejsnpm packages released individuals whenever want core packages Eclipse community coordinate around synchronized yearly releasestext10 strategy also common package systems Debiantext11 Bioconductortext12 Delaying releases may incur coordination overhead opportunity costs slowing development changer reduces frequency though necessarily severity downstream users exposed changes gives downstream users planning horizon Communication users Finally developers communicate various ways users reduce impact breaking change Seven interviewees E6 R4 R7 R8 R9 N6 N7 made early announcements create awareness receive feedback R7 explained “two weeks month actual release sort prerelease announcement Twitter tell people use README” told us validation phase since written script email downstream maintainers release Another reason communicating downstream users help deal aftermath change simplest case developer could invest effort documenting upgrade Nine interviewees E7 R2 R3 R7–R9 N1 N4 N5 mentioned aware users personally could reach individually example N1 contacted users still using old API help migrate N5 users present onsite could therefore help migrate code E7 went far create individual patches downstream packages within Eclipse core get adopt new interface move away old deprecated one cases package maintainers invest effort reduce costs downstream users 423 Influence Community Values previously discussed techniques mechanisms developers use tweaking pays costs change Individual developers often adopt patterns fact six interviewees E1 R3 R4 R5 R8 N6 described gradual 10httpswikieclipseorgSimultaneousRelease 11httpswwwdebianorgdocmanualsdebianhandbooksectreleaselifecyclerohtml 12According httpswwwbioconductororgdeveloperspackagesubmission “There two releases year around April October” Table 5 Practices Mostly Upstream Communicate Mitigate Effects Change Study 2 Method Practice U Freeze APIs protect downstream users change U Release major change new package name rather new version U Mark API points deprecated warn future removal U Remove deprecated API points eventually U Parallel releases protect users want upgrade U Release changes batch rather made make less churn users U Write new code backward compatible possibly cost incurring technical debt U Proactively notify users upcoming changes U Assist users trouble upgrading new version breaking change U Write migration guide help users upgrade U Write change log document compatibility problems prior releases U Use semantic versioning signal kinds changes made Platform rules requiring package authors negotiate compatibility releasing snapshot consistency U Continue critical updates older versions give users way avoid expensive major upgrade Ways check APIs changed eg API tools since tags documentation adoption formal processes time learned value experience time could clearly observe attitudes practices differ significantly among three ecosystems heavily influenced ecosystem values tools policies Eclipse Developers willing accept high costs opportunity costs Eclipse’s value backward compatibility especially core packages community developed educational material explaining Java’s binary compatibility giving recommendations backward compatible API design 24 25 API Tools community developed sophisticated tool support detect even subtle breaking changes enforce changerelated policies adding since tags API documentation Breaking changes core packages fact rare 38 Even though arguably make platform harder learn maintain Eclipse developers identified documented 25 part 3 workarounds extending interface maintaining old interfaces creating additional interfaces avoid modifying existing ones eg IDetailPane2 IDetailPane3 IHandler2 runtime weaving Deprecating interfaces methods common actually removing example like many methods orgeclipsecoreruntimePluginstartup publication still included despite deprecated 15 years E6 noted backward compatibility prevents modernizing APIs replacing arrays collections 13httpswwweclipseorgpdepdeapitools 14eg guide published Eclipse foundation evolving APIs says “Obsolete API elements marked deprecated point new customers new API replaces need continue working advertised couple releases expense breakage low enough deleted” 25 15This method deprecated 2004 httpsgithubcomeclipseeclipseplatformruntimecommita46e757a1938edb0a7109dafef349c3a3ffc58ea still present 2020 httpsgithubcomeclipseeclipseplatformruntimeblob9aedff3f2141631a8bc5fa6d1abe005ea633f107bundlesorgeclipsecoreruntimesrcorgeclipsecoreruntimePluginjava Eclipse community invests significant effort release planning cost resulting friction reported multiple interviewees E9 “Eclipse release process projects release time platform projects day projects day you’re expected available little bit make sure bills properly right that’s kinda complexity” required coordination invested toward ensuring stability smooth transitions plannable times downstream users Eclipse release complex process steps aimed maintaining technical interoperability prior versions also maintaining consistent level legal compatibility usability standards security onfootnotehttpswikieclipseorgDevelopmentResourcesHOWTOReleaseReviews culture conservative change contrasts example R developer told us R7 “On one hand try careful hand don’t want inflict harm like paralyzed fact anything might make someone’s life worse Sometimes like go ahead accept things going break it’s end world” Eclipse maintenance releases old major revisions common Table 7 presumably backward compatibility users simply told update latest release RCRAN RCRAN community values making easy users get consistent uptodate installation developers invest significant effort achieve consistency policy CRAN packages making changes affect larger body code outside CRAN However changes affect CRAN packages upstream developers asked bear significant extra cost reaching coordinating maintainers affected packagesfootnotehttpscranrprojectorgwebpackagespolicieshtmlSubmission termed “forward impact management” De Souza RedmilesciteDeSouza2019 Downstream maintainers may also bear cost pressure update packages first upstream developer make breaking change ensure CRAN packages consistent CRAN’s policy requires verifies developers maintain constant synchronization 5 10 interviewees R2 R3 R7 R8 R9 specifically mentioned reaching individually known downstream developers contrast three Nodejs interviewees N1 N4 N5 one Eclipse interviewee E7 Synchronization thus continuous decentralized localized Eclipse’s simultaneous releases Among interviewees five developers specialized R packages targeted small close communities knew users personally example R3 mentioned “no one used” feature asked knew replied “statisticians working lot medical imaging type applications R small community There’s many people know” R3 said got know users interactions dependency one Node Eclipse interviewees E6 mentioned personal connections downstream users sample small sure sampling bias Consistency enforced manual automated checks package updatefootnotehttpscranrprojectorgwebpackagespolicieshtmlSubmission change management process collaborative also demanding maintainers time R7 said timeline adapt upstream change “might relatively short timeline two weeks month that’s difficult deal try sort focus one couple weeks time remain productive” Node developers contrast ignore changes feel like updating N5 “Why don’t upgrade often It’s work you’d hope” Eclipse developers rarely need worry change eg E1 “When new version comes every year July whenever I’d go ahead test plugin works correctly new version don’t care much New features mostly irrelevant didn’t care much that” platform conducive multiple parallel releases—on CRAN package revision must higher version number one supersedes old major version cannot updated policies also discourage forking submitting separate name central release planning perhaps perceived slow access cuttingedge research Overall observed much communication coordination downstream users individual changes Eclipse also flexibility regard performing breaking changes Nodejsnpm Nodejsnpm community values ease upstream developers possibility move fast 75 much less demanding developer make breaking change Six Nodejs interviewees talked importance signaling change semantic versioning sharply contrasts R developers asked two R interviewees spoke semantic versioning example R7 “I’m familiar semantic versioning stuff It’s don’t find useful personally R users aren’t familiar think convention little bit ridiculous side R users don’t think version numbers send terribly strong signal likely know version using currently anyway” Semantic versioning Node allows developers make breaking changes long clearly indicate intentions technical platform allows downstream developers still easily use old version without fearing version inconsistencies breaking changes easily cause rippling effects immediate costs downstream users still avoid breaking changes employ various strategies maintain old interfaces interviews Nodejsnpm developers generally willing perform breaking changes name progress fighting technical debt including experimenting APIs right example N6 told us downstream user concerned breaking change “I could tell person well look problem least workaround simple Change dependency exact dependency instead saying depend package foo version Change exactly version still using old one know love postpone problem day need new thing that’s come longer backported old version knowing kind feel kind confident enough say yeah we’re gonna bump major version we’re gonna announce whatever takes don’t really feel much desire kind read backward compatible people” mitigation strategy maintenance releases old versions common made easy platform associated tools Analyzing npm repository found 24 100 “starred” packages least common Eclipse RCRAN Table 7 Summary RQ11 results Developers motivated change code many reasons requirements context changes bugs new features rippling effects upstream changes technical debt postponed changes also opportunity costs 19httpscranrprojectorgwebpackagespolicieshtmlSubmission forgoing postponing changes Opposing motivation awareness costs stream users changes especially userbase large visible cases developers want avoid imposing costs users choice binary however ways softening impacts change maintaining old interfaces making parallel releases making communicating plans upcoming changes De velopers weigh choices differently depending ecosystem’s values Eclipse core package developers discouraged heavily change thus opt techniques al low strictly backwardcompatible additions RCRAN developers officially discouraged making changes aware ecosystems rules parallel releases onus downstream users update burdensome downstream users emphasize com munication collaboration updates Nodejsnpm developers encouraged make changes mechanisms signal downstream users changes yet insulating requirement adopt changes result upstream developers quite likely opt change police others’ rigorous use signaling mechanisms change semantic versioning 43 Study 1 Results Coping Upstream Change RQ12 upstream developers flexibility planning changes may affect downstream developers downstream developers flexibilities regarding whether react upstream change influenced values policies technologies Table 6 monitor react upstream change significant burden developers eg mismatch schedules shown barrier collaboration 42 urgency reacting change depend significantly development context platform mechanisms discussing frequently react upstream change interviewees described spectrum ranging never updating E3 closely monitoring changes upstream packages N1 N2 R9 Two interviewees mentioned explicitly ignoring certain upstream changes N3 N7 others upgraded dependencies time releases N3 N5 deliberate housecleaning sweeps N7 E2 Even platform require updates developers often prefer update dependencies incorporate new fixes features E3 N2 avoid accumulating technical debt R6 N5 avoid updating updates require much effort eg causing complicated conflicts N5 E3 cause much disruption downstream N7 431 Monitoring Change developers want react timely fashion stream changes need monitor upstream projects way platform eg Nodejs R core CRAN infrastructure often additional source changes devel opers need keep interviews discovered many different strategies moni toring including technical social strategies strategies varied along urgency needs active monitoring upstream activity general social awareness upstream activities purely reactive stance developers wait kind notifications Active monitoring four interviewees E5 R9 N1 N4 reported actively monitoring stream changes sense maintaining personal awareness upstream changes regu larly looking activity going upstream dependencies R9 N1 N2 said used GitHub’s notification feed regularity N2 changes Nodejs platform upstream packages N4 kept following Twitter feeds blogs attending conferences R7 indicated raw notification feeds current form significant burden low signal noise ratio saying “The quantity notifications get GitHub already point overwhelming don’t even mostly read unless I’m actually working moment” later told us interview tried scaling back watching three five projects actively working one interviewee R9 feel overwhelmed saying occasional skimming GitHub feeds useful way get overview activity Upstream participation seven cases developers mentioned monitoring upstream changes outsiders following stream data active participants projects collaborating influence toward needs E5 N4 N7 R6 providing direct contributions packages E7 E9 R7 example describing challenge getting upstream projects prioritize changes needed Eclipse developer said “I touch everything care it’s really hard convince people things need find much easier learn projects need something myself” aligns de Souza Redmiles’ observation exchange personnel common strategy cooperation among dependent projects19 developers wear hats projects maintain active awareness upstream downstream developers upstream developers downstream work informs understanding upstream project’s requirements Others like E5 actively compiled tested development versions upstream dependencies emphasizing importance giving timely reactions “if report within week there’s better chance developer might remember … provides good chance revert change hit milestone” Social awareness Many interviewees tried maintain broad awareness change various social means frequently mentioned mechanism especially Nodejs community Twitter E9 R7–R9 N2 N3 N4a N4b N6 N7 example N4a commented “the people write actual fairly well connected Twitter … like water cooler type thing tend know what’s going elsewhere” ecosystem interviewees E5 R9 N4 N6 mentioned importance facetoface interactions conferences awareness important changes ecosystem mentioned social mechanisms learn change personal networks R6 R8 blogs E1 R4 R7 R8 N4 N7 curated mailing lists N1 Reactive monitoring Although research questions led us probe interviewees aforementioned active social monitoring practices reactive strategy also possible dependencies rather maintain awareness understanding plans activity upstream example watching Github feed keeping track follow changes might relevant developer may instead ignore upstream projects’ activity given actionable evidence needs adapt way developer waits hear problems others advance things broken Upstream developers contacting breaking changes failing tests dependency updates platform maintainers warning changes would affect tools enable reactive stance generate targeted notifications certain kinds changes specific tools differ among platforms support different practices policies Policies common practices eg testing practices platform strongly turn affect reliability reactive strategy corresponding tools Four developers R3 E5 N2 N7 mentioned use continuous integration detect compiletime issues caused breaking changes upstream packages early tools gemnasium 32 greenkeeper 35 allowed Nodejsnpm developers get notifications new Table 6 Practices Mostly Downstream Monitor Change Manage Avoid Effects Study 2 Method Practice Awareness coordination Reactively track upstream packages breaks you’re notified somehow Proactively track maintain awareness via github notifications mailing lists etc Submit feature requests bug reports upstream package authors Participate decisionmaking upstream package’s future Toolbased notifications upstream changes eg Greenkeeper Regularly test unreleased development versions dependency give timely feedback P Socially connected group developers following Twitter going conferences etc P Political work among core people get buy making breaking change Protection potential change update dependencies leave old versions known work Upgrade dependencies making new release Dependency hell manual manipulation dependency version constraints get set dependencies mutually compatible U Violate semantic versioning trivial changes prevent rippling updates version change would require Lock file fix versions upstream packages incl transitive dependencies release Report wrong semantic versioning bug MS Specify exact version number specific dependency MS Specify range legal version numbers dependencies eg allow minor major upgrades MS Specify dependency’s name constrain version used Protection dependencies significant research dependency weighing whether adopt Wrap dependency abstraction layer decrease risk change Avoid use dependencies roll SM Clone dependency’s code maintain new code MS Copy dependency code repository “vendoring” get exact version needed releases upstream packages Gemnasium alerted developers package releases fix known vulnerabilities whereas greenkeeper submitted pull requests automate continuous integration run new release either case developers could react notifications email pull requests CRAN’s requirement upstream developers notify downstream dependents change coming appears encourage downstream developers across ecosystem take reactive stance contrast Eclipse Nodejsnpm individual downstream developers need employ optional monitoring tools R7 defended practice waiting told breaking changes principled attentionpreserving choice consistent ecosystem norms R2 apologetic reactive “I guess I’ll sound crass say things like would wait hear CRAN something broke don’t think keep it” CRAN enforces policy manual automated checking package update running package’s tests test downstream packages repository well static checks CRAN team may warn affected downstream developer upcoming change email 432 Reducing Exposure Change Many developers developed strategies reduce exposure change upstream modules thus reduce monitoring rework efforts degree developers adopt mitigation strategies depends technology policies values discuss Limiting dependencies CRAN Eclipse interviewees asked 11 interviewees R1 R2 R3 R4 R6 R7 E1 E2 E4 E5 E9 felt better fewer dependencies Reasons limiting dependencies included limiting one’s exposure upstream changes burdening one’s users lot modules install potential version conflicts “dependency hell” Interviewee E5 represents common view “I depend things really worthwhile basically everything depend going give pain every often that’s inevitable” Apart removing longer needed dependencies tooling provided Eclipse six developers described aggressive actions avoid dependencies including copying R4 recreating R1 R6 R7 N6 functionality another package N6 fork recreate upstream dependency temporary measure licensing issue feel dependencies burden generally contrast due Nodejsnpm’s ability use old versions Eclipse’s stability three developers E3 N1 N5 specifically said see dependencies burden Selecting appropriate dependencies limiting appropriate dependencies interviewees mentioned variety different signals looked fell five categories Trust developers Seven interviewees E4 R1 R5 R6 R7 N4 N6 mentioned basing decisions personal trust package maintainers Criteria included large organization E4 reputation high quality code R6 N6 consistent maintenance R6 One interviewee R7 deliberately sent bug reports package test whether developer would responsive depending Activity level Five interviewees E4 N6 N2 R1 R6 considered activity level community developers example distinguishing “real” ongoing abandoned research prototype high low activity levels positive indicator depending state stated N2 “Ones activity mostly better maintained lots people contributing like express It’s likely community eyes ball consider backward compatibility ramifications … Ones little activity small projects don’t change often change isn’t issue either” Size identity user base Four developers mentioned size user base using signals daily download counts E2 N3 N5 whether projects trusted developers use N6 E2 said “Whether I’ll actually jump perceive projects using it” N5 told us “We look see many people using number downloads per day it’s low that’s clue it’s sketchy perfect heuristic” history Four interviewees said assumed past stable behavior package would predict future stability R1 R4 R6 E2 Signals included experience package N4 E5 status part platform’s core set packages E4 visible version history lack recent updates version number 10 E3 N1 N4 artifacts Finally developers mentioned signals artifacts including coding style R1 R6 documentation R1 good maintenance N6 perceived ease adoption R1 code size E2 N4 N7 conflicts dependencies N5 Encapsulating change Interestingly almost mention traditional encapsulation strategies isolate impact changes upstream modules contrary expectations typical softwareengineering teaching 63 73 88 N6 mentioned developing abstraction layer package upstream dependency implemented anticipated change Questions encapsulation interview protocol ask specifically one possible explanation since upstream package already generally try avoid gratuitous API changes ones necessary would require changes encapsulating class’s API obviating point encapsulation
::::
433 Platform Values Developer Values policies tools practices support different values ecosystem impose different costs developers depending whether attitude towards particular dependency aligned conflicted community’s broader values situations developers treat dependency fixed resource draw functionality also termed API contract 20 situations treat interface open negotiation change also API communication mechanism 20 Eclipse’s value backward compatibility predictable release planning convenient developers corporate stakeholders wish rely released core platform code fixed resource Stability ensures developers relying platform packages need monitor upstream changes reacting yearly releases Signals whether trust upstream package primarily social sense trust packages part core supported corporations known invested stability platform According E6 developers working within volatile parts Eclipse ecosystem using code outside stable core indevelopment features core greater need monitoring may exposed change sometimes encountering friction associated E6 told us “there different understanding important compatibility means start platform outer circles Eclipse” E5 talked recompiling upstream code often report bugs within week Thus although Eclipse deeply values stability necessarily sphere activity active collaboration change value appropriately set aside CRAN’s emphasis consistency timely access research seems encourage API communication rather API contract 20 view dependencies snapshot consistency approach forces maintainers react breaking upstream changes quickly typically weeks 87 causes apparent friction researchers might otherwise wish publish move things Many interviewees limited dependencies sometimes quite aggressively replicating code reacting notifications change rather actively following community upstream developers However active socially connected subset developers R7–R9 seemed welcome collaboration Although R7 advocated reacting upstream changes rather trying anticipate R7 R8 R9 emphasized Twitter conferences maintain upstream awareness Nodejsnpm’s emphasis convenience developers led infrastructure seems decouple upstream downstream developers collaborate since downstream depend old versions upstream long like logically lead less urgency monitor upstream changes except patching security vulnerabilities Developers nonetheless often choose take collaborative approach development using tools continuous integration greenkeeper 32 force stay date despite platform’s permissiveness Summary RQ12 results Downstream developers motivated update dependencies take advantage bug fixes new features avoid technical debt However updates complex risky disrupt downstream users may require awareness ongoing activity upstream Strategies balance costs risks include different levels awareness upstream projects social technical participation active merely reactive monitoring chunking work making updating decisions periodically limiting problem carefully vetting dependencies begin upstream change decisions ecosystem’s context affects participants’ choices Eclipse’s extreme interface stability allows downstream developers least outside core trust ignore possibility change CRAN’s policy global consistency among packages creates pressure package maintainers actively collaborate upstream counterparts core community seems spurred active collaboration Twitter conferences peripheral community limits dependencies avoid necessity Finally NPM’s tooling decouples downstream developers immediate impact upstream changes developers nonetheless wish stay date adopt tools like greenkeeper remind encourage update
::::
44 RQ13 Unintended Consequences Interviewees told us instances policies combinations led unintended consequences Eclipse One Eclipse developer said “political” nature making changes drive away developers users “You patient know talk whatnot really know play game get patches accepted think it’s intimidating new people come on” explained many interdependent packages managed different people mandate change interfaces implementing rippling change require negotiations among people conflicting interests Another consequence Eclipse’s stability along use semantic versioning many packages changed major version number 10 years However E8 told us strict semantic versioning impractical follow even cases breaking changes clearly documented release notes removing deprecated functions major versions often increased Updating major version number ripple version updates downstream packages entail significant work many downstream projects hardcoded major version numbers dependencies NodejsNPM Nodejsnpm contrast rapid rate changes automatic integration patches raise concerns reproducibility commercial deployments many cases community builds tools work around issues providing tools take specific snapshot installation including transitive package dependencies eg “npm shrinkwrap” RCRAN’s packrat “In npm install today tomorrow you’ll get 100s dependencies something may changed even version servers could running slightly different code customer facing code differ hard reproduce” RCRAN CRAN similar issue regarding scientific rather deployment reproducibility community’s goal timely access current research conflicts many researchers’ goal ensure reproducibility studies 61 RCRAN opposite dynamic Node evident versioning policy official policy version numbers requires version numbers increase submission20 permissive form semantic versioning used recommended many developers 87 91 conflicts unintended consequences suggest design ecosystem practices solved problem Summary RQ13 results Unexpected community responses policies included creative use semantic versioning innovative ways promoting replicability stagnation
::::
5 STUDY 2 SURVEY VALUES PRACTICES PREVALENCE CONSENSUS RELATIONSHIPS research questions Study 2 emerged large part results first study Study 2 endeavored expand scope beyond three cases ask questions raised results Study 1 revealed substantial differences three cases practices used manage breaking changes values practices appeared serve raises question prevalent differences values may nearly universal practices may fundamental wellknown effective employed nearly ecosystems However different ecosystems make use different technologies evolved different cultures serve different constituencies suggesting least values practices may vary perhaps dramatically among ecosystems questions Study 2 therefore RQ21 extent values practices managing breaking changes shared among diverse set ecosystems Moreover making assumption ecosystems tend shared view values practices across ecosystem ie characteristics ecosystems rather individual projects subecosystem clusters projects seems important test assumption hence RQ22 extent individual ecosystems exhibit consensus within community values practices Finally observed Study 1 seems practices designed serve ecosystem’s values eg insulate installed base applications changes Eclipse make easy endusers install use latest RCRAN allow developers contribute code simply possible Nodejs particular values always associated specific practices value ask generally RQ23 relationship ecosystem values practices Anonymized survey data available 7 20httpscranrprojectorgwebpackagespolicieshtml 51 Study 2 Results Validation Study 1 presenting new results survey take opportunity validate results Study 1 since available hundreds survey responses covering similar questions three ecosystems study Study 1 characterized practices values three ecosystems based interviews developers ecosystem values inferred Eclipse NodejsNPM align data Eclipse participants seem value backward compatibility postulated Stability compatibility two highest ranked values Table 10 Aligning findings interviews Eclipse developers topranked claiming make design compromises name backward compatibility Figure 3c Aligning interview result showed Nodejs developers value ease contributions developers Nodejs participants survey top ranked valuing innovation ranked highly making frequent changes package Figure 3a facing breaking changes dependencies Figure 4a although midrank feeling less constrained making changes ecosystems Figure 3b CRAN survey participants highly rank rapid access expected interviews averse adopting dependencies predicted shown although predicted claim clone code shown Aligning interview results discussing personal contacts among upstream downstream developers top ranked reporting personally warned changes dependencies Figure 4e contrary expectations low ranked warning downstream users Figure 3h contrast particular ie frequently warned rarely issuing warnings suggests RCRAN interviews may overweighted toward downstream developers Although survey largely validates interview results differences highlight fact different methods different sampling strategies produce somewhat different results even design intentions core members responsible promulgating practices necessarily propagated whole community 52 Study 2 Results Extent Values Practices Shared across Ecosystems RQ21 survey policy analysis data mining revealed interesting pattern similarity differences values practices across ecosystems vary across ecosystems rare see clear division ecosystems two distinct groups Rather sorting tends generate smooth curve extremes Visible differences ecosystems either end spectrum generally statistically significant often ecosystems stand discuss plot answers many survey questions Figures 2 3 4 Table 7 values except commerce Figure 2 considered least “somewhat important” ecosystems Stability quality community nearly universal values compatibility rapid access replicability also rated highly across ecosystems see bottom rows Figure 2 exceptions quality particular participants felt even strongly consistently high importance personally ecosystem whole mean personal value quality 08 scale points higher mean ecosystem value Still see strong differences ecosystems end spectrum Personal values correlate strongly perceived community values Spearman rho 0416 p 00001 n 10878 comparing two answers eleven values person separate observation participants average rated quality much higher personally Table 7 Comparison Datamined Practices Data librariesio World Code 57 see Section 336 Details Ecosystem Exact b min c range unconstrained e Cloning f Lock Files g Maint old vers Atom plugins 225 155 737 129 262 01 18 CocoaPods – – – – – 837 385 Eclipse plugins – – – – – na – ErlangElixirHex 909 925 816 00 – 657 395 Go – – – – 324 144 v – Haskell CabalHackage – – – – – 05 104 Haskell StackStackage – – – – – 0 na LuaLuarocks – – – – 321 0 – Maven 1000 0 0 0 072 Java na 254 NodejsNPM 163 044 786 367 703 08 396 NuGet 527 887 601 0 – 72 176 PerlCPAN 1000 00 00 00 230 10 272 PHPPackagist 213 372 667 799 116 169 106 PythonPyPi 146 345 586 441 817 na 607 RBioconductor – – – – 359 02 na RCRAN 00 244 00 756 269 08 010 RubyRubygems 378 496 463 094 176 174 454 RustCargo 386 214 936 40 690 146 14 Dependency Version Constraints versions packages data packages’ dependencies proportion dependencies constrained Exact version number specified minimum version range versions left version unconstrained Dash– means data dependencies tracked librariesio language files indexed WoC common type constraint ecosystem bolded Cloning percent packages repository whose projects borrowed file another package Maint old vers percent packages whose version number increase monotonically Lock files percentage packages use lock file set exact version transitive dependencies na equivalent lock file v Go includes projects “vendor” directory similar effect lock file compared rated ecosystem value 9 Likert scale points paired ttest p0001 also tended rate fun slightly higher personally 6 Likert scale paired ttest p0001 differences within half Likert scale point Additional values openended questions also asked openended question values important ecosystem Common themes counted Table 8 Answers included usability 15 responses social benevolence good conduct altruism empowerment making resources available 17 responses interesting pair contrasting values considered standardization 12 responses technical diversity 17 responses Technical diversity advocates valued freedom implement things interact developers diversity ways “the package creator charge deciding best manage hisher package organize contributors …” NodejsNPM respondent standardization advocates said ecosystem limited choice save developers time effort promoting wide adherence standards eg Python respondent said platform’s “open ecosystem proposes commonly used sensible ways solve popular problems enforces de facto standards” decried chaos “NIH Invented syndrome” Table 8 Number Respondents Suggesting Ecosystem Values Usability Social Benevolence Standardization Technical Diversity Documentation Modularity Testability Ecosystem Usability Social Benevolence Standardization Technical Diversity Documentation Modularity Testability Atom plugins 1 CocoaPods 2 2 Eclipse plugins ErlangElixirHex 1 1 1 Go 1 4 4 2 1 1 Haskell CabalHackage Haskell StackStackage LuaLuarocks 1 Maven NodejsNPM 1 1 3 7 NuGet PHPPackagist PerlCPAN 2 2 3 5 2 1 5 PythonPyPi 1 2 1 2 2 RBioconductor 4 RCRAN RubyRubygems 3 3 2 2 4 RustCargo 1 1 1 1 1 1 1 1 1 responses question deemed really ecosystem values rather favored technical qualities code package level 64 responses might promoted ecosystem culture good documentation 11 responses 4 Bioconductor participants high modularity 16 responses 7 NodejsNPM testability 11 responses 4 Ruby Perl Finally 13 8 responses objected framing question claiming either community existed could said share values 5 respondents 3 Maven saying multiple subcommunities existed differing values 8 respondents including 2 ErlangHex 2 HaskellCabal recent surveys 34 77 used similar sets values light responses survey propose revised list values Appendix C new list adds new values Standardization Technical Diversity Usability Social Benevolence removes Quality since distinguish among ecosystems Change planning practices Participants across ecosystems indicated survey Figure 3 perform breaking changes rarely median less year changes participants perform Figure 3a breaking changes package faces dependencies Figure 4a Although prior research suggests breaking changes “frequent” Section 2 relative overall frequency change Applying backofenvelope estimate Decan et al 21’s findings example report 5 updates actually caused breakages background rate 12 updates per year per package 1029 updates 1710 packages sixmonth window one breakage every 17 years Given breakages may evenly distributed packages multiple recursive dependencies developers work multiple packages experiencing breakage year range Table 9 Comparison Sanctioned Practices Features Ecosystem Dependencies outside repository b Central Repository c Access old dependency versions e Gatekeeping standards f Synced ecosystem Atom plugins ● ● ● ● ● CocoaPods ● ● ● ● ● Eclipse plugins ● ● ● ● ● ErlangElixirHex ● ● ● ● ● Go ● ● ● ● ● Haskell CabalHackage ● alt repo ● ● ● ● Haskell StackStackage ● ● ● ● ● LuaLuarocks ● ● ● ● ● Maven ● ● ● ● ● NodejsNPM ● ● ● ● ● NuGet ● alt repo ● ● ● ● PerlCPAN ● alt repo ● ● ● ● PHPPackagist ● ● ● ● ● PythonPyPi ● ● ● ● ● RBioconductor ● alt repo ● ● ● ● RCRAN ● alt repo ● ● ● ● RubyRubygems ● ● ● ● ● RustCargo ● ● ● ● ● ● ecosystem feature ○ feature □ feature group packages individual packages alt repo reference alternative repository staged releases groups packages debugged together released group submitter author package vetted core core packages See Section 335 details plausability perhaps actual experience dealing breaking change may infrequent even breaking changes frequent overall ecosystem Respondents every ecosystem agreed average used semantic versioning comparable versioning strategies Figure 3f batch multiple changes single release Figure 3d document changes Figure 3e conservative adding dependencies projects Figure 4c seem generally considered good softwareengineering practices independent programming language ecosystem Answers varied dramatically among ecosystems included reluctance make breaking changes Figure 3b willingness compromise design backward compatibility Figure 3c synchronizing users releasing changes Figure 3h Data mining reveals ecosystems also vary considerably often make updates previous versions ranging high 25 Maven projects least 01 RCRAN projects Turning shared community resources two ecosystems studied supply central repository server packages could downloaded automatically needed Table 9b Two Go Eclipse maintain indexes maintainers’ servers must supply package metadata standard way Advertised submission requirements packages show ecosystems differed level vetting Table 9e packages repositories apply Haskell’s CabalHackage system unusual vets maintainers apply accounts handchecked human reviewers apply minimal automated standards submitted packages CRAN strict standards package submissions updates vetted hand well automated tests Three ecosystems released regular synchronized schedule Table 9f core set packages Eclipse well whole Bioconductor synchronized releases R runtime CPAN work staged sequence development build worked consistent parts released group official supported release ecosystems allow developers release packages whenever authors wish similar practices operatingsystemlevel ecosystems Debian’s APT repackage variety languages ecosystems compatible releases operating system Note Stackage’s sets compatible packages curated together post hoc development synchronized unless developers collaborate Practices coping dependency changes Sixteen 18 ecosystems offer optional Table 9b widely used central repository Table 9a packages usually encouraging packages refer dependencies name version number asked specifically package’s exposure breaking changes upstream packages participants across ecosystems reported low frequencies Figure 4a quarter participants indicated saw breaking change per year Participants ecosystems conservative change practices eg Eclipse Erlang Perl exposed slightly fewer breaking changes Participants across ecosystems indicated conservative adding dependencies Figure 4c perform significant research first Figure 4d contrast learn updates Figures 4e–g eg personal contacts tools rate may skip Figure 4h declare version constraints dependencies Figure 4i depends significantly ecosystem Data mining Table 7 reveals file cloning rare less 10 projects every ecosystem measured developers instead rely package dependency infrastructure Table 7e Mining also confirmed survey answers users packages chose constrain versions packages depended Maven almost universally relies fixed version number eg package might depend precisely version 321 package B ecosystems typically constrain dependencies version number ranges NodejsNPM Atom PHP RustCargo specifying minimum version NuGet RubyRubyGems leaving versions unconstrained PythonPyPi RCRAN Survey mining results differed one ecosystem however PerlCPAN users claimed ecosystem’s typical practice specify name 43 respondents version range 36 dependencies yet mining librariesio revealed nearly 100 use exact version numbers may matter developer perception librariesio apparently measures precise dependencies captured published repository tools DistZillaPluginDistINI generate lessconstrained numbers specified developers Universal distinctive considerable nuance differences among ecosystems overall results suggest several values seem universal least 21 httpscranrprojectorgwebpackagespolicieshtml 22 httpswikidebianorgApt 23 httpsgithubcomcommercialhaskellstackagefrequently askedquestions 18 ecosystems surveyed Chief among stability quality community compatibility rapid access replicability achieved nearuniversal status unique personality ecosystem however seems derive either key distinctions values practices set apart many examples including Bioconductor Eclipse stand coordinating releases synchronized fixed schedule survey Figures 3i j Table 9f valuing curation Figure 2 Table 9e Go distinctive version numbering practice require version updates changes Figure 3g Table 9c CRAN Bioconductor strict requirements submission update packages Figure 3k Table 9e Lua developers value fun feel least constrained making changes code generally coordinate much others Figures 3bh Rust strong stance openness least prone make design compromises backward compatibility Figure 3b c Data mining Cargo projects show rarely port fixes earlier code releases Table 7g CPAN developers universally claim write change logs Figure 3e Value differences ecosystem statistically significant values KruskalWallis run separately value check differs ecosystem p 000001 chi2 ranging 53704 quality 17869 commerce Summary RQ21 results Stability quality community compatibility rapid access replicability important across ecosystems openness curation standardization technical diversity values universal differ ecosystem Breaking changes experienced rarely one developer order yearly even though common within ecosystem whole Differing ecosystem circumstances lead great variety developers’ willingness make breaking changes conversely compromise designs ensure backward compatibility turn consumers’ eagerness incorporate upstream changes
::::
53 Study 2 Results Extent Consensus within Ecosystems Values Practices RQ22 distribution value ratings within ecosystem particularly wide values replicability openness curation indicating generally less consensus values evidence broad consensus highest ranked values ecosystems Table 10 conspicuously cases value clearly aligns core purpose ecosystem illustrative example Stackage CabalHackage two Haskellbased ecosystems contrasted strongly compatibility curation participants rated values much important Stackage HackageCabal Stackage also rated markedly lower rapid access ecosystems values consistent stated goals Stackage “to create stable builds complete package sets” Stackage built top Cabal express purpose curating compatible sets versions Hackage submissions require submitted developer whose identity manually vetted Table 9e Volunteer curators wait set consistent package versions assembled release Table 10 Values Commonly Rated Highest Ecosystem Ecosystem Top 3 values Consensus HaskellStack compatibility replicability curation 75 55 45 PerlCPAN stability replicability quality 64 40 31 Maven replicability stability quality 64 38 32 LuaLuarocks fun replicability quality 64 35 17 Eclipse stability compatibility quality 62 48 37 NuGet replicability compatibility stability 59 37 20 Go quality stability fun 56 37 19 RBioconductor replicability quality compatibility 52 32 26 CocoaPods quality stability compatibility 52 30 17 RustCargo replicability stability community 51 31 23 PHPPackagist quality stability compatibility 50 32 23 NodeNPM rapidaccess community innovation 50 24 15 Atom rapidaccess fun openness 50 26 17 Erlang quality fun stability 46 24 18 HaskellCabal quality innovation replicability 43 17 8 Python replicability quality stability 42 20 14 Ruby fun community rapidaccess 41 18 12 RCRAN replicability compatibility innovation 36 20 8 Consensus Cn percent respondents ecosystem rate value higher ecosystem’s highest n values Top three values listed ecosystem indicates relative popularity values indicates ties unit trading rapid release tested compatibility StackageHackage choice controversial Haskell community may make perceived differences values practices visible examples include Maven primarily build tool comes centralized hosting platform Java packages designed collaborative platform purpose reflected strongly valuing replicability least valuing community openness fun Bioconductor platform scientific computation specifically analysis genomic data molecular biology replicability research results key asset commerce clearly focus Lua widely used embedded scripting language games prior work shown culture game developers significantly different application developers 58 example game development communities value creativity communication designers rigid specifications makes extensive automated testing impractical Others like RCRAN markedly less consensus least regarding set values surveyed practice differences explained enforced policies design choices platform tools example Nodejsnpm sets version range dependencies default dependency added Figure 4i Bioconductor core packages Eclipse synchronized central release Figure 3i j Table 9f Bioconductor CRAN require reviews packages included repository Figure 3k Table 9e practices supported optional tooling ecosystem tools create notifications dependency updates Nodejs Ruby community Figure 4i eg gemnasium greenkeeperio practices seem mere community conventions—for example providing change logs encouraged documentation CPAN enforced yet practice apparently universal Figure 3e Interestingly cases practices surprisingly little consensus ecosystems given know tools policies ecosystem example 266 Nodejs respondents indicated “package meet strict standards accepted repository” Figure 3k even though community’s npm repository checks Table 9e fact contains many junk packages may ecosystem members aware design space practices ecosystems employ biased interpretation “strict standard” Alternatively participants may members subcommunities contrasting values practices example may vetting revisions among developers within specific subcommunity also hosted npm role roles wanted explore possibility survey respondents’ differences perceived values practices may explained role respondent ecosystem ecosystem may appear different depending one’s responsibilities perspective survey asked people role ecosystem choices user committer submitter package lead central package lead aka lead founder analyzed core lead founder roles differed rest within ecosystem suspected core peripheral ecosystem participants may different values found little evidence case tested ratings perceptions 11 values found one value replicability statistically significant difference ttest p 0044 n 1504 however difference small average rating 35 5 core 368 noncore thus difference 018 scale points evidence value perceptions differed values ttest p 13 73 n ranging 1492 1504 Core people seemed enmeshed community roles sense likely collaborate upstream packages chi21 N932 16571 p 0001 21 likely answer yes question “In last 6 months participated discussions made bugfeature requests worked development another package one packages depends on” downstream dependencies chi21 N925 24132 p 0001 18 likely answer yes question “Have contributed code upstream dependency one packages last 6 months one you’re primary developer” claim know users’ needs chi21 N932 62947 p 0001 29 likely answer “Strongly” “Somewhat agree” question “I know changes users want” People core roles felt slightly confident answers community values questions chi21N932 62247 p 05 8 likely answer “Confident” “Very confident” question “How confident ratings values above” difference statistically significant large short features distinguish core community members rest seem culturally part communities perceive values Summary RQ22 results Ecosystems tend many values distinguish virtue distinctive values strongly related purpose audience Consensus practices largely entirely driven affordances shared tooling policies enforce encourage Core peripheral members ecosystem community share ecosystem’s values core members collaborative practices 54 Study 2 Results Relationship Values Practices Case Stability RQ23 One might expect ecosystems share similar values would adopt similar practices support values practices case averaged value practice answer within ecosystem get summary ecosystem mean answers looked correlations value practice among columns within 18 rows strong correlations values practices 418 valuepractice comparisons 29 significantly correlated Spearman test p 005 however even may due chance small sample size n 18 large number comparisons applying HolmBonferroni correction rules taking correlations conclusive fact practices universally associated particular values implies value associated adoption different practices example practices shown violin plots one perception ecosystem’s use exact version numbers refer dependencies Figure 4i choice E significantly correlated perceived value stability ecosystem Spearman correlation mean answers within ecosystem rho 0506 p 05 n 18 ecosystems investigate relationship comparison practices associated stability three ecosystems high ratings high consensus stability Eclipse Perl Rust Figure 2 Table 10 survey results indicate ecosystems achieved stability different sometimes nearly opposite practices Eclipse stability strict standards gatekeeping Eclipse’s leadership strongly promotes stable plugin APIs mentioned earlier official developer documentation includes “prime directive” “When evolving Component API release release break existing Clients” 25 Eclipse developers rated stability higher ecosystem smallest variance mean ratings stability Figure 2 strong consensus stability highest value cf Table 10 Survey answers practices show Eclipse relies gatekeeping Figure 3k developers claim make design compromises achieve backward compatibility Figure 3c police others’ backward compatibility release together sure break legacy code Figure 3i developers feel constrained making changes Figure 3b Rust stability dependency versioning stability attributes Rust contrast ranked lowest design compromises backward compatibility Figure 3c rarely maintains outdated versions Table 7g high semantic versioning Figure 3f Rust’s Cargo infrastructure prevents use wildcards dependency versions although allows ranges Figure 4i almost universally used 936 Cargo packages Table 7c 24Figures 3b–k Figures 4c–h j four answers Figure 4i taken separately Users thus prodded use older versions dependencies rather letting tools upgrade automatically burdening upstream packages bug reports things change stability features include “lock” file records exact versions dependencies used version Table 7f feature called “stability attributes” tag API elements guaranteed stable contrast new features might change 80 Survey results show Rust developers acknowledged community’s stated value stability Figure 2 despite fact participants also perceived ecosystem’s packages fact relatively unstable Figure 4b Rust language developers consistent promising stability “stable” branch language extent test compiler changes entire corpus Rust programs find GitHub analysis community’s 2016 user survey 79 summarized many users complained instability many packages “crates” relied unstable “nightly” development versions compiler take advantage interesting new features concluded “consensus formed around need move ecosystem onto stable language away requiring nightly builds compiler” CPAN stability centralized testing Finally Perl unlike Rust low semantic versioning Figure 3f fact likely ecosystem claim refer dependencies name version number Figure 4i indicate gatekeeping design compromises extent Eclipse Figure 3c k However response openended question values covered survey 12 40 30 PerlCPAN participants gave comments mentioned testability many referring Perl’s extensive battery tests run CPAN packages volunteers one explicitly claimed test facility helped stability Perl packages CPAN stages changes releases packages together Table 9f almost entirely specifying fixed version numbers dependencies Table 7a HaskellHackage participant mentioned CPAN’s kwalitee metric operationalization quality employed testing facilities attributed ecosystem’s “focus stability compatibility” three ecosystems work towards stability different ways Eclipse longstanding corporate support able dictate upstream developers pay cost maintaining backward compatibility RustCargo although users clamor stability eager attract developers cannot impose cost stability fiat Eclipse instead apply gentle pressure upstream developers various ways easing pressure downstream developers discouraging automatic major updates CPAN finally large cadre volunteers CPAN Testers built infrastructure taking task thorough testing comparison stability practices demonstrates relationships practices values contextdependent thus hard generalize comprehensive theory incorporating insights task future work hope dataset questions suggests provide useful launching point Contrasts revealed survey ripe investigation researchers find appropriate subjects case studies values pursued contrasting ways conversely practices associated contrasting values case analyzing differences three ecosystems suggests theory practices values take account factors including presence availability motivations different kinds developers confirmed however exhaustive study 25 Testability value surveyed recommend new value expanded list since many survey takers suggested ecosystems practice contrasts Ecosystem communities dissatisfied practices use starting place find alternative combinations practices others using Summary RQ23 results Many ecosystems clear distinctions key values practices Often consensus important values high practices actually enforced policies platform tools However values particularly quality nearly universal value engineers little variance among ecosystems Breaking changes also generally avoided though strategies achieved difficult perceived depends specifics ecosystem
::::
6 DISCUSSION FUTURE WORK article makes several contributions toward understanding ecosystems go critical task managing breaking changes practices reflect culture values ecosystem participants Study 1 contributes qualitative accounting different ways three contrasting ecosystems manage change differences relate different values different ideas classes participants bear costs Prior work 19 36 67 72 examined particular practices change management noted prevalence breaking changes 22 48 54 90 contribution characterize types change negotiation practices found three different ecosystems show different sets practices require varying amounts effort different classes ecosystem participants also show different sets practices reflect ecosystem values community community needs take precedence Study 2 builds examining practices values larger set 18 ecosystems find values appear universal nearly within set ecosystems perhaps reflecting broader open source culture values show considerable divergence appears substantial component ecosystems’ distinctive “personalities” Within ecosystems values appear reflect consensus among participants views others highly variable perhaps reflecting diverse views subsets projects individuals rather ecosystemwide values also show relationship practices values simple illustrate apparent nature relationships contrasting different practices several ecosystems employ pursuit stability value highly following subsections outline new interesting research questions brought light work
::::
61 Practices Conflict Complementary seems highly unlikely practices treated independent one another ecosystem considering adopting new practice eg enhance stability outcome trying implement various stabilityenhancing practices likely contingent set practices already place example introducing semantic versioning signal breaking changes would make sense snapshot consistency current versions everything must compatible already enforced Complementarity side coin Certain practices may effective certain practices adopted well example centralized testing likely effective ecosystem repository strong gatekeeping mechanism norm dissuades developers using alternative repositories suspect many conflicts complementarities among practices much subtle greater insight relations among practices would helpful clarifying feasible paths achieving ecosystem goals survey data contains many starting points investigations example allowing researchers identify ecosystems various combinations values practices targets exploration 62 Assimilation Ecosystem Selection survey indicates developers’ personal values usually align well values ecosystems Figure 2 operate Understanding alignment comes would help predict outcome attempted interventions design interventions likely effective least two major possibilities Developers may join ecosystems reasons unrelated values eg application domain technical characteristics exposed ecosystem values may assimilate time adapting behavior personal values experience around However alignment may come primarily valuebased selection developers join ecosystems resonate system’s values two possibilities often carry different implications interventions developers tend assimilate ecosystem’s values existing community might steered toward different practices expect developers adapt time contrast developers pick ecosystems based compatible values would likely mean substantial changes would attract new valuealigned developers risk significant disruption longterm contributors rebel leave one might expect degree selection assimilation understanding values practices easily adapted tend resistant change could big help designing effective interventions survey data provide insights causation provide starting points investigations combined external data approach questions took small step direction illustrate possibilities developers tend assimilate practices values around would expect values practices shared among ecosystems relatively large overlap participating developers relatively small overlap preliminary study investigated whether ecosystems share many developers26 similar practices values pairs ecosystems found sizable correlation similarity average responses ecosystem practice questions depicted Figures 3 4 overlap committers ecosystems Spearman rho 0341 p 00001 n 289 pairs ecosystems correlating average perceived ecosystem value pair ecosystems developer overlap Interestingly perceived values ecosystem seem align developer overlap rho 005 p 044 n 289 correlating average personal value pair ecosystems developer overlap number interpretations relationships possible data consistent idea practices diffuse among ecosystems large developer overlap values Future work using time series data developer overlap historic participation ecosystems would allow researchers identify specific developers moved ecosystems different similar practices values according survey data use interviews surveys data mining see behavior changed 26To measure developer overlap assembled list packages ecosystem librariesio Cargoio LuaRockscom identified Eclipse plugins nonfork packages GitHub containing “pluginxml” file Using authors commits packages’ github projects archived Mockus 57 counted percent ecosystem’s contributors also contributed ecosystem excluded Bioconductor clear mapping GitHub repositories 63 Attempted Changes Broadly Adopted Collecting cases effective ineffective past changes ecosystems help understand conditions favor broadly adopted changes Examples attempted policy practices changes often found surveys survey text answers contrasting ecosystems often explained practices deliberately designed Five Perl developers example described extensive centralized testing infrastructure CPAN Testers added improve quality compatibility CPAN modules Perhaps beginning results conducting new interviews surveys possible unearth many examples attempted change determine outcome second approach could identify conflicts values practices suggest ineffective changes case Rust example high value stability Figure 3a also high perception instability Figure 4b led us investigate Rust’s struggle mentioned promote practices leading stable versions libraries despite community’s eagerness innovate new features Edgar Schein’s work organizational culture recommendations 70 p 323ff changing organization include strong role models new behaviors lowering learning anxiety raising survival anxiety ie making people confident learn new practices aware community fail Elements advice visible practices ecosystems tried change values Rust example compiler team models stability practices packages might follow 80 Rust’s stability attributes packages may reduce learning anxiety making easier downstream users create stable interfaces Rust’s annual survey helps developers see others’ agreement problems stability
::::
7 CONCLUSION managing change long important topic engineering particularly interesting context open source ecosystems since projects tend highly interdependent yet independently maintained variety practices used manage change considerable perhaps interestingly might think political dimension selection practices Whose interests served adoption one set practices rather others costs primarily effort distributed types ecosystem participants values practices actually serve attempted provide somewhat detailed description practices used three ecosystems well broader characterization 18 ecosystems believe studies scratch surface however much work remains done understanding practices fit values effective changes made address ecosystem weaknesses hope work data making publicly available contributed better understanding issues APPENDICES STUDY 1 INTERVIEW PROTOCOL following lists questions interview script ask question interviewee instead directed towards areas personal experience Given iterative approach questions script added modified earlier interviews maintainers upstream packages work plan strategy interface evolve people come depend Think recent larger change backwardcompatible impact expect would packages depend package1 Follow consider alternative ways making change1 would less impact users package1 Follow made change1 would happened differently package1 ’s future Follow position backward compatibility platform helphinder evolution decisions change1 platform mechanism alternative mechanism developers upstream dependencies work package1 there’s useful looking package claims provide functionality need decide whether adopt What’s general strategy choosing version package depend think it’s reasonable expected package change interface prefer stable stale rapidly evolving unstable dependency rate interface change often burden many dependencies give example package you’ve considered felt like stability consideration positively negatively keep changes packages depend change1 happened upstream package1 first find ever watching development activity releases using Github notification mechanism whywhy could ideal notification system get important changes would system look like changes would notify think change1 appropriate change left alone developers experience working platform asked questions specific policies intentions consequences example questions CRAN CRAN differs repositories asks package authors notify reverse dependency packages submitting update breaks API —Was anything specific precipitated policy —Did consider options solving problem tradeoffs thought —How successful policy far generally CRAN stricter requirements authors package repositories factors CRAN team take consideration deciding quality standard worth effort instituting enforcing Bioconductor coordinated releases packages CRAN lets packages update schedule —How two repositories end different policies —What consequences two repositories —Will likely stay way CRAN makes easy install latest version package repositories let users install old versions done way • CRAN permissive expectations version number changes platforms current system sufficient considered altering policies numbering • tell something potential breaking changes handled among developers base recommended packages — developers communicate coordinate synchronize changes — work differently base recommended among ordinary packages CRAN repository B STUDY 2 SURVEY QUESTIONS transparency replicability list evaluated questions survey including exact phrasing exclude small number questions power structures community health motivation used article Part Ecosystem • Please choose ONE ecosystem publish package don’t publish packages pick ecosystem whose packages use “Software ecosystem” community people using developing packages depend using shared language platform “Package” distributable separately maintained unit ecosystems names “libraries” “modules” “crates” “cocoapods” “rocks” “goodies” we’ll use “package” consistency selection textfield substituted remainder survey Ecosystem Role • Check statement best describes role ecosystem — I’m founder core contributor ie language platform repository — I’m lead maintainer commonlyused package — I’m lead maintainer least one package — commit access least one package — submitted patch pull request package — used packages code scripts I’ve written • many years using way — 1 year — 1–2 years — 2–5 years — 5–10 years — 10–20 years — 20 years Ecosystem values • important think following values community personally we’ll ask separately See Section 332 11 value questions results shown Figure 2 • confident ratings values — confident — Slightly confident — Confident — confident • value community emphasizes asked describe Part II Package • following going ask experience working one particular package Please think one package contributed recently familiar haven’t contributed package name you’ve written relies packages packages may use pseudonym concerned keeping responses anonymous — text fields substituted remainder survey • submit package chose athe repository associated Choose “no” ecosystem central repository — yesno • maintained people depends package chose — yesno • package chose installed default part standard basic set packages platform tools — yesno • important values development personally See Section 332 11 value questions • OPTIONAL value important personally mentioned — text fields • often face breaking changes upstream dependencies require rework Results shown Figure 4a — Never — Less year — Several times year — Several times month — Several times week — Several times day • often make breaking changes ie changes might require endusers downstream packages change code — frequency scale aboveResults shown Figure 3a Making changes • feel constrained make many changes • potential impact users Results shown Figure 3b — Strongly agree — Somewhat agree — Neither agree disagree — Somewhat disagree — Strongly disagree — don’t know • know changes users want — agreementdon’t know scale • multiple breaking changes make try batch single release — agreementdon’t know scale aboveResults shown Figure 3d • release fixed schedule users aware — agreementdon’t know scale aboveResults shown Figure 3j • Releases coordinated synchronized releases packages authors — agreementdon’t know scale aboveResults shown Figure 3i • working make technical compromises maintain backward compatibility users — agreementdon’t know scale aboveResults shown Figure 3c • working often spend extra time working extra code aimed backward compatibility eg maintaining deprecated outdated methods — agreementdon’t know scale • working spend extra time backporting changes ie making similar fixes prior releases code backward compatibility — agreementdon’t know scale Releasing Packages • large part community releases updatesrevisions packages together time — agreementdon’t know scale • package meet strict standards accepted repository — agreementdon’t know scale aboveResults shown Figure 3k • packages sometimes small updates without changing version number — agreementdon’t know scale • packages version greater 100 increment leftmost digit version number change might break downstream code — agreementdon’t know scale • sometimes release small updates users without changing version number — agreement scale without “don’t know”Results shown Figure 3g • packages whose version greater 100 always increment leftmost digit change might break downstream code semantic versioning — agreement aboveResults shown Figure 3f • making change usually write explanation changed change log — agreement aboveResults shown Figure 3e • working usually communicate users performing change get feedback alert upcoming change — agreement aboveResults shown Figure 3h • making breaking change usually create migration guide explain upgrade — agreement • making breaking change usually assist one users individually upgrade eg reaching affected users submitting patchespull requests offering help — agreement Part IV Dependencies • last 6 months participated discussions made bugfeature requests worked development another package one packages depends — yesno • contributed code upstream dependency one packages last 6 months one you’re primary developer — yesno • often communicate developers packages depend eg participating mailing lists conferences Twitter conversations filing bug reports feature requests etc — frequency scale aboveResults shown Figure 4f dependencies packages rely way typically become aware change dependency might break package read dependency project’s internal media eg dev mailing lists general public announcements — agreement scale read dependency project’s external media eg general announcement list blog Twitter etc — agreement scale developer typically contacts personally bring change attention — agreement scale aboveResults shown Figure 4e Typically get notification tool new version dependency likely break package — agreement scale aboveResults shown Figure 4f Typically find dependency changed something breaks try build package — agreement scale aboveResults shown Figure 4g typically declare version numbers packages depends — Results shown Figure 4i specify exact version number specify range version numbers eg 3xx 21 24 specify package name always get newest version specify range name take snapshot dependencies eg shrinkwrap packrat common practice declaring version numbers dependencies — scale previous “don’t know” Using avoiding dependencies adding dependency usually significant research assess quality package maintainers relying package seems provide functionality need — agreement scale aboveResults shown Figure 4d It’s worth adding dependency adds substantial amount value — agreement scale aboveResults shown Figure 4c often choose update use latest version dependencies — agreement scale aboveResults shown Figure 4h adding dependency usually create abstraction layer ie facade wrapper shim protect internals code changes — agreement scale working often copy rewrite segments code packages package avoid creating new dependency — agreement scale working must expend substantial effort find versions dependencies work together — agreement scale OPTIONAL Compare ecosystems you’ve used heard – one features adopt name ecosystems describe features — text field OPTIONAL think people chose design ecosystems differently — text field Part V Demographics motivations Age 18–24 25–34 — 35–44 — 45–54 — 55–64 — 65 • Gender — malefemaleother • Formal computer science educationtraining — None — Coursework — Degree • many years contributing open source way including writing code documentation engaging discussions etc — time scale “years used ecosystem” • many years developing maintaining — previous • OPTIONAL anything else asked would help us better understand experience community values breaking changes tell us — text field C SUGGESTED SET VALUES FUTURE STUDIES propose following list values appear distinguish ecosystems derived Study 1 results plus examination ecosystem webpages modified based survey results adding values suggested survey respondents Standardization Technical Diversity Usability Social Benevolence removing one distinguish meaningfully among developers ecosystems Quality • Stability Backward compatibility allowing seamless updates “do break existing clients” • Innovation Innovation fast potentially disruptive changes • Replicability Longterm archival current historic versions guaranteed integrity exact behavior code replicated • Compatibility Protecting downstream developers endusers struggling find compatible set versions different packages • Rapid Access Getting package changes endusers quickly release “no delays” • Commerce Helping professionals build commercial • Community Collaboration communication among developers • Openness Fairness ensuring everyone community say decisionmaking community’s direction • Curation Selecting set consistent compatible packages cover users’ needs • Fun personal growth Providing good experience package developers users • Standardization Promote standard tools practices limiting developers choice save time effort • Technical Diversity Allowing developers freedom develop interact diversity ways • Usability Ensuring tools libraries easy developers use ensuring resulting easy endusers use • Social Benevolence ethical community empowering others making resources available
::::
LOCK FILE NAMES ECOSYSTEM Ecosystem Lock file Notes Atom plugins packagelockjson npmshinkwrapjson see NodejsNPM CocoaPods podfilelock Eclipse plugins NA function would done within project’s regular metadata files pluginxml pomxml could measured readily technique ErlangElixirHex mixlock Go GoPkglock vendor Preceding GoPkglock file canonical method locking dependency versions simply include snapshot source code looked vendor directory Haskell CabalHackage cabalconfig Haskell StackStackage cabalconfig Although possible never used since Stackage’s main distinguishing feature constrain versions set packages LuaLuarocks NA could find evidence canonical even common practice way locking Lua versions Maven NA function would done within project’s regular metadata file pomxml could measured readily technique NodejsNPM packagelockjson npmshinkwrapjson npm lockfiles semantic differences npmshrinkwrap intended published packagelock however found GitHub projects NuGet projectlockjson NuGet blog suggests saving file repository lock dependency versions PerlCPAN cpanfilesnapshot could find evidence canonical way CPAN one recommendation thirdparty package called Carton creates snapshot file PHPPackagist composerlock PythonPyPi NA could find evidence canonical way Pypi StackOverflow post suggested several nonstandard alternatives RBioconductor packratlock canonically standard common wellknown However mostly irrelevant Bioconductor since set mutually compatible packages released unit RCRAN packratlock canonically standard common wellknown RubyRubygems Gemfilelock RustCargo Cargolock 27 httpsdocsnpmjscomfilespackagelockjson 28 httpsblognugetorg20181217Enablerepeatablepackagerestoresusingalockfilehtml 29 httpsmetacpanorgpodCarton 30 httpsstackoverflowcomquestions8726207whatarethepythonequivalentstorubysbundlerperlscarton ACKNOWLEDGMENTS want thank Audris Mockus WoC University Tennessee Knoxville access WoC archive 57 data mining many people interviewed surveyed helped design promotion survey REFERENCES 1 Pietro Abate Roberto DiCosmo Ralf Treinen Stefano Zacchiroli 2011 MPM modular package manager Proceedings International Symposium Component Based Engineering CBSE’11 ACM Press New York 179–188 DOI httpsdoiorg10114520002292000255 2 Rabe Abdalkareem 2017 Reasons drawbacks using trivial npm packages developers’ perspective Proceedings 11th Joint Meeting Foundations Engineering ESECFSE’17 ACM New York NY 1062–1064 3 Cyrille Artho Kuniyasu Suzaki Roberto Di Cosmo Ralf Treinen Stefano Zacchiroli 2012 packages conflict IEEE International Working Conference Mining Repositories 141–150 4 Anat Bardi Shalom H Schwartz 2003 Values behavior Strength structure relations Personal Soc Psychol Bull 29 10 2003 1207–1220 5 Gabriele Bavota Gerardo Canfora Massimiliano Di Penta Rocco Oliveto Sebastiano Panichella 2015 Apache community upgrades dependencies evolutionary study Empir Softw Eng 20 5 2015 1275–1317 6 Christopher Bogart Christian Kästner James Herbsleb Ferdian Thung 2016 break API Cost negotiation community values three ecosystems Proceedings International Symposium Foundations Engineering FSE’16 ACM Press New York 7 Christopher Bogart Anna Filippova James Herbsleb Christian Kastner 2017 Culture Breaking Change Survey Values Practices 18 Open Source Ecosystems DOI httpsdoiorg101184R15108716v1 8 Shawn Bohner Robert Arnold 1996 Change Impact Analysis IEEE Computer Society Press Los Alamitos CA 9 Virginia Braun Victoria Clarke 2006 Using thematic analysis psychology Qualit Res Psychol 3 2 2006 77–101 DOI httpsdoiorg1011911478088706qp063oa 10 Brito L Xavier Hora Valente 2018 Java developers break APIs Proceedings IEEE 25th International Conference Analysis Evolution Reengineering SANER’18 255–265 11 Javier Luis Cánovas Izquierdo Jordi Cabot 2015 Enabling definition enforcement governance rules open source systems Proceedings International Conference Engineering ICSE’15 505–514 DOI httpsdoiorg101109ICSE2015184 12 Jaepil Choi Heli Wang 2007 promise managerial values approach corporate philanthropy J Bus Ethics 75 4 2007 345–359 13 Juliet Corbin Anselm Strauss 2014 Criteria evaluation Basics Qualitative Research Techniques Procedures Developing Grounded Theory 3rd ed Sage Publications Inc 14 Bradley E Cossette Robert J Walker 2012 Seeking ground truth retroactive study evolution migration libraries Proceedings International Symposium Foundations Engineering FSE’12 ACM Press New York 55 15 John W Creswell J David Creswell 2014 Research Design Qualitative Quantitative Mixed Methods Approaches 4th ed Sage Publications 16 Mary Crossan Daina Mazutis Gerard Seijts 2013 search virtue role virtues values character strengths ethical decision making J Bus Ethics 113 4 2013 567–581 17 Laura Dabish Colleen Stuart Jason Tsay Jim Herbsleb 2012 Social coding GitHub Transparency collaboration open repository Proceedings Conference Computer Supported Cooperative Work CSCW’12 1277–1286 18 Barthélémy Dagenais Martin P Robillard 2010 Creating evolving developer documentation Understanding decisions open source contributors Proceedings ACM International Symposium Foundations Engineering 127–136 DOI httpsdoiorg10114518822911882312 19 Cleidson R B de Souza David F Redmiles 2008 empirical study developers’ management dependencies changes Proceedings International Conference Engineering ICSE’08 20 Cleidson R B De Souza David F Redmiles 2009 roles APIs coordination collaborative development Comput Supp Coop Work 18 56 2009 445–475 DOI httpsdoiorg101007s1060600991013 21 Alexandre Decan Tom Mens Maëlick Claes Philippe Grosjean 2016 GitHub meets CRAN analysis interrepository package dependency problems Proceedings International Conference Analysis Evolution Reengineering 493–504 DOI httpsdoiorg101109SANER201612 22 Alexandre Decan Tom Mens Maëlick Claes 2017 empirical comparison dependency issues OSS packaging ecosystems Proceedings International Conference Analysis Evolution Reengineering SANER’17 23 Dedoose 2016 Version 7023 Web Application Managing Analyzing Presenting Qualitative Mixed Method Research Data SocioCultural Research Consultants LLC Los Angeles CA Retrieved wwwdedoosecom 24 Jim des Rivières 2005 API First Retrieved httpwwweclipseconorg2005presentationsEclipseCon2005122APIFirstpdf 25 Jim des Rivières 2007 Evolving Javabased APIs Retrieved httpswikieclipseorgEvolvingJavabasedAPIs 26 Jens Dietrich David J Pearce Jacob Stringer Kelly Blincoe 2019 Dependency versioning wild Proceedings Conference Mining Repositories MSR’19 349–359 DOI httpsdoiorg101109MSR201900061 27 Dillman Jolene Smyth Leah Melani Christian 2014 Internet Phone Mail Mixedmode Surveys Tailored Design Method John Wiley Sons 28 Alexander Eck 2018 Coordination across open source communities Findings rails ecosystem Tagungsband Multikonferenz Wirtschaftsinformatik MKWI’18 109–120 29 Stephen G Eick Todd L Graves Alan F Karr J Marron Audris Mockus 2001 code decay Assessing evidence change management data IEEE Trans Softw Eng 27 1 Jan 2001 1–12 DOI httpsdoiorg10110932895984 30 Erich Gamma Richard Helm Ralph Johnson John Vlissides 1995 Design Patterns Elements Reusable ObjectOriented AddisonWesley Boston 31 R Stuart Geiger 2017 Summary analysis 2017 GitHub open source survey CoRR abs170602777 2017 32 Gemnasium 2017 Gemnasium Retrieved 28 April 2021 httpswebarchiveorgweb20180324121439httpsgemnasiumcom 33 Mohammad Gharehyazie Baishakhi Ray Vladimir Filkov 2017 Crossproject code reuse GitHub Proceedings IEEE International Working Conference Mining Repositories 291–301 DOI httpsdoiorg101109MSR201715 34 GitHub Inc 2017 Open Source Survey 2017 Retrieved httpopensourcesurveyorg2017 4282021 35 Neighbourhoodie GmbH 2017 Greenkeeperio Retrieved 28 April 2021 httpswebarchiveorgweb20180224075015httpsgreenkeeperio 36 Johannes Henkel Amer Diwan 2005 CatchUp Capturing replaying refactorings support API evolution Proceedings International Conference Engineering ICSE’05 ACM Press New York 274–283 37 Steven Hitlin Jane Allyn Piliavin 2004 Values Reviving dormant concept Ann Rev Sociol 30 1 2004 359–393 38 Reid Holmes Robert J Walker 2010 Customized awareness Recommending relevant external change events Proceedings International Conference Engineering ICSE’10 ACM Press New York 465–474 DOI httpsdoiorg10114518067991806867 39 Daqing Hou Xiaojia Yao 2011 Exploring intent behind API evolution case study Proceedings Working Conference Reverse Engineering WCRE’11 IEEE Computer Society Los Alamitos CA 131–140 40 Marco Iansiti Roy Levien 2004 Keystone Advantage New Dynamics Business Ecosystems Mean Strategy Innovation Sustainability Harvard Business Press Boston 41 Javier Luis Cánovas Izquierdo Jordi Cabot 2015 Enabling definition enforcement governance rules open source systems Proceedings International Conference Engineering ICSE’15 IEEE 505–514 42 Steven J Jackson David Ribes Ayse G Buyuktur Geoffrey C Bowker 2011 Collaborative rhythm Temporal dissonance alignment collaborative scientific work Proceedings Conference Computer Supported Cooperative Work CSCW’11 245–254 43 Slinger Jansen Michael Cusumano 2013 Defining ecosystems survey platforms business network governance Ecosystems Analyzing Managing Business Networks Industry Edward Elgar Publishing 44 Puneet Kapur Brad Cossette Robert J Walker 2010 Refactoring references library migration Proceedings International Conference Objectoriented Programming Systems Languages Applications OOPSLA’10 ACM Press New York 726–738 DOI httpsdoiorg10114518694591869518 45 Smitha Keertipati Sherlock Licorish Bastin Tony Roy Savarimuthu 2016 Exploring decisionmaking processes Python Proceedings International Conference Evaluation Assessment Engineering ACM 43 46 Riivo Kikas Georgios Gousios Marlon Dumas Dietmar Pfahl 2017 Structure evolution package dependency networks Proceedings 14th International Conference Mining Repositories MSR’17 IEEE Press Piscataway NJ 102–112 47 Daniel Le Berre Pascal Rapicault 2009 Dependency management eclipse ecosystem Eclipse P2 metadata resolution Proceedings International Workshop Open Component Ecosystems IWOCE’09 21–30 DOI httpsdoiorg10114515958001595805 48 Mario LinaresVásquez Gabriele Bavota Carlos BernalCárdenas Massimiliano Di Penta Rocco Oliveto Denys Poshyvanyk 2013 API change fault proneness threat success Android apps Proceedings European Engineering ConferenceFoundation Engineering ESECFSE’13 ACM Press New York 477–487 49 Cristina V Lopes Petr Maj Pedro Martins Vaibhav Saini Di Yang Jakub Zitny Hitesh Sajnani Jan Vitek 2017 DéjàVu map code duplicates GitHub Proc ACM Program Lang 1 OOPSLA 2017 1–28 DOI httpsdoiorg1011453133908 50 Mircea F Lungu 2009 Reverse Engineering Ecosystems PhD Dissertation University Lugano 51 Fabio Mancinelli Jaap Boender Roberto Di Cosmo Jerome Vouillon Berke Durak Xavier Leroy Ralf Treinen 2006 Managing complexity large free open source packagebased distributions 199–208 DOI httpsdoiorg101109ASE200649 52 Konstantinos Manikas 2016 Revisiting ecosystems research longitudinal literature study J Syst Softw 117 2016 84–103 53 Michael Mattsson Jan Bosch 2000 Stability assessment evolving industrial objectoriented frameworks J Softw Maint Res Pract 12 2 2000 79–102 54 Tyler McDonnell Baishakhi Ray Miryung Kim 2013 empirical study API stability adoption Android ecosystem Proceedings International Conference Maintenance ICSM’13 IEEE Computer Society Los Alamitos CA 55 Mens 2016 ecosystemic sociotechnical view maintenance evolution Proceedings IEEE International Conference Maintenance Evolution ICSME’16 1–8 56 David G Messerschmitt Clemens Szyperski et al 2005 Ecosystem Understanding Indispensable Technology Industry MIT Press Books 57 Audris Mockus 2009 Amassing indexing large sample version control systems Towards census public source code history Proceedings IEEE Conference Mining Repositories MSR’09 58 Emerson MurphyHill Thomas Zimmerman Nachiappan Nagappan 2014 Cowboys ankle sprains keepers quality video game development different development Proceedings International Conference Engineering ICSE’14 DOI httpsdoiorg10114525682252568226 59 Linda Northrop Peter Feiler Richard P Gabriel John Goodenough Rick Linger Tom Longstaff Rick Kazman Mark Klein Douglas Schmidt Kevin Sullivan Kurt Wallnau 2006 Ultralargescale Systems Challenge Future Engineering Institute 60 Siobhán O’Mahony Fabrizio Ferraro 2007 emergence governance open source community Acad Manag J 50 5 2007 1079–1106 61 Jeroen Ooms 2013 Possible directions improving dependency versioning R R Journal 5 1 2013 1–9 62 Klaus Ostermann Paolo G Giarrusso Christian Kästner Tillmann Rendel 2011 Revisiting information hiding Reflections classical nonclassical modularity Proceedings European Conference Objectoriented Programming ECOOP’11 Lecture Notes Computer Science Vol 6813 SpringerVerlag Berlin 155–178 63 David L Parnas 1972 criteria used decomposing systems modules Commun ACM 15 12 1972 1053–1058 DOI httpsdoiorg101145361598361623 64 Raphael Pham Leif Singer Olga Liskin Fernando Figueira Filho Kurt Schneider 2013 Creating shared understanding testing culture social coding site Proceedings International Conference Engineering ICSE’13 IEEE Computer Society Los Alamitos CA 112–121 65 Tom PrestonWerner 2013 Semantic Versioning 200 Retrieved httpsemverorg 66 Steven Raemaekers Arie van Deursen Joost Visser 2012 Measuring library stability historical version analysis Proceedings International Conference Maintenance ICSM’12 IEEE Computer Society Los Alamitos CA 378–387 67 Steven Raemaekers Arie Van Deursen Joost Visser 2014 Semantic versioning versus breaking changes study Maven repository Proceedings International Working Conference Source Code Analysis Manipulation SCAM’14 IEEE Computer Society Los Alamitos CA 215–224 DOI httpsdoiorg101109SCAM201430 68 Romain Robbes Mircea Lungu David Röthlisberger 2012 developers react API deprecation case smalltalk ecosystem Proceedings International Symposium Foundations Engineering FSE ACM Press New York DOI httpsdoiorg10114523935962393662 69 RStudio Team 2015 RStudio Integrated Development R Technical Report RStudio Inc Boston Retrieved wwwrstudiocom 70 Edgar H Schein Peter Schein 2017 Organizational Culture Leadership 5th ed Wiley 71 Shalom H Schwartz 1992 Universals content structure values Theoretical advances empirical tests 20 countries Adv Exper Soc Psychol 25 1992 1–65 72 Leif Singer Fernando Figueira Filho MargaretAnne Storey 2014 engineering speed light developers stay current using Twitter Proceedings International Conference Engineering ICSE’14 211–221 DOI httpsdoiorg10114525682252568305 73 Ian Sommerville 2010 Engineering 9th ed Pearson Addison Wesley 74 Diomidis Spinellis 2012 Package management systems IEEE Softw 29 2 2012 84–86 75 Adam Stakoviak Andrew Thorp Isaac Schleuter 2013 Changelog Retrieved httpschangelogcom101 76 Peri Tarr Harold Ossher William Harrison Stanley Sutton Jr 1999 N degrees separation Multidimensional separation concerns Proceedings International Conference Engineering ICSE’99 IEEE Computer Society Los Alamitos CA 107–119 77 LibreOffice Design Team 2017 Open Source Means LibreOffice Users Retrieved httpsdesignblogdocumentfoundationorg20170913opensourcemeanslibreofficeusers 78 Rust Team 2021 Cargo Book Retrieved 28 April 2021 httpsdocrustlangorgcargofaqhtmlwhydobinarieshavecargolockinversioncontrolbutnotlibraries 79 Jonathan Tuner 2016 State Rust Survey 2016 Retrieved httpsblogrustlangorg20160630StateofRustSurvey2016html 80 Turon N Matsakis 2014 Stability Deliverable Rust Programming Language Blog Retrieved httpsblogrustlangorg20141030Stabilityhtml 81 Ivo van den Berk Slinger Jansen Lútzen Luinenburg 2010 ecosystems Proceedings European Conference Architecture ECSA’10 127–134 DOI httpsdoiorg10114518427521842781 82 Bill Venners 2003 Philosophy Ruby Conversation Yukihiro Matsumoto Part Retrieved httpwwwartimacomintvrubyPhtml 83 Jonathan Wareham Paul B Fox Josep Lluís Cano Giner 2014 Technology ecosystem governance Organiz Sci 25 4 2014 1195–1215 84 Mark Weiser 1984 Program slicing IEEE Trans Softw Eng 10 4 1984 352–357 85 Joel West 2003 open open enough Melding proprietary open source platform strategies Res Polic 32 7 2003 1259–1285 86 Joel West Siobhán O’Mahony 2008 role participation architecture growing sponsored open source communities Industr Innov 15 2 2008 145–168 87 Hadley Wickham 2015 Releasing Package O’Reilly Media Sebastopol CA Retrieved httprpkgshadconzreleasehtml 88 Wei Wu Foutse Khomh Bram Adams Yann Gaël Guéhéneuc Giuliano Antoniol 2015 exploratory study API changes usages based Apache Eclipse ecosystems Empir Softw Eng 2015 1–47 DOI httpsdoiorg101007s1066401594117 89 Wei Wu Foutse Khomh Bram Adams YannGaël Guéhéneuc Giuliano Antoniol 2016 exploratory study API changes usages based Apache Eclipse ecosystems Empir Softw Eng 21 6 2016 2366–2412 90 Laerte Xavier Aline Brito Andre Hora Marco Tulio Valente 2017 Historical impact analysis API breaking changes largescale study Proceedings IEEE International Conference Analysis Evolution Reengineering SANER’17 IEEE 138–147 91 Yihui Xie 2013 R Package Versioning Retrieved httpyihuinameen201306rpackageversioning 92 Robert Yin 2013 Case Study Research Design Methods 5th ed Sage Publications Received August 2019 revised December 2020 accepted January 2021
::::
Code Reuse Open Source Development Quantitative Evidence Drivers Impediments March 2010 Manuel Sojer¹ Joachim Henkel¹ ² ¹Technische Universität München Schöller Chair Technology Innovation Management Arcisstr 21 D80333 Munich Germany sojerhenkelwitumde ²Center Economic Policy Research CEPR London Abstract focus existing open source OSS research individuals firms add commons public OSS code—that “giving” side open innovation process contrast research corresponding “receiving” side innovation process scarce address gap studying existing OSS code reused serves input OSS development findings based survey 686 responses OSS developers interesting results multivariate analyses developers’ code reuse behavior point developers larger personal networks within OSS community experience greater number OSS projects reuse presumably network size broad experience facilitate local search reusable artifacts Moreover find development paradigm calls releasing initial functioning version early—as “credible promise” OSS—leads increased reuse Finally identify developers’ interest tackle difficult technical challenges detrimental efficient reusebased innovation Beyond OSS discuss relevance findings companies developing receiving side open innovation processes general Keywords Innovation development open source code reuse reuse grateful Oliver Alexy Timo Fischer Stefan Haefliger Francesco Rullani seminar participants PreECIS 2009 Open Source Innovation Workshop TUMImperial Paper Development Workshop 2009 Open Source Innovation Entrepreneurship Workshop 2010 helpful comments 1 Introduction public development open source OSS1 specific instance open innovation term coined Chesbrough 2003 large body empirical work addressed “giving” side open innovation process exploring question individuals eg Ghosh et al 2002 Hars Ou 2002 Hertel et al 2003 Lakhani Wolf 2005 Henkel 2009 firms eg West 2003 Dahlander 2005 Gruber Henkel 2005 Bonaccorsi et al 2006 Henkel 2006 Rossi Lamastra 2009 make developments freely available others use build upon contrast research “receiving” side innovation process2 extent drivers impediments reuse existing OSS code subsequent OSS development scarce either based highlevel code dependency analyses German 2007 Mockus 2007 Spaeth et al 2007 Chang Mockus 2008 case studies von Krogh et al 2005 Haefliger et al 2008 research suggests code reuse major importance OSS development largescale quantitative study phenomenon level individual developers lacking better understanding code reuse OSS desirable also yield insights reuse beyond OSS Reuse long recognized crucial overcome “software crisis” Naur Randell 1968 allows efficient effective development higher quality Krueger 1992 Kim Stohr 1998 generally literature innovation management points knowledge reuse important factor mitigating cost innovation eg Langlois 1999 Majchrak et al 2004 Despite significant advances reuse research especially reuse commercial firms still without issues antecedents fully understood yet eg Desouza et al 2006 Sherif et al 2006 scholars suspect reuse failure often related individual developer issues eg Isoda 1995 Morisio et al 2002 However 1 better readability use term Open Source article work also refers Libre Free differs open source ideological considerations technical ones See httpwwwgnuorgphilosophyfreeswhtml information 2 Also users OSS obviously receive code however since base innovations consider “receiving” side OSS innovation process paucity especially quantitative research addressing view individual developers reuse eg Sen 1997 Ye Fischer 2005 aim fill gap regarding “receiving” side OSS innovation leverage findings augment general reuse literature adding insights regarding perspectives individual developers reuse surveybased empirical study code reuse public OSS development quantitatively assess importance code reuse one form reuse development OSS explore drivers impediments level individual developers empirical approach relies webbased survey via email invited 7500 developers SourceForgenet largest OSS development platform results point code reuse play major role OSS development developers reported average 30 percent functionality implemented current main projects based reused code Investigating drivers reuse multivariate analyses find developers believe effectiveness efficiency quality benefits reuse developers see reuse means work preferred development tasks rely existing code presumably larger network experience greater number projects provide access local search reusable artifacts developers larger personal networks within OSS community experience greater number OSS projects reuse Moreover find development paradigm calls releasing initial functioning version product early delivering “credible promise” leads increased reuse Finally developers’ interest tackle difficult technical challenges identified detrimental efficient reusebased innovation developers’ commitment OSS community leads increased reuse behavior remainder paper organized follows next section reviews relevant literature reuse OSS followed section presents research model hypotheses elaborate data measures present analyses results last section concludes summary discussion supplemental appendix contains tables referred paper included main body space considerations 2 Literature Review theoretical foundation paper draws two streams literature First review relevant engineering literature reuse implementation firms Second scholarly work OSS development provides context work establishing basic concepts developers contribute OSS projects summary small base scholarly work code reuse OSS development concludes literature review 21 Reuse Development reuse softwarespecific form knowledge reuse eg Langlois 1999 Majchrak et al 2004 “… process creating systems existing rather building systems scratch” Krueger 1992 p 131 artifacts commonly reused development components pieces encapsulate functionality developed specifically purpose reused snippets multiple lines code existing systems Krueger 1992 Kim Stohr 1998 study focuses two artifacts refer reuse “code reuse” reuse promises increased development efficiency reduced development times also improved quality better maintainability developers develop everything scratch rather rely existing proven thoroughly tested artifacts Frakes Kang 2005 Despite compelling benefits reuse still fails frequently commercial firms sometimes technical often human organizational reasons eg Morisio et al 2002 importance individual developer successful reuse undisputed Isoda 1995 p 183 instance concedes “unless engineers find benefits applying reuse … … perform reuse” Still paucity reuse research focuses individual developer Sen 1997 Ye Fischer 2005 OSS seems unique opportunity enhance knowledge role individuals successful reusebased innovation reuse particular two reasons First contrary commercial developers often restricted limited amount code available firms’ reuse repositories abundance OSS code available licenses generally permit reuse OSS projects provides OSS developers broad options reuse existing code wish Second broad scholarly knowledge motivations beliefs OSS developers helpful analyzing perspectives individual developers reuse next section establishes communitybased public OSS development empirical setting analysis 22 Open Source Development Strictly speaking OSS comes open source license license grants users right access inspect modify source code distribute modified unmodified versions it3 Since much OSS developed informal collaboration public OSS projects Crowston Scozzi 2008 term “OSS” often also understood imply developed “OSS fashion” von Krogh et al 2008 Typically development OSS projects differs strongly development traditional commercial setups Crowston et al 2009 context motivation developers spend considerable time OSS projects process OSS development particular relevance study large body literature emerged addresses first topic Common work finding OSS developers work projects intrinsic extrinsic reasons intrinsic motivations scholars identified identification OSS community resulting wish support Hertel et al 2003 ideological support OSS movement Stewart Gosain 2006 desire help others Hars Ou 2002 importantly fun enjoyment developers experience working projects Lakhani Wolf 2005 Based psychology research Amabile et al 1994 Sen et al 2008 differentiate fun enjoyment “flow” feelings Csíkszentmihályi 1990 developers perceive writing code satisfaction solving challenging technical problems Extrinsic motivations 3 Whether license open source license determined Open Source Initiative httpwwwopensourceorg OSS developers may derive wish enhance reputation OSS community Lakhani Wolf 2005 hone development skills Hars Ou 2002 develop adapt functionality needs Hertel et al 2003 signal skills potential employers business partners Lerner Tirole 2002 Also may paid directly OSS work example part job Ghosh et al 2002 Regarding process OSS development OSS projects often started individual developer need certain functionality yet exist Raymond 2001 initialization developer typically wants attract developers participate incentive others join offers interesting tasks also seems feasible von Krogh et al 2003 founder enhance recruitment process delivering “credible promise” Lerner Tirole 2002 p 220 describe “a critical mass code programming community react Enough work must done show doable merit” However founder prove worthy support others also developers interested joining often show possess skills required solving technical issues currently facing von Krogh et al 2003 23 Code Reuse Open Source Development scant research code reuse OSS far largescale quantitative data developer level exist Initial academic work however suggests code reuse practiced OSS projects even high level Analyzing code large number OSS projects Mockus 2007 Chang Mockus 2008 measure overlap filenames among OSS projects database 387 thousand OSS projects conclude 50 percent components exist one Mockus’s 2007 data even suggests code reuse popular OSS development traditional commercial closed source arena Following different approach German 2007 Spaeth et al 2007 rely dependency information available Linux distributions show packages distributions require packages reuse functionality Using case studies individual developer level rather largescale code analyses von Krogh et al 2005 Haefliger et al 2008 confirm OSS developers reuse existing code—in form components snippets—as well abstract knowledge—such algorithms methods Diving mechanics code reuse OSS Haefliger et al 2008 find OSS developers reuse code want make development work efficient lack skills implement certain functionality prefer specific development work tasks want deliver “credible promise” authors point exist equivalents components corporate reuse programs OSS repositories like SourceForgenet substitute internal reuse repositories within firms reuse frequency component serve proxy component’s quality thus substitutes certification Research Questions Hypotheses Building existing research code reuse OSS presented paper seeks use largescale quantitative data obtained survey among OSS developers answer question conditions developers prefer reusing existing code developing code scratch context following specific research questions addressed important code reuse OSS development projects OSS developers perceive benefits code reuse see issues impediments degree code reuse open source developers’ work determined characteristics first question establishes extent OSS developers reuse existing code subsequent questions explore behavior understood explained Question three addressed using regression analyses guide choice explanatory variables formulate hypotheses research model developed following section provide solid theoretical base research model builds wellestablished Theory Planned Behavior TPB Ajzen 1991 refined extended interviews literature code reuse OSS 31 Theory Planned Behavior Initially developed context social psychology TPB behavioral model found wide adoption various fields information systems research TPB parsimonious rather generic model explaining human behavior thus provides excellent starting point investigate code reuse one particular form behavior Research related topic study relied TPB sister model TAM Technology Acceptance Model Davis et al 1989 explain example developers’ application various development methodologies CASE tools Riemenschneider Hardgrave 2001 objectoriented development Hardgrave Johnson 2003 generally formalized development processes Riemenschneider et al 2002 Hardgrave et al 2003 Following encouraging results stream research base research model TPB TPB posits behavior determined intention predicted three factors 1 attitude toward behavior 2 subjective norms 3 perceived behavioral control Attitude formed individual’s beliefs consequences outcomes positive negative behavior Subjective norms refer pressure social environment perceived individual perform perform behavior Lastly perceived behavioral control perception individuals ability perform behavior broken individuals’ “capability” performing behavior “controllability” Ajzen 2002 individuals behavior whether decision perform behavior 32 Research Model Hypotheses Using TPB starting point research model see Figure 1 argue developers’ reuse behavior influenced attitude toward code reuse subjective norms code reuse behavioral control perceive regarding code reuse Contrary typical work relying TPB employ generic scales measure constructs cases rather operationalize unique scales single items explicitly framed OSS code reuse context second deviation typical TPB research test research model different regressions either use intention reuse dependent variable employ actual reuse behavior dependent variable Since combine intention behavior one construct rather employ one regression models stay true TPB assumption two concepts related Comparing results regressions different dependent variables adds robustness findings Note research model aims explaining developers’ reuse behavior without explicitly differentiating component snippet reuse conventional development component reuse typically considered blackbox reuse implying developers neither access modify source code components reuse Thus component reuse assumed follow different drivers whitebox reuse eg snippet reuse access source code given Ravichandran Rothenberger 2003 context OSS however also source code components available reusing developers survey data indicate 50 developers exercise option modify expect fundamental differences drivers component snippet reuse treat forms code reuse jointly research model Based interviews4 existing research identified five main drivers influence developers’ attitude toward code reuse since determine whether developers expect positive negative outcomes reuse drivers developers’ perceptions 1 effectiveness reuse 2 efficiency reuse 3 quality attained reuse 4 task selection benefits resulting reuse 5 potential loss control might come reuse link reuse effectiveness efficiency quality straightforward addition code reuse might result task selection benefits developers avoid certain tasks reusing existing code Haefliger et al 2008 fifth driver reuse lead control loss developer reusing code another might become dependent develop code fix bugs Since developers positive perception drivers hold positive attitude toward reuse TPB suggests rely reusing existing code work Based logic following hypotheses derived five drivers Developers reuse existing code… H1a …the strongly perceive effectiveness benefits reuse H1b …the strongly perceive efficiency benefits reuse H1c …the strongly perceive quality benefits reuse H1d …the strongly perceive task selection benefits reuse H1e …the less strongly perceive loss control risks code reuse Since primary interest research understand individual developer characteristics influence reuse subjective norms perceived behavioral control two parts TPB besides attitude treated control variables model controllability portion perceived behavioral 4 See next section overview interviews control operationalized six variables relating attributes Two dummy variables indicate whether exist policies supporting discouraging code reuse Four Likertscale variables capture intensity general impediments code reuse lack reusable code specific requirements developer’s conflicts license developer’s license code reused incompatibilities programming languages code reused written different language developer’s Haefliger et al 2008 programming language focal makes difficult include code foreign languages architecture developer’s modular enough allow easy reuse existing code Baldwin Clark 2006 capability portion perceived behavioral control operationalized developer’s selfreported skill level development arguing without proficiency developers able understand integrate foreign code TPB research posits attitude toward behavior subjective norms perceived behavioral control explain behavior comprehensively Ajzen 1991 stay true assumption add groups hypotheses control variables hereinafter additional groups could incorporated three original TPB groups attitude subjective norms perceived behavioral control however choose display hypotheses independent groups better illustrate ideas behind Moreover control variables shown group influence attitude subjective norms perceived behavioral control rather indirect first additional hypotheses group argue developers’ access local search leads increased code reuse Banker et al 1993 show developers reuse costs searching integrating existing code lower developing scratch costs searching integrating lower OSS developers turn experience fellow OSS developers point code need assure quality explain works best integrate Haefliger et al 2008 Consequently posit developers larger personal network OSS developers reuse code reap benefits local search H2a Similarly developers active OSS projects past also show increased code reuse behavior H2b Summarizing following two hypotheses derived regarding developers’ access local search Developers reuse existing code… H2a …the larger personal OSS network H2b …the greater number OSS projects involved also conjecture relationship maturity OSS code reuse behavior developers pointed literature review section OSS developers launching strive deliver “credible promise” quickly possible order attract developers’ support Code reuse excellent tool accomplish allows addition large blocks functionality new limited effort Haefliger et al 2008 code reuse help new overcome “liabilities smallness” Aldrich Auster 1986 quickly close gap established competing projects domain Lastly code reuse helpful early phases life OSS expect importance decline reached certain level maturity point implemented required basic functionality turns toward finetuning aspects make unique definition difficult reused code Thus posit less mature OSS code developers reuse H3 H3 Developers reuse existing code less mature final group hypotheses argue compatibility code reuse developers’ goals influence extent code reuse behavior important “attitudes”group model presented captures developers’ general attitude toward code reuse “compatibility”group presented following help link general attitudes developers’ work one specific follow Moore Benbasat 1991 p 195 define compatibility degree code reuse “is perceived consistent existing values needs past experiences” OSS developer focus primarily “values” “needs” “experiences” addressed H2b argumentation regarding compatibility developers’ goals reuse behavior based motivations developers participate OSS projects described earlier Sen et al 2008 show empirically developers tackling difficult technical problems main motivation work try limit number team members involved besides want solve problems without help others similar fashion developers work tackle difficult technical challenges reuse less existing code reuse would solve challenges H4a order able focus solving difficult technical challenges developers might well show increased reuse behavior parts control effect including developers’ perception task selection benefits reuse see H1d Also supportive argumentation DiBona et al’s 1999 p 13 description “satisfaction ultimate intellectual exercise” developers feel “after completing debugging hideously tricky piece recursive code source trouble days” seems likely reuse would reduce joy described thus developers challenge seeking major motivation reuse less existing code Related effect challenge seeking reuse also lower importance developers work pleasure experience writing code H4b Code reuse would reduce need write code thus reduce pleasure derived Hars Ou 2002 p 28 provide nice illustration argumentation quote OSS developer explaining motivation work “innate desire code code code day die” seems plausible developer feeling way coding would ceteris paribus reuse less challenge seeking one might argue developers code fun might reuse order focus enjoyable tasks However statistically controlled including developers’ perception task selection benefits reuse see H1d goal improve one’s development skills could affect reuse intensity two directions One could conjecture developers want hone skills purposefully reinvent wheel order learn done Yet argue countervailing effects dominate developers skill improvement important also reuse existing code H4c rationale based DiBona’s 2005 finding OSS developers leverage existing code starting point learning study modify improve skills also found confirmation stance interviews5 developers example told us “used code reuse way learning” pointed “reusing code snippets really help learn new programming language” Also supportive argumentation finding survey6 50 developers modify components reuse thus practice blackbox reuse get touch source code components Regarding community commitment motivation argue developers feel strongly committed OSS community want successful reuse code H4d Code reuse helps developers write better faster allows make community stronger contributing last two motivations conjectured influence developers’ reuse behavior turn reputation building first within OSS community second purpose signaling skills potential commercial partners employers Regarding developers’ reputation within OSS community argue developers seeking improve reputation reuse code H4e Code reuse make better thus create attention within OSS community also developers associated argumentation receives support Sen et al 2008 find developers OSS reputation building important prefer part successful many developers one developers less successful One could object OSS developer’s reputation grounded technical skills best proves unique—that reusebased—contributions OSS community Yet argumentation refuted von Krogh et al’s 2003 finding developers 5 See next section overview interviews 6 survey introduced detail next section need prove worthiness join making initial contributions often include reused code Furthermore Raymond’s 2001 p 24 famous saying “good programmers know write Great ones know rewrite reuse” also leans toward hypothesis developers reputation building OSS community important reuse existing code Finally basically following argumentation posit developers want signal development skills potential employers business partners reuse code parties outside OSS community likely become aware successful OSS projects developers H4f Summarizing posit following hypotheses addressing compatibility developers’ motivations work code reuse Developers reuse existing code… H4a …the less important challenge seeking… H4b …the less important coding fun enjoyment… H4c …the important skill improvement… H4d …the important community commitment… H4e …the important OSS reputation building… H4f …the important commercial signaling… …is motivation work Finally multiple additional control variables included model account contextual differences code reuse behavior control variables encompass four groups First account characteristics size number team members technical complexity project’s position stack whether aims creating standalone executable application reusable component addition control level professionalism seriousness developers work current main including number years already involved OSS average weekly hours invest current main share functionality developed current main compared team members whether worked work professional developers Moreover account developers’ education training reuse shown determinant reuse behavior development firms previous research eg Frakes Fox 1995 Finally accommodate developers’ geographic residence continent level Subramanyam Xia 2008 shown developers different geographies prefer example different levels modularity OSS projects Following line thought geographic origin might also antecedent reuse behavior Research Design Data Measures collected data study using webbased survey developed based 12 interviews OSS developers existing literature Moreover questionnaire items questions assessed clarity fellow researchers OSS developers qualitative pretest survey asked developers experiences code reuse context current main OSS order capture high heterogeneity OSS projects developers chose largest OSS repository SourceForgenet platform selected survey participants April 2009 two rounds quantitative pretests total 2000 developers invited conducted assess quality questionnaire terms content scope language Following minor refinements based analysis pretest feedback respondents main survey took place July 2009 email sent 7500 developers SourceForgenet inviting participate survey developers selected random SourceForgenet developers active platform first half
::::
7 number years developer active OSS treated control variable included local search hypotheses intensity experience eg measured number years rather breadth experience eg measured number projects involved conjectured facilitate better access local search consequently code reuse 8 Ten interviews conducted via phone Internetbased voice communication two others conducted via email exchange Nine voicebased interviews taped transcribed average length 49 minutes received total 686 responses equaling response rate 96 percent 338 invitations could delivered rate similar obtained recent surveys among SourceForgenet developers eg Wu et al 2007 Sen et al 2008 Eleven responses eliminated due inconsistent corrupt entries leaving us 675 completed surveys demographic profile developers participating study see Table 1 largely consistent reported studies among OSS developers eg Lakhani Wolf 2005 Sen et al 2008 particular find indication nonresponse biased sample overrepresent less serious OSS developers special relevance endeavor fact 92 percent 624 developers surveyed actually write code OSS projects developers writing code practice code reuse analyses focus 624 developers starting analysis data briefly assess multiitem constructs employed measure developers motivation work main items constructs adopted prior research OSS domain Hars Ou 2002 Lakhani von Hippel 2003 Roberts et al 2006 psychological motivation research Amabile et al 1994 Clary et al 1998 measured sevenpoint Likert scales “strongly disagree” “strongly agree” took several steps ensure validity reliability measures Content validity qualitatively assessed building existing OSS literature whenever possible discussions fellow OSS researchers two rounds pretests Reliability assessed via Cronbach’s alpha multiitem variable Cronbach’s alpha values exceed Straub’s 1989 rule thumb 08 exceed Nunnally’s 1978 threshold 06 see Table A1 Appendix Convergent validity assessed factor analysis confirms items highest loading respective intended 9 Given large number surveys among SourceForgenet developers one might suspect especially active developers platform would show signs “survey fatigue” However comparing selfreported weekly hours developers spend working main survey mean 88 first SourceForgenet survey ever taken Lakhani Wolf 2005 mean 75 mitigates concerns additional finding 69 percent developers survey worked professional developers still working professional developers average tenure 79 years rules concern less skilled programmers took part survey construct loadings higher 05 Hair et al 2006 see Table A1 Appendix Discriminant validity demonstrated showing square root average variance extracted construct greater correlations constructs see Table A2 Appendix thus satisfying FornellLarcker criterion Fornell Larcker 1981 Table 1 Demographics Survey Participants Percentage Age mean 318 median 30 119 5 2029 42 3039 35 4049 13 50 5 Residence North America 26 South America 5 Europe 54 Asia rest world RoW 15 Highest education level Nonuniversity education 15 Undergraduate equivalent 35 Graduate equivalent 30 PhD higher 20 Task profile open source projects Includes writing code 93 include writing code 7 Hours spent working main OSS per week mean 88 median 5 14 48 59 19 1019 21 20 12 Size personal OSS network mean 299 median 8 19 70 1019 18 20 12 Number OSS projects ever involved mean 37 median 2 14 65 59 26 1014 6 15 3 order reduce common method bias employed several measures data collection suggested Podsakoff et al 2003 taken care formulate simple unambiguous questions survey discussing questionnaire items interview partners conducting multiple rounds pretests survey respondents assured survey introduced responses would treated strictly confidentially Moreover much survey items address motivations attitudes beliefs nature right wrong answers estimate presence common method bias data survey completion employed Harman’s test variables model loaded onto single factor principal component factor analysis significant amount common method bias assumed exist one factor explains large portion variance data Podsakoff et al 2003 data find maximum variance explained one factor 93 percent hint toward strong common method bias Results Discussion Following research questions presented section consists four parts first establish importance code reuse OSS development Next present perceived benefits issues reuse well impediments address question OSS developers reuse code third part presents core study form multivariate analysis code reuse behavior used test research model final fourth part discuss potential threats validity limitations study 51 Importance Code Reuse measuring code reuse focused component snippet reuse survey component reuse defined “reusing functionality external components form libraries included files Eg implementing cryptographic functionality OpenSSL functionality parse INI files external class included Please count functionalities libraries part development language C libraries” similar fashion snippet reuse defined “reusing snippets several existing lines code copied pasted external sources modified code copying pasting eg renaming variables adjusting specific library use would still considered … reuse …” Three different measures depicted Table 2 employed investigate importance code reuse First related example Cusumano Kemerer 1990 Frakes Fox 1995 asked developers indicate share functionality based reused code added current main found average nearly one third mean30 median20 functionality OSS developers added based reused code points code reuse indeed important element OSS development interpretation supported fact six percent developers surveyed report reused code Furthermore maximum share reused functionality 99 percent shows developers rely heavily code reuse see role mainly writing “gluecode” integrate various pieces reused code second measure employed selfdeveloped fouritem scale directly measure perceived importance reuse individual developers’ work main projecttext10 sevenpoint Likert scales developers indicated agreement four statements described various ways reuse “very important” mean 474 median525 58 percent developers least “somewhat agreeing” statements important role code reuse OSS development confirmed Finally third approach using selfdeveloped fouritem scaletext11 asked developers indicate intent reuse existing code future development current main results largely similar obtained using second measure perceived importance reuse past work indicating code reuse important However mean median significantly lower mean457 median475 previous 10 scale developed based interviews developers research general knowledge reuse Watson Hewett 2006 also draws intention behavior scales commonly employed TAM TPB research domain example Riemenschneider et al 2002 Mellarkod et al 2007 statements scale “Reusing extremely important past work current main project” “Without reusing current main would today” “I reuse much past work current main project” “My past work current main would possible without reusing” scale explains 834 percent total variance Cronbach’s alpha 093 11 statements scale “Reusing extremely important future work current main project” “Realizing future tasks goals current main possible without reusing” “I reuse much developing current main future” “Realizing future tasks goals current main difficult without reusing” scale explains 838 percent total variance Cronbach’s alpha 094 measure finding might first indication supporting hypothesis H3 states code reuse important earlier phases OSS Measure Mean Median SD Min Max Share implemented functionality based reused code 300 200 265 00 990 Importance reuse past work sevenpoint Likert scale 474 525 186 100 700 Importance reuse future work sevenpoint Likert scale 457 475 169 100 700 Measure based four single items N624 Despite prominent role code reuse consistently indicated three measures high standard deviations also reveal large heterogeneity developers’ code reuse behavior Developers’ individual reasons code reuse development suspected largely drive heterogeneity explored following section 52 Developers’ Reasons Code Reuse analysis developer’s reasons code reuse differentiate three sets factors First analyze benefits code reuse perceived OSS developers Second investigate drawbacks issues developers see code reuse finally address importance general impedimentstext12 code reuse Based interviews well existing literature identified eight distinct benefits code reuse Survey participants asked indicate agreement sevenpoint Likert scale statements regarding benefits Results displayed Figure 2 show statements received rather high shares agreement two statements highest level agreement point efficiency effects reuse followed statement pertaining effectiveness effects benefits ranks four higher agreement drops significantly compared rank three yet still quite high Ranked fourth fifth statements addressing effects reuse quality 12 “general impediments” rather objective compared developers’ beliefs benefits issues may still reflect individual developer’s opinions measured asking developers developed making stable compatible standards statement ranked eighth effects code reuse security also pertains group however receives considerably less agreement could explained fact many OSS projects develop types security major concern example games Ranked sixth seventh statements position reuse means developers select tasks preference avoid mundane jobs example “outsourcing” maintenance work original developers reused code fix bugs implement new functionality code reusing developer benefits without work Reuse benefits perceived developers developers Share agreement Share disagreement 1 Reusing helps developers realize goals tasks faster 92 3 2 Reusing allows developers spend time important tasks 91 9 3 Reusing allows developers solve difficult problems lack expertise 85 14 4 Reusing helps developers create reliable stable eg less bugs 74 12 5 Reusing ensures compatibility standards eg look feel GUIs 72 14 6 Reusing allows developers spend time development activities fun 67 24 7 Reusing allows developers “outsource” maintenance tasks certain parts code developers outside 60 19 8 Reusing helps developers create secure eg less vulnerabilities 57 19 Note share developers “indifferent” statements shown N624 Figure 2 Share Developers DisagreeAgree Reuse Benefits order check consistency responses construct factor scores used multivariate analyses later exploratory factor analysis carried four components explains 772 percent total variance yields good quality measures KMO 076 p00001 resulting components interpreted development efficiency ranks 1 2 quality ranks 4 5 8 task selection ranks 6 7 development effectiveness rank 3
::::
13 better interpretability resulting components components Eigenvalue less 1 also extracted fourth component Eigenvalue 079 14 factor analysis uses principal component analysis Varimax rotation Cronbach’s alpha components quality development efficiency task selection 080 072 047 respectively See Table A3 Appendix detailed factor loadings Following benefits code reuse nine issues drawbacks identified interviews existing literature shown Figure 3 presented participants asked indicate agreement respective statements highest share agreement received statement pointing loss control developer may accept reusing code Statements ranked second third also relate losing control however significantly lower levels agreement statement ranked second points difficult install build use endusers due technical dependencies statement ranked third reflects developer’s obligation check integrate updates reused codetextsuperscript15 Ranked fourth fifth eighth—and significantly lower levels agreement previous statements—are two potential issues code reuse point quality security risks statements ranked sixth seventh ninth describe situations development scratch efficient code reuse however receive least 50 percent disagreement emphasizes developers deem searching understanding adapting reusable code inefficient beginfigureh centering includegraphicswidthtextwidthfigure3png captionShare Developers DisagreeAgree Reuse Issues Drawbacks endfigure textsuperscript15 statements mainly refer component reuse partially applicable snippet reuse exploratory factor analysis issues drawbacks explains 690 percent total variance three components yields good quality measures KMO 072 p00001 resulting components interpreted control loss ranks 1 2 3 quality risks ranks 4 5 8 inefficiency reuse ranks 6 7 9textsuperscript16 consolidate number variables multivariate model employed later factor analysis merged quality benefits quality risks one component development efficiency benefits merged inefficiency reuse five final components used multivariate model effectiveness benefits efficiency benefits quality benefits task selection benefits loss control risks benefits issuesdrawbacks code reuse subjective perceived individual developer also exist general impediments reuse general impediments resulted interviews existing literature make code reuse difficult impossible even individual developer wanted rely existing code see Figure 4 Interestingly however four statements offered surveyed developers received disagreement agreement statement “there exist reusable resources current main project” ranked first 39 percent developers agreeing Oneway ANOVA analysis used identify projects exist least reusable resources found target operating system significant influence availability reusable code p00497 Projects developed POSIX operating system systems eg Linux Windows less reusable code disposal Neither type eg “Software Development” “Scientific Engineering” “Games Entertainment” significant influence p02440 graphical user interface employed 01171 Ranked second general impediment code reuse 24 percent agreement license incompatibilities situation would occur example programmer wanted reuse code snippets licensed GPL licensed BSD license expected license developer’s main significantly influences general impediment Oneway ANOVA textsuperscript16 factor analysis uses principal component analysis Varimax rotation Cronbach’s alpha components control loss quality risks inefficiency reuse 066 076 085 respectively See Table A4 Appendix detailed factor loadings p00001 developers working GPL licensed projects least likely perceive issue However low share agreement surprising Three possible explanations finding seem plausible First might exist enough reusable code license category Second developers might able mitigate license incompatibilities modular architectures clearly separate modules different licenses thus avoid contamination issues Henkel Baldwin 2009 Third developers knowledgeable license incompatibilities ignore potential issues Ranked third fourth 17 percent nine percent agreement respectively architecture developer’s current main modular enough allow easy integration reusable code rank 3 incompatibilities project’s main programming language programming language code developer wants reuse rank 4 significantly dependent programming language developer’s Oneway ANOVA p00036 p00001 rank 3 rank 4 respectively C Java objectoriented languages posing least issues General impediments reuse perceived developers developers 1 exist reusable resources current main 2 License issues make reusing current main difficult eg reusing GPL component would require license current main changed GPL well 3 architecture current main makes reusing difficult eg architecture current main projects modular 4 programming language current main projects makes reusing difficult eg programming language current main projects makes including popular libraries difficult Note share developers “indifferent” statements shown N624 Figure 4 Share Developers DisagreeAgree General Reuse Impediments 53 Multivariate Analysis Reuse Behavior Following descriptive analysis objective research model explain observed heterogeneity developers’ reuse behavior found earlier developer characteristics test research model three different measures reuse behavior dependent variables three different regression models order ensure robustness resultstextsuperscript17 three models tested using Tobit regressions dependent variables restricted either 0100 17textsuperscript18 summary research model hypotheses support received multivariate analyses presented Table 3 detailed regression tables containing Tobit models depicted Table 4 robustness check ran specifications three models successive elimination insignificant variables results robustness check largely consistent results main models shown Table A7 Appendix results multivariate analyses presented discussed following Hypotheses Confirmed Attitude toward reuse Developers reuse existing code… H1a …the strongly perceive effectiveness benefits reuse ✓ H1b …the strongly perceive efficiency benefits reuse ✓ H1c …the strongly perceive quality benefits reuse ✓ H1d …the strongly perceive task selection benefits reuse ✓ H1e …the less strongly perceive loss control risks code reuse ✗ Access local search Developers reuse existing code… H2a …the larger personal OSS network ✓ H2b …the greater number OSS projects involved ✓ maturity H3 Developers reuse existing code less mature ✓ Compatibility goals Developers reuse existing code… H4a …the less important challenge seeking motivation work ✗ H4b …the less important coding fun enjoyment motivation work ✗ H4c …the important skill improvement motivation work ✓ H4d …the important community commitment motivation work ✓ H4e …the important OSS reputation building motivation work ✗ H4f …the important commercial signaling motivation work ✗ Legend ✓ fully confirmed ✓ partially confirmed ✗ supported textsuperscript17 Descriptive statistics explanatory variables depicted Table A5 Appendix correlation matrix shown Table A6 Appendix textsuperscript18 contrast OLS regression Tobit model accounts censoring dependent variable present case means example share functionality reused resources cannot less zero percent larger 100 percent Table 4 Multivariate Analysis Developers’ Reuse Behavior Past importance reuse 1 Likert scale 2 Percentage scale 3 Future importance reuse Likert scale Attitude toward reuse BenefitEffectiveness H1a 0222 0076 2701 1021 0168 0063 BenefitEfficiency H1b 0653 0084 5959 1114 0517 0069 BenefitQuality H1c 0303 0081 1800 1073 0250 0067 BenefitTaskSelection H1d 0155 0078 3528 1041 0132 0064 IssueControlLoss H1e 0030 0077 0506 1036 0004 0064 Access local search DevOSSNetsize log H2a 0165 0083 2098 1102 0230 0069 DevOtherProjects H2b 0022 0016 0398 0208 0032 0013 maturity ProjPhase H3 0149 0070 3227 0928 0219 0057 Compatibility goals MotChallenge H4a 0148 0083 2559 1103 0067 0068 MotFun H4b 0098 0080 0575 1072 0055 0066 MotLearning H4c 0003 0080 1438 1053 0015 0066 MotCommunity H4d 0177 0086 1964 1150 0148 0071 MotOSSReputation H4e 0005 0057 0128 0758 0065 0047 MotSignaling H4f 0054 0061 0336 0817 0013 0051 Subjective norms DevNorm 0140 0066 2372 0887 0197 0055 Perceived behavioral control ProjPolSupport 0440 0200 0946 2670 0297 0165 ProjPolDiscourage 1087 0457 4977 6161 1279 0383 ConditionLack 0250 0044 2317 0589 0168 0036 ConditionLicense 0065 0045 0309 0599 0018 0037 ConditionLanguage 0030 0060 0071 0802 0060 0049 ConditionArchitecture 0017 0052 0481 0698 0017 0043 DevSkill 0075 0095 0123 1270 0018 0078 control variables ProjSize 0000 0002 0021 0024 0002 0001 ProjComplexity 0131 0092 2194 1236 00190 0076 ProjStack 0210 0091 1499 1209 0135 0074 ProjStandalone 0118 0197 0233 2633 0203 0163 DevOSSExperience 0010 0018 0076 0249 0000 0015 DevProjTime 0014 0008 0039 0107 0008 0007 DevProjShare 0003 0002 0031 0033 0001 0002 DevProf 0056 0186 0214 2492 0184 0154 DevEduReuse 0127 0165 1177 2201 0266 0136 DevProfEduReuse 0603 0237 5883 3094 0378 0193 ResidenceN America 0159 0181 3310 2408 0120 0149 ResidenceS America 0236 0359 3424 4743 0013 0294 ResidenceAsia RoW 0102 0226 0764 3031 0109 0187 Constant 3026 0888 23275 1187 2545 0731 Observations 624 624 624 Pseudo R² 0107 0029 0119 Likelihood ratio Χ²3526742 p00001 Χ²3516274 p00001 Χ²3528955 p00001 σ 1790 24337 1493 Notes models Tobit models standard errors parentheses significant 10 significant 5 significant 1 Electronic copy available httpsssrncomabstract1489789 531 Attitude Toward Reuse regression results confirm hypotheses H1a H1d Developers perceive higher effectiveness efficiency quality task selection benefits code reuse attribute higher importance practice coefficients four hypotheses positive significant dependent variables specifications contrast hypothesis H1e confirmed data show developers fear lose control reuse less code surprising descriptive analysis loss control ranked main issue developers code reuse plausible interpretation developers’ concerns losing control affect decision code reuse affect total amount code reuse example developers concerned losing control might choose reuse components developed projects proven track record fixing bugs quickly keeping structure code stable Haefliger et al 2008 532 Access Local Search effect developers’ access local search reuse behavior captured logarithm size OSS network H2a number OSS projects involved H2b Hypothesis H2a confirmed models H2b confirmed partially coefficient significant model 1 Nonetheless coefficients positive models supporting assumption developers access evaluate understand integrate reusable code easily due local search practice code reuse finding number years developer involved OSS exhibit significant effect reuse behavior see control variable DevOSSExperience consistent argumentation regarding local search claimed developers turn personal OSS network experience OSS projects reuse better access local search greater number years involved OSS alone yet facilitate better access example developer ten years OSS work spent one access local search regarding code projects use solve particular problem 533 Maturity hypothesis developers reuse less code matured H3 confirmed across dependent variables specifications19 Developers indeed seem leverage reuse tool deliver “credible promise” early overcome liabilities newness get par competing existing projects later phases call specific refinements projects less available code reuse 534 Compatibility Goals Regarding compatibility code reuse developer’s individual goals hypothesis H4d community commitment confirmed models except model 2 H4a challenge seeking confirmed models past reuse dependent variable models 1 2 5 hypotheses coding fun enjoyment H4b skill improvement H4c OSS reputation building H4e commercial signaling H4f null hypothesis cannot rejected support hypothesis H4d highlights developers feel part OSS community want grow successful rely code reuse developers Code reuse compatible goal contributing OSS community leveraging code reuse contribute higher quality20 partial confirmation hypothesis H4a supports assumption developers’ goal seek tackle technical challenges impedes code reuse reusing existing code developers would denied pleasure solving problem Thus would rather refrain code reuse challenge seeking major importance OSS work finding respective coefficient significant dependent variable developers’ future intent reuse may due desire 19 Note models 1 2 4 5 past reuse behavior dependent variable amount reused code reported developers projects later development phases average reuse level including assumed high levels code reuse early phases proposed lower levels later phases However reuse goes maturity proposed also average reuse decreases lifetime 20 Moreover developers sympathetic toward OSS community might also affected general positive attitude toward reuse community eg Raymond 2001 effect however captured via subjective norms control variable developers may solve problem without external help something occur spontaneously thus difficult predict turn hypotheses supported argued similarly challenge seeking fun enjoyment developers experience writing code leads reuse less code H4b cannot confirm hypothesis fact respective coefficients negative expected positive though insignificant remaining unconfirmed hypotheses skill improvement H4c OSS reputation building H4e commercial signaling H4f partially show varying coefficient signs could contrary assumptions code reuse could supportive well detrimental goals reused code could used example improve programming skills could also hamper learning developers treat reused code black box Regarding reputation building commercial signaling expected developers make projects successful help code reuse regarded highly OSS community present better developers potential employers business partners However also possible certain situations code created developers without help code reuse important build OSS reputation signal skills potential employers partners situations developers would refrain code reuse reputation building signaling main motivation OSS work 535 Control Variables Due large number control variables included model point main results social norms perceived developers show consistently significant positive influence predicted TPB Consequently OSS developers feel peers appreciate reusing existing code reuse variables describing developers’ perceived behavioral control lack reusable code consistently negative significant influence reuse behavior exception one dependent variable policies discouraging reuse lead reduced code reuse policies promoting reuse found significantly increase reuse behavior three models 1 4 6 Lastly developers received training reuse companies practice significantly code reuse developers learned reuse academic education differ code reuse behavior developers reuse curriculum summarize regression analyses shed light developers’ code reuse behavior particular partially confirmed hypotheses H2 access local search H3 maturity H4a challenge seeking provide interesting findings also relevant beyond scope OSS 54 Possible threats validity limitations study following employ four generally accepted criteria validity Cook Campbell 1979 structure Construct validity internal validity statistical conclusion validity external validity Construct validity threats concern ability measure interested measuring pointed sections 4 5 measures employed study based existing measures studies interviews measures assessed clarity researchers OSS developers pretests described Furthermore multiitem constructs quantitatively gauged regards reliability convergent validity discriminant validity thus consider study possess sufficient construct validity Nonetheless potential issue whether developers able accurately estimate level code reuse questionnaire However additional verification results using objective measure code reuse certainly worthwhile developers pretests convinced us considerable precision estimate degree code reuse Furthermore ensure robustness findings employed three different measures code reuse survey Finally also many reuse studies rely reported reuse levels eg Frakes Fox 1995 Lee Litecky 1997 Internal validity maintaining exist alternative explanations relationships identified research model constructs also given since research model relies well established TPB included multiple control variables derived interviews OSS reuse literature potential issue approach deal component snippet reuse simultaneously component reuse OSS development equaled blackbox reuse might exist different drivers snippet reuse However find 50 surveyed developers modify components reuse argue least OSS context component reuse constitute typical blackbox reuse Consequently expect component snippet reuse influenced largely drivers addition also consider results valid regard statistical conclusions since based sample considerable size backed significance levels hypotheses well largely consistent results various model specifications various dependent variables Finally external validity threats concern generalization findings line main studies individual OSS developers drew sample SourceForgenet developers pointed chapter 4 reason believe sample representative SourceForgenet developers Thus generalization frequently researched group OSS developers feasible ensure external validity generalizing OSS developers registered platforms eg projects larger traditional developers working proprietary commercial firms would necessary replicate study settings However data well research model suggest generalization contexts yield similar results example data side find significant differences reuse behavior paid hobbyist OSS developers Regarding research model would surprising find rather general hypotheses effect network size challenge seeking work differently context proprietary development Conclusion paper set use quantitative data obtained survey explain understand code reuse OSS projects Contributing emerging stream scholarly work code reuse OSS present strong evidence code reuse major importance OSS development contributed success show OSS developers perceive efficiency effectiveness main benefits code reuse relevance OSS research also domains engineering receiving side open innovation processes general investigation drivers code reuse finds developers better access local search due larger personal OSS network exposure different OSS projects reuse existing code presumably costs accessing code lower developers convinced benefits code reuse efficiency effectiveness gains enhanced quality chance work preferred tasks practice developers use code reuse support goal serving OSS community Moreover developers see code reuse means kickstart new projects helps deliver “credible promise” close gap existing competing projects quickly Lastly find partial support hypothesis developers desire solve technical problems satisfaction refrain reuse thus make projects less efficient effective could academic work code reuse OSS begun merits research study addressed development reuse future work investigate development reuse OSS projects develop components primarily intended reused projects Questions relevance context developers bear reportedly large additional costs writing reusable code21 found ways mitigate Additionally pointed Haefliger et al 2008 strategies OSS developers employ make reusable code known reused deserve investigation Moreover limitations work open several research avenues First dependent variables reflect developers’ subjective perception importance code reuse OSS work alternative way potentially adding robustness findings importance reuse could captured objectively analyzing code Similarly independent variables captured data sources could added model example social network data derived SourceForgenet eg Fershtman Gandal 2009 could employed extend test hypotheses local search Moreover described code reuse general differentiating various forms components snippets algorithms finegrained analysis using dimensions might yield insights mechanics 21 example Tracz 1995 estimates writing reusable code leads 100 percent additional effort code reuse OSS projects Finally focused developers projects determinants code reuse future work could employ even detailed approach analyze single reuse incidents incorporating developers projects artifacts consider reuse approach could instance analyze impact quality relationship “giving” “receiving” side open innovation process code reuse Beyond scholarly implications findings also relevance managerial practice highlight high level reuse within OSS community provide motivation firms also leverage existing OSS code development thereby partly mitigating typically high upfront investment costs building internal reuse library artifacts firmspecific Frakes Kang 2005 intend pursue avenue reusing OSS code commercial firms encourage support employees enhance access local search OSS code building personal OSS networks becoming involved various OSS projects Beyond reuse OSS code modified incentives development processes based findings could support internal corporate reuse activities engineering beyond part modifications developers could provided option select tasks according preference could compensated according work results delivered based time spent work could required deliver “credible promises” new development projects Haefliger et al 2008 Lastly accommodate desire developers tackle difficult technical challenges makes reuse less could firms could consider job enrichment eg Herzberg 1968 integrate challenges developers’ work best interest firm thereby accommodating needs developer firm 22 Obviously accordance licenses OSS code However welldesigned product architectures mitigate many issues potentially arising Henkel Baldwin 2009 7 References Ajzen 1991 Theory Planned Behavior Organizational Behavior Human Decision Processes 50 2 pp 179211 Ajzen 2002 Constructing TpB Questionnaire Conceptual Methodological Considerations Manuscript University Massachusetts Available URL httppeopleumasseduaizenpdftpbmeasurementpdf Aldrich H E Auster 1986 Even Dwarfs Started Small Liabilities Age Size Strategic Implications Cummings L B Staw Eds Research Organizational Behavior San Francisco CA JAI Press pp 165198 Amabile TM KG Hill Hennessey EM Tighe 1994 Work Preference Inventory Assessing Intrinsic Extrinsic Motivational Orientations Journal Personality Social Psychology 66 5 pp 950967 Armitage C Conner 2001 Theory Planned Behavior British Journal Social Psychology 40 4 pp 471499 Baldwin CY KB Clark 2006 Architecture Participation Code Architecture Mitigate Free Riding Open Source Development Model Management Science 52 7 pp 11161127 Banker RD RJ Kauffman Zweig 1993 Repository Evaluation Reuse IEEE Transactions Engineering 19 4 pp 379389 Bonaccorsi Giannangeli C Rossi 2006 Entry Strategies Competing Standards Hybrid Business Models Open Source Industry Management Science 52 7 pp 10851098 Chang HFA Mockus 2008 Evaluation Source Code Copy Detection Methods FreeBSD International Working Conference Mining Repositories Leipzig Germany Chesbrough HW 2003 Open Innovation New Imperative Creating Profiting Technology Boston Harvard Business School Press Clary EG Snyder RD Ridge J Copeland AA Stukas J Haugen 1998 Understanding Assessing Motivations Volunteers Functional Approach Journal Personality Social Psychology 74 6 pp 15161530 Cook TD DT Campbell 1979 QuasiExperimentation Design Analysis Issues Field Setting Chicago IL Rand McNally Crowston K B Scozzi 2008 Bug Fixing Practices within FreeLibre Open Source Development Teams Journal Database Management 19 2 pp 130 Crowston K K Wei J Howison Wiggins 2009 FreeLibre Open Source Development Know Know 07072009 Working Paper Available URL httpflosssyreduStudyPReview20Paper070709pdf Electronic copy available httpsssrncomabstract1489789 Csíkszentmihályi 1990 Flow Psychology Optimal Experience New York NY Harper Row Cusumano C Kemerer 1990 Quantitative Analysis US Japanese Practice Development Management Science 36 11 pp 13841406 Dahlander L 2005 Appropriation Appropriability Open Source International Journal Innovation Management 9 3 pp 259285 Davis FD RP Bagozzi RP Warshaw 1989 User Acceptance Computer Technology Comparison Two Theoretical Models Management Science 35 8 pp 9821002 Desouza KC Awazu Tiwana 2006 Four Dynamics Bringing Use Back Reuse Communications ACM 49 1 pp 96100 DiBona C 2005 Open Source Proprietary Development DiBona C Cooper Stone Eds Open Source 20 Continuing Evolution Sebastopol CA OReilly Media DiBona C J Ockerbloom Stone 1999 Introduction DiBona C Ockman Stone Eds Open Sources Voices Open Source Revolution Sebastopol CA OReilly Associates pp 117 Fershtman C N Gandal 2009 RD Spillovers Social Network Open Source 16052009 Working Paper Available URL httpwwwtauacilgandalOSSpdf Fornell C F Larcker 1981 Evaluating Structural Equation Models Unobservable Variables Measurement Error Journal Marketing Research 13 1 pp 3950 Frakes WB CJ Fox 1995 Sixteen Questions Reuse Communications ACM 38 6 pp 7587 Frakes WB K Kang 2005 Reuse Research Status Future IEEE Transactions Engineering 31 7 pp 529 536 German DM 2007 Using Distributions Understand Relationship among Free Open Source Projects 4th International Workshop Mining Repositories Minneapolis MN Ghosh RA R Glott B Krieger G Robles 2002 FreeLibre Open Source Survey Study Deliverable D18 Final Report Part IV Survey Developers Available URL httpwwwinfonomicsnlFLOSSreportFLOSSFinal4pdf Gruber J Henkel 2005 New Ventures Based Open Innovation Empirical Analysis Startup Firms Embedded Linux International Journal Technology Management 33 4 pp 354372 Haefliger G von Krogh Spaeth 2008 Code Reuse Open Source Management Science 54 1 pp 180193 Hair JF Jr RL Tataham JE Anderson W Black 2006 Multivariate Data Analysis Upper Saddle River NJ Pearson Prentice Hall Hardgrave BC FD Davis CK Riemenschneider 2003 Investigating Determinants Developers Intentions Follow Methodologies Journal Management Information Systems 20 1 pp 123151 Hardgrave BC RA Johnson 2003 Toward Information Systems Development Acceptance Model Case ObjectOriented Systems Development IEEE Transactions Engineering Management 50 3 pp 322336 Hars Ou 2002 Working Free Motivations Participating OpenSource Projects International Journal Electronic Commerce 6 3 pp 2539 Henkel J 2006 Selective Revealing Open Innovation Processes Case Embedded Linux Research Policy 35 7 pp 953969 Henkel J 2009 Champions Revealing Role Open Source Developers Commercial Firms Industrial Corporate Change 18 3 pp 435471 Henkel J CY Baldwin 2009 Modularity Value Appropriation Drawing Boundaries Intellectual Property March 2009 Working Paper Harvard Business School Hertel G Niedner Hermann 2003 Motivation Developers Open Source Projects InternetBased Survey Contributors Linux Kernel Research Policy 32 7 pp 11591177 Herzberg F 1968 One Time Motivate Employees Harvard Business Review 46 1 pp 5362 Isoda 1995 Experience Reuse Journal Systems 30 pp 171186 Kim YE EA Stohr 1998 Reuse Survey Research Directions Journal Management Information Systems 14 4 pp 113147 Krueger CW 1992 Reuse ACM Computer Surveys 24 2 pp 131183 Lakhani KR E von Hippel 2003 Open Source Works Free UsertoUser Assistance Research Policy 32 6 pp 923943 Lakhani KR RG Wolf 2005 Hackers Understanding Motivation Effort FreeOpen Source Projects Feller J B Fitzgerald Hissam KR Lakhani Eds Perspectives Free Open Source Cambridge MIT Press pp 322 Langlois RN 1999 Scale Scope Reuse Knowledge Dow SC PE Earl Eds Economic Organization Economic Knowledge Cheltenham UK Edward Elgar pp 239254 Lee NY CR Litecky 1997 Empirical Study Reuse Special Attention Ada Transactions Engineering 23 9 pp 537549 Lerner J J Tirole 2002 Simple Economics Open Source Journal Industrial Economics 50 2 pp 197234 Majchrak LP Cooper OP Neece 2004 Knowledge Reuse Innovation Management Science 50 2 pp 174188 Mellarkod V R Appan DR Jones K Sherif 2007 MultiLevel Analysis Factors Affecting Developers Intention Reuse Assets Empirical Investigation Information Management 44 7 pp 613625 Mockus 2007 LargeScale Code Reuse Open Source 1st International Workshop Emerging Trends FLOSS Research Development Minneapolis MN Moore GC Benbasat 1991 Development Instrument Measure Perceptions Adopting Information Technology Innovation Information Systems Research 2 3 pp 192222 Morisio Ezran C Tully 2002 Success Failure Factors Reuse IEEE Transactions Engineering 28 4 pp 340357 Naur P B Randell 1968 Engineering Report Conference Nato Science Committee Brussels Belgium NATO Science Affairs Division Nunnally JC 1978 Psychometric Theory New York NY McGrawHill Podsakoff PM SB MacKenzie J Lee NP Podsakoff 2003 Common Method Biases Behavioral Research Critical Review Literature Recommended Remedies Journal Applied Psychology 88 5 pp 879903 Ravichandran Rothenberger 2003 Reuse Strategies Component Markets Communications ACM 46 8 pp 109114 Raymond ES 2001 Cathedral Bazaar Sebastopol CA OReilly Associates 2nd Edition Riemenschneider CK BC Hardgrave 2001 Explaining Development Tool Use Technology Acceptance Model Journal Computer Information Systems 41 4 pp 18 Riemenschneider CK BC Hardgrave FD Davis 2002 Explaining Developer Acceptance Methodologies Comparison Five Theoretical Models IEEE Transactions Engineering 28 12 pp 11351145 Roberts JA Hann SA Slaughter 2006 Understanding Motivations Participation Performance Open Source Developers Longitudinal Study Apache Projects Management Science 52 7 pp 984999 Rossi Lamastra C 2009 Innovativeness Comparison Proprietary FreeOpen Source Solutions Offered Italian SMEs RD Management 39 2 pp 153169 Sen 1997 Role Opportunism Design Reuse Process IEEE Transactions Engineering 23 7 pp 418436 Sen R C Subramaniam ML Nelson 2008 Determinants Choice Open Source License Journal Management Information Systems 25 3 pp 207239 Electronic copy available httpsssrncomabstract1489789 Sherif K R Appan Z Lin 2006 Ressources Incentives Adoption Systematic Reuse International Journal Information Management 26 1 pp 7080 Spaeth Stuermer Haefliger G Von Krogh 2007 Sampling Open Source Development Case Using Debian GNULinux Distribution 40th Annual Hawaii International Conference System Sciences Waikoloa HI Stewart KJ Gosain 2006 Impact Ideology Effectiveness Open Source Teams MIS Quarterly 30 2 pp 291314 Straub 1989 Validating Instruments MIS Research MIS Quarterly 13 2 pp 147169 Subramanyam R Xia 2008 FreeLibre Open Source Development Developing Developed Countries Conceptual Framework Exploratory Study Decision Support Systems 46 1 pp 173186 Tracz W 1995 Confessions Used Program Salesman Institutionalizing Reuse Reading AddisonWesley von Krogh G Spaeth Haefliger 2005 Knowledge Reuse Open Source Exploratory Study 15 Open Source Projects 38th Annual Hawaii International Conference System Sciences Big Island HI von Krogh G Spaeth Haefliger Wallin 2008 Open Source Know Know Motives Contribute April 2008 Working Paper DIME Working Papers Intellectual Property Available URL httpwwwdimeeuorgfilesactive0WP38vonKroghSpaethHaefligerWallinIPROSSpdf von Krogh G Spaeth KR Lakhani 2003 Community Joining Specialization Open Source Innovation Case Study Research Policy 32 7 pp 12171241 Watson K Hewett 2006 MultiTheoretical Model Knowledge Transfer Organizations Determinants Knowledge Contribution Knowledge Reuse Journal Management Studies 43 2 pp 141173 West J 2003 Open Open Enough Melding Proprietary Open Source Platform Strategies Research Policy 32 7 pp 12591285 Wu CG JH Gerlach CE Young 2007 Empirical Analysis Open Source Developers’ Motivations Continuance Intentions Information Management 44 3 pp 253262 Ye G Fischer 2005 ReuseConducive Development Environments Automated Engineering 12 2 pp 199235
::::
Appendix
::::
Table A1 Factor Analysis Reliability Developer Motivation Constructs Constructitem 1 2 3 4 5 6 Cronbach’s α 1 Challenge seeking 0807 Chal1 0052 0794 0137 0203 0007 0043 Chal2 0031 0891 0119 0135 0034 0019 Chal3 0020 0794 0075 0172 0026 0026 2 Coding fun enjoyment 0746 Fun1 0021 0176 0122 0763 0024 0111 Fun2 0008 0284 0217 0718 0100 0005 Fun3 0038 0165 0077 0839 0010 0002 3 Community commitment 0640 Com1 0068 0043 0109 0055 0154 0743 Com2 0138 0112 0010 0027 0099 0691 Com3 0051 0017 0089 0033 0186 0832 4 Skill improvement 0758 Learn1 0101 0148 0832 0162 0003 0044 Learn2 0192 0120 0831 0159 0027 0058 Learn3 0034 0093 0721 0005 0190 0125 5 OSS reputation building 0901 OSSRep1 0253 0004 0053 0035 0892 0098 OSSRep2 0240 0021 0055 0010 0900 0091 6 Commercial signaling 0866 ComSig1 0847 0004 0178 0065 0095 0019 ComSig2 0857 0027 0087 0007 0250 0016 ComSig3 0800 0056 0045 0009 0359 0031 Notes factor analysis uses principal component analysis Varimax rotation high factor loadings component rotated matrix indicated bold text gray shading N624
::::
Table A2 Discriminant Analysis Developer Motivation Constructs Constructitem 1 2 3 4 5 6 1 Challenge seeking 0757 2 Coding fun enjoyment 0444 0705 3 Community commitment 0112 0132 0657 4 Skill improvement 0285 0323 0207 0751 5 OSS reputation building 0033 0064 0194 0189 0906 6 Commercial signaling 0047 0063 0026 0254 0495 0832 Notes diagonal bolded entries square roots average variance extracted AVE respective construct offdiagonal entries standardized correlations constructs correlation significant 10 correlation significant 5 correlation significant 1 level N624 Electronic copy available httpsssrncomabstract1489789
::::
Table A3 Exploratory Factor Analysis Reuse Benefits Item Rank Figure 2 1 2 3 4 Difficult Problem Rank 3 0081 0171 0090 0948 Faster Rank 1 0181 0793 0001 0326 Important Rank 2 0176 0834 0236 0062 Fun Rank 6 0021 0414 0743 0021 Outs Maintenance Rank 7 0332 0029 0779 0162 Reliable SW Rank 4 0840 0278 0130 0031 Secure SW Rank 8 0872 0124 0113 0090 Standard SW Rank 5 0739 0002 0097 0237 Notes factor analysis uses principal component analysis Varimax rotation high factor loadings component rotated matrix indicated bold text gray shading N624
::::
Table A4 Exploratory Factor Analysis Reuse Issues Drawbacks Item Rank Figure 3 1 2 3 Finding Rank 9 0854 0089 0036 Understanding Rank 7 0876 0125 0073 Adapting Rank 6 0847 0165 0087 Quality Risks Rank 5 0156 0934 0100 Security Risks Rank 4 0088 0935 0084 Performance Loss Rank 8 0231 0451 0284 Installation Rank 2 0152 0089 0764 Dependence Rank 1 0051 0118 0785 Additional Work Rank 3 0162 0162 0707 Notes factor analysis uses principal component analysis Varimax rotation high factor loadings component rotated matrix indicated bold text gray shading loading item construct rather low however retained due good overall Cronbach’s alpha construct 076 N624 Table A5 Descriptive Statistics Explanatory Variables Used Table 6 Variable Dummy variable equal “1” if… Frequency “0” Frequency “1” ProjPolSupport Developer’s current main policy encouraging developers reuse 438 70 186 30 ProjPolDiscourage Developer’s current main policy discouraging developers reuse 606 97 18 3 ProjStandalone Developer’s current main standalone executable application component 162 26 462 74 DevProf Developer working professional developer worked professional developer firm 191 31 433 69 DevEduReuse Developer received training reuse education 412 66 212 34 DevProfEduReuse Developer received training reuse working developer firm 544 87 80 13 ResidenceNAmerica Developer resides North America 455 73 169 27 ResidenceSAmerica Developer resides South America 594 95 30 5 ResidenceAsiaRoW Developer resides Asia Africa Australia Oceania 536 86 88 14 Variable Explanation Min Max Med Mean SD BenefitEffectiveness Factor score exploratory factor analysis… developer’s perception effectiveness effects code reuse 4762 2047 0178 0 1 BenefitEfficiency …on developer’s perception efficiency effects code reuse 3568 2313 0093 0 1 BenefitQuality …on developer’s perception quality effects code reuse 3972 2909 0027 0 1 BenefitTaskSelection …on developer’s perception task selection effects code reuse 3884 3026 0033 0 1 IssueControlLoss …on developer’s perception control loss effects code reuse 3781 2376 0065 0 1 DevOSSNetsize log Size developer’s personal OSS network logarithm 0 6217 2197 2001 1033 DevOtherProjects Number OSS projects besides current main developer ever involved 0 48 2 3617 5388 ProjPhase Development phase developer’s current main 1PreAlpha 2Alpha 3Beta 4StableProduction 5Mature 1 5 3 3221 1184 MotChallenge Index variable constructed challenge scale 1Strongly disagree… 7Strongly agree 1 7 5333 5128 1060 MotFun Index variable constructed fun scale 1Strongly disagree… 7Strongly agree 1667 7 5000 5152 1092 MotLearning Index variable constructed learning scale 1Strongly disagree… 7Strongly agree 1 7 5333 5317 1100 Electronic copy available httpsssrncomabstract1489789 Variable Description Mean SD Median N MotCommunity Index variable constructed community commitment scale 1Strongly disagree… 7Strongly agree 1 7 5667 5614 1003 MotOSSReputation Index variable constructed OSS reputation scale 1Strongly disagree… 7Strongly agree 1 7 4000 3609 1621 MotSignaling Index variable constructed signaling scale 1Strongly disagree… 7Strongly agree 1 7 4667 4312 1527 DevNorm Index variable constructed subjective norms scale 1Strongly disagree… 7Strongly agree 1 7 4000 3927 1555 ConditionLack Developer’s agreement 1Strongly disagree… 7Strongly agree to… lack reusable code impediment reuse 1 7 4 3784 1823 ConditionLicense … issues license incompatibilities impediment reuse 1 7 2 3006 1852 ConditionLanguage … issues programming language incompatibilities impediment reuse 1 7 2 2154 1401 ConditionArchitecture … issues architecture impediment reuse 1 7 2 2630 1597 DevSkill Selfassessment developer’s development skills compared average OSS developer 1Much worse… 5Much better 1 5 3 3269 0989 ProjSize Size developer’s current main number developers 1 999 2 6091 44420 ProjComplexity Complexity developer’s current main compared average SourceForgenet 1Much less complex… 5More complex 1 5 3 2947 1029 ProjStack Position developer’s current main stack 1Very low… 5Very high 1 5 4 3333 0921 DevOSSExperience Number years developer active working OSS projects 1 40 5 5668 4709 DevProjTime Average weekly hours developer works current main 05 58 5 8775 10723 DevProjShare Share work done developer current main opposed team members 5 100 90 67436 36998 main developer Linux high number team members seems reasonable developer claims involved OSS even got started assume implies already working later became OSS point time N624 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 BenefitEffectiveness 100 2 BenefitEfficiency nm 100 3 BenefitQuality nm nm 100 4 BenefitTaskSelection nm nm nm 100 5 IssueControlLoss nm nm nm nm 100 6 DevOSSNetsize 014 011 100 7 DevOtherProjects 008 031 100 8 ProjPhase 007 010 017 017 100 9 MotChallenge 008 100 10 MotFun 008 009 008 044 100 11 MotLearning 008 016 009 012 029 032 100 12 MotCommunity 009 014 014 007 022 013 010 011 013 021 100 13 MotOSSReputation 015 010 008 013 015 009 019 019 100 14 MotSignaling 007 016 025 050 100 15 DevNorm 007 019 026 021 009 007 010 012 018 012 100 16 DevSkill 012 007 013 010 016 015 009 17 ProjPolSupport 010 009 023 019 015 010 009 019 011 012 012 100 18 ProjPolDiscourage 009 012 007 008 007 009 011 100 19 ConditionLack 008 019 008 20 ConditionLicense 010 015 21 ConditionLanguage 023 22 ConditionArchitecture 016 008 007 007 23 ProjSize 007 008 019 011 009 009 008 011 24 ProjComplexity 011 012 009 018 019 021 009 011 038 030 25 ProjStack 015 26 ProjStandalone 007 27 DevOSSExperience 013 008 026 029 029 015 012 013 025 009 28 DevProjTime 009 013 013 011 012 011 010 021 009 017 029 29 DevProjShare 021 008 022 007 30 DevEduReuse 009 011 31 DevProfEduReuse 008 32 DevProf 013 011 008 008 008 010 014 007 015 008 039 007 33 ResidenceN America 007 34 ResidenceS America 007 35 ResidenceAsia RoW 008 009 18 ProjPolDiscourage 19 ConditionLack 20 ConditionLicense 21 ConditionLanguage 22 ConditionArchitecture 23 ProjSize 24 ProjComplexity 25 ProjStack 26 ProjStandalone 27 DevOSSExperience 28 DevProjTime 29 DevProjShare 30 DevEduReuse 31 DevProfEduReuse 32 DevProf 33 ResidenceN America 34 ResidenceS America 35 ResidenceAsia RoW 18 ProjPolDiscourage 19 ConditionLack 100 20 ConditionLicense 017 100 21 ConditionLanguage 023 024 100 22 ConditionArchitecture 024 011 033 100 23 ProjSize 008 100 24 ProjComplexity 009 018 009 016 100 25 ProjStack 013 007 007 011 100 26 ProjStandalone 0101 0 013 037 100 27 DevOSSExperience 013 007 023 008 100 28 DevProjTime 008 014 015 038 011 100 29 DevProjShare 020 008 008 017 034 010 015 100 30 DevEduReuse 007 100 31 DevProfEduReuse 009 008 100 32 DevProf 012 008 008 009 014 009 018 025 100 33 ResidenceN America 015 009 100 34 ResidenceS America 008 008 nm 100 35 ResidenceAsia RoW nm nm 100 Notes correlations p01 shown nm meaningful variables dummy variables coding characteristic scores exploratory factor analysis Table A7 Multivariate Analysis Developers Reuse Behavior – Robustness Check 4 Likert scale 5 Percentage scale 6 Future importance reuse Likert scale Attitude toward reuse BenefitEffectiveness H1a 0220 0076 2464 1010 0146 0062 BenefitEfficiency H1b 0634 0080 6047 1059 0499 0066 BenefitQuality H1c 0322 0079 2262 1048 0273 0065 BenefitTaskSelection H1d 0157 0077 3368 1026 0144 0064 IssueControlLoss H1e Access local search DevOSSNetsize log H2a 0172 0080 2307 1047 0246 0066 DevOtherProjects H2b 0030 0015 0465 0196 0034 0013 maturity ProjPhase H3 0124 0066 2984 0871 0204 0054 Compatibility goals MotChallenge H4a 2466 0962 MotFun H4b MotLearning H4c MotCommunity H4d 0180 0081 1912 1067 0163 0066 MotOSSReputation H4e MotSignaling H4f Subjective norms DevNorm 0120 0065 2133 0870 0205 0054 Perceived behavioral control ProjPolSupport 0405 0180 0335 0143 ProjPolDiscourage 1210 0447 1299 0375 ConditionLack 0236 0042 2355 0564 0160 0035 ConditionLicense ConditionLanguage ConditionArchitecture DevSkill control variables ProjSize ProjComplexity ProjStack 0232 0083 0172 0069 ProjStandalone DevOSSExperience DevProjTime 0016 0007 DevProjShare DevProf DevEduReuse DevProfEduReuse 0573 0232 5581 3012 0414 0189 ResidenceN America ResidenceS America ResidenceAsia RoW Constant 3145 0622 34228 8393 2858 0509 Observations 624 624 624 Pseudo R² 0101 0026 0112 Likelihood ratio chi21525281 p00001 chi21214936 p00001 chi21427267 p00001 sigma 1814 24600 1514 Notes models Tobit models standard errors parentheses significant 10 significant 5 significant 1 Eliminated variables also jointly insignificant Electronic copy available httpsssrncomabstract1489789
::::
impact using trivial packages empirical case study npm PyPI Rabe Abdalkareem1 · Vinicius Oda1 · Suhaib Mujahid1 · Emad Shihab1 Published online 9 January 2020 © Springer ScienceBusiness Media LLC part Springer Nature 2020 Abstract Code reuse traditionally encouraged since enables one avoid reinventing wheel Due npm leftpad package incident trivial package led breakdown popular web applications Facebook Netflix questioned reuse Reuse trivial packages particularly prevalent platforms npm date study examines reason developers reuse trivial packages npm Therefore paper study two large platforms npm PyPI mine 500000 npm packages 38000 JavaScript applications 63000 PyPI packages 14000 Python applications study prevalence trivial packages found trivial packages common making 160 105 studied platforms performed surveys 125 developers use trivial packages understand reasons drawbacks use surveys revealed trivial packages used perceived well implemented tested pieces code However developers concerned maintaining risks breakages due extra dependencies trivial packages introduce objectively verify survey results validate cited reason drawback find contrary developers’ beliefs around 28 npm 49 PyPI trivial packages tests However trivial packages appear ‘deployment tested’ similar test usage community interest nontrivial packages hand found 184 29 studied trivial packages 20 dependencies npm PyPI respectively Keywords Trivial packages · JavaScript · Nodejs · Python · npm · PyPI · Code reuse · Empirical studies 1 Introduction Code reuse form combining related functionalities packages encouraged due fact reduce timetomarket improve quality Communicated Arie van Deursen Rabe Abdalkareem rababduencsconcordiaca Extended author information available last page article boost overall productivity Basili et al 1996 Lim 1994 Mohagheghi et al 2004 Therefore surprise platforms Nodejs encourage reuse attempt facilitate code sharing often delivered packages modules1 available package management platforms Node Package Manager textitnpm Python Package Index textitPyPI npm 2016 Bogart et al 2016 However good news many cases code reuse negative effects leading increase maintenance costs even legal action McCamant Ernst 2003 Orsila et al 2008 Inoue et al 2012 Abdalkareem et al 2017a example incident code reuse JavaScript package called leftpad used Babel caused interruptions largest Internet sites eg Facebook Netflix Airbnb Many referred incident case ‘almost broke Internet’ Macdonald 2016 Williams 2016 incident lead many heated discussions code reuse sparked David Haney’s blog post “Have Forgotten Program” Haney 2016 real reason leftpad incident textitnpm allowed authors unpublish packages problem resolved npm Blog 2016 raised awareness broader issue taking dependencies trivial tasks easily implemented Haney 2016 previous work Abdalkareem et al 2017 defined examined trivial packages textitnpm discovered number relevant findings Trivial JavaScript packages tend small size less complex Trivial packages prevalent making approximately 168 packages textitnpm JavaScript developers generally use trivial packages since believe trivial packages provide well tested implemented code however concerned management extra dependencies addition found cases trivial JavaScript packages dependencies imposing significant overhead However one major limitation original work deep focus JavaScript textitnpm particular Abdalkareem et al 2017 example questions existence trivial packages defined package management platforms remain Also whether perceived advantages eg trivial packages well tested disadvantages eg management additional dependencies using trivial packages generalized beyond JavaScript developers remain unanswered Hence paper extended previous work Abdalkareem et al 2017 strengthen empirical evidence use trivial packages replicating extending study Python Package Index textitPyPI chose examine textitPyPI package management platform since 1 Python one popular general purpose programming languages 2 Python one main wellestablished package platform textitPyPI 3 textitPyPI mature package management platform existence twelve years extended study provides following key additions extended study textitnpm package management platform increased textitnpm dataset 231092 501001 packages provide definition textitPyPI trivial packages examine prevalence trivial packages Python ecosystem 1In paper use term package refer library published studied package management platforms – surveyed 37 Python developers investigate reasons drawback using trivial packages PyPI package management platform – examine top main reasons drawbacks using PyPI trivial packages based developers survey Altogether study involves 500000 npm packages 38000 JavaScript applications 63000 PyPI packages 14000 Python applications study also contains survey results 125 JavaScript Python developers findings indicate definition trivial packages JavaScript Python developers two different package management platforms tended definition trivial packages found original paper Abdalkareem et al 2017 npm trivial packages packages leq 35 LOC McCabe’s cyclomatic complexity leq 10 also found PyPI trivial packages definition Trivial packages common popular npm PyPI management platforms 501001 npm 63912 PyPI packages dataset 160 106 trivial packages Moreover 38807 JavaScript 14717 Python applications GitHub 261 69 directly depend one trivial packages JavaScript Python developers differ perception trivial packages 239 JavaScript developers considered use trivial packages bad whereas 703 Python developers consider use trivial package bad practice Developers believe trivial packages provide well implementedtested code increase productivity time increase dependency overhead risk breakage applications two cited drawbacks Developers need careful trivial packages use empirical findings show many trivial packages dependencies npm 432 trivial packages least one dependency 184 trivial packages 20 dependencies PyPI 368 trivial packages least one dependency 29 20 dependencies facilitate replicability work make dataset anonymized developer responses publicly available Abdalkareem et al 2019
::::
11 Paper Organization paper organized follows Section 2 provides background introduces datasets Section 3 presents determine trivial package Section 4 examines prevalence trivial packages use JavaScript Python applications Section 5 presents results developer surveys presenting reasons perceived drawbacks developers use trivial packages Section 6 presents quantitative validation commonly cited reason drawback using trivial packages implications findings noted Section 7 discuss related works Section 8 limitations study Section 9 present conclusions Section 10 2 Background Case Studies section provide background two studied package management platforms npm PyPI also provide overview dataset collected used rest study 21 Node Package Manager npm JavaScript used write client server side applications popularity JavaScript steadily grown thanks popular frameworks Nodejs active developer community Bogart et al 2016 Wittern et al 2016 JavaScript projects classified two main categories JavaScript packages used applications JavaScript applications used standalone Node Package Manager npm provides tools manage JavaScript packages perform study gather two datasets two sources obtain JavaScript packages npm registry applications use npm packages GitHub npm Packages Since interested examining impact ‘trivial packages’ mined latest version JavaScript packages npm September 30 2017 package obtained source code npm registry total mined 549629 packages GitHub JavaScript Applications also want examine use npm packages JavaScript applications Therefore mined JavaScript applications GitHub obtain list JavaScript applications extracted applications identified JavaScript application GHTorrent dataset Gousios et al 2014 ensure indeed obtaining JavaScript applications GitHub npm packages compare URL GitHub repositories GHTorrent URLs obtained npm packages URL GitHub also npm flagged npm package removed application list determine application uses npm packages looked ‘packagejson’ file specifies amongst others npm package dependencies used application Finally eliminate dummy applications may exist GitHub choose nonforked applications 100 commits 2 developers Similar filtering criteria used prior work Kalliamvakou et al 2014 total obtained 115621 JavaScript applications removing applications use npm platform left 38807 JavaScript applications 22 Python Package Index PyPI PyPI official package management platform Python programming language Python one popular programming language today mainly due strong community support versatility ie Python used many different domains game development server side applications Vasilescu et al 2015 Ray et al 2014 distinguish Python packages used Python applications standalone Python applications typically use Python packages Similar case JavaScript gather two datasets two sources perform study obtain Python packages PyPI registry applications use PyPI packages GitHub PyPI Packages collected latest versions Python packages PyPI order determine packages ‘trivial packages’ PyPI contains around 118324 packages Librariesio 2017 September 30 2017 total able obtain 116905 packages PyPI registry since packages exist anymore GitHub Python Applications examine usage ‘trivial packages’ Python applications mined Python applications hosted GitHub provided GHTorrent dataset Gousios et al 2014 followed aforementioned process used gather JavaScript applications ensure indeed obtaining Python applications GitHub PyPI package repositories nutshell compare URL GitHub repositories URLs obtained PyPI packages URL GitHub also PyPI flagged PyPI package removed application list total obtained 14717 Python applications hosted GitHub addition eliminate dummy immature Python applications may exist GitHub performed filtering steps JavaScript application choose nonforked Python applications 100 commits 2 developers
::::
3 Defining Trivial Packages Although trivial package loosely defined past eg blogs Hemanth 2015 Harris 2015 want precise objective way determine trivial packages determine constitutes trivial package conducted two separate surveys one studied package management platforms npm PyPI mainly asked participants considered trivial package indicators used determine package trivial conducted two different surveys since 1 two studied package management platforms serve different programming languages 2 developers two package management platforms may different perspective consider ‘trivial packages’ package management platform npm PyPI devised online survey presented source code 16 randomly selected packages range size 4 250 JavaScriptPython lines code LOC Participants asked 1 indicate thought package trivial 2 specify indicators use determine trivial package opted limit size selected packages surveys maximum 250 JavaScriptPython LOC since want overwhelm participants review excessive amounts code asked survey participants indicate trivial packages list packages provided provided survey participants loose definition trivial package ie package contains code easily code hence worth taking extra dependency Figure 1 shows example trivial JavaScript package called isPositive simply checks number positive survey questions divided three parts 1 questions participant’s development javascript moduleexports function n return toStringcalln object Number n 0 Fig 1 Package isPositive npm background 2 questions classification provided packages 3 questions indicators participant would use determine trivial package npm survey sent survey 22 developers colleagues familiar JavaScript development received total 12 responses also sent PyPI survey 18 developers colleagues familiar Python development received total 13 responses important note sent two surveys different groups developers make sure participants one survey biased experience participating ie first survey Participants’ Background Experience first four columns Table 1 show background participants npm survey 12 respondents 2 undergraduate students 8 graduate students 2 professional developers Ten 12 respondents least 2 years JavaScript experience half participants developing JavaScript five years last four columns Table 1 show background participants PyPI survey 13 participants survey 9 identified graduate students 4 professional developers working industry 7 participants 5 years Python development experience 2 respondents 3 5 years 3 others 2 3 years experience finally one person less 1 year Python practice happy majority respondents wellexperienced Python Result asked participants two surveys list indicators use determine package trivial indicate packages considered trivial 12 participants JavaScript survey 11 92 state complexity code 9 75 state size code indicators use determine trivial package Also 3 20 mentioned used code comments indicators eg functionality indicate package trivial results Python survey reveal 9 69 developers use size code 9 69 use complexity code main indicators determine trivial packages Also 7 54 participants stated use source code comments determine trivial Python packages 3 23 participants mentioned indicators use identify trivial package example one participant related trivial Python package “If it’s one function” npm Experience JavaScript Developers’ position PyPI Experience python Developers’ position 1 2 Undergrad Student 2 1 1 Undergrad Student 0 2 – 3 3 Graduate Student 8 2 – 3 3 Graduate Student 9 3 – 5 1 Professional Developer 2 3 – 5 2 Professional Developer 4 5 6 – – 5 7 – – Total 12 Total 12 Total 13 Total 13 Since clear size complexity common indicators trivial packages universal measure measured JavaScript Python use two measures determine trivial packages mentioned participants could provide one indicator hence percentages sum 100 Next analyze packages marked trivial two surveys main goal analysis find values size complexity metrics indicative trivial packages npm Survey Responses total received 69 votes 16 packages ranked packages ascending order based size tallied votes voted packages find 79 votes consider packages less 35 lines code trivial also examine complexity packages using McCabe’s cyclomatic complexity find 84 votes marked packages total complexity value 10 lower trivial important note although provide source code packages participants explicitly provide size complexity packages participants bias towards specific metrics PyPI Survey Responses received 89 votes 16 packages Similar case npm ranked packages ascending order based size tallied votes voted packages find 764 votes consider packages equal less 35 lines code trivial also examine complexity packages using McCabe’s cyclomatic complexity find 798 votes marked packages total complexity value 10 lower trivial Python package Similar npm also provide metric values packages avoid bias Based aforementioned findings used two indicators JavaScriptPython LOC ≤ 35 complexity ≤ 10 determine trivial packages dataset Hence define trivial JavaScriptPython packages XLOC leq 35 cap XComplexity leq 10 XLOC represents JavaScriptPython LOC XComplexity represents McCabe’s cyclomatic complexity package X Although use aforementioned measures determine trivial packages consider possible way determine trivial packages
::::
4 Prevalent Trivial Packages section want know prevalent trivial packages examine prevalence two aspects first aspect package management platforms npm PyPI perspective interested knowing many packages two package management platforms trivial second aspect considers use trivial packages JavaScript Python applications identify trivial packages two datasets calculate LOC complexity npm PyPI packages LOC calculate number lines source code removing white space source code comments complexity use McCabe’s complexity since widely used industry academia Ebert Cain 2016 package removed test code since mostly interested actual source code packages identify remove test code similar prior work Gousios et al 2014 Tsay et al 2014 Zhu et al 2014 look term “test” variants ‘tests’ andor ‘TESTcode’ file names file paths calculate LOC complexity every package datasets use Understand tool SciTools httpsscitoolscom Understand source code analysis tool provides various code metrics extensively used work eg Rahman et al 2019 Castelluccio et al 2019 41 Many npm’s PyPI’s Packages Trivial npm use two measures LOC complexity determine trivial packages use quantify number trivial packages dataset dataset contained total 549629 npm packages package calculated number JavaScript code lines removed packages zero LOC removed 48628 packages eliminated npm packages zero LOC since present dummy empty packages developers publish different reasons reserve unique package name left us final number 501001 packages 501001 npm packages mined 80232 160 packages trivial packages addition examined growth trivial packages npm Figure 2 shows percentage trivial packages published npm per month see increasing trend number trivial packages published time growth trivial packages became stable around beginning 2015 Overall approximately 140 packages added every month trivial packages investigated spike around March 2016 found spike corresponds time npm disallowed unpublishing packages npm Blog 2016 addition see effect leftpad incident number published trivial packages investigate number published trivial npm packages leftpad incident 216309 npm packages published leftpad incident found 34750 161 trivial packages leftpad incident 284692 published found 45482 160 trivial packages PyPI PyPI dataset also interested discerning trivial packages others terms LOC complexity mined 116905 available packages PyPI platform got 116905 packages PyPI register However package PyPI could releaseddistributed different formats able process found 42242 PyPI packages platform exclusive eg windows exe mac dmg corrupted compressed gz files could analyzed process left us 74663 PyPI packages measure LOC complexity remove packages zero LOC removed another 10751 packages remove packages zero LOC since want count empty packages exist PyPI various reasons learning publish packages PyPI analysis reveals 63912 PyPI packages analyzed 6759 106 packages trivial packages PyPI package management platform examined growth trivial packages PyPI Figure 3 shows percentage trivial packages published PyPI per month time period 2011 2017 see slight increase trend publishing trivial packages PyPI platform trend starts decrease late 2013 also found approximately 11 packages added every month trivial packages also looked percentage trivial packages publish leftpad incident found 33335 PyPI package published prior leftpad incident 3717 112 trivial packages 3042 100 packages published leftpad incident trivial 42 Many Applications Depend Trivial Packages JavaScript Applications trivial packages exist npm mean actually used also examine number applications use trivial packages examine packagejson file contains dependencies application installs npm However cases application may install package use avoid counting instances parse JavaScript code examined applications use regular expressions detect required dependency statements indicates application actually uses package code2 Finally measured number packages trivial set packages used applications Note consider npm packages since popular package manager JavaScript packages package managers manage subset packages eg Bower 2012 manages frontendclientside frameworks libraries modules find 38807 applications dataset 10139 261 directly depend least one trivial package Python Applications Similar case JavaScript also analyzed Python applications depend trivial packages contrast JavaScript’s availability ‘packagesjson’ file analyzing Python applications presents challenges fully identify given script’s dependency set reasons described previously Section 41 statically parse source code relevant “import” like clauses along statements allow verifying packages effectively put use ie package supposed installed functionsdefinitions indeed called rather merely imported used facilitate analysis use popular snakefood httpfuriuscasnakefood tool tool generates dependency graphs Python code parsing Abstract Syntax Tree Python files analysis showed 14717 examined Python applications 1024 69 found depend one trivial PyPI package
::::
5 Survey Results surveyed developers understand reasons drawbacks using trivial packages used survey allows us obtain firsthand information developers use trivial packages order select relevant participants sent survey developers use trivial packages used Git’s pickaxe command lines contain required dependency statements JavaScript Python applications helped us identify name email developer introduced trivial package dependency 2Note package required application exist break application Survey Participants mitigate possibility introducing misunderstood misleading questions initially sent survey two developers incorporated minor suggestions improve survey npm participants sent survey 1055 JavaScript developers 1696 applications select developers ranked based number trivial packages use took sample 600 developers use trivial packages another 600 indicated least use trivial packages survey emailed 1200 selected developers however since emails returned various reasons eg email account exist anymore etc could reach 1055 developers also sent survey Python developers filtering invalid duplicated developers’ emails successfully sent survey 460 Python developers introduce trivial Python packages PyPI 1024 Python applications dataset designed survey using Google Forms survey listed trivial package application detected trivial package total received 125 developer responses First received 88 responses survey JavaScript developers translates response rate 83 survey response rate higher typical 5 response rate reported questionnairebased engineering surveys Singer et al 2008 left part Table 2 show JavaScript experience position developers majority 67 respondents 5 years experience 14 35 years 7 13 years experience position survey respondents 88 respondents 83 identified developers working either industry 68 full time independent developers 15 remaining 5 identified casual developers 2 3 including one student two developers working executive positions npm Second received 37 survey responses Python developers yielding response rate 804 accordance supposedly observed studies engineering domain Singer et al 2008 right part Table 2 shows Python experience position developers vast majority respondents 92 identified five years Python development experiences 3 respondents identified development experience Pythons range 3 five years Regarding current position survey respondents 27 respondents refer developers working industry 4 developers identified full time independent developers reset respondents identified casual developers 1 5 including researchers students npm Experience JavaScript Developers’ Position PyPI Experience Python Developers’ Position 1 3 years 7 Industrials 68 1 3 years 0 Industrials 27 3 5 years 14 Independent 15 3 5 years 3 Independent 4 5 years 67 Casual 2 5 years 34 Casual 1 – – 3 – – 5 Total 88 Total 88 Total 37 Total 37 fact respondents experienced JavaScript Python developers gives us confidence survey responses 51 Developers Consider Trivial Packages Harmful first question survey participants “Do consider use trivial packages bad practice” reason ask question bluntly allows us gauge deterministic way developers felt issue using trivial packages provided three possible replies Yes case provided text box elaborate Figure 4 shows distribution responses JavaScript Python developers 88 JavaScript participants 51 579 stated consider use trivial packages bad practice Another 21 239 stated indeed think using trivial package bad practice remaining 16 182 stated really depends circumstances time available critical piece code package used thoroughly tested Contrary case JavaScript 26 703 Python developers responded survey generally consider use trivial packages bad practice 3 81 survey participants stated think using trivial package bad practice remaining 8 216 indicate really depends circumstances example PPyPI 3 states “If language doesn’t provide common inherently useful functionality fixing oversight use thirdparty library reasonable Moreover little functionality actually ‘trivial’ may short implement likely mistake introduce bug program surely mistake something ‘nontrivial’” Fig 4 Developer responses question “is using trivial package bad” JavaScript developers answered whereas Python developers answered yes 52 Developers Use Trivial Packages answered question whether developers say using trivial packages bad practice interested developers resort using trivial packages view drawbacks using trivial packages Therefore second part survey asks participants list reasons resort using trivial packages ensure bias responses developers answer fields questions freeform text ie predetermined suggestions provided analyze separately responses two surveys JavaScript Python gathering responses grouped categorized responses twophase iterative process first phase two authors carefully read participant’s answers independently came number categories responses fell Next discussed groupings agreed extracted categories Whenever failed agree category third author asked help break tie categories decided two authors went answers independently classified respective categories majority cases two authors agreed categories classifications responses measure agreement two authors used Cohen’s Kappa coefficient Cohen 1960 Cohen’s Kappa coefficient used evaluate interrater agreement levels categorical scales provides proportion agreement corrected chance resulting coefficient scaled range 1 1 negative value means less chance agreement zero indicates exactly chance agreement positive value indicates better chance agreement Fleiss Cohen 1973 categorization level agreement measured authors 090 083 npm survey PyPI survey respectively considered excellent interrater agreement Table 3 shows reasons using trivial packages reported respondents JavaScript Python surveys see table two cited reasons Reason Description npm Resp PyPI Resp Wellimplemented tested Participants state trivial packages effectively implemented tested 48 546 20 541 Increased productivity Trivial packages reduce time needed implement existing source code 42 477 12 324 Wellmaintained code eases source code maintenance since developers maintain trivial package 8 91 2 54 Improved readability reduced complexity Using trivial packages improve source code quality terms readability reduce complexity 8 91 5 135 Better performance Trivial packages improve performance web applications compared use large frameworks 3 34 0 00 reason – 7 80 7 189 ie wellimplemented tests increased productivity npm PyPI package management platforms However comes 3 less common reasons slight difference npm PyPI notably reason trivial packages provide better performance evident survey Next discuss reasons presented Table 3 detail R1 Wellimplemented tested cited reason using trivial packages provide well implemented tested code half responses mentioned reason 546 541 responses JavaScript Python respectively particular although may easy developers code trivial packages difficult make sure details addressed eg one needs carefully consider edge cases example responses mention issues stated participants Pnpm 68 Pnpm 4 PPyPI 5 cite reasons using trivial packages follows Pnpm 68 “Tests already written lot edge cases captured ” Pnpm 4 “There may elegantefficientcorrectcrossenvironmentcomplatable solution trivial problem yours” PPyPI 5 “They covered extra cases would thought initially” R2 Increased productivity second cited reason improved productivity using trivial packages enables 477 324 JavaScript Python respectively Trivial tasks writing code requires time effort hence many developers view use trivial packages way boost productivity particular early developer want worry small details would rather focus efforts implementing difficult tasks example participants Pnpm 13 Pnpm 27 JavaScript survey state Pnpm 13 “ save time think best implement even simple things” Pnpm 27 “Don’t reinvent wheel task done before” Another example Python survey participant PPyPI 17 states “Often write code package reusable module don’t write later point whether module authored someone else mostly irrelevant What’s relevant get avoid repeatedly implementing functionality new project” aforementioned clear examples developers would rather code something even trivial course comes cost discuss later R3 Wellmaintained code less common 91 54 responses JavaScript Python cited reason using trivial packages fact maintenance code need performed developers essence outsourced community contributors trivial packages example participants Pnpm 45 PPyPI 1 states Pnpm 45 “Also highly used trivial package probable well maintained” PPyPI 1 “The simple advantages may trivial used many people therefore potentially maintained developers” Even tasks bug fixes dealt contributors trivial packages attractive users trivial packages reported participant Pnpm 80 “ leveraging feedback larger community fix bugs etc” R4 Improved readability reduced complexity Participants also reported using trivial packages improves readability reduces complexity code 91 13 responses two package management platforms example Pnpm 34 states “immediate clarity use readability developers commonly used packages” Pnpm 47 states “Simple abstract brings less complexity” Python developers report advantage using trivial packages example PPyPI 5 states “Code clarity many two liners become one liners saves space whole point batteries included mentally” R5 Better performance JavaScript participants 34 stated using trivial packages improves performance since alleviates need application depend large frameworks Notably load time trivial packages compared larger JavaScript packages small speeds overall load time applications example Pnpm 35 states “ depend huge utility library need part” JavaScript developers reported trivial packages improve performance Python developers report claim One explanation JavaScript used develop frontend applications often sensitive performance ie load time whereas Python used implement applications wide variety domains Overall developer responses show different perception using trivial package among developers two package management platforms small percentage 80 respondents JavaScript stated see reason use trivial packages However Python developers 189 respondents believe advantages using trivial packages 53 Drawbacks Using Trivial Packages addition knowing reasons developers resort trivial packages wanted understand side coin perceive drawbacks decision use packages drawbacks question part survey followed aforementioned process analyze survey responses case drawbacks Cohen’s Kappa agreement measure 086 091 npm PyPI respectively considered excellent agreement Table 4 lists drawback mentioned survey respondents along brief description frequency drawback see table top two cited drawbacks ie dependency overhead breakage applications npm PyPI However less cited drawbacks npm developers cited performance development slow missed learning opportunities next set drawbacks whereas PyPI developers consider security development slow decreased performance next set drawbacks worth noting however little difference individual drawbacks eg security vs development Table 4 Drawback using trivial packages npm PyPI Drawback Description npm Python Dependency overhead Using trivial packages results dependency mess hard update maintain 49 557 25 676 Breakage applications Depending trivial package could cause application break package becomes unavailable breaking update 16 182 12 324 Decreased performance Trivial packages decrease performance applications includes time install build application 14 159 3 81 Slows development Finding relevant high quality trivial package challenging time consuming task 11 125 4 108 Missed learning opportu practice using trivial packages leads developers learning experiencing writing code trivial tasks 8 91 0 0 Security Using trivial packages open door security vulnerability 7 80 5 135 Licensing issues Using trivial packages could cause licensing conflicts 3 34 2 54 drawbacks – 7 80 3 81 slow within two package management platforms ie npm PyPI Next discuss drawbacks detail D1 Dependency overhead cited drawback using trivial packages increased dependency overhead eg keeping dependencies date dealing complex dependency chains developers need bear Bogart et al 2016 Mirhosseini Parnin 2017 situation often referred ‘dependency hell’ especially trivial packages additional dependencies drawback came clearly many comments account 557 responses form JavaScript developers example Pnpm 41 states “ people don’t actively manage dependency versions could exposed serious problems ” Pnpm 40 “Hard maintain lot tiny packages” Python developers percentage responses related dependency overhead high 676 well example responses Python developers mention issues stated participants PPyPI 2 PPyPI 4 PPyPI 13 state PPyPI 2 “it’s difficult distribute something dependency doesn’t come Python” PPyPI 4 “Lots brittle dependencies” PPyPI 13 “When projects consist lot trivial modules becomes almost impossible track update time might forget even do” Hence trivial packages may provide wellimplementedtested code improve productivity developers clearly aware management additional dependencies something need deal D2 Breakage applications Developers also worry potential breakage application due specific package version becoming unavailable JavaScript developers stated issue 182 responses percentage 324 Python developers example leftpad issue main reason breakage removal leftpad Pnpm 4 states “Obviously whole ‘leftpad crash’ exposed issue” PPyPI 22 states “potential breaking NPM leftpad situation” However since incident npm disabled possibility package removed npm Blog 2016 Although disallowing removal solves part problem packages still updated may break application issue clear one responses PPyPI 7 stated “Potential breaking changes version version” nontrivial package may worth take risk however trivial packages may worth taking risk D3 Decreased performance issue related dependency overhead drawback Developers mentioned incurring additional dependencies slowed build run time increased application installation times 159 81 example Pnpm 64 states “Too many metadata download store real code” Pnpm 34 states “ slow installs make noisy unintuitive attempting cobble together many disparate pieces instead targeted code” Another Python developer PPyPI 1 states “If modules ubiquitous needing dependency real drag one install Also job done may run much faster easier understand mentioned earlier cases fact trivial package adds dependency cases trivial package depends additional packages negatively impacts performance even D4 Slows development cases use trivial packages may actually reverse effect slow development 125 108 responses JavaScript Python developers example Pnpm 23 Pnpm 15 state Pnpm 23 “Can actually slow team matter trivial package developer hasn’t required read docs order double check rather reading lines source” Pnpm 15 “ problem locating packages useful “trustworthy” ” difficult find relevant trustworthy package Even others try build code much difficult go fetch package learn rather read lines code Python developers also agree issue example PPyPI 15 states “If finding reading understanding documentation module takes longer reading implementation hiding functionality thirdpart trivial modules obscures source base” D5 Missed learning opportunities certain cases reported JavaScript developers 91 use trivial packages seen missed learning opportunity developers example Pnpm 24 states “Sometimes people forget things could lead lack control knowledge languagetechnology using” clear example using package rather coding solution lead less knowledge code base contrast JavaScript developers Python developers seem worried issue since use trivial packages common within Python developer community JavaScript developers D6 Security cases trivial packages may security flaws make application vulnerable issue pointed developers 80 135 example Pnpm 15 mentioned earlier difficult find packages trustworthy Also Pnpm 57 mentions “If depend public trivial packages careful selecting packages security reasons” PPyPI 3 states “more dependencies greater likelihood knowing code actually works lower level security issues” case dependency one takes always chance security vulnerability could exposed one packages D7 Licensing issues 34 cases responses 34 54 JavaScript Python developers concerned potential licensing conflicts trivial packages may cause example Pnpm 73 states “ possibly licenseissues” Pnpm 62 “ risk ‘trivial’ package might licensed GPL must replaced anyway prior shipping” PPyPI 23 also mentions “Can licensing hell” general observe similar concerns regarding use trivial packages two managements platforms studied also approximately 8 responses package management platforms stated see drawbacks using trivial packages
::::
6 Putting Developer Perceptions Microscope developer surveys provided us valuable insights developers use trivial packages perceive drawbacks Whether empirical evidence support perceptions remains unexplored Thus examine commonly cited reason using trivial packages ie developers’ belief trivial packages well tested drawback ie impact additional dependencies based findings Section 5 61 Examining ‘Well Tested’ Perception shown Table 3 half responses studied package management platforms indicate use trivial packages developers believe well implemented tested However really case trivial packages really well tested section want examine whether belief grounds 611 Node Package Manager npm npm requires developers provide test script name submission packages listed packagejson file fact 737 59110 80232 trivial packages dataset test script name listed However since developers provide script name field difficult know package actually tested examine whether npm package really well tested implemented two aspects first check package tests written Second since many cases developers consider packages ‘deployment tested’ means trivial packages used many developers also consider usage package indicator well tested implemented Zambonini 2011 carefully examine whether package really well tested implemented use npm online search tool known npms Cruz Duarte 2017 measure various metrics related well packages tested used valued provide ranking packages npms mines calculates number metrics based development eg tests usage eg downloads data use three metrics measured npms validate ‘well tested implemented’ perception developers 1 Tests considers tests’ size coverage percentage build status looked npms source code found Tests metric calculated texttestsSize times 06 textbuildStatus times 025 textcoveragePercentage times 015 use Tests metric determine package tested trivial packages compare nontrivial packages terms well tested One example motivates us investigate well tested trivial package response Pnpm 68 says “Tests already written lot edge cases captured ” 2 Community interest evaluates community interest packages using number stars GitHub npm forks subscribers contributors find source code npms Community interest simply sum aforementioned metrics measured textstarsCount textforksCount textsubscribersCount textcontributorsCount use metric compare interested community trivial nontrivial packages measure community interest since developers view importance trivial packages evidence quality stated Pnpm 56 says “ Using isolated module welltested vetted large community helps mitigate chance small bugs creeping in” 3 Download count measures mean downloads last three months number downloads package often viewed indicator package’s quality Pnpm 61 mentions “this code tested used many makes trustful reliable” initial step calculate number trivial packages Tests value greater zero means trivial packages tests find 284 trivial packages tests ie Tests value 0 addition compare values Tests Community interest Download count Trivial nonTrivial packages focus values aforementioned metric values trivial packages however also present results nontrivial packages put results context Figure 5 shows beanplots Tests Community interest Download count cases trivial packages median smaller Community interest value Download count compared nontrivial packages except Tests value Fig 5a shows Tests metric trivial packages median similar value nontrivial packages said observe Fig 5a distribution Tests metric similar trivial nontrivial packages packages Tests value zero small pockets packages values aprox 030 3It important note motivation full derivation eg put weight 015 test coverage etc metrics beyond scope paper refer interested readers npms documentation details Cruz Duarte 2017 make paper selfsufficient include metrics calculated 06 09 10 case Community interest Download count metrics see similar distributions although clearly median values lower trivial packages examine whether difference metric values trivial nontrivial packages statistically significant performed MannWhitney test compare two distributions determine difference statistically significant pvalue 005 also use Cliff’s Delta nonparametric effect size measure interpret effect size trivial nontrivial packages suggested Grissom Kim 2005 interpret effect size value small 033 positive well negative values medium 033 leq 0474 large geq 0474 Table 5 shows pvalues effect size values observe cases differences statistically significant however effect size small results show although majority trivial packages tests written statistically lower Community interest Download count values effect size smaller nontrivial packages
::::
612 Python Package Index PyPI Since PyPI collect metadata show Python package tested use data sources examine well tested perception use two ways examine whether Python packages tested 1 use source code packages hosted GitHub 2 relied information Python packages Metrics pvalue Tests 22e16 0222 small Community interest 22e16 0225 small Downloads count 22e16 0261 small collected open source service librariesio httpslibrariesio librariesio monitors collects metadata open source packages across 36 different package management platforms falls CCBYSA 40 licenses used research work eg Decan et al 2018a b obtain extracted metadata information related PyPI package management examine testing perception three complementary ways 1 Tests examine package test code written Since standard way determine Python application tests eg exist 100 Python testing tools httpswikipythonorgmoinPythonTestingToolsTaxonomy manually investigate whether PyPI package contains test code written idea developers writes tests put tests package repository One example motivated us look test code package developer response PPyPI 11 stated “Shorter code overall welltested code fundamental tasks helps smooth language nits” Since heavily manual process decide examine representative sample packages Therefore take statistically significant sample 6759 Python packages identify trivial Python packages Section 41 sample size selected randomly attain 5 confidence interval 95 confidence level sampling process result 364 PyPI trivial packages two authors manually examine code bases sampled packages looking test code identify packages test measure Cohen’s Kappa coefficient evaluate level agreement two annotators Cohen 1960 result process find level agreement two authors 097 consider excellent agreement Finally two authors discuss cases agree come agreement 2 Community interest evaluates community interest packages using number stars GitHub forks subscribers contributors adopted formula defined npms basically sum aforementioned metrics measured textstarsCount textforksCount textsubscribersCount textcontributorsCount use metric compare interested community trivial nontrivial packages measure community interest since developers view importance trivial packages evidence quality 3 Usage count represents number applications use package applications using Python package popular package may also indicate package high quality example PPyPI 11 indicated “The simple advantages may trivial used many people therefore potentially maintained developers” Hence use usage count metric since indicates package quality thus many developers use applications calculate number Python applications use PyPI trivial packages use librariesio dataset provides list Python applications packages depend Also PyPI package dataset count number Python applications use package found 364 sampled trivial Python packages manually examined 185 5082 packages test code 179 4918 examined packages test code written important note analysis examines whether trivial package tests whether tests actually effective completely different issue one reasons examining two metrics Community interest Usage count Figure 6 shows beanplots Community interest Usage count values trivial nontrivial Python packages dataset figures show two cases trivial Python packages median smaller Community interest value Usage count compared nontrivial packages said observe Fig 6a case Community interest metric see clearly median values lower trivial packages Figure 6b shows distribution Usage count metric similar trivial nontrivial packages examine whether difference metric values trivial nontrivial packages statistically significant performed MannWhitney test compare two distributions determine difference statistically significant also use Cliff’s Delta measure effect size PyPI trivial nontrivial packages Table 6 shows pvalues effect size values observe cases community interest usage count differences statistically significant effect size small negligible respectively 62 Examining ‘Dependency Overhead’ Perception discussed Section 5 top cited drawback using trivial packages developers need take maintain extra dependencies ie dependency overhead Examining impact dependencies complex wellstudied issue eg de Souza Redmiles 2008 Decan et al 2016 Abate et al 2009 examined multitude ways choose examine issue application package perspectives 621 Applicationlevel Analysis compared coding trivial tasks using trivial package imposes extra dependencies One problematic aspects managing dependencies Table 6 MannWhitney Test pvalue Cliff’s Delta trivial vs non trivial packages PyPI Metrics pvalue Community interest 22e16 −0251 small Usage count 0004557 −0039 negligible Applications dependencies updated causing potential break application Therefore first step examined number releases trivial nontrivial packages intuition developers need put extra effort ensure proper integration new releases beanplots Figs 7 8 show distribution number releases studied package management platforms Figure 7a shows trivial packages npm less releases nontrivial packages median 1 trivial 2 nontrivial packages However examine number different release types found trivial nontrivial npm packages similar numbers minor major releases Fig 7c b patch releases trivial npm packages less patch releases Fig 8a also observe trivial packages PyPI less releases nontrivial packages examine number releases PyPI packages based release type Figures 8b c show distribution minor major patch releases trivial nontrivial PyPI packages Fig 8b c see difference trivial nontrivial packages minor major releases patch releases observe trivial PyPI packages smaller number patch releases fact trivial packages updated less frequently may attributed fact trivial packages ‘perform less functionality’ hence need updated less frequently addition examine whether differences distribution type releases trivial nontrivial packages statistically significant performed Wilcox test also use Cliff’s Delta examine effect size Table 7 shows pvalues effect size releases types npm PyPI shows releases types differences statistically significant pvalues 005 Also effect size values small negligible Next examined developers choose deal updates trivial packages One way application developers reduce risk package impacting application ‘version lock’ package example JavaScript application use npm packages version locking dependencypackage means updated automatically specific version mentioned packagesjson file used stated responses survey eg Pnpm 8 “ Also people don’t lock” Fig 7 Distribution different types releases trivial nontrivial npm packages versions pain” general different types version locks ie updating major releases updating patches updating minor releases lock means package automatically updates version locks specified configuration file next every package name example npm defines packagesjson file examined frequency trivial nontrivial packages locked npm find average trivial packages locked 263 time whereas nontrivial packages locked 282 time Wilcox test also shows difference statistically significant pvalue 005 pvalue 9116e07 hand PyPI find average trivial packages locked 317 time whereas nontrivial packages locked 362 time Also Wilcox test shows difference statistically significant pvalue 9707e08 findings show trivial packages locked less npm true PyPI trivial packages locked less nontrivial packages cases however find large difference percentage packages trivial vs nontrivial locked
::::
622 Packagelevel Analysis package level investigate direct indirect dependencies trivial packages particular would like determine trivial packages dependencies makes dependency chain even complex trivial nontrivial package npm install count actual number direct indirect dependencies package requires allows us know true direct indirect dependencies package requires Note simply looking json Release type npm pvalue small PyPI pvalue small 22e16 02016 22e16 02995 Minor 22e16 00823 22e16 02447 Major 22e16 01185 22e16 01276 Patch 22e16 01985 22e16 02729 file require statements provide direct dependencies indirect dependencies Hence downloaded packages npm dataset mock installed4 build dependency graph npm platform Similarly PyPI count actual number direct indirect dependencies package requires leveraged metadata provided Valiev et al 2018 study Valiev et al extracted list direct indirect dependencies package PyPI resort use data provided Valiev et al 2018 since recently extracted data covers history PyPI six years read dependencies package build dependency graph PyPI platform Figure 9 shows distribution dependencies trivial nontrivial packages npm PyPI Since trivial packages dependencies median zero Therefore bin trivial packages based number dependencies calculate percentage packages bin Table 8 shows percentage packages respective number dependencies npm PyPI observe majority npm trivial packages 569 zero dependencies 21 110 dependencies 38 1120 dependencies 184 20 dependencies table also shows PyPI trivial packages much dependencies npm packages fact 632 PyPI packages zero dependencies approx 34 trivial packages 120 dependencies approx 3 PyPI trivial packages 20 dependencies Interestingly table shows trivial packages npm many dependencies indicates indeed trivial packages introduce significant dependency overhead also shows PyPI trivial packages small number dependencies One explanation difference Python language mature standard API provides needed utility functionalities 4we modified npm code intercept install call counted installations needed every package Table 8 Percentage packages vs number dependencies used npm PyPI package management platforms Packages npm Dependencies Direct Indirect PyPI Dependencies Direct Indirect 0 110 1120 20 0 110 1120 20 Trivial 569 21 38 184 632 296 43 29 Non Trivial 371 241 68 321 425 394 107 74 Trivial packages fewer releases less likely version locked nontrivial packages said developers careful using trivial packages since cases trivial packages numerous dependencies fact find 434 npm trivial packages least one dependency 184 npm trivial packages 20 dependencies 368 PyPI trivial packages least one dependency 29 PyPI trivial packages 20 dependencies
::::
7 Relevance Implications common question asked empirical studies implications findings would practitioners care findings discuss issue relevance study developer community based responses survey highlight implications study 71 Relevance Practitioners care start study sure practically relevant study trivial packages However surprised interest developers study fact one developers Pnpm 39 explicitly mentioned lack research topic stating “There enough research I’ve taking note people’s proposed “quick simple” code handle functionality trivial packages it’s surprised see high percentage times proposed code buggy incomplete” Moreover conducted studies asked respondents would like know outcome study provide us email address 125 JavaScript Python respondents 81 aprox 65 provided email us provide outcomes study respondents hold high level leadership roles npm us indicator study outcomes high relevance JavaScript Python development communities 72 Implications Study study number implications engineering practice research 721 Practical Implications direct implication findings trivial packages commonly used others perhaps indicating developers view use bad practice especially JavaScript developers Moreover developers assume trivial packages well implemented tested since findings show otherwise npm developers need expect trivial packages submitted making task finding relevant package even harder Hence issue manage help developers find best packages needs addressed example Pnpm 15 indicated “ problem locating packages useful ‘trustworthy’ ever growing sea packages” extent npms recently adopted npm specifically address aforementioned issue Developers highlighted lack decent core standard JavaScript library causes resort trivial packages Often want install large frameworks leverage small parts framework hence resort using trivial packages example Pnpm 35 “especially JavaScript relieves thinking cross browser compatibility special casescoming polyfills testing edge cases Basically it’s substitute missing standard library depend huge utility library need part” PPyPI 23 “Usually indication inadequacy standard library seems particularly JavaScript might find using many modules” Therefore need JavaScript community create standard JavaScript API library order reduce dependence trivial packages issue creating standard JavaScript library much debate Fuchs 2016 722 Implications Future Research study mostly focused determining prevalence reasons drawbacks using trivial packages two large package management platforms npm PyPI Based findings find number implications motivations future work First survey respondents indicated choice use trivial packages black white many cases depends team application example one survey respondent stated team less experienced developers likely use trivial packages whereas experienced developers would rather write code trivial tasks issue experienced developers likely trust code less experienced likely trust external package Another aspect maturity application survey respondents pointed much likely use trivial packages early development life cycle waste time trivial tasks focus fundamental tasks application application matures start look ways reduce dependencies since pose potential points failure application study motivates future work examine relationship team experience application maturity use trivial packages Second survey respondents also pointed using trivial packages seen favourably compared using code Questions Answers QA sites StackOverflow Reddit example Pnpm 84 stated “I’d research solve particular problem peruse questions answers StackOverflow Reddit Coderanch find recent readable solution among everything I’ve found write go work simply ‘require’ someone else’s solution continue working towards goal matter seconds” compared using code StackOverflow developer know posted code else uses whether code may tests using trivial package npm andor PyPI seen much better option case using trivial packages seen best choice certainly better choice Although many studies examined developers use QA sites StackOverflow Abdalkareem et al 2017a b Wu et al 2018 Baltes Diehl 2018 aware studies compare code reuse QA sites trivial packages findings indicate need study
::::
8 Related Work section discuss work related study divided related work work related code reuse general work studied ecosystems 81 Studies Code Reuse Prior research code reuse shown many benefits include improving quality development speed reducing development maintenance costs Mockus 2007 Lim 1994 Mohagheghi et al 2004 Basili et al 1996 example Sojer Henkel 2010 surveyed 686 open source developers investigate reuse code findings show experienced developers reuse source code 30 functionality open source OSS projects reuse existing components Developers also reveal see code reuse quick way start new projects Similarly Haefliger et al 2008 conducted study empirically investigate reuse open source development practices developers OSS triangulated three sources data developer interviews code inspections mailing list data six OSS projects results showed developers used tools relied standards reusing components Mockus 2007 conducted empirical study identify largescale reuse open source libraries study shows 50 source files include code OSS libraries hand practice reusing source code challenging drawbacks including effort resource required integrate reused code Di Cosmo et al 2011 Furthermore bug reused component could propagate target system Dogguy et al 2011 study corroborates findings main goal define empirically investigate phenomenon reusing trivial packages particular JavaScript Python applications 82 Studies Ecosystems recent years analyzing characteristics ecosystems engineering gained momentum Bavota et al 2013 Bloemen et al 2014 Manikas 2016 Decan et al 2016 example recent study Bogart et al 2015 Bogart et al 2016 empirically studied three ecosystems including npm found developers struggle changing versions might break dependent code Wittern et al 2016 investigated evolution npm ecosystem extensive study covers dependence npm packages download metrics usage npm packages real applications One main findings npm packages updates packages steadily growing 80 packages least one direct dependency studies examined size characteristics packages ecosystem German et al 2013 studied evolution statistical computing GNU R aim analyzing differences code characteristics core usercontributed packages found usercontributed packages growing faster core packages Additionally reported usercontributed packages typically smaller core packages R ecosystem Kabbedijk Jansen 2011 analyzed Ruby ecosystem found many small large projects interconnected Decan et al 2018b investigated evolution package dependency networks seven packaging ecosystems findings reveal studied packaging ecosystems grow time term number published updated packages also observed increasing number transitive dependencies packages works investigate challenges using external packages ecosystem including identify conflicts JavaScript package Patra et al 2018 examine pull requests help developers upgrade outofdate dependencies applications Mirhosseini Parnin 2017 study usage repository badges npm ecosystem Trockman et al 2018 usage dependency graph discover hidden trend ecosystem Kula et al 2018 many ways study complements previous work since instead focusing packages ecosystem specifically focus trivial packages studied two different package management platforms npm PyPI Moreover examine reasons developers use trivial package view drawbacks study reuse trivial packages subset general code reuse Hence expect overlap prior work Like many empirical studies confirm prior findings contribution Hunter 2001 Seaman 1999 Moreover paper adds prior findings example validation developers’ assumptions Lastly believe study fills real gap since 65 participants said wanted know study outcomes
::::
9 Threats Validity section discuss threats validity case study 91 Internal Validity Internal validity concerns factors may influenced results datasets collection process study reasons drawback using trivial packages surveyed developers potential survey questions may influenced replies respondents However minimize influence made sure ask freeform responses publicly share survey anonymized survey responses Abdalkareem et al 2019 Moreover way asked survey questions might affected response respondents causing responses advocate advocate use trivial packages reduce bias ensure participants’ anonymity Also study may impacted fact overlap exist developer groups participated two user studies ie defining trivial packages understanding developers’ perception use trivial packages find second survey served confirmation observations made first survey participants however given two different populations may reported different observations removed test code dataset ensure analysis considers production source code identified test code searching term ‘test’ variants eg ‘TESTcode’ file names file paths Even though technique widely accepted literature Gousios et al 2014 Tsay et al 2014 Zhu et al 2014 confirm whether technique correct ie files term ‘test’ names paths actually contain test code took statistically significant sample packages achieve 95 confidence level 5 confidence interval examined manually found examined cases contain test code addition examine welltested perception PyPI trivial packages first two authors manually examined source code trivial packages classify whether test code written ensure validity classification measure classification agreement two authors found classification agreement two authors excellent Cohen’s Kappa value 097 92 Construct Validity Construct validity considers relationship theory observation case measured variables measure actual factors define trivial packages surveyed 12 JavaScript 13 Python developers However find consensus considered trivial package Although analysis shows packages leq 35 LOC complexity leq 10 trivial packages believe definitions possible trivial packages said 125 survey participants emailed using trivial packages 2 mentioned flagged package trivial package even though fit criteria us confirmation definition applies vast majority cases although clearly perfect addition determine considered trivial package conducted experiment JavaScript Python developers mostly students undergraduate graduate students professional experience may present professional developers per se Sjoberg et al 2002 prior work shown experiment students provide results professional developers engineering domain Salman et al 2015 Höste et al 2000 identify JavaScript Python applications examine study rely metadata provided GHTorrent dataset Gousios et al 2014 Thus selection JavaScript Python applications heavily depends correctness applications’ programming language listed GHTorrent use LOC cyclomatic complexity code determine trivial packages cases may measures need considered determine trivial packages example trivial packages dependencies may need taken consideration experience tells us developers look package dependencies determining trivial said replicated questionnaire another set participants Python language community found developers seem confirm definition trivial JavaScriptPython packages Abdalkareem et al 2019 Based user study defined trivial npm packages package 35 LOC Cyclomatic Complexity 10 However one threat definition 10 cyclomatic complexity high package trivial examine concern calculate cyclomatic complexity nontrivial packages dataset found average nontrivial npm packages cyclomatic complexity 803 indicates 10 Cyclomatic complexity value definition still significantly smaller compared one nontrivial packages study trivial packages PyPI package management platform able extract 63912 packages Collecting packages may provide details trivial packages PyPI package management platform Also identify Python applications use PyPI trivial packages use snakefood tool httpfuriuscasnakefood extract applications dependencies Hence limited accuracy snakefood extracting used packages Python applications study understand developers use trivial packages conducted two user surveys JavaScript Python developers two surveys performed different dates consequence may affect outcome survey results However given two package management platforms independent envision impact date shift significant study identify developers used trivial packages applications use regular expressions identify packages process may flag wrong package developers mitigate threat analysis make sure extract right packages several rounds manual checking results addition none developers contacted indicated shehe use identified packages serves slight confirmation methodology incorrect study npm used npms measure various quantitative metrics related testing community interest download counts measurements accurate npms however given main search tool npm confident npms metrics also use librariesio calculate community interested usage count metrics PyPI packages measurements accurate librariesio resort use librariesio data since used prior work eg Decan et al 2018a b addition use dataset provided Valiev et al 2018 measure direct indirect dependencies packages PyPI analysis also use different R packages perform analysis analysis may impacted accuracy used R packages mitigate threat make dataset used tools available online Abdalkareem et al 2019 93 External Validity External validity considers generalization findings findings derived open source JavaScript applications npm packages replication Python PyPI packages Even though believe two studied package management platforms amongst commonly used ones findings may generalize platforms ecosystems said historical evidence shows examples individual cases contributed significantly areas physics economics social sciences even engineering Flyvbjerg 2006 believe strong empirical evidence built studies individual cases studies large samples list reasons drawbacks using trivial packages based survey 88 JavaScript 37 Python developers Although large number developers results may hold developers different sample developers may result different list ranking advantages disadvantages mitigate risk due sampling contacted developers different applications responses show experienced developers distinguish domain studied packages may impact findings However help mitigate bias analyzed 500000 npm 74663 PyPI packages cover wide range package domains Lastly study based open source applications hosted GitHub therefore study may generalize open source commercial applications
::::
10 Conclusion use trivial packages increasingly popular trend development Abdalkareem et al 2017 Abdalkareem 2017 Like development practice proponents opponents goal study extend understanding use trivial packages examine prevalence reasons drawbacks using trivial packages different package management platforms Thus consider trivial packages PyPI addition previous studied npm Abdalkareem et al 2017 results indicate trivial packages commonly widely used JavaScript Python applications also find majority JavaScript developers study oppose use trivial packages majority Python developers believe using trivial packages could harmful Additionally based developers’ responses developers two package management platforms stated main reasons developers use trivial packages due fact considered well implemented tested cite additional dependencies’ overhead drawback using trivial packages empirical study showed considering trivial packages well tested misconception since half studied trivial package even tests However trivial packages seem ‘deployment tested’ similar Community interest DownloadUsage count values nontrivial packages addition find trivial packages dependencies studied dataset 184 npm 29 PyPI trivial packages 20 dependencies Hence developers careful trivial packages use Based findings provide following practical suggestions developers – Developers assume trivial packages welltested implemented since found 284 492 npm PyPI trivial packages test code – Due fact trivial packages dependencies developers aware using trivial packages would increase dependency overhead applications Acknowledgments authors grateful many survey respondents dedicated valuable time respond surveys Also authors would like thank anonymous reviewers editor thoughtful feedback suggestions help us improve study References Abate P Di Cosmo R Boender J Zacchiroli 2009 Strong dependencies components Proceedings 2009 3rd International Symposium Empirical Engineering Measurement ESEM ’09 IEEE Computer Society pp 89–99 Abdalkareem R 2017 Reasons drawbacks using trivial npm packages developers’ perspective Proceedings 2017 11th Joint Meeting Foundations Engineering ESECFSE 2017 ACM pp 1062–1064 Abdalkareem R Nourry Wehaibi Mujahid Shihab E 2017 developers use trivial packages empirical case study npm Proceedings 11th Joint Meeting Foundations Engineering ESECFSE ’17 ACM pp 385–395 Abdalkareem R Oda V Mujahid Shihab E 2019 impact using trivial packages empirical case study npm pypi httpsdoiorg105281zenodo3095009 Abdalkareem R Shihab E Rilling J 2017 code reuse Stack Overflow exploratory study Android apps Inf Softw Technol 88C148–158 Abdalkareem R Shihab E Rilling J 2017 developers use crowd study using Stack Overflow IEEE Softw 34253–60 Baltes Diehl 2018 Usage attribution Stack Overflow code snippets gitHub projects Empirical Engineering Basili VR Briand LC Melo WL 1996 reuse influences productivity objectoriented systems Commun ACM 3910104–116 Bavota G Canfora G Penta MD Oliveto R Panichella 2013 evolution interdependencies ecosystem case Apache Proceedings 2013 IEEE International Conference Maintenance ICSM ’13 IEEE Computer Society pp 280–289 Blais snakefood Python Dependency Graphs httpfuriuscasnakefood accessed 09232018 Bloemen R Amrit C Kuhlmann Ordóñez Matamoros G 2014 Gentoo package dependencies time Proceedings 11th Working Conference Mining Repositories MSR ’14 ACM pp 404–407 Bogart C Kastner C Herbsleb J 2015 breaks breaks ecosystem developers reason stability dependencies Proceedings 2015 30th IEEEACM International Conference Automated Engineering Workshop ASEW ’15 IEEE Computer Society pp 86–89 Bogart C Kästner C Herbsleb J Thung F 2016 break API Cost negotiation community values three ecosystems Proceedings 2016 24th ACM SIGSOFT International Symposium Foundations Engineering FSE ’16 ACM pp 109–120 Bower 2012 Bower package manager web httpsbowerio accessed 08232016 Castelluccio L Khomh F 2019 empirical study patch uplift rapid release development pipelines Empir Softw Eng 2453008–3044 Cohen J 1960 coefficient agreement nominal scales Educ Psychol Meas 2037–46 Cruz Duarte 2017 npms httpsnpmsio accessed 02202017 de Souza CRB Redmiles DF 2008 empirical study developers’ management dependencies changes Proceedings 30th International Conference Engineering ICSE ’08 ACM pp 241–250 Decan Mens Constantinou E 2018a impact security vulnerabilities npm package dependency network International Conference Mining Repositories Decan Mens Grosjean P 2018b empirical comparison dependency network evolution seven packaging ecosystems Empirical Engineering Decan Mens Grosjean P et al 2016 github meets CRAN analysis interrepository package dependency problems Proceedings 23rd IEEE International Conference Analysis Evolution Reengineering volume 1 SANER ’16 IEEE pp 493–504 Di Cosmo R Di Ruscio Pelliccione P Pierantonio Zacchiroli 2011 Supporting evolution componentbased FOSS systems Sci Comput Program 76121144–1160 Dogguy Glondu Le Gall Zacchiroli 2011 Enforcing typeSafe linking using interpackage relationships Studia Informatica Universalis 91129–157 Ebert C Cain J 2016 Cyclomatic complexity IEEE Softw 33627–29 Fleiss JL Cohen J 1973 equivalence weighted kappa intraclass correlation coefficient measures reliability Educ Psychol Meas 33613–619 Flyvbjerg B 2006 Five misunderstandings casestudy research Qual Inq 122219–245 Fuchs 2016 great standard library JavaScript – medium httpsmediumcomthomasfuchswhatifwehadagreatstandardlibraryinjavascript52692342ee3fpw7d4cq8j accessed 02242017 German Adams B Hassan 2013 Programming language ecosystems evolution R Proceedings 17th European Conference Maintenance Reengineering CSMR ’13 IEEE pp 243–252 Gousios G Vasilescu B Serebrenik Zaidman 2014 Lean ghtorrent Github data demand Proceedings 11th Working Conference Mining Repositories MSR ’14 ACM pp 384–387 Grissom RJ Kim JJ 2005 Effect sizes research broad practical approach Lawrence Erlbaum Associates Publishers Haefliger Von Krogh G Spaeth 2008 Code reuse open source Manag Sci 541180–193 Haney 2016 Npm leftpad forgotten program httpwwwhaneycodesnetnpmleftpadhaveweforgottenhowtoprogram accessed 08102016 Harris R 2015 Small modules it’s quite simple httpsmediumcomRichHarrissmallmodulesitsnotquitethatsimple3ca532d65de4 accessed 08242016 Hemanth HM 2015 Oneline node modules issue10 sindresorhusama httpsgithubcomsindresorhusamaissues10 accessed 08102016 Höst Regnell B Wohlin C 2000 Using students subjects—a comparative study students professionals leadtime impact assessment Empir Softw Eng 53201–214 Hunter JE 2001 desperate need replications J Consum Res 281149–158 Inoue K Sasaki Xia P Manabe 2012 code come go integrated code history tracker open source systems Proceedings 34th International Conference Engineering ICSE ’12 IEEE Press pp 331–341 Kabbedijk J Jansen 2011 Steering insight exploration Ruby ecosystem Proceedings Second International Conference Business ICSOB ’11 Springer pp 44–55 Kalliamvakou E Gousios G Blincoe K Singer L German DM Damian 2014 promises perils mining gitHub Proceedings 11th Working Conference Mining Repositories MSR ’14 ACM pp 92–101 Kula RG Roover CD German DM Ishio Inoue K 2018 generalized model visualizing library popularity adoption diffusion within ecosystem 2018 IEEE 25th International Conference Analysis Evolution Reengineering volume 00 SANER ’18 pp 288–299 Librariesio Librariesio open source discovery service httpslibrariesio accessed 05202018 Librariesio 2017 Pypi httpslibrariesiopypi accessed 03082017 Lim WC 1994 Effects reuse quality productivity economics IEEE Softw 11523–30 Macdonald F 2016 programmer almost broke Internet last week deleting 11 lines code httpwwwsciencealertcomhowaprogrammeralmostbroketheinternetbydeleting11linesofcode accessed 08242016 Manikas K 2016 Revisiting ecosystems research longitudinal literature study J Syst Softw 11784–103 McCamant Ernst MD 2003 Predicting problems caused component upgrades Proceedings 9th European Engineering Conference Held Jointly 11th ACM SIGSOFT International Symposium Foundations Engineering ESECFSE ’03 ACM pp 287–296 Mirhosseini Parnin C 2017 automated pull requests encourage developers upgrade outofdate dependencies Proceedings 32Nd IEEEACM International Conference Automated Engineering ASE ’17 IEEE Press pp 84–94 Mockus 2007 Largescale code reuse open source Proceedings First International Workshop Emerging Trends FLOSS Research Development FLOSS ’07 IEEE Computer Society p 7– Mohagheghi P Conradi R Killi OM Schwarz H 2004 empirical study reuse vs defectdensity stability Proceedings 26th International Conference Engineering ICSE ’04 IEEE Computer Society pp 282–292 npm 2016 npm — node package managment documentation httpsdocsnpmjscomgettingstartedwhatisnpm accessed 08142016 npm Blog 2016 npm blog changes npm’s unpublish policy httpblognpmjsorgpost141905368000changestounpublishpolicy accessed 08112016 Orsila H Geldenhuys J Ruokonen Hammouda 2008 Update propagation practices highly reusable open source components Proceedings 4th IFIP WG 213 International Conference Open Source Systems OSS ’08 pp 159–170 Patra J Dixit PN Pradel 2018 Conflictjs Finding understanding conflicts javaScript libraries Proceedings 40th International Conference Engineering ICSE ’18 ACM pp 741–751 Python Python testing tools taxonomy python wiki httpswikipythonorgmoinPythonTestingToolsTaxonomy accessed 05162018 Rahman MT Rigby PC Shihab E 2019 modular feature toggle architectures google chrome Empir Softw Eng 242826–853 Ray B Posnett Filkov V Devanbu P 2014 large scale study programming languages code quality gitHub Proceedings 22Nd ACM SIGSOFT International Symposium Foundations Engineering FSE ’14 ACM pp 155–165 Salman Misirli Juristo N 2015 students representatives professionals engineering experiments 2015 IEEEACM 37th IEEE International Conference Engineering volume 1 ICSE ’15 IEEE pp 666–676 SciTools Understand tool httpsscitoolscom accessed 04162019 Seaman CB 1999 Qualitative methods empirical studies engineering IEEE Trans Softw Eng 254557–572 Singer J Sim SE Lethbridge TC 2008 engineering data collection field studies Guide Advanced Empirical Engineering Springer london pp 9–34 Sjoberg DIK Anda B Arisholm E Dyba Jorgensen Karahasanovic Koren EF Vokac 2002 Conducting realistic experiments engineering Proceedings International Symposium Empirical Engineering IEEE pp 17–26 Sojer Henkel J 2010 Code reuse open source development Quantitative evidence drivers impediments J Assoc Inf Syst 1112868–901 Trockman Zhou Kästner C Vasilescu B 2018 Adding sparkle social coding empirical study repository badges npm ecosystem Proceedings International Conference Engineering ICSE ’18 ACM Tsay J Dabbish L Herbsleb J 2014 Influence social technical factors evaluating contribution gitHub Proceedings 36th International Conference Engineering ICSE ’14 ACM pp 356–366 Valiev Vasilescu B Herbsleb J 2018 Ecosystemlevel determinants sustained activity opensource projects case study pyPi ecosystem Joint European Engineering Conference Symposium Foundations Engineering ESECFSE ’18 ACM Vasilescu B Yu Wang H Devanbu P Filkov V 2015 Quality productivity outcomes relating continuous integration gitHub Proceedings 2015 10th Joint Meeting Foundations Engineering ESECFSE ’15 ACM pp 805–816 Williams C 2016 one developer broke Node Babel thousands projects 11 lines JavaScript httpwwwtheregistercouk20160323npmleftpadchaos accessed 08242016 Wittern E Suter P Rajagopalan 2016 look dynamics javaScript package ecosystem Proceedings 13th International Conference Mining Repositories MSR ’16 ACM pp 351–361 Wu Wang Bezemer CP Inoue K 2018 developers utilize source code Stack Overflow Empirical Engineering Zambonini 2011 Practical Guide Web App Success chapter 20 Five Simple Steps accessed 02232017 Gregory ed Zhu J Zhou Mockus 2014 Patterns folder use popularity case study gitHub repositories Proceedings 8th ACMIEEE International Symposium Empirical Engineering Measurement ESEM ’14 ACM pp 301–304 Publisher’s note Springer Nature remains neutral regard jurisdictional claims published maps institutional affiliations Rabe Abdalkareem postdoctoral fellow Analysis Intelligence Lab SAIL Queen’s University Canada received PhD Computer Science Engineering Concordia University Montreal Canada research investigates adoption crowdsourced knowledge affects development maintenance Abdalkareem received masters applied Computer Science Concordia University work published premier venues FSE ICSME MobileSoft well major journals TSE IEEE EMSE IST Contact rababduencsconcordiaca httpusersencsconcordiacarababdu Vinicius Oda MASc student Department Computer Science Engineering Concordia University Montreal research interests include Engineering Ecosystems Mining Repositories among others Suhaib Mujahid PhD student Department Computer Science Engineering Concordia University received masters Engineering Concordia University Canada 2017 obtained Bachelors Information Systems Palestine Polytechnic University research interests include wearable applications quality assurance mining repositories empirical engineering find httpusersencsconcordiacasmujahi Emad Shihab Associate Professor Concordia University Research Chair Department Computer Science Engineering Concordia University research interests Engineering Mining Repositories Analytics work published prestigious SE venues including ICSE ESECFSE MSR ICSME EMSE TSE serves steering committees PROMISE SANER MSR three leading conferences analytics areas work done collaboration adopted biggest companies Microsoft Avaya BlackBerry Ericsson National Bank senior member IEEE homepage httpdasencsconcordiaca Affiliations Rabe Abdalkareemtextsuperscript1 cdot Vinicius Odatextsuperscript1 cdot Suhaib Mujahidtextsuperscript1 cdot Emad Shihabtextsuperscript1 Vinicius Oda vodaencsconcordiaca Suhaib Mujahid smujahiencsconcordiaca Emad Shihab eshihabencsconcordiaca textsuperscript1 DataDriven Analysis DAS Lab Department Computer Science Engineering Concordia University Montréal Canada
::::
Understanding Usage Impact Adoption NonOSI Approved Licenses Rômulo Meloca1 Gustavo Pinto2 Leonardo Baiser1 Marco Mattos1 Ivanilton Polato1 Igor Scaliante Wiese1 Daniel German3 1Federal University Technology – Paraná UTFPR 2University Pará UFPA 3University Victoria ABSTRACT license one important nonexecutable pieces system However due nontechnical nature developers often misuse misunderstand licenses Although previous studies reported problems related licenses clashes inconsistencies paper shed light important yet overlooked issue use nonapproved opensource licenses licenses claim opensource formally approved Open Source Initiative OSI developer releases nonapproved license even interest make opensource original author might granting rights required use uncover reasons behind use nonapproved licenses conducted mixmethod study mining data 657K opensource projects 4367K versions surveying 76 developers published projects Although 1058554 versions employ least one nonapproved license nonapproved licenses account 2151 license usage also observed uncommon developers change nonapproved approved license asked developers mentioned transition due better understanding disadvantages using nonapproved license perspective particularly important since developers often rely package managers easily quickly get dependencies working CCS CONCEPTS • engineering → Open source model KEYWORDS Open Source license OSI approved ACM Reference Format Rômulo Meloca1 Gustavo Pinto2 Leonardo Baiser1 Marco Mattos1 Ivanilton Polato1 Igor Scaliante Wiese1 Daniel German3 2018 Understanding Usage Impact Adoption NonOSI Approved Licenses Proceedings MSR ’18 15th International Conference Mining Repositories Gothenburg Sweden May 28–29 2018 MSR ’18 11 pages httpsdoiorg10114531963983196427
::::
1 INTRODUCTION licenses one important nonexecutable part system 5 Particularly relevant opensource OSS opensource licenses drive one use OSS also ensure extent others reuse 19 Similarly code licenses change 27 evolve 25 relicensing indeed commonplace opensource world 7 example Facebook recently relicensed four key opensource softwares BSD Patents MIT license1 According change motivated unhappy community looking alternatives permissive licenses concern however pertains large companies maintain opensource softwares since license common good opensource Therefore surprise licensing active research field 1 4 16 23 Despite importance developers fully understand problems related license usage 1 lack licenses license inconsistencies way developers develop exacerbates problem since simple actions copying code snippet web potential infringing license 12 13 issue becomes even relevant opensource era constant flow new opensource born regular basis 10 developers myriad codebases refer way might infringe license consequently whole chain depends Another relevant yet fully understood problem use opensource licenses approved OSI Open Source Initiative see Section 2 details licenses formally approved opensource regulator therefore vetted opensource Currently OSI maintains list 83 approved opensource licenses2 licenses went rigorous review process licenses submitted approved eg CC0 license3 submitted approved According website purpose OSI’s license review process “1 Ensure approved licenses conform Open Source Definition OSD 2 Identify appropriate License Proliferation Category 3 Discourage vanity duplicative Licenses”4 Furthermore OSI defined open source Open Source Definition claims “only licensed OSIapproved Open Source license labeled ‘Open Source’ software”5 1httpscodefacebookcomposts300798627056246 2httpsopensourceorglicensesalphabetical 3httpsopensourceorgfaqcczero 4httpsopensourceorgapproval 5httpsopensourceorgfaq study investigate extent licenses provide opensource guarantees “nonapproved licenses” short used opensource projects published package managers Package managers particularly relevant license usage due least two reasons 1 growing faster terms number libraries available packages published 3 28 2 since packages obey standardized architecture 22 installing reusing thirdparty package comes pain Therefore packages published package managers might higher number dependencies rely package manager shall see Section 4 average package NPM 480 dependencies 3rd Quartile 5 Max 792 paper study three wellknown package managers NPM Node Package Manager RubyGems CRAN Comprehensive R Archive Network one package managers downloaded investigated packages available process ended comprehensive list 657811 packages scattered three wellknown long lived package managers Specifically investigated 510964 NPM packages 11366 CRAN packages 135481 RubyGems packages Still order provide evolutionary perspective license usage packages studied 4367440 different packages versions 3539494 NPM 816580 RubyGems 11366 CRAN manually analyzed license employed one package versions paper makes following contributions conducted largest study licenses usage evolution targeting ∼660k packages 43 million versions published three wellknown package managers NPM RubyGems CRAN studied impact use nonapproved licenses comprehending whole dependency chain deployed survey 76 package publishers package developers owners authors understand use nonapproved licenses
::::
2 BACKGROUND OPENSOURCE LICENSES Open Source Definition 17 published OSI defines 10 properties license must satisfy called Open Source OSI also established approval process license approved Open Source today 83 licenses approved although many submitted organizations also approve licenses open source Free Foundation FSF Debian Foundation two call Free Licenses—with one exception NASA Open Source Agreement 13 OSI approved licenses considered free FSF7 scope paper consider licenses approved OSI decision motivated fact differently FSF develop approve licenses OSI develop — approves — licenses Since license submitted anyone interested contributing opensource community participation crucial aspect modern opensource 2 18 much strong OSI side better understand approval process implications using OSI approved license conducted semistructured interview OSI’s board member According anybody submit license OSI approval certification process everyone invited participate review discussion license goal certification process make sure submitted license meets criteria stated OpenSource Definition licenses satisfies requirements set OpenSource definition license approved One main benefits using OSI approved license guarantee OSI—and open source community large—has vetted license license widely known Therefore community understand trust use license Otherwise OSI everyone could develop new license claim opensource would require using hire lawyers understand license means even license popular domains Create Commons Zero CC0 license released CC0 opensource According board member importantly threat applies recursively “if ‘A’ uses OSI approved license depends ‘B’ use OSI approved license would dangerous ‘A’ using OSI approved license” Nevertheless one interested publishing assets data images opensource data safely released CC0 requirements OSD apply assets similar issue occurs one state license case original author granted rights recipient without permission original author one use redistribute create derivative works clearly opposite opensource concepts
::::
3 METHOD section present research questions method data gathered ground definitions 31 Research Questions main goal study gain indepth understanding nonapproved opensource licenses designed following three research questions guide research RQ1 common nonapproved licenses packages RQ2 impact nonapproved licenses package managers ecosystem RQ3 developers adopt nonapproved licenses answer questions conducted twophase research adopting sequential mixedmethod approach First collected data license usage evolution corpus ∼660k packages Section 32 performed survey targeting 76 package publishers Section 33 32 First study mining license usage 321 Package Package Managers first study mined license information packages hosted three wellknown longlived package managers NPM RubyGems CRAN package managers studied following characteristics NPM manages indexes Nodejs packages Nodejs JavaScript runtime environment NPM package manager launched 2009 October 2017 contains 521K packages Although offers support maintaining packages insite version control system packages available maintained elsewhere eg GitHub submit package NPM user must create account push package using NPM utility RubyGems manages indexes Ruby packages RubyGems launched 2009 October 2017 contains 192K packages also offers support maintaining packages insite packages published maintained elsewhere eg GitHub RubyGems distributes binaries ie gem file web interface Anyone interested submitting package RubyGems must create account push package using gem utility CRAN manages indexes R packages Differently NPM RubyGems CRAN distributes source binary code packages published CRAN launched 1998 October 2017 contains 11K packages One interested submitting package CRAN needs create account submit package CRAN web interface package managers host several wellknown nontrivial packages including React NPM Rails RubyGems ggplot2 CRAN Packages package managers downloaded millions times per month instance September 2017 NPM packages BlueBirdtext11 Reacttext12 Lodashtext13 total downloaded 69 million times 18 mi 6 mi 45 mi respectively Package managers also make available package releases ie new version Table 1 presents distribution versions per package see 56 packages published NPM three version 58 RubyGems 75 CRAN Packages 10 versions also common 17 NPM 16 RubyGems 05 CRAN Generally speaking CRAN less package versions NPM RubyGems 322 Data Collection created infrastructure download extract data match dependencies package versions infrastructure downloaded metadata packages available three package managers NPM RubyGems provide API collect relevant datatext14 infrastructure gathers CRAN metadata navigating public HTML files CRAN NPM collected data September 7th 2017 collected RubyGems metadata September 15th 2017 Table 2 depicts metadata download package version package manager Versions CRAN NPM RubyGems 1 8848 150546 42668 2 1942 80243 22720 3 360 55028 15089 4 140 39890 10743 5 67 30192 7688 6 38 22886 5814 7 30 18190 4549 8 12 15105 3550 9 17 12000 2870 ≥10 67 86884 19790 downloading metadata infrastructure validated whether downloaded package X depends also downloaded package validated dependencies using version number stated package X version number defined package three package managers use notion delimiters express range possible versions compatible given package Example delimiters include characters “” “” “sim” “wedge” example package X depends ‘react’ package declare dependency “reactsim1500” indicates package X depends version compatible react1500 addition NPM RubyGems package publishers could use “x” character specify small range versions eg 11x 1x match dependencies selected first version available matched pattern example NPM package ‘gulp’ version ‘260’ gulp260 short depends package eventstream30x result infrastructure successfully matched package gulp260 eventstream300 dependency match procedure important impact analysis RQ2 downloaded data using three Google Cloud Platform VMs used one dualcore VM 75Gb main memory 20Gb SSD two singlecore VMs 35Gb main memory 10Gb hard disk downloading dataset occupied 12Gb disk space 11Gb NPM data 46Mb CRAN data 182Mb RubyGems data RubyGems data infrastructure used well data collected found companion websitetext15 Table 3 shows distribution number licenses per package version Licenses CRAN NPM RubyGems 0 0 369914 394582 1 5346 3158391 419095 2 5881 10287 2411 3 130 669 355 4 6 222 29 5 1 11 61 6 2 0 46 10 0 0 1 see majority packages single license Interestingly package license could found CRAN happens CRAN publish packages without selection licensetext16 Still package versions two licenses common instance package sixarmrubyunaccent112 published RubyGems released 10 licenses apache20 artistic20 bsd3clause ccbyncsa40 agpl30 gpl30 lgpl30 mit mpl20 ruby Table 4 presents number dependencies per version package Approximately 29 NPM package versions dependencies 39 CRAN 30 RubyGems respectively Dependencies CRAN NPM RubyGems 0 6435 1047089 258810 1 1782 537283 194312 2 1701 412121 143616 3 1517 322234 84679 4 1183 241449 51338 5 978 180349 31424 6 733 139429 22698 7 521 111070 13720 8 436 85631 11302 9 323 69024 8699 ≥10 1060 472466 32879 Although average number dependencies per package version 38 outliers found instance CRAN package seurat201 41 dependencies RubyGems package awssdkresources310 105 dependencies NPM package primengcustom400beta1 500 dependencies 323 License Groups aforementioned downloaded metadata 657811 packages 510964 NPM packages 11366 CRAN packages 135481 RubyGems packages spanning 4367440 versions 3539494 NPM 816580 RubyGems 11366 CRAN analyzing licenses version released found included typos wrong names happened NPM RubyGems allow one fill license field information manually normalized license found normalization process conducted pairs followed conflict resolution meetings license two authors checked 1 approved OSI 2 approved defined somewhere else ie Package Data Exchangetext17 3 approved neither defined anywhere else Licenses found OSI list neither SPDX allocated category check whether license already defined searched specification blog posts QA websites mailing lists formal specification license found license included nonapproved license group process ended six license groups namely OSI licenses licenses approved OSI case also fixed small issues trivial typos example successfully normalized apache 2 license correct form apache20 Incomplete licenses probably approved license although could fix issues instance package publishers often omit version number eg bsd lgpl could sure license version used SPDX OSI licenses licenses listed SPDX License Listtext18 formally approved OSI group include popular defined licenses Fuck Want Public License WTFPL Creative Commons Zero CC0 license Missing Absence license aggregated group package versions without license ie package publishers left empty license field developers filled explicit NONE word license field subcategory copyright licenses discussed Section 2 license declared original authors retains rights licenses undefined typos wrong names even curses Examples include license specified license Additionally included group licenses packager publisher put external link license information inspect file individually data included analysis conducted represent less 05 Copyright licenses occurs package publishers explicitly mention retain copyright Examples include license c Copyright license rights reserved license end normalization process ended 973 distinct licenses 758 NPM 46 CRAN 336 RubyGems 15httpsgithubcomrmelocaEcosystemsAnalysis 16httpscranrprojectorgwebpackagespolicieshtml 17httpsspdxorg 18httpsspdxorglicenses Nonapproved licenses comprehend licenses OSI licenses Incomplete licenses
::::
33 Second study survey package publishers second study deployed survey package publishers NPM package manager focused package manager 1 email addresses package publishers could recovered 2 packages package manager exhibits greatest number dependencies likely affectbe affected license inconsistency found used following criteria identify population selected package publishers packages versions released nonapproved license least one dependency ensures irregularity propagates packages apply criteria obtained 385 package publishers different survey based recommendation Smith et al 21 employing principles increasing survey participation sending personalized invitations allowing participants remain completely anonymous asking closed direct questions much possible survey 14 questions three open grouped three broad interests demographics eg gender profession understanding nonapproved adoption eg choose aware implications usage frequency eg often use nonapproved licenses often declare license open questions analyzed pairs followed conflict resolution meetings Participation voluntary estimated time complete survey 510 minutes sending invitation email 8 messages delivered due technical reasons received 76 responses representing 20 response rate survey available httpsgooglJiuwzp
::::
4 RESULTS section report results study grouped research question
::::
41 RQ1 common nonapproved licenses packages normalization process found total 973 distinct licenses licenses declared total 4369024 times number license declarations higher number package versions given one package often employs one license showed Table 3 Table 5 shows distribution license group see nonapproved licenses licenses defined Section 323 except OSI licenses Incomplete licenses used 858311 times corresponds roughly 20 overall license usage nevertheless related absence license found 764496 package versions without license declaration accounts 89 nonapproved license usage particular RubyGems missing licenses correspond 48 total license used 1041 NPM also studied license usage evolutionary perspective order provide general overview Table 6 groups evolution patterns license changes pairwise analyzed versions available order verify many times license changed one group another results show package versions regardless package manager tend propagate license used versions Therefore main diagonal always higher values instance NPM found 311455 package versions without license associated still nonapproved license next version
::::
Table 5 License Groups Package Versions Group CRAN NPM RubyGems TOTAL OSI 15724 3009782 403693 3429199 INCOMPLETE 34 73647 7833 81514 SPDX OSI 162 30688 6215 37065 MISSING 8 400618 396178 796804 220 10978 4953 16151 COPYRIGHT 0 7106 1185 8291
::::
Table 6 Patterns license evolution NPM FromTo OSI INC SPDX MISS OTH COP OSI 2576692 3012 2060 2125 423 116 INC 4573 61535 44 144 363 205 SPDX 2153 26 25489 182 78 56 MISS 8911 321 256 337711 87 23 OTH 502 345 99 51 9231 241 COP 200 212 58 19 267 6424 RubyGems FromTo OSI INC SPDX MISS OTH COP OSI 336639 505 574 380 553 37 INC 854 6575 99 5 116 0 SPDX 618 82 5095 51 270 1 MISS 8112 329 279 324153 185 10 OTH 808 119 272 9 4197 14 COP 50 1 1 5 15 1029 Since changes approved nonapproved relevant ones study counted many times package version changed OSIapproved license nonapproved license viceversa identified changes 12491 packages RubyGems 24075 packages NPM Among package RubyGems 10442 package versions changed nonapproved approved license case publishers corrected wrong license presented Table 8 Interestingly number changes approved nonapproved license much lesser RubyGems found 2049 package versions changed approved license nonapproved license similar behavior occurred NPM number changes nonapproved license much greater opposite 16339 package versions changed nonapproved license approved one whereas 7736 package versions changed approved nonapproved one example upgrading zorg001 zorg0010 NPM package changed know ISC license license performed analysis CRAN provide information provide finegrained perspective evolution patterns analyzed top 10 common changes approved license nonapproved license viceversa Table 7 presents evolution patterns focusing changes approved nonapproved license majority changes observed changing MIT license license 1286 instances found NPM 248 RubyGems effects missing license exactly opposite developer might think applies copyright instead opening source code Therefore migration missing license MIT license explained correction effect specially due permissive characteristics license evidence supported Almeida 1 findings developers might fully understand licensing process Table 7 10 Common License Evolution Patterns Approved NonApproved NPM RubyGems Evolution Patterns Evolution Patterns mit → missing 1286 mit → missing 248 isc → missing 604 apache20 → missing 85 apache20 → missing 116 bsd3clause → missing 33 bsd2clause → missing 37 lgpl20 → missing 4 gpl30 → missing 20 gpl30 → missing 4 bsd3clause → missing 19 bsd2clause → missing 2 gpl20 → missing 12 gpl20 → missing 2 lgpl30 → missing 9 lgpl30 → missing 1 fair → missing 9 mspl → missing 1 mpl20 → missing 7 — — RQ1 Summary found 1058554 packages versions 2423 released nonapproved licenses Packages published RubyGems affected ones 55 employed nonapproved license missing lack license license widespread license change occurs package versions keep license although changes nonapproved approved license viceversa common Table 8 10 Common License Evolution Patterns NonApproved Approved NPM RubyGems Evolution Patterns Evolution Patterns missing → mit 6667 missing → mit 6556 missing → isc 831 missing → apache20 614 missing → apache20 633 missing → gpl30 239 missing → bsd3clause 262 missing → gpl20 153 missing → gpl30 137 missing → bsd3clause 133 missing → bsd2clause 91 missing → lgpl30 86 missing → gpl20 85 missing → bsd2clause 81 missing → lgpl30 61 missing → artistic20 73 missing → mpl20 49 missing → agpl30 33 missing → agpl30 35 missing → lgpl21 31 42 RQ2 impact nonapproved licenses package managers ecosystem understand impact nonapproved license calculated two types metrics irregular affected three different granularities graph order Irregular package called irregular least one versions direct dependency package released nonapproved license package irregular means affect packages depends Affected package affected least one versions direct indirect dependency package irregular Direct dependency one package father affected depends child irregular Indirect dependency one level affect irregular packages metrics analyzed whole dependency graph package versions Table 9 shows impact nonapproved licenses terms packages versions dependencies terms packages although NPM irregular affected packages RubyGems presents higher proportion irregular 46 vs 18 affected 55 vs 38 packages NPM suggests almost half package versions RubyGems irregular low number packages versions dependencies affected CRAN CRAN prevents absence licenses requiring package publishers choose least one license selection projected impact including indirect dependencies package version impact NPM higher RubyGems NPM packages versions provide detailed example Figure 1 shows fragment dependency graph package request0810 particular package 23205 direct dependencies 6840 irregular 42938 indirect dependencies parents Moreover omitted Figure 1 regular direct dependencies figure solid lines edges regular dependencies dotted lines edges irregular dependencies Double border lines vertexes regular package versions whereas single solid border ones irregular Table 9 Impact caused nonapproved licenses package manager Graph Order Metric CRAN NPM RubyGems Packages 11366 510964 135481 Irregular 1082 78224 62967 Proportion 0095 0153 0464 Affected 1455 194741 75475 Proportion 0128 0381 0557 Versions 11366 3539494 816580 Irregular 35 690703 440443 Proportion 0003 0195 0539 Affected 36 1619248 520967 Proportion 0003 0457 0637 Dependencies 1086 15521508 1765288 Irregular 59 1364281 1088298 Proportion 0054 0087 0616 Dotted border vertexes represents affected packages Notice package might irregular affect time also observed fragment graph three packages nonapproved missing license associated assertplus verror extsprintf worth mention package assertplus extsprintf considered regular packages dependency package version released nonapproved license Figure 1 Example affected package version dependency tree Another example occurs RubyGems package manager package activesupport actually version 426 downloaded 174538434 times entire life cycle version 400 released 2013 25th June package depending unlicensed packages minitest420 multijson133 threadsafe010 tzinfo0337 activesupport also depending MITlicensed package i18n064 particular version downloaded 3107216 times used 1093 another published packages directly 16526 packages taking account direct indirect dependencies package activesupport toolkit extracted Rails framework’s core provide extra perspective impact nonapproved licenses compared number irregular affected values incomplete licenses chose incomplete licenses interpreted wrong licenses since correct name version license Table 10 presents common incomplete licenses per package manager Among incomplete licenses observed package publishers using number licenses omitting version Table 10 Top 10 Incomplete Licenses CRAN NPM RubyGems License License License agpl 12 bsd 59132 bsd 4280 bsd 11 gpl 7904 gpl 1783 cecill 6 lgpl 2747 lgpl 1067 mpl 2 epl 1173 agpl 304 epl 2 mpl 854 artistic 166 bsl 1 agpl 832 epl 71 —— free 218 mpl 50 —— ibm 216 free 36 —— apl 194 osl 26 —— cecill 179 afl 16 sense Table 11 presents impact Incomplete licenses worth mention even consider incomplete licenses inconsistent licenses nonapproved licenses 9 presented higher impact Incomplete licenses instance number irregular packages caused nonapproved licenses 62154 63329 irregular packages caused Incomplete licenses RubyGems ratio difference 813362 almost 25 times higher compare affected versions RubyGems impact nonapproved licenses almost 69 times higher Incomplete Licenses general way also found NPM affected Incomplete licenses RubyGems Finally CRAN packages highly impacted Incomplete licenses mostly due lack license version behavior turns ∼11 CRAN packages irregulars affects almost 15 published packages recognize nonapproved licenses dangerous package authors publishers package managers users – create explicit publish package direct dependencies published packages – uncertainty whether dependencies desiredtopublish package regular fact package publishers look whole dependency chain However factors might imply presence irregularities package managers height package dependency tree presence newcomers Table 11 Impact caused Incomplete licenses package manager Graph Order Metric CRAN NPM RubyGems Packages 11366 510964 135481 Irregular 1256 94515 63329 Proportion 0110 0184 0467 Affected 1480 197626 75455 Proportion 0130 0386 0556 Versions 11366 3539494 816580 Irregular 38 825520 443072 Proportion 0003 0233 0542 Affected 38 1639430 520836 Proportion 0003 0463 0637 Dependencies 1086 15521508 1765288 Irregular 62 1759643 1098489 Proportion 0057 0113 0622 open source community might completely aware license constraints RQ2 Summary Nonapproved licenses impact packages NPM RubyGems making packages irregular affecting direct indirect dependencies Nonapproved licenses considered harmful incomplete licenses since impact higher compared amount irregular affected packages versions License group
::::
43 RQ3 developers adopt nonapproved licenses answer question report results survey 76 package publishers target population 94 male 96 work development industry 53 created contribute 30 opensource projects 18 created contribute 100 opensource projects Still 48 respondents believe 20 createdcontributed opensource projects use nonapproved license interestingly however fact 27 respondents idea many projects contribute use nonapproved license Similarly Section 41 showed evidence 18 package versions studied use nonapproved license asked use nonapproved license found 26 respondents care specific license terms Along line one respondent mentioned chose WTFPL license really don’t care use modules share code people it’s pleasure know someone finds useful Maybe wrote something really great like Facebook’s React would think fame Also 17 respondents acknowledged using nonapproved license naive decision thought appropriate Still small projects seem prone licensed nonapproved license Yet 5 respondents aware nonapproved license makes sense licensing nonsoftware projects instance fits content repository best source code repository contains data Finally developers adopt nonapproved licenses claim simpler 6 occurrences open 4 occurrences instance one respondent said likes idea WTFPL Makes everything pretty clear want Right afterwards asked whether aware implications using nonapproved license 43 respondents mentioned lack awareness mentioned aware implications asked cite one example implication Among answers found developers believe nonapproved license might limit adoption 12 occurrences example one respondent said use license others never heard others less likely contribute andor may wary using Code thefts also recurring implication mentioned 7 respondents Finally one respondent raised fact main implication using nonapproved license can’t automatically recognized machines categorize license may exclude search results particularly interesting since Github helps owners choose correct license repositories However Github help documentation also highlight developers responsible define correct license see paragraph GitHub provides information asis basis makes warranties regarding information licenses provided disclaims liability damages resulting using license information next five following questions asked often Q9 investigate license chose conforms license depends Q10 declare license Q11 use nonapproved license Q12 use copyright license one opensource Q13 use one license either approved Figure 2 shows results figure shows couple interesting information First see 46 respondents Never Rarely take account license used software’s dependencies believe important result discussed Section 2 licenses inconsistencies directly impact depends upon similar implications 11 respondents “Always” “Very Often” declare license One respondent even mentioned “Frequently forget declare license seems unimportant” Similarly 25 respondents “Always” “Very Often” use nonapproved license Finally 94 mentioned “Never” “Rarely” use one license either approved One respondent mentioned reasons uses one license related forkbased model “TypoPRO collection fonts font already distinct Open Source license upstream vendor TypoPRO stays union set upstream licenses” RQ3 Summary 26 respondents care license used respondents believe nonapproved licenses open simpler use Among implications 12 respondents believe nonapproved licenses limit adoption 46 respondents take license account choosing package dependency
::::
5 IMPLICATIONS research implications different kinds stakeholders Three possible groups discussed Package managers Since observed NPM RubyGems require developers inform license many packages published packages managers either 1 use license 2 state wrong incomplete license name RQ1 problem hinders researchers conducting indepth studies license usage also potential confusing developers interested using package Package managers therefore might introduce mechanisms prevent introduction wrong even nonexisting license names Researchers Although licensing established research topic notion nonapproved licenses yet fully explored RQ1 implications unclear RQ2 Researchers expand comprehension nonapproved licenses many ways First researchers could introduce mechanisms automatically detect use nonapproved licenses Still since packages tend propagate licenses releases RQ1 researchers create techniques avoid nonapproved license propagation CS Professors Educators also benefit findings study Since license common misunderstood topic among developers 1 engineering professors could bring problems related license usage classroom invite students discuss possible solutions compare perception professional developers RQ3 Similarly order make licenses appealing aspiring engineers professors use license inconsistency graph RQ2 advanced datastructure classes invite students understand license inconsistencies complex deeper graphs
::::
6 THREATS VALIDITY study proportion always many limitations threats validity First could retrieve data 2140 packages 1079 NPM packages 1052 RubyGems packages 9 CRAN packages happened packages metadata could located However packages represent 004 whole universe packages study Second normalization process manual therefore errorprone mitigated threat using pairreview work author independently analyzed set licenses subsequent conflict resolution meetings original normalized license sets available future analysis choose analyze external FILE licenses package versions hosted GitHub would require manual search license file repositories CRAN 1391 package versions file license declared NPM 19010 RubyGems 20000 package version using FILE license Third one might argue packages studied might full simple trivial projects However packages available package managers often mature compared projects hosted coding websites Github often personal projects class projects 9 Fourth rely licenses approved OSI Even license commonplace — instance found 4927 package versions using creative commons zero CC0 license 104 CRAN 3022 NPM 1801 RubyGems — still consider licenses nonapproved Although aware many institutions Free Foundation FSF Debian Foundation approve licenses decided stick OSI approval 1 licenses submitted anyone interested get OSI approve 2 licenses approved OSI commonly used — shown Table 5 licenses found dataset recognized OSI Finally double checked whether license informed package manager indeed declared official package website chose validate license used due two reasons first package publisher often core member charge declaring license used given published version one package publisher would confident state correct license used second manually studied hundreds thousands packages packages often hosted thirdparty coding website eg GitHub BitBucket store license information using distinct ways eg Github shows license name project’s first page algorithm succeed inferring license always case BitBucket hand explicitly demand license creating repository Additionally proper license file display license project’s cover page problem exacerbates considering license information per version release Therefore due lack standards substantial sample size performing manual process would prohibitive
::::
7 RELATED WORK Recent studies investigated licenses inconsistencies similar concept nonapproved licenses Since nonapproved licenses also introduce inconsistencies one see nonapproved subset license inconsistencies However believe implication nonapproved licenses greater known problems related licenses inconsistencies best knowledge work first analyze usage adoption NonApproved licenses also discussed impact NonApproved licenses compared incomplete licenses package manager context attracted attention practitioners researchers since NPM CRAN RubyGems growing faster becoming increasingly popular summarize related work terms licenses maintenance evolution licenses inconsistencies Di Penta et al 4 proposed method track evolution licensing investigated relevance six open source projects inconsistencies found related files without license Vendome et al 24 27 conducted large empirical study investigating developers adopt change licenses Recently Vendome et al 26 performed another largescale empirical study change history 51K FOSS systems investigate prevalence known license exceptions presenting categorization Machine LearningBased Detection algorithm identify license exceptions Santos 20 analyzed set 756 projects FLOSSmole repository Sourceforgenet data changed source code distribution allowances author found 88 projects “none” license – might leave projects exposed legally unattended – 55 times projects changed current state license one license German et al 8 investigated licenses declared packages consistent source code files Fedora ecosystem Manabe et al 15 extended proposing graph visualization understand relationships found GPL Licenses likely include licenses Apache Licenses tend contain files license authors reported changes valid license none cases nonvalid license changed valid license Wu et al 30 31 investigated license inconsistencies caused redistributors removed modified license header source code authors described categorized different types license inconsistencies proposing method detect Debian ecosystem authors found average 24 packages relationship “none” license however effect discussed Wu et al 29 also studied whether issues license inconsistencies properly solved analyzing two versions Debian investigating evolution patterns license inconsistencies disappear downstream projects get synchronized Lee et al 14 compared machinebased algorithms identify potential license violations guide nonexperts manually inspect violations authors reported accuracy crowds comparable experts machine learning algorithm Interesting note approximately 25 files 227 projects 794 projects analyzed license Almeida et al 1 conducted survey 375 developers understand whether understand violations assumptions three popular open source licenses GNU GPL30 GNU LGPL 30 MPL 20 alone combination authors confront answers expert’s opinion found answers consistent 62 42 cases Although previous work understanding licenses pointed “None” frequently choose files packagers neither scenario involved aspect Van der Burg et al 23 proposed approach construct analyze Concrete Build Dependency Graph CBDG system tracing system calls buildtime case study seven open source systems authors showed constructed CBDGs accurately classify sources included excluded deliverables 88100 precision 98100 recall uncover license compliance inconsistencies real systems German Di Penta 6 presented method open source license compliance Java applications authors implemented tool called Kenen mitigate potential legal risk developers reuse open source components Kapitsaki et al 11 compared tools used detect licenses components avoid license violations classifying three types License information identification source code binaries metadata stored code repositories license modeling associated reasoning actions
::::
8 CONCLUSION paper conducted largescale study nonapproved licenses terms usage impact adoption Nonapproved licenses license approved OSI Open Source Initiative released nonapproved license cannot claimed opensource original author retains rights Nonapproved licenses include licenses typos wrong names even curses even missing licenses eg package publishers fill license information mining data 657k opensource projects observed hundreds nonapproved licenses exist 24 packages released used least one nonapproved licenses majority nonapproved licenses found fact absence license Still found package publishers tend propagate license used though package versions Nonapproved licenses impact packages NPM RubyGems Incomplete licenses compared amount irregular affected packages versions Finally asked packagers publishers nonapproved license found 46 respondents take license account choosing package dependency respondents believe nonapproved licenses open simpler use hand 12 respondents believe nonapproved licenses may limit adoption future work plan investigate evolution nonapproved licenses finegrained way eg commits instead version releases would deepen understanding nonapproved licenses adopted Still since CRAN developers might diverse background eg biologists mathematicians among others plan get touch understand motivations behind usage nonapproved licenses ACKNOWLEDGMENTS work supported Fundação Araucária CNPq 40630820160 43064220164 PROPESPUFPA FAPESP 2015245273 REFERENCES 1 Almeida G C Murphy G Wilson Hoye 2017 Developers Understand Open Source Licenses 2017 IEEEACM 25th International Conference Program Comprehension ICPC 1–11 httpsdoiorg101109ICPC 2 Jailton Coelho Marco Tulio Valente 2017 Modern Open Source Projects Fail 25th International Symposium Foundations Engineering FSE 186–186 3 Eirini Kalliamvakou Tom Mens 2017 Empirical Comparison Developer Retention RubyGems Npm Ecosystems Innov Syst Softw Eng 13 23 Sept 2017 101–115 httpsdoiorg101007s1133401700304 4 Eirini Kalliamvakou Georgios Gousios Kelly Blincoe Leif Singer Daniel German Daniela Damian 2016 indepth study promises perils mining GitHub Empirical Engineering 21 5 2016 2035–2071 httpsdoiorg101007s1066401593935 5 Karl Fogel 2017 Producing Open Source Run Successful Free second ed OReilly Media httpwwwproducingosscom 6 German Di Penta 2012 Method Open Source License Compliance Java Applications IEEE 29 3 May 2012 58–63 httpsdoiorg101109MS201250 7 Daniel German Jesús GonzálezBaralona 2009 Empirical Study Reuse Licensed GNU General Public License Springer Berlin Heidelberg Berlin Heidelberg 185–198 httpsdoiorg101007978364202032217 8 German Di Penta J Davies 2010 Understanding Auditing Licensing Open Source Distributions 2010 IEEE 18th International Conference Program Comprehension 84–93 httpsdoiorg101109ICPC201048 9 Eirini Kalliamvakou Georgios Gousios Kelly Blincoe Leif Singer Daniel German Daniela Damian 2016 indepth study promises perils mining GitHub Empirical Engineering 21 5 2016 2035–2071 httpsdoiorg101007s1066401593935 10 Eirini Kalliamvakou Georgios Gousios Kelly Blincoe Leif Singer Daniel German Daniela Damian 2014 Promises Perils Mining GitHub Proceedings 11th Working Conference Mining Repositories MSR 2014 92–101 11 Georgia Kapitsaki Nikolaos Tselikas Ioannis E Foukarakis 2015 insight license tools open source systems Journal Systems 102 2015 72 – 87 httpsdoiorg101016jjss201412050 12 Cory Kapser Michael W Godfrey 2008 “Cloning considered harmful” considered harmful patterns cloning Empirical Engineering 13 6 2008 645–692 13 Miryung Kim L Bergman Lau Notkin 2004 ethnographic study copy paste programming practices OOPL Empirical Engineering 2004 ISESE ’04 Proceedings 2004 International Symposium 83–92 14 Sanghoon Lee Daniel German Seungwon Hwang Sunghun Kim 2015 Crowdsourcing Identification License Violations Journal Computing Science Engineering 9 4 2015 190–203 15 Yuki Manabe Daniel German Katsuro Inoue 2014 Analyzing Relationship License Packages Files Free Open Source Springer Berlin Heidelberg Berlin Heidelberg 51–60 httpsdoiorg10100797836425512946 16 Trevor Maryka Daniel German Germán PooCaamaño 2015 Variability BSD MIT Licenses Springer International Publishing Cham 146–156 httpsdoiorg101007978331917837014 17 OSD 2018 Open Source Definition Annotated 2018 httpsopensourceorgosdannotated 18 Gustavo Pinto Igor Steinmacher Marco Aurélio Gerosa 2016 Common Think Indepth Study Casual Contributors IEEE 23rd International Conference Analysis Evolution Reengineering SANER 2016 Suita Osaka Japan March 1418 2016 Volume 1 112–123 httpsdoiorg101109ICPC 19 Lawrence Rosen 2004 Open Source Licensing Freedom Intellectual Property Law Prentice Hall PTR Upper Saddle River NJ USA 20 Carlos Denner dos Santos 2017 Changes free open source licenses managerial interventions variations attractiveness Journal Internet Services Applications 8 1 07 Aug 2017 11 httpsdoiorg101186s1317401700623 21 E Smith R Loftin E MurphyHill C Bird Zimmermann 2013 Improving developer participation rates surveys 2013 6th International Workshop Cooperative Human Aspects Engineering CHASE 89–92 httpsdoiorg101109CHASE20136614738 22 Diomidis Spinellis 2012 Package Management Systems IEEE 29 2 2012 84–86 23 Sander van der Burg Eelco Dolstra Shane McIntosh Julius Davies Daniel German Armijn Hemel 2014 Tracing Build Processes Uncover License Compliance Inconsistencies Proceedings 29th ACMIEEE International Conference Automated Engineering ASE ’14 ACM New York NY USA 731–742 httpsdoiorg10114526429372643013 24 Christopher Vendome Gabriele Bavota Massimiliano Di Penta Mario LinaresVásquez Daniel German Denys Poshyvanyk 2017 License usage changes largescale study gitHub Empirical Engineering 22 3 01 Jun 2017 1537–1577 httpsdoiorg101007s1066401694384 25 Christopher Vendome Gabriele Bavota Massimiliano Di Penta Mario LinaresVásquez Daniel Germán Denys Poshyvanyk 2017 License usage changes largescale study gitHub Empirical Engineering 22 3 2017 1537–1577 26 Christopher Vendome Mario LinaresVasquez Gabriele Bavota Massimiliano Di Penta Daniel German Denys Poshyvanyk 2017 Machine Learningbased Detection Open Source License Exceptions Proceedings 39th International Conference Engineering ICSE ’17 IEEE Press Piscataway NJ USA 118–129 httpsdoiorg101109ICSE201719 27 Christopher Vendome Mario LinaresVasquez Gabriele Bavota Massimiliano Di Penta Daniel German Denys Poshyvanyk 2015 Developers Adopt Change Licenses Proceedings 2015 IEEE International Conference Maintenance Evolution ICSME ICSME ’15 IEEE Computer Society Washington DC USA 31–40 httpsdoiorg101109ICSM20157332449 28 Erik Wittern Philippe Suter Shriram Rajagopalan 2016 Look Dynamics JavaScript Package Ecosystem Proceedings 13th International Conference Mining Repositories MSR ’16 ACM New York NY USA 351–361 httpsdoiorg10114529017392901743 29 Yuhao Wu Yuki Manabe Daniel German Katsuro Inoue 2017 Developers Treating License Inconsistency Issues Case Study License Inconsistency Evolution FOSS Projects Springer International Publishing Cham 69–79 httpsdoiorg10100797833195773578 30 Wu Manabe Kanda German K Inoue 2015 Method Detect License Inconsistencies LargeScale Open Source Projects 2015 IEEEACM 12th Working Conference Mining Repositories 324–333 httpsdoiorg101109MSR201537 31 Yuhao Wu Yuki Manabe Tetsuya Kanda Daniel German Katsuro Inoue 2017 Analysis license inconsistency large collections open source projects Empirical Engineering 22 3 01 Jun 2017 1194–1222 httpsdoiorg101007s1066401694878
::::
Beyond Dependencies Role CopyBased Reuse Open Source Development MAHMOUD JAHANSHAHI DAVID REID AUDRIS MOCKUS University Tennessee USA Open Source resources open reuse introducing dependencies copying resource contrast dependencybased reuse infrastructure systematically support copybased reuse appears entirely missing aim enable future research tool development increase efficiency reduce risks copybased reuse seek better understanding reuse measuring prevalence identifying factors affecting propensity reuse identify reused artifacts trace origins method exploits World Code infrastructure begin set theoryderived factors related propensity reuse sample instances different reuse types survey developers better understand intentions results indicate copybased reuse common many developers aware writing code propensity file reused varies greatly among languages source code binary files consistently decreasing time Files introduced popular projects likely reused least half reused resources originate “small” “medium” projects Developers various reasons reuse generally positive using package manager CCS Concepts • engineering → creation management • General reference → Empirical studies Additional Key Words Phrases Reuse Open Source Development Copybased Reuse Supply Chain World Code
::::
1 INTRODUCTION reuse refers practice developing systems existing rather creating scratch 55 Starting scratch may demand time effort reusing preexisting highquality code fits required task Developers therefore opportunistically frequently reuse code 48 Programming clearly defined problems often starts search code repositories typically followed careful copying pasting relevant code 85 fundamental principle Open Source OSS lies “openness” enables anyone access inspect reuse artifact could significantly enhance efficiency development process Platforms GitHub increase reuse opportunities enabling community developers curate projects promoting improving process opportunistic discovery reuse artifacts 46 significant portion OSS intentionally built reused offering resources functionality projects 39 thus reuse categorized one building blocks OSS Indeed developers open source community seek opportunities reuse existing highquality code also actively promote wellcrafted artifacts others utilize 33 widely reused Authors’ address Mahmoud Jahanshahi mjahanshvolsutkedu David Reid dreid6volsutkedu Audris Mockus audrisutkedu Department Electrical Engineering Computer Science University Tennessee Knoxville TN USA Permission make digital hard copies part work personal classroom use granted without fee provided copies made distributed profit commercial advantage copies bear notice full citation first page Copyrights components work owned others authors must honored Abstracting credit permitted copy otherwise republish post servers redistribute lists requires prior specific permission andor fee Request permissions permissionsacmorg © 2025 Copyright held ownerauthors ACM 1557739220251ART httpsdoiorg1011453715907 ACM Trans Softw Eng Methodol increases popularity maintainers providing job prospects 79 also may bring new maintainers well corporate support 46 commonly code reuse refers introduction explicit dependencies functionality provided readymade packages libraries frameworks platforms maintained projects referred dependencybased blackbox reuse external code modified developer generally committed project’s repository relied upon via package manager Copybased reuse whitebox reuse hand refers case source code reusable artifacts reused copying original code committing duplicate code new repository may remain modified developer reuse specifically focus copybased reuse study generally accepted programs modular 75 internal implementation details exposed outside module copybased reuse exactly opposite OSS’s copybased reuse source code file even code snippet reused another may result multiple possibly modified instances source code replicated across various files repositories copies may undergo changes maintenance leading multiple different versions originally identical code existing latest releases corresponding projects Unifying multiplicity versions copybased reuse refactor single package projects could depend upon may always tractable problem Moreover reuse process continues across various projects possibly modifications data related initial design authorship copyright status licensing could lost 76 loss could impede future enhancements bugfixing efforts might also diminish motivation original authors seek recognition work lead legal complications downstream users issues impact reuse code also dependent least one package involves reused code 20 landscape Open Source OSS expands tracing origins source code identifying highquality code suitable reuse deciphering simultaneous progression code across numerous projects become increasingly challenging pose risks spread potentially lowquality vulnerable code 46 eg orphan vulnerabilities 78 Despite sustained attention potential benefits risks associated reuse exact scale prevalent practices possible negative impacts related OSSwide reuse thoroughly explored primarily due formidable task tracking code throughout entirety OSS 46 Gaining comprehensive understanding reuse practices could guide future research towards developing methods tools enhance productivity mitigating inherent risks associated reuse Specifically aim quantify several aspects concerning extent nature reuse OSS providing information necessary investigate approaches support common activity making efficient safer use measurement framework created Jahanshahi Mockus 46 tracks versions artifacts referred blobs1 across repositories approach first time blob committed repository identified repository blob tuples sorted based commit time first appearance unique blob repository repository earliest commit time identified originating repository person made commit recognized creator blob Reuse instances identified pairing originating repository subsequent repositories commit blob work investigates much kind wholefile reuse happens scale OSS findings could help guide future research tool development support common potentially risky activity First show existing studies ignoring “small” inactive projects miss almost half code reused even “largest” active projects necessity indepth study 1In alignment terminology used Git version control system use term “blob” refer single version file fully comprehend abundant yet unseen “dark matter” projects contribute reuse activity Second theorize investigate empirically properties artifacts originating projects influence likelihood file reuse addressing key question previous work predominantly focused copy detection techniques missed investigate historic reuse trends also introduce timelimited measure reuse findings reveal several surprising patterns showing copying varies programming language properties blob originating projects insights could help prioritize articulate research tool development supports common reuse patterns Third obtain responses 374 developers code reused originated respondents write code explicit expectation reused Developers reuse code several reasons concerned bugs reused code willing use package managers reused code tools provided Overall find despite questionable reputation due inherent risks code copying common useful many developers keep mind writing code summary ask following research questions RQ1 much copybased reuse occurs factors affect propensity reuse extensive copying entire OSS landscape b copybased reuse limited particular group projects c characteristics blob affect probability reuse characteristics originating affect probability reuse RQ2 developers perceive engage copybased reuse foster reproducibility made replication package study including datasets creation scripts analysis notebooks publicly available httpszenodoorgrecords14743941
::::
2 BACKGROUND section structured provide comprehensive understanding context foundation research begins exploration types reuse supply chains Following delve associated risks discussing potential vulnerabilities legal issues challenges arise reuse third subsection introduces social contagion theory SCT helps select factors likely affect diffusion adoption reuse practices within open source development community 21 Reuse Supply Chains supply chain comprises various components libraries tools processes used develop build publish artifacts covers stages initial development final deployment including proprietary open source code configurations binaries plugins container dependencies infrastructure required integrate elements supply chain ensures right components delivered right places right times create functioning products reuse one form supply chain enhances efficiency reduces costs mitigates risks associated developing new scratch context open source reuse supply chains categorized based open source components integrated utilized within projects 69–71 211 Dependencybased Reuse Dependencybased reuse involves using open source libraries packages dependencies dependencies typically managed package managers NPM JavaScript pip Python Maven Java reliance dependencies introduce vulnerabilities risks properly managed 98 web application using React library turn depends numerous libraries example reuse kind supply chain 212 Copybased Reuse Copybased reuse type reuse investigated work copybased reuse code open source projects copied directly example developer might copy utility function open source repository integrate approach quick lead challenges maintaining updating copied code essential track manage copies ensure secure uptodate 56 213 Knowledgebased Reuse Knowledgebased reuse involves using knowledge practices derived open source projects without directly copying code using dependencies includes adoption development methodologies architectural patterns best practices open source communities example implementing microservices architecture inspired successful open source projects explicitly detailed many researchers concept knowledgebased supply chains inferred broader discussions open source influence development practices 100 22 Associated Risks reuse potentially reduce development costs always beneficial could introduce certain risks might eventually escalate overall costs risks include limited security vulnerabilities compliance spread bugs lowquality code 31 46 221 Security relationship security reuse possess dualnature system become secure leveraging mature dependencies also become vulnerable creating larger attack surface exploitable dependencies 35 context copybased reuse extensive code copying lead widespread dissemination potentially vulnerable code artifacts may reside inactive projects still publicly available others reuse potentially spread vulnerability also highly popular active projects 78 Understanding copybased supply chain helps identifying potential security risks implementing appropriate safeguards 73 Therefore detecting reused code aids identifying consistently patching vulnerabilities across affected systems 56 222 Compliance Many open source licenses come specific requirements must met Unintentional reuse code subject intellectual property IP rights licensing restrictions lead legal complications Understanding supply chain detecting reused artifacts ensures compliance licensing agreements protects IP infringements 59 100 systems evolve licenses evolve well evolution driven various factors changes legal environment commercial code licensed free open source code reused open source systems evolution licensing impact system parts subsequently reused 46 Therefore monitoring evolution important 19 However keeping track vast amount data across entire OSS landscape challenging task result many developers fail adhere licensing requirements 2 32 example investigating subset codes reused Stack Overflow environment revealed extensive number potential license violations 2 Even license requirements known challenge combining components different possibly incompatible licenses create application complies licenses potentially persists great importance 32 individual files reused licensing information may lost findings study might suggest approaches identify remediate problems 223 Quality Ensuring components supply chain meet quality standards essential reliability performance final product 9 Copied code thoroughly vetted tested introduce bugs defects identifying evaluating reused code organizations ensure meets quality standards 69 Code reuse assumed escalate maintenance costs specific conditions also seen prone defects inconsistent modifications duplicated code result unpredictable behavior 48 Additionally failure consistently modify identifiers variables functions types etc throughout reused code lead errors often bypass compiletime checks transform hidden bugs extremely challenging detect 58 Apart bugs introduced code reuse source code could inherent bugs low quality issues propagate similarly security vulnerabilities spread patterns reuse identified study could potentially suggest strategies leverage information gathered multiple projects reused code thereby reducing risks 23 Social Contagion Theory Reusing code instance technology adoption One key questions want ask may affect propensity adopting copying blob Social Contagion Theory SCT 14 widely used theory examining dynamic social networks human behavior context technology adoption 3 84 field engineering used explain developers select packages 64 using SCT theorize dynamics code reuse conceptualizing terms exposure infectiousness susceptibility SCT helps us frame research questions providing structured way analyze code reuse spreads within open source community Specifically explore developers become aware reusable code inherent qualities code make likely reused characteristics projects developers make likely adopt reusable code dimensions guide formation research questions enabling us systematically investigate factors influencing reuse activity open source key value SCT case help articulate factors affecting copy propensity via three dimensions Exposure Exposure intuitive notion order copy artifact first learn find Infectiousness Infectiousness property artifact affects propensity reused Susceptibility Susceptibility property destination developer reflects much benefit would believe would derive reusing artifact First blob infectious agent reused developer needs become aware words needs exposed open source community population Social coding platforms GitHub provide various crowdsourced signals popularity Developers may consider characteristics popularity health choosing resource use 23 61 considerations suggest developers likely exposed code popular active projects Therefore used properties proxy likelihood awareness primarily addresses RQ1b RQ1d study second concept SCT infectiousness means highly virulent infectious agent likely spread context measured characteristics blob corresponding RQ1c literature reuse primarily focused aspect reused resource final concept theory susceptibility refers vulnerability target population infectious agent case approximated characteristics target author reuses blob example use value much blob needed copies characteristics definition highly specific target making challenging measure aim shed light aspect RQ2
::::
3 RELATED WORK CONTRIBUTIONS benefits risks associated code reuse seem tangible extent types reuse across entirety OSS remain unclear prioritize risks benefits explore methods minimize maximize respectively employ approach introduced previous work 46 method allows us track copybased reuse scale commensurate vast size OSS scope copying activity fully encompassed previous studies based convenience samples illustrate results section aware curation system operates level blob finer granularity easy way determine extent OSSwide copybased reuse level Methods identifying reuse one introduced Kawamitsu et al 50 designed find reuse specific input projects easily scale detect reuse across OSS repositories 46 methods use identify characterize reuse could therefore serve foundation tools expose difficulttoobtain yet potentially important phenomenon 46 acknowledge actual extent reuse likely much higher find bloblevel granularity Nevertheless believe results present still insightful especially lower bound extent copybased reuse activity entirety OSS first differentiate copybased reuse related fields discuss contributions 31 Related Research Areas comprehensively understand copybased reuse essential discuss two closely related fields clone detection cloneandown practice Following discussion focus differentiating copybased reuse dependencybased reuse clone detection cloneandown practices situating within broader context code reuse literature 311 Code Reuse Analysis Code Reuse Analysis encompasses techniques practices aim maximize efficiency reliability development leveraging existing code Techniques static analysis dependency analysis repository mining help identify reusable components within codebase 52 methods code reuse analysis seeks reduce redundancy enhance maintainability Frakes Kang 25 show systematic code reuse significantly reduce development time costs improving quality 312 Clone Detection Clone Detection technique within code reuse analysis identifying similar identical code fragments codebase process involves using tools detect exact slightly modified duplicates refactored reusable components Techniques range textual tokenbased methods advanced semantic abstract syntax tree AST analyses 80 91 methods focus identifying code clones within constrained contexts often limited small code snippets within projects 92 Clone detection helps managing redundancy maintaining code quality highlighting areas code simplified reused 80 effectiveness clone detection tools validated various studies showing significant improvements maintainability 49 313 Clone Clone practice existing components copied modified meet new requirements approach often utilized product line engineering situations rapid development important Cloneandown allows developers quickly adapt existing solutions lead maintenance challenges due proliferation similar independently maintained code fragments 54 82 practice common open source development involves significant modifications independent maintenance often leading divergent development paths 7 30 clone detection focuses technical identification code snippets cloneandown practice highlights importance customization independent management forked projects cloneandown practice involves technical customization significant social factors community engagement governance models understanding aspects important managing forked projects 7 30 Although cloneandown supports purpose code reuse facilitating quick adaptation often results code duplication complicating longterm maintenance Research shown cloneandown prevalent practice due simplicity effectiveness short term 4 314 Copybased Reuse Copybased reuse form code reuse involves copying existing code potentially modifying use new contexts method allows rapid development shares maintenance challenges associated cloneandown duplicated code must managed across different parts summary code reuse analysis encompasses techniques like clone detection manage redundancy practices like cloneandown adapt existing code new purposes clone detection code reuse analysis share goal improving code quality maintainability identifying managing redundancy cloneandown focuses rapid adaptation rather efficient redundancy management despite serving similar purpose promoting reuse copybased reuse clone detection address code duplication differ significantly methodologies scopes Copybased reuse research exemplified work provides broader ecosystemlevel perspective incorporating social aspects characteristics entire projects contrast clone detection focuses technical identification code snippets within specific contexts cloneandown practice emphasizes customization independent maintenance forked projects 32 Contributions contribution work three aspects follows 321 Accuracy study leverages World Code WoC infrastructure analyze reuse nearly entire open source landscape allows capture instances copying would missed subset public repositories analyzed contrast previous studies often focused samples mostly “popular” repositories drawn specific communities subsets programming languages either mostly concentrated specific community eg Java language Android apps etc 21 39 40 43 68 86 sampled single hosting platform eg GitHub 33 34 consequently prevented identification intercommunity outofsample copies Even research comprehensive programming language coverage study Lopes et al 60 studies Hata et al 41 42 analyze subset programming languages additionally use convenience sampling methods excluding less active “unimportant” repositories results demonstrate even inactive “small” projects appear provide many artifacts reused OSS even “largest” active projects Existing literature code cloning primarily focuses empirical studies case studies tool evaluations Empirical studies typically analyze code clones within specific projects samples open source repositories datasets large exhaustive entire OSS ecosystem example studies Juergens et al 48 Roy et al 81 examine hundreds thousands files repositories providing valuable partial insights Case studies offer indepth analysis cloning practices within individual projects organizations giving detailed context limiting scale specific cases study Tool evaluations involve benchmark studies clone detection tools evaluating performance curated datasets studies contribute important information tool effectiveness cover entire OSS ecosystem Unlike studies rely selective sampling analysis encompasses nearly entire open source ecosystem providing broad necessary foundation understanding code reuse fundamental requirement accurately tracking origin files within entire OSS helps uncover accurate trends patterns would biased analyses based samples data offering accurate understanding reuse practices 322 Methodology Focus Copybased reuse explored thoroughly dependencybased reuse eg 15 26 74 example Mili et al 66 shown dependencybased reuse lead sustainable architectures promoting componentbased design reducing redundancy Additionally Brown Wallnau 11 demonstrated leveraging welldefined interfaces reusable libraries dependencybased reuse significantly improve maintainability scalability Nevertheless similar analyses exist regarding copybased reuse Copybased reuse potentially less important much less understood form reuse 46 studies copybased reuse domain focus clone detection tools techniques 1 40 47 81 97 rather characteristics entire source code files possibly make reuse less likely Furthermore almost studies reviewed focus solely source code reuse whereas track artifacts whether code reusable development resources 46 using World Code research infrastructure encompasses nearly entire OSS ecosystem identified analyzed copying activity scale first time contrast clone detection primarily involves identifying similar code snippets within specific directories domains 45 90 research addresses broader context entire files diverse artifacts across OSS ecosystem providing comprehensive understanding reuse method bridges clone detection cloneandown approaches detecting instances reuse whether kept without changes modified reuse thereby encompassing technical managerial aspects code reuse existing clone detection literature several methods employed identify code clones methods include textbased tokenbased treebased graphbased techniques Textbased methods detect clones comparing raw text straightforward less accurate due variations formatting Tokenbased methods improve converting code tokens detecting similarities abstract level enhancing accuracy still susceptible variations code structure Treebased methods parse code abstract syntax trees ASTs identify clones comparing trees providing structured semantically meaningful detection Graphbased methods abstract code control flow data flow graphs allowing detection complex semantic clones 81 clone literature primarily employs detection methods understand broader landscape code cloning example Juergens et al 48 utilized combination techniques analyze cloning practices projects methods effective identifying different types clones exact parameterized semantic clones often focus similarities patterns rather exact matches contrast research employs method focused identifying reuse bloblevel specifically detecting exact versions code copied misses instances single code snippet copied approach rely abstractions patterns method involves obtaining hashes versions entire open source ecosystem detect identical code segments ensuring every version code tracked origin exhaustive detailed approach allows comprehensive analysis copybased supply chains OSS level Since supply chains form network entire OSS feasible study sampling projects representative samples large graphs notoriously difficult obtain see eg 57 addition ensuring entire file copied committed method easily scales entire OSS ecosystem avoids need look similarities among tens billions versions utilizing hashes Traditional clone detection techniques would need substantially modified work scale discuss potential approaches Section 81 323 Influencing Factors Social Aspects study explores characteristics OSS projects influence propensity artifacts reused examining social aspects Previously focus primarily desired functionality code 29 87 also investigate social aspects phenomenon open source community literature clone detection research explore social aspects code reuse different perspectives varying emphases social technical factors Existing literature clone detection primarily focuses technical aspects identifying code clones understanding impact maintenance quality instance studies Juergens et al 48 Roy Cordy 80 delve reasons code cloning improving productivity learning avoiding reimplementation similar functionalities studies often highlight technical motivations behind code cloning reusability rapid prototyping also touch upon social aspects like collaborative development knowledge sharing within teams However primary emphasis remains technical detection management code clones contrast research takes broader view examining characteristics open source projects influence propensity artifacts reused includes detailed analysis social technical factors study explores diverse motivations implications reuse OSS community considering aspects size community engagement collaborative nature OSS development highlight importance social dynamics code reuse including factors like community contributions reputation projects collaborative environment fosters code sharing reuse examining social technical factors study provides comprehensive understanding motivations behind code reuse OSS community draw parallels factors influencing copybased reuse ease access code open collaborative nature OSS projects role community support documentation broader perspective allows us highlight diverse sometimes conflicting motivations code reuse ranging technical efficiency social recognition collaborative learning
::::
4 METHODOLOGY begin briefly describing World Code infrastructure utilized study followed presenting methods introduced previous work 46 identify instances copying Next explain time complexity method discuss rationale behind choice second third subsections discuss methods used answer research question detail make subsequent discussion precise first introduce definitions time unique blob b first committed P denoted tbP first repository Pob textArgMinP tbP referred originating repository b first author creator pairs consisting originating commit destination one subsequent commits producing blob Pob Pdb identified reuse instances reuse propensity likelihood blob copied least one modeled based type file represented blob activity popularity characteristics originating projects 41 Identification Reused Blobs 411 World Code Infrastructure Finding duplicate pieces code tracking revisions code across open source projects data computationintensive task due vast number OSS projects hosted numerous platforms 46 Previous studies reuse consequently often focused relatively small subset open source potentially missing full extent reuse could obtained nearly complete collection 46 World Code WoC 62 63 infrastructure aims address challenges regularly discovering retrieving indexing crossreferencing information new updated version control repositories publicly available WoC operationalizes copybased reuse mapping blobs versions source code commits projects created means copybased reuse detected entire file duplicated without alterations 46 reuser commits reused blob making modifications method find however commit making alterations original file identified Given study focuses solely wholefile copying activity Consequently different versions originally file treated distinct entities since different blobs 412 Deforking understand reuse across entirety open source important identify distinct projects Git commits based Merkle Tree structure uniquely identifying modified blobs therefore shared commits repositories typically indicate forked repositories distributed version control system VCS Git facilitates cloning via git clone GitHub fork button resulting numerous repositories serve distributed copies feature enables distributed collaboration also leads many clones original repository 72 differentiate copybased reuse forking use deforking map p2P provided WoC 72 Using community detection algorithms map provides clearer picture distinct projects linking forked repositories p single deforked P based shared commits advantage map using fork data platforms like GitHub WoC’s p2P map based shared commits providing higher recall missing forks occur GitHub’s forking option rather cloning repository Additionally forks clones hosted different platforms cannot traced easily WoC map platformindependent constraint Moreover forks may diverge significantly original repository still considered forks hosting platforms WoC’s deforking algorithms use community detection via shared commits forks diverge substantially via maintenance forking community detection algorithm would recognize distinct projects reduces false positives increases precision Whenever mention “project” paper actually referring “deforked project” defined ensures discussions reuse based unique instances development projects rather duplicated efforts forks 413 Dataset Creation understand identification reused blobs important explain dataset used 46 created Despite key relationships WoC offers several obstacles resolved initial step pinpoint first instance denoted tbP approximately 16 billion blobs appeared almost 108 million projects goal first c2fbb map result diff commit commit file blob old blob lists blobs created commit joined c2dat map full commit data obtain date time commit result joined c2P map commit identify projects containing commit 2See httpsgithubcomwochacktutorial information WoC map naming convention result new c2btP map commit blob time create timeline blob data sorted blob time resulting b2tP map b P blob time deforked contain desired timeline C1 Finally blob timelines3 used identify instances reuse C1P1 C1P2 Ptb2Pt map first originating project4 second destination reused blob meaning blob created later time resulting Ptb2Pt map contains instances blob reuse data flow reuse identification shown Figure 1
::::
414 Time Complexity Analysis evaluate complexity time requirements methodology identifying reuse analyze time complexity step provide benchmark execution time typical computer setup overall time complexity dominated sorting operations involved processing large maps Data preparation joining involve merging precalculated maps WoC namely c2fbb c2P c2dat maps Since maps already sorted split 128 partitions join complexity 128 times Ol n l n number rows maps respectively drop commit hashes sort joined b2tP map based blob time computationally intensive step complexity log n n total number rows b2tP map Identifying reuse instances given data already sorted blob complexity n total number copy instances Using highperformance workstation benchmark 8core processor 35 GHz 128 GB RAM 2 TB SSD calculate execution time step Data preparation joining lineartime merge primarily involve reading writing large files sequential readwrite speed approximately 500 MBs SSDs joining maps total size around 128 billion rows expected take roughly 12 hours Sorting created b2tP map requires external sorting 74 billion rows necessitates multiple passes data Based empirical data modern external sorting algorithm 8 cores handle around 05 billion rows per hour Hence sorting map would take approximately 148 hours Identifying reuse instances involving efficient IO operations estimated take 46 hours total entire process estimated take approximately 153156 hours 65 days Detecting code reuse finer granularity bloblevel syntax tree parsing text similarity techniques would offer comprehensive view code reuse However methods involve several computational challenges resource constraints making impractical study Parsing abstract syntax tree AST file detect structural similarities involves several computational steps First file must parsed AST representation operation n total number unique blobs dataset 16 billion blobs parsing step alone would extremely resourceintensive Following parsing comparing AST identify potential reuse instances would require pairwise comparisons pairwise comparison complexity On2 resulting infeasible O16 times 1092 complexity Text similarity measures hand Levenshtein distance cosine similarity involve comparing blob’s contents every blob methods typically operate complexity On2 pair files resulting infeasible O16 times 1092 complexity Even optimizations like localitysensitive hashing approximation techniques scale data renders approach impractical Given significant computational complexity resource requirements detecting code reuse finer granularity bloblevel feasible study Instead chosen focus bloblevel reuse detection provides practical scalable solution approach limited detecting exact file 3All first commit time creating blob dropped blob often reused within repository 4See section 7 limitations identifying originating Fig 1 Reuse Identification Data Flow Diagram copies ensures analysis remains within bounds available computational resources time constraints thereby enabling thorough efficient examination code reuse OSS landscape 42 RQ1 much copybased reuse occurs factors affect propensity reuse 421 RQ1a extensive copying entire OSS landscape investigate widespread wholefile copying OSS actually first want establish baseline fraction blobs ever reused reused many downstream projects Specifically RQ1a showing number blobs originating well destination projects deforked copy instances across entire OSS ecosystem numbers estimates actual numbers calculated complete dataset 422 RQ1b copybased reuse limited particular group projects One may argue results RQ1a necessarily important “small” projects may reuse code copybased manner see actually case randomly sampled 5 million reuse instances 128 files data divided based first two bytes hash blobs resulted total 640 million instances analysis approach ensured sample distributed across entire dataset capturing diverse range copy instances sample size 640 million instances constitutes approximately 267 entire dataset Although small fraction data sufficiently large ensure statistical reliability representativeness analysis large absolute size sample guarantees statistical reliability according Central Limit Theorem going need define qualitative importantly subjective terms “small” “big” projects quantitative justified measures Crowston Howison 17 Koch Schneider 51 shown activity measured commit frequency strong indicator health sustainability Additionally use stars metric wellsupported literature represent form user endorsement correlated visibility perceived quality 77 choose two metrics number commits number stars indicators project’s activity popularity Commits reflect ongoing development maintenance efforts important sustainability evolution Stars hand reflect community’s interest endorsement indicating project’s visibility influence metrics widely used empirical engineering research evaluate health impact open source projects 8 47 define projects 100 commits 10 stars “big” projects mean 3rd quantile values number commits dataset 46 12 respectively aligns established practices literature thresholds often set significantly average isolate highly active projects setting threshold double mean ensure topperforming projects classified big Similarly threshold 10 stars set based mean 233 3rd quantile value 0 stars indicates majority projects receive stars reflecting popularity community engagement levels selecting projects least 10 stars focus significant community recognition capturing less 1 dataset representing influential projects thresholds chosen “small” group hand projects stars fewer 10 commits ensure projects indeed small inactive approach ensures small group comprising 62 projects includes minimal activity engagement consistent findings Gousios Spinellis 37 large proportion open source projects relatively inactive consider projects fall either big small categories “medium” group medium group captures middle ground excluding extremes thus providing balanced representation majority active projects Using taxonomy counted number unique blobs involved copy instances groups mentioned blob several downstream projects necessarily fall group Therefore considered biggest downstream analysis purposes example blob originated medium reused big small count “medium big” category Considering biggest downstream unique blob ensures significant reuse instances captured approach supported research indicating impact code reuse often determined size activity downstream projects utilizing code 68 95 focusing largest downstream ensure analysis reflects substantial influential reuse cases particular blob 423 RQ1c characteristics blob affect probability reuse third part research question RQ1 focuses properties reused artifacts address obtained large random sample blobs comprising 1128 blobs point unlike RQ1b randomly sampled copy instances meaning blobs involved reused least sampling b2tP map includes blobs whether reused dataset divided 128 files based first two bytes blob hash Hash functions design distribute input data evenly across output space use hash functions divide data ensures uniform distribution across resultant files 67 using one 128 files sample given vast size dataset ensure unbiased representation entire dataset sample size sufficient achieve high statistical power accuracy analyses employed logistic regression model response variable one reused blobs zero nonreused blobs Logistic regression robust statistical method used model probability binary outcome based one predictor variables widely used empirical engineering understand factors influencing development practices 44 using logistic regression quantify effect various predictors likelihood blob reused research question concerned infectiousness based Social Contagion Theory Specifically looking properties artifacts affect propensity reused first predictor model programming language blob Different programming languages associated distinct package managers development environments community cultures influence reuse practices 6 example ease dependency management languages like Python via pip JavaScript via NPM might facilitate reuse languages less mature package management systems Thus including programming language predictor helps capture contextual differences anticipate source code programming languages C lack package managers likely copied frequently source code languages sophisticated package managers JavaScript second predictor time blob creation factor helps account temporal dynamics indicating period blob created reflecting different reuse practices time hypothesize older blobs likely reused due fewer available reusable artifacts OSS landscape time However time creation inherently includes effect blob’s availability duration tbPd tbPo meaning older blobs time discovered reused Previous research Weiss Lai 95 indicates age visibility code artifacts influence reuse isolate examine influence creation period without confounding effect longer availability introduce concept timelimited reuse focusing copies occurring within specific time intervals blob’s creation remove advantage longer visibility better assess creation period influences reuse5 evaluated oneyear twoyear intervals found similar results evaluating intervals finding similar results enhance robustness conclusions maintain conciseness avoid repetition report findings twoyear interval Reporting twoyear interval results provides balance sufficient observation time reuse events practical need concise reporting Consequently excluded blobs created May 1 2020 ensuring blobs least two 5This definition used solely purposes regression model subsequent analysis applied RQ1a RQ1b RQ2 years potentially reused providing consistent time frame analysis 96 approach ensures findings skewed varying availability periods third predictor whether blob source code binary hypothesize binaries identified git treatment file extensions like tar jpeg zip may exhibit different reuse patterns compared source code expect binary files images might copied often easy understand reuse difficult recreate Unlike types files developers cannot easily extract specific parts functionalities binary files source code blobs directly reusable modifiable whereas binaries might reused asis without modification distinction important affects ease necessity reuse 27 Therefore comes wholefile reuse definition reuse work anticipated binary blobs likely copied last factor hypothesize might affect propensity blob reused size size blob influence reuse several reasons Larger blobs may contain functionality making attractive reuse Conversely smaller blobs may simpler integrate existing projects Previous research Capiluppi et al 12 Mockus 68 indicated size code artifacts impact maintainability comprehensibility ultimately reuse investigate whether difference exists sizes copied noncopied blobs exclude binary blobs analysis size binary blobs comparable size source code blobs due fundamentally different nature Binary blobs often include compiled code media files compressed archives provide meaningful comparison plain text source code terms size differences incorporate blob size predictor logistic regression model Including binary blobs could skew results lead misleading conclusions Instead perform ttest compare sizes copied blobs noncopied blobs ttest robust statistical method used determine whether significant difference means two groups 88 applying ttest rigorously assess whether blob size influences likelihood reuse 424 RQ1d characteristics originating affect probability reuse fourth part RQ1 concerns chances finding aware blob approximated signals level exposure factor Social Contagion Theory conduct study use WoC’s MongoDB database randomly sample one million projects comprising nearly 1 projects indexed WoC achieve balance statistical validity computational feasibility sample size one million large enough provide representative snapshot entire population search reuse instances C1 3 C1 3 Ptb2Pt map determine originated least one reused blob logistic regression model response variable one introduced least one reused blob zero otherwise constructed predictors projectlevel model include number commits blobs authors forks earliest commit time activity duration time first last commit binary ratio ratio binary blobs total blobs programming language also use number GitHub stars predictor data WoC number stars sourced GHTorrent 36 choice predictors model based current literature relevant properties Number Commits Number commits strong indicator activity maintenance Koch Schneider 51 show projects higher commit frequencies tend active development likely reused due perceived reliability continuous improvement Number Blobs Number blobs represents volume content potential reusable components Larger projects blobs likely offer opportunities reuse 68 also indicate project’s complexity modularity Projects files may modular provide reusable components • Number Authors Number authors reflects collaborative nature Projects contributors tend diverse expertise supports innovation decentralized communication improving development process 17 potentially increasing likelihood reuse • Number Forks Number forks proxy project’s popularity community engagement Projects forks often viewed valuable trustworthy 93 increasing reuse potential • Earliest Commit Time Activity Duration Earliest commit time activity duration provide insights project’s maturity stability Older longactive projects likely wellestablished reused 28 • GitHub Stars GitHub stars form social endorsement indicating community approval interest Projects stars likely considered highquality reliable making attractive reuse 8 • Binary Ratio Binary ratio defined ratio binary blobs total blobs impact reuse potential Binary blobs compiled code media files often indicate prepackaged functionalities resources ready use higher binary ratio may suggest provides readytouse components facilitate reuse 68 Regarding language assignment bloblevel WoC’s b2sl map used blob language detection based file extensions method straightforward effective identifying programming languages individual blobs Nevertheless assigning primary language complex due use multiple languages projects WoC’s MongoDB database provides counts files language extension allowing us pick frequent extension project’s main language study considered subset blobs specifically originating blobs blobs first seen OSS within assumed common language among blobs project’s primary language approach aligns practice determining dominant language based primary contributions 94 43 RQ2 developers perceive engage copybased reuse second research question study aims triangulate quantitative results understand developers perceive engage copybased reuse quantitative research often focuses metrics frequency intensity duration behavior qualitative methods better suited explore beliefs values motives underlying behaviors 13 Using questionnaire triangulation allows us obtain selfreported data confirm challenge quantitative findings method helps identify discrepancies provides deeper understanding participant behavior 18 study questionnaire included direct question “Did create copy file” gather selfreported data whether participants copied blob offering direct measure compare quantitative results Additionally based Social Contagion Theory SCT hypothesize characteristics destination andor author influence reuse activity However treating reusers could problematic developers may fundamentally different reasons reuse Motivations reuse vary widely based individual needs requirements perceived benefits reused code 24 68 primary focus understand motivations categorize different types reuse potentially providing insight measuring susceptibility future research categorizing motivations aim identify distinct patterns factors influencing reuse behavior facilitating development targeted strategies enhance code reuse practices approach aligns qualitative research methods seek explore complex phenomena detailed contextualized analysis 16 gain insights motivations behind copybased reuse conducted online survey targeting authors commits introducing reused blobs authors commits originating repositories survey aimed capture range experiences perceptions related copybased reuse 431 Survey Content Questions survey included questions nature file needed chosen whether developers would use tools manage reused files General questions repositories developers’ expertise also included Notably question reason needing file openended capture unbiased detailed responses motivations reuse questions optional except first one asked respondent created reused file chose directly ask developers choose copy avoid provoking legal ethical concerns copybased reuse reason instead asked “Why file needed help project” Furthermore asked developers file resides intended used people Understanding whether creators intend resources reused helps assess cultural strategic aspects OSS development significant portion creators design code reuse mind indicates collaborative ecosystem resources shared built upon also asked series Likert scale scale 1 5 questions follows “To extent file help you” Gauging helpful creators reusers find reused blobs provides quantitative data perceived value reused code Comparing ratings creators reusers highlights discrepancies alignment perceived usefulness “To extent concerned potential bugs file” Investigating reusers’ concerns bugs reused code sheds light perceived risks associated practice Understanding level concern indicate much trust reusers place original code’s quality “How important know original file changed” Understanding reusers’ concerns changes original files helps identify potential issues related stability continuity reused code Frequent changes disrupt functionality dependent projects “How likely would use package manager could handle changes file one” Understanding likelihood reusers adopting package manager available provides insights demand tools streamline manage code reuse 432 Sampling Strategy ensure representative comprehensive sample stratified data along several dimensions Stratified sampling ensures relevant subgroups adequately represented survey enhancing generalizability findings 16 considering multiple dimensions productivity popularity copying patterns file types temporal aspects ensure comprehensive analysis captures diversity reuse behaviors OSS community Productivity Popularity Based number commits stars differentiated high low productivitypopularity projects similar RQ1b Copying Patterns distinguished instances files copied versus multiple files might indicate different reuse behaviors File Extension included various file types programming languages capture diverse range reuse scenarios 6The survey procedure approved institutional review board ensuring adhered ethical guidelines research involving human subjects 7See online appendix survey questions • Temporal Dimensions considered blob creation time delay creation reuse understand temporal patterns reuse behavior 433 Survey Design copy instance targeted author commit introducing blob destination repository author commit originating repository dual perspective allowed us capture originator’s reuser’s viewpoints offering comprehensive understanding reuse dynamics conducted three rounds surveys progressively expanding sample size refining questions based feedback preliminary results chose conduct survey three steps ensure thorough iterative approach understanding developer motivations behind copybased reuse handpicked 24 developers 12 creators 12 reusers initial survey openended questions round aimed gather indepth qualitative data identify key themes small purposive sample size allows deep exploratory insights important initial stages qualitative research 38 survey sent 724 subjects 329 creators 395 reusers mix openended multiplechoice questions round helped validate refine themes identified first round increased sample size round provides data ensure themes patterns observed idiosyncratic rather indicative broader trends intermediate sample size balances need extensive data still allowing qualitative depth 65 survey expanded 8734 subjects 2803 creators 5931 reusers questions multiplechoice facilitate quantitative analysis except openended question reason needing file large sample size final round ensures findings statistically significant generalizable across broader population developers involved copybased reuse sample size aligns recommendations achieving sufficient statistical power survey research 53 reason behind seemingly random numbers survey subjects three rounds sampling data perform data cleansing preparation reach survey target audience process normally caused samples removed Initially chose sample sizes 30 1000 10000 respondents three rounds respectively data cleansing process actual numbers lower 434 Thematic Analysis thematic analysis allows us systematically identify patterns themes within qualitative data providing deep insights reasons behind copybased reuse 10 analyze survey responses followed structured thematic analysis process outlined Yin 99 Compiling First author compiled responses Disassembling author individually analyzed coded responses identify ideas concepts similarities differences 5 89 Reassembling coded responses organized meaningful themes author independently focusing identifying different types reuse 10 Interpreting Concluding authors discussed compared themes clarifying organizing ensure coherent comprehensive understanding final themes used reclassify interpret survey responses
::::
5 RESULTS DISCUSSIONS numbers presented section derived version U WoC recent version available time analysis
::::
8 explicitly disclosed email address public profile 9 httpsbitbucketcomswscoverview 51 RQ1 much copybased reuse occurs factors affect propensity reuse 511 RQ1a extensive copying entire OSS landscape identified nearly 24 billion copy instances unique tuples containing blob originating destination projects encompassing 1 billion distinct blobs approximately 16 billion blobs entire OSS landscape approximated WoC 69 blobs reused least reused blob copied average 24 projects see Table 1 Count Total Reuse instances 23914332270 Blobs 1084211945 15698467337 69 Originating projects 31706416 107936842 294 Destination projects 86483266 107936842 801 Nearly 32 million projects 30 nearly 108 million deforked OSS projects indexed WoC originated least one reused blob 86 million projects copied blobs meaning 80 OSS projects reused blobs another least RQ1a Key Findings 1 identified nearly 24 billion copy instances encompassing 1 billion distinct blobs 2 69 blobs entire OSS reused least 3 30 OSS projects originated least one reused blob 80 projects reused blobs least extensive reuse observed highlights efficiency gains OSS development projects benefit existing code accelerate development cycles reduce costs widespread reuse also raises security concerns vulnerabilities copied code propagate across numerous projects necessitates improved vulnerability detection management practices ensure integrity reused code Additionally License violations due improper code reuse lead legal challenges compliance issues underscoring importance clear licensing adherence open source policies Furthermore identification bloblevel reuse accounts exact matches slight modifications suggests actual extent code reuse might even higher findings advocate development better tools infrastructure manage copybased reuse including automated detection security legal risks tools maintaining code quality reused components 512 RQ1b copybased reuse limited particular group projects numbers already demonstrate prevalence copybased reuse OSS community understand reuse activity distributed across different groups projects constructed contingency table explained methods section blob’s originating unique falls one three categories big medium small However downstream projects unique consider largest downstream blob analysis revealed nearly 112 million unique blobs reused 640 million sample copy instances nearly 13 million blobs reused least one big see Table 2 indicates 11 blobs reused least least one big showing copybased reuse limited small projects widespread phenomenon OSS community Table 2 Blob Counts Reuse Sample Biggest Downstream Projects Total Big Medium Small Upstream Projects 6748621 22273811 6515122 35537554 318 Medium 5348651 36434732 14552148 56335531 503 Small 691644 10151838 9231618 20075100 179 Total 12788916 114 68860381 615 30298888 271 111948185 However still unclear reused blobs predominantly introduced big projects case one could presume blobs mostly good quality errorprone making costs managing tracking code propagation reuse potentially outweigh benefits Sampling copy instances revealed big projects responsible 30 reused blobs remaining 70 introduced medium small projects Specifically nearly 18 blobs introduced small projects remaining 50 coming medium projects Furthermore even big projects almost 50 blobs reuse originate medium small projects see Table 2 Therefore evident big projects serve upstream sources copybased reuse Indeed many blobs introduced medium small projects widely reused Even widely reused blobs exclusively introduced big projects copybased reuse still requires management several reasons example security vulnerabilities may continue spread even main fixed issue 78 RQ1b Key Findings 32 reused blobs originate big projects comprise 1 total projects 18 reused blobs originate small projects make 62 total projects 50 reused blobs originate medium projects represent 37 total projects Nearly 50 blobs reused big projects originate medium small projects highlighting significant crosscategory reuse findings demonstrate nonnegligible portion reused code OSS community comes medium small projects challenging assumption highquality code predominantly originates large projects implies diverse quality spectrum reused code underscores importance ensuring quality security across sizes vulnerabilities smaller projects propagate widely Tools track origin usage blobs essential ensure timely updates fixes across OSS ecosystem mitigating risks associated vulnerabilities outdated code widespread nature code reuse across projects sizes emphasizes need quality assurance effective management community collaboration maintain health sustainability OSS landscape 513 RQ1c characteristics blob affect probability reuse section first demonstrate reuse trends followed logistic regression model predicting probability blob reused Additionally present reuse propensity per language show difference blob size reused nonreused blobs Finally discuss case study using JavaScript example frac5348651 69164412788916 Reuse Trends explained methods section use 2yearlimited copying definition RQ1c RQ1d models results means consider blob reused reused within 2 years creation definition 75 blobs reused Figure 2a shows total counts new blobs copied blobs quarter since year 2000textsuperscript11 counts exhibit rapid growth although growth new blob creation appears outpace copying investigate difference Figure 2b shows reuse propensity measured via reuse ratio reused blobs divided total blobs confirming new blob creation outpaced copied blobs since 2006 ratio began decline Fig 2 Quarterly Reuse Trends Logistic Regression Model expect nature blob affect propensity reused test hypothesis use logistic regression model response variable set one blob copied least ie committed least two projects within two years creation zero otherwise used WoC definition programming language associated blob categorized less common programming languages sample “other” descriptive statistics variables presented Table 3 Variable Statistics Reused Yes 6419388 75 78136705 925 Language Counts JavaScript 11122849 Java 4579458 C 3460733 65393053 Creation Time Date 5 Median 7292012 Mean 272018 95 5282017 Binary Yes 18516721 218 66039372 782 textsuperscript11The number projects blobs much smaller 2000 sample dataset predominantly composed blobs written JavaScript significant counts also Java C Additionally distribution blob creation time provided showing median date February 7 2018 Furthermore notable proportion blobs 218 binary results logistic regression model shown Tables 4 5 model shows coefficients predictors statistically significant pvalues less 00001 meaning impact probability blob reused see Table 4 Estimate Std Error z value Prz Intercept 180293 00186 96707 2 × 1016 Binary 04775 00010 46016 2 × 1016 Creation Time 08108 00010 82834 2 × 1016 C 07142 00017 42632 2 × 1016 C 01277 00033 3815 2 × 1016 Go 03095 00065 4774 2 × 1016 JavaScript 00832 00015 5621 2 × 1016 Kotlin 05606 00133 4202 2 × 1016 ObjectiveC 00810 00066 1230 2 × 1016 Python 00327 00030 1097 2 × 1016 R 04070 00083 4922 2 × 1016 Rust 00879 00095 930 2 × 1016 Scala 06168 00123 5021 2 × 1016 TypeScript 01827 00046 3938 2 × 1016 Java 00794 00019 4237 2 × 1016 PHP 03561 00024 15114 2 × 1016 Perl 07664 00082 9295 2 × 1016 Ruby 04782 00044 10858 2 × 1016 ANOVA table Table 5 provides insights significance different variables see predictors pvalue equal zero meaning null hypothesis12 rejected null deviance 45438151 represents deviance model intercept Adding Binary variable reduces deviance 124114 indicating strong influence reuse likelihood Creation Time variable reduces deviance 830322 highlighting importance predicting reuse “Language” variable also reduces deviance 230614 Although reductions might seem small relative null deviance statistically significant given large sample size high degrees freedom involved assess direction size predictor effects need go logistic regression model positive coefficient estimate indicates predictor variable increases odds outcome occurring increase negative coefficient estimate indicates predictor variable increases odds outcome occurring decrease Since coefficients represent change logodds outcome oneunit increase predictor transform coefficients odds ratios exponentiating interpret actual impact predictor odds ratio indicates odds outcome change oneunit increase predictor results shown Figure 3 graph displays odds ratios 12H0 reduced model without predictor provides fit data significantly worse full model predictor suggests predictor significantly improve model’s fit Table 5 Bloblevel Model ANOVA Table Df Deviance Resid Df Resid Dev pvalue NULL 84556092 4543815100 Binary 1 12411420 84556091 4531403680 2 times 1016 Creation Time 1 83032263 84556090 4448371417 2 times 1016 Language 15 23061417 84556075 4425310000 2 times 1016 various predictors logistic regression model blob level odds ratio greater 1 indicates increase likelihood reuse odds ratio less 1 indicates decrease Fig 3 Bloblevel Model Logistic Regression Odds Ratios creation time highest positive coefficient time variable model represents time elapsed blob’s creation current time meaning older blobs higher time values positive coefficient indicates newer blobs smaller time values less likely reused visible shorter duration controlled timebound definition reuse likely due factors hypothesized fewer artifacts available reuse time creation Binary blobs show significant increase reuse likelihood odds ratio 163 Given confirmed effect calculated reuse propensity binary nonbinary blobs separately results showed 95 binary blobs reused compared 70 nonbinary blobs sample Different programming languages show varied impacts reuse likelihood Blobs written Perl C R PHP Go TypeScript ObjectiveC Java Rust likely reused Perl showing highest odds ratio contrast blobs written Kotlin Scala Ruby C JavaScript Python less likely reused Kotlin Scala showing significant negative coefficients variability suggests certain languages perhaps due prevalence specific use cases conducive code reuse PerLanguage Propensity Following logistic regression results demonstrated programming language statistically significant factor reuse probability blob calculated propensity copy programming language measured percentage reused blobs within language see Table 6 results show blobs written Perl highest propensity reused 185 indicating strong tendency code reuse among Perl developers Conversely Kotlin lowest propensity 30 suggesting minimal code reuse language Languages C 152 PHP 99 also show high reuse rates Python 64 JavaScript 55 TypeScript 63 lower rates languages like Java 78 Go 79 R 98 fall middle range moderate reuse rates Language Ratio Language Ratio Language Ratio C 152 ObjectiveC 84 TypeScript 63 C 60 Python 64 Java 78 Go 79 R 98 PHP 99 JavaScript 55 Rust 67 Perl 185 Kotlin 30 Scala 38 Ruby 51 JavaScript Example role programming language reuse activity might several underlying reasons previously discussed One reason presence reliable package manager true improvements package manager reduce propensity reuse artifact examine analyzed timeline reuse ratio JavaScript shown Figure 4 figure indicates sharper decrease slope around 2010 year NPM package manager introduced downward trend continues mid2013 copying activity rate drops around 7 levels pattern supports hypothesis introduction adoption NPM significantly reduced code reuse copying However important note illustration research needed understand phenomenon fully current study focused aspect conduct indepth analysis Additional investigations data points comparisons languages introduced similar improvements package management systems necessary confirm observed effect coincidental specific JavaScript alone Blob Size final predictor hypothesized affect reuse probability blob size investigate whether significant difference sizes copied noncopied blobs conducted ttest comparing sizes analysis revealed significant difference pvalue 22e16 indicating average copied blobs smaller noncopied blobs However effect varies language Specifically perlanguage ttests reveal copied blobs smaller languages like JavaScript TypeScript larger languages C Python remain unchanged ObjectiveC detailed Table 7 example JavaScript tvalue 599 suggesting copied blobs significantly smaller C tvalue 1959 indicating copied blobs larger Similar patterns observed languages TypeScript showing tvalue 359 smaller copied blobs Python tvalue 58 also smaller copied blobs Conversely languages like Java tvalue 1207 PHP tvalue 286 show copied blobs tend larger Table 7 Size Difference Reused nonReused Blobs Positive value means larger reused blobs Language value pvalue Language value pvalue C 1959 2 times 1016 Rust 78 2 times 1016 C 125 2 times 1016 Scala 91 2 times 1016 Go 155 2 times 1016 TypeScript 359 2 times 1016 JavaScript 599 2 times 1016 Java 1207 2 times 1016 Kotlin 145 2 times 1016 PHP 286 2 times 1016 ObjectiveC 07 0430298 Perl 58 2 times 1016 Python 58 2 times 1016 Ruby 249 2 times 1016 R 76 2 times 1016 3649 2 times 1016 variation highlights relationship blob size reuse propensity complex influenced languagespecific factors findings demonstrate general trend smaller copied blobs differing patterns across languages suggest underlying factors may play RQ1c Key Findings reuse ratio decreasing time 75 blobs reused within two years creation Older blobs controlling confounding effect increased visibility likely reused Binary blobs 63 likely reused Programming languages significantly impact reuse likelihood Blobs written languages like Perl C R PHP Go TypeScript ObjectiveC Java Rust likely reused written Kotlin Scala Ruby C JavaScript Python less likely reused reuse ratio timeline JavaScript shows notable decrease slope around year NPM package manager introduced Copied blobs generally smaller noncopied blobs consistent across different languages size difference varies language reused blobs C Java PHP Go C Scala Perl ObjectiveC larger nonreused blobs JavaScript TypeScript Ruby Kotlin Rust R Python reused blobs smaller nonreused blobs higher reuse propensity among binary blobs suggests binaries inherently reusable likely due compiled nature allows easy integration across projects lower reuse likelihood newer blobs indicates potential issue integration acceptance recent contributions possibly due rapid technological advancements shifts development practices significant impact programming languages reuse likelihood highlights importance languagespecific tools ecosystems Languages higher reuse rates Perl C benefit mature ecosystems newer niche languages like Kotlin Scala show lower reuse rates potentially due smaller communities decline JavaScript code reuse postNPM introduction suggests improved package management reduce need direct code copying promoting modular maintainable codebases Regarding blob size general trend indicates smaller code artifacts reusable likely due simplicity ease integration However trend varies significantly across different programming languages example languages like JavaScript TypeScript copied blobs tend smaller supporting idea writing concise modular code enhance reusability contrast languages like C Python copied blobs often larger suggesting nature use cases languages might necessitate larger reusable components variation underscores importance understanding languagespecific factors considering code reuse management strategies
::::
514 RQ1d characteristics originating affect probability reuse section first present logistic regression model demonstrate perlanguage reuse propensity compare bloblevel results Finally analyze binary blob reuse Logistic Regression Model applied logistic regression model determine likelihood introducing least one reused blob response variable binary 1 introduced reused blob 0 otherwise Descriptive statistics model variables presented Table 8 Consistent bloblevel data frequent languages sample JavaScript Java Table 8 Projectlevel Model Descriptive Statistics Variable Description Statistics Reused least 1 reused blob Yes 205140 337 403195 663 5 Median Mean 95 Blobs Number generated blobs 1 15 1627 397 Binary Binary blobs total blobs ratio 0 0 01 06 Commits Number commit 1 5 570 84 Authors Number authors 1 1 25 3 Forks Number forks 0 0 15 1 Stars Number GitHub stars 0 0 34 2 Time Earliest commit time 7182013 3262018 9152017 332020 Activity Total months active 1 1 25 8 Language JavaScript Java Python PHP C 86065 43172 40503 24659 22258 391678 Spearman’s correlation analysis suitable observed heavily skewed distributions presented Table 9 number commits shows high correlation two predictors activity time 068 number blobs 067 high correlations indicate redundancy number commits add significant information beyond already captured activity time number blobs redundancy lead multicollinearity potentially distorting model’s coefficients reducing interpretability Consequently remove number commits model simplifying without sacrificing explanatory power correlations 052 concerning Table 9 Projectlevel Model Spearman’s Correlations Predictors Blobs Binary Commits Authors Forks Stars Time Activity Blobs 100 046 067 034 022 022 009 052 Binary 100 018 012 006 005 002 014 Commits 100 045 027 026 005 068 Authors 100 032 022 005 038 Forks 100 048 014 028 Stars 100 013 028 Time 100 005 Activity 100 results projectlevel logistic regression model shown Tables 10 11 variables model pvalues less 005 indicating statistically significant predicting likelihood introducing reused blobs see Table 10 demonstrates strong evidence null hypothesis suggesting variables effect reuse Examining ANOVA results Table 11 provides insight impact significance predictors see predictors pvalue equal zero meaning null hypothesis rejected deviance values ANOVA table indicate reduction model deviance predictor included example adding number blobs model reduces deviance 13121953 Table 10 Projectlevel Model Coefficients Estimate Std Error z value Prz Intercept 479 016 3001 2 × 1016 Blobs 061 000 22894 2 × 1016 Binary 077 002 4009 2 × 1016 Authors 009 001 824 2 × 1016 Forks 031 001 2772 2 × 1016 Stars 006 001 719 661 × 1013 Time 010 001 1200 2 × 1016 Activity 007 001 1048 2 × 1016 C 033 002 1960 2 × 1016 C 030 002 1574 2 × 1016 Go 029 004 770 133 × 1014 JavaScript 021 001 2258 2 × 1016 Kotlin 023 005 430 175 × 105 ObjectiveC 013 003 363 0000288 Python 019 001 1478 2 × 1016 R 027 005 593 304 × 109 Rust 048 007 665 287 × 1011 Scala 027 007 379 0000153 TypeScript 088 003 3457 2 × 1016 Java 025 001 2090 2 × 1016 PHP 029 001 1959 2 × 1016 Perl 031 010 320 0001395 Ruby 063 002 3318 2 × 1016 substantial reduction underscores important role model results confirm importance predictors explaining variability likelihood reuse Table 11 Projectlevel Model ANOVA Table Df Deviance Resid Df Resid Dev pvalue NULL 608334 77766048 Blobs 1 13121953 608333 64644095 2 × 1016 Binary 1 66294 608332 64577801 2 × 1016 Authors 1 92669 608331 64485132 2 × 1016 Forks 1 208402 608330 64276730 2 × 1016 Stars 1 6377 608329 64270353 144 × 1015 Time 1 15698 608328 64254654 2 × 1016 Activity 1 13931 608327 64240724 2 × 1016 Language 15 517820 608312 63722903 2 × 1016 understand size direction impacts look odds ratios inferred logistic regression coefficients odds ratio calculated exponential coefficient odds ratio greater 1 indicates positive impact odds ratio less 1 indicates negative impact results shown Figure 5 logistic regression analysis shows several predictors significantly impact likelihood reused blob TypeScript Binary Ruby Blobs strongest positive effects indicating increases variables substantially raise odds reused positive predictors include Forks PHP JavaScript Time Authors Activity Stars also increase likelihood though lesser extent Conversely predictors like Rust C Perl C Go Scala R Java Kotlin Python ObjectiveC negatively impact odds suggesting increases variables decrease likelihood introducing reused blob interpreting time variable important note since earliest commit timestamp represented number calculated time elapsed earliest commit current date better interpretability larger time value indicates older earliest commit model shows time positive coefficient suggesting older earliest commit higher probability introducing reused blobs result could influenced two factors First bloblevel model already observed older blobs higher probability reused Additionally timebound definition reuse controls confounding effect longer visibility blob level account longer visibility Therefore observed result might also affected project’s age implies longer visibility even though blob reused within two years creation PerLanguage Propensity projectlevel model highlights significance programming languages likelihood introducing reused blob explore calculated percentage projects language introduced reused blobs previous analysis RQ1a know approximately 29 projects introduced least one reused blob using timebound definition copying ratio increased 33 sample results language shown Table 12 Languages Ratio Language Ratio Language Ratio C 332 ObjectiveC 400 TypeScript 623 C 370 Python 305 Java 362 Go 313 R 285 PHP 464 JavaScript 412 Rust 315 Perl 299 Kotlin 400 Scala 360 Ruby 512 ratio projects introduced reused blobs varies significantly across different programming languages offering new insights compared bloblevel analysis example projects dominated TypeScript highest probability 62 introducing least one reused blob finding particularly interesting blob level propensity copy TypeScript lower average discrepancy suggests TypeScript projects acting upstream language’s supply chain less centralized Developers language seem inclined incorporate code various possibly unknown projects languages also show distinct patterns instance Ruby projects high probability 51 reusing blobs whereas Python projects lower probability 305 variation indicates likelihood code reuse strongly influenced primary language reflecting different practices community norms across languages insights emphasize importance considering programming language studying code reuse patterns projects ensure results comparable bloblevel analysis calculated copied blob ratio copied blobs total blobs took average ratio projects language important difference bloblevel propensity blob level language assignment based file extension blob binary blobs categorized “Other” projectlevel analysis language blob determined predominant language belongs example Pythonwritten blob Cdominated counted C blob Similarly binary blobs assigned language dominant language respective projects results new definition shown Table 13 Language Ratio Language Ratio Language Ratio C 154 ObjectiveC 95 TypeScript 56 C 47 Python 73 Java 58 Go 67 R 72 PHP 95 JavaScript 88 Rust 51 Perl 212 Kotlin 34 Scala 35 Ruby 53 propensity copy varies using projectlevel definition compared bloblevel definition see Table 6 example propensity copy JavaScriptdominated projects higher JavaScript blobs general 88 vs 55 indicates greater likelihood reuse within JavaScript projects compared individual JavaScript blobs various projects could attributed modularity strong reuse culture JavaScript ecosystem libraries frameworks frequently shared integrated JavaScript projects often incorporate multiple languages HTML CSS web development serverside languages backend functionality enhancing reuse shared components evolution JavaScript projects involving various tools libraries also contributes higher reuse rate within context Perldominated projects propensity reuse higher Perl blobs general 212 vs 185 suggests blobs within Perl projects likely reused compared individual Perl blobs different projects Perl’s strong culture code reuse sharing exemplified Comprehensive Perl Archive Network CPAN encourages use distribution reusable code modules Perl projects often include wide range scripts utilities shared across different applications enhancing reuse Furthermore Perl’s use scripting text processing system administration often requires reuse common patterns libraries contributing higher reuse rate within projects Conversely Rdominated projects show lower propensity reuse compared R blobs general 72 vs 98 implies individual R blobs likely reused blobs within Rdominated projects R primarily used statistical computing data analysis specific scripts functions reused across different analyses However R projects often tailored specific datasets analyses resulting lower overall reuse within context specialized nature many R projects unique data processing analysis pipelines limits reuse compared individual reusable components like functions libraries Javadominated projects exhibit lower propensity reuse compared Java blobs general 58 vs 78 indicates individual Java blobs likely reused blobs within Javadominated projects Java widely used across various domains reusable components like libraries frameworks common across different projects However Java projects tend large complex specific architectures dependencies may limit crossproject reuse high degree customization specificity Java enterprise applications reduces reuse rate within context compared reuse individual Java blobs libraries analyses reflect differing dynamics code reuse various programming ecosystems Understanding differences help improve strategies fostering code reuse optimizing development practices across different languages contexts Binary Blob Analysis Although previous analyses indicated binary blobs likely reused aimed investigate whether propensity varies across projects dominated different programming languages blob level feasible ascertain programming language binary blob However level analysis becomes possible Therefore examined reused binary blob ratio percentage reused binary blobs total reused blobs within language compared binary blob ratio percentage binary blobs total blobs within language utilizing ttest identify significant differences Consistent bloblevel analysis reused binary blob ratio exceeds general binary blob ratio across programming languages indicating higher likelihood reuse binary blobs observation raises questions languagespecific differences binary blob reuse Specifically hypothesize binary blobs frequently reused certain languages compared others words want know identifying reused binary blob allows us infer likely originate projects written particular languages findings confirm hypothesis proportion reused binary blobs varies significantly among different programming languages Nevertheless hypothesize least difference stems general difference binary blob ratios different languages limited reuse statistical tests reveal binary blob ratios indeed differ significantly across languages Consequently ratio reused binary blobs also exhibits significant variation among different languages suggesting difference necessarily mean varying binary reuse practices among want determine higher number reused binary blobs certain language solely due general prevalence binary blobs language languages tend reuse binary blobs control confounding effect normalize binary blob reuse ratio based total binary blob ratio Given binary blobs ratio br binary blobs total blobs defined reused binary ratio cbr binary reused blobs total reused blobs binary ratio br metric metric cbrbr averaged 4104 projects sample using linear regression project’s primary language predictor obtained results shown Table 14 fraccbrbr fraccbcccbcc normalized binary reuse metric cbr copied binary ratio br binary ratio cbc copied binary count cc copied count bc binary count c total count Language Metric pvalue Language Metric pvalue C 333 0810722 Rust 606 0422024 C 492 0025270 Scala 538 0545028 Go 573 0173372 TypeScript 517 0063922 JavaScript 704 2 times 1016 Java 491 0000497 Kotlin 542 0306698 PHP 449 0035326 ObjectiveC 217 0217673 Perl 332 0975449 Python 219 0005547 Ruby 351 0951277 R 265 0614773 analysis reveals reused binary blobs binary blobs metric varies across programming languages Notably C JavaScript Python Java PHP exhibit statistically significant differences pvalue 005 particular JavaScript projects demonstrate higher tendency reuse binary blobs Python projects show lower tendency suggests JavaScriptdominated projects reusing binary blobs likely efficient costeffective reusing code Conversely Python projects might benefit reusing code rather binary blobs complete coefficients regression ANOVA tables available online appendix RQ1d Key Findings properties significantly impact probability blobs reused binary ratio number blobs forks authors activity duration stars positive impact Older projects likely introduced reused blobs Blobs residing projects dominated different programming languages varying probabilities reuse TypeScript Ruby PHP JavaScript higher probabilities Rust C Perl C Go Scala R Java Kotlin Python ObjectiveC lower probabilities average 337 projects introduced least one reused blob percentage varies significantly languages TypeScript 623 Ruby 512 highest propensity R 285 Perl 299 lowest tendency reuse binary blobs much higher JavaScript projects Python projects show lower tendency projectlevel analysis reveals various factors significantly influence likelihood code reuse open source projects Projects blobs binary blob ratio longer activity tend exhibit higher reuse rates aligns hypothesis health activity popularity signals play important role promoting reuse variation reuse likelihood across different programming languages underscores influence languagespecific ecosystems practices consistent bloblevel results instance TypeScript Ruby projects show highest propensity reuse may due robust ecosystems strong community practices encourage code sharing reuse Conversely languages like Python Perl lower reuse rates suggesting different reuse dynamics possibly need improved tools practices foster reuse However impact blob’s language language resides differs suggests underlying factors behind differences technical aspects languages tools also community culture practices significant reuse binary blobs particularly languages like JavaScript indicates binary artifacts valuable assets projects might due efficiency ease integrating precompiled binaries compared source code However lower reuse rate binary blobs Python suggests language’s ecosystem favors source code reuse could due dynamic nature extensive use interpreted scripts findings important implications development support tools facilitate reuse different programming languages languages like JavaScript binary blob reuse prevalent enhancing asset libraries could beneficial contrast languages like Python code reuse advantageous improving code package managers would appropriate differentiation underscores necessity tailored support tools optimize reuse practices various programming environments findings highlight impact context reuse patterns suggest different definitions granularity levels yield varying insights code reuse behaviors 52 RQ2 developers perceive engage copybased reuse Across three rounds received 247 complete responses reusers 127 creators also 360 178 partial responses making total 607 305 responses reusers creators respectively results shown Table 15 discussed Section 712 identified originating repository might always true creator blob 39 developers identified creators reported reusing blob another source Additionally reusers might obtained blob another reuser original creator see Section 713 Among Table 15 Survey Participation Total Started Completed Response Rate Completion Rate Creator 3144 305 127 970 404 Reuser 6338 607 247 958 390 Total 9482 912 374 962 394 reusers confirmed reusing blob 43 acknowledged originating source 48 reported copying elsewhere 9 answer question findings provide important estimates fraction reuse within open source OSS least 61 fraction reuse originating projects least 43 data essential understanding dynamics code reuse within OSS highlighting significance direct reuse original projects secondary reuse intermediate projects Furthermore 60 identified reusers confirmed reusing blob remaining 40 claimed created see Table 16 discrepancy attributed several factors First individuals might indeed original authors blob originating implying reused resources Second gap could explained activities private repositories eg Developer creates file private repository Developer B copies public repository Developer reuses another public repository Third mentioned Section 43 concerns potential licensing violations might made many reusers uncomfortable admitting reuse explicitly Additionally developers’ faulty memory could play role especially reuse instances occurred long time ago One potential area investigation could examining owners commit authors copy instance gain better understanding gap However pursued study main focus Exploring factors future research could provide deeper insights complexities code reuse attribution within open source projects Table 16 Identified vs Claimed Creators Reusers Identified Creators Reusers Total Claimed Creator 77 61 99 40 176 Reuser 50 39 148 60 198 Total 127 247 374 Another dimension survey explored intentions creators others reuse artifacts Sixtytwo percent creators indicated resources intended reuse others asked helpfulness particular blob scale 1 5 5 helpful reusers rated average helpfulness 381 creators rated 424 suggests developers well aware reuse potential artifacts even blob may essential primarily projects background sections discussed risks associated type reuse asked reusers concerned risks well scale 1 5 5 concerned average concern bugs reused file 183 average concern changes original file 235 Several factors might contribute low level concern among developers including trust original code’s quality confidence testing processes However lack concern could facilitate spread potentially harmful code even creator fixes original code fact reusers significantly worried risks amplifies potential risk OSS supply chain level Next asked participants likely would use package manager one available particular blob scale 1 5 5 likely average likelihood using package manager 293 indicates although developers may concerned bugs changes potential improvements many would still use tool available suggests “packagemanager” type tools refactoring least maintaining reused code might gain traction developed results shown Table 17 Question audience Responses Average Median StdDev helpful creators 156 425 5 115 helpful reusers 185 382 4 132 Concern bugs reusers 185 185 1 133 Concern changes original file reusers 187 233 2 156 Likelihood using package manager reusers 184 289 3 164 Finally thematic analysis reasons reuse specifically responses question “why” revealed eight themes 162 responses received see Table 18 analysis provides nuanced understanding motivations behind code reuse highlighting several key themes Theme Description Frequency Demo demonstration test prototype 14 Dependency part library 11 Education learning purposes 16 Functionality specific functionality 39 reuse 2 Resource image style dataset license 30 Template template starting point framework 14 Tool parser plugin SDK configuration 23 expected one main reasons reuse provide specific functionality indicates developers often reuse code incorporate existing functionalities projects saving time effort development practice welldocumented literature 48 underscores importance reusable components efficient development Another observed theme reuse various resources including datasets instructions license files graphical design objects eg PNG JPEG fonts styles aligns significant reuse binary blobs identified RQ1 inclusion diverse resources indicates developers often depend readily available materials enhance projects’ visual functional aspects literature acknowledges practice findings suggest slightly higher emphasis resource reuse indicates resource management might important developers previously thought 14Since survey participants chosen stratified sampling frequencies represent actual data distribution Reusing tools parsers plugins SDKs configuration files mentioned 23 times practice noted practicality efficiency setting development environments ensuring consistency across projects highlights role auxiliary components streamlining development processes providing necessary infrastructure functionality Assignments school projects learning objectives similar concepts another prominent theme emphasizes role code reuse development knowledge supply chain developers reuse existing code understand learn new concepts Code reuse demonstration testing prototyping purposes identified 14 times theme suggests developers often reuse code quickly create prototypes test scenarios without focusing quality security licensing reused code priority cases achieve rapid results aligns findings Juergens et al 48 developers often clone code create prototypes perform tests quick prototypes however may end active projects Templates starting points frameworks mentioned 14 times Developers often clone templates frameworks solid foundation projects practice supported findings Roy Cordy 80 approach leverages existing structures expedite development ensure consistency Part library dependency management cited 11 times practice highlighted studies emphasize importance managing dependencies within development process study Roy Cordy 80 Although checking library files considered best practice many developers maintain specific versions avoid potential issues updates changes conscious decision highlights tradeoff best practices practical needs Reusing one’s code mentioned twice theme “own reuse” developers clone code reuse new projects less prominently featured literature compared reasons code cloning Developers clone code ensure consistency save time leverage previously written tested code practice practical efficient especially developers familiar code functionality However literature emphasize reason strongly studies acknowledge broader concept code reuse focus reusing code external sources libraries educational purposes 48 80 discrepancy suggests “own reuse” might underexplored area existing research indicates developers recognize practice frequently may thoroughly documented emphasized academic literature gap highlights opportunity investigation developers engage “own reuse” impact development processes also 13 instances responses either incomprehensible respondent remember file reason reuse RQ2 Key Findings 39 identified creators stated reused blob another source Among reusers 43 acknowledged originating direct reuse 48 copied elsewhere indirect reuse Reuse within OSS landscape least 61 60 reusers confirmed reuse 40 claimed creation 62 creators intended resources reuse Reusers concerned potential bugs changes original file Reusers willing use package manager available Main reuse themes functionality resources tools education demotestingprototyping templates dependencies reuse findings reveal nonnegligible portion developers engage copybased reuse within OSS community practice common many reusers sourcing code directly original creators intermediaries Understanding dynamics important improving transparency traceability reused code could potentially enhance code quality security discrepancies identified claimed creators highlight complexities attribution ownership Additionally survey respondents’ replies always accurate true complicates understanding true origins code gap underscores need better tracking mechanisms within repositories accurately reflect code origins Future research could delve deeper factors offering insights could inform policy tooling improvements OSS development Creators often intend code reused creators reusers recognize utility artifacts positive perception suggests promoting reuse beneficial community fostering collaboration innovation However difference helpfulness ratings indicates might room improving clarity documentation reusable code better meet reusers’ needs Despite low concern potential risks like bugs changes moderate interest package management tools suggests opportunity developing solutions help maintain refactor reused code tools could mitigate risks providing updates improvements managed manner enhancing overall reliability reused code thematic analysis reuse motivations provides comprehensive view developers opt copybased reuse Reusing specific functionality underscores importance modular reusable code development also highlights potential benefits welldocumented easily integrable code components readily reused others practice including library files suggests deliberate effort maintain stability avoid uncertainties might come updates changes However also highlights potential area improvement developer education best practices well importance tools help manage dependencies effectively insights contribute understanding motivations behind code reuse practical considerations developers face maintaining projects reusing demo testing accelerate development innovation also raises potential risks Developers may inadvertently propagate vulnerabilities violate licenses leading broader issues within supply chain Highlighting importance balancing speed security testing phases inform best practices educational efforts Educational use underscores educational value code reuse Reusing existing code allows learners understand realworld applications coding practices fostering skill development However also emphasizes need proper guidance resources ensure educational reuse done ethically effectively Encouraging educators integrate lessons best practices code reuse enhance quality learning adherence legal ethical standards proportion meaningful answers recalling file indicate reuse instances welldocumented remembered developers lack clarity hinder understanding traceability reuse practices highlights need better documentation tracking mechanisms ensure reasons contexts reuse transparent wellunderstood Implementing measures improve management reused code resources reducing potential risks associated undocumented reuse
::::
6 IMPLICATIONS 61 Developers Copybased reuse enables developers save time effort leveraging existing code However introduces risks maintenance fragmentation security vulnerabilities outdated dependencies address challenges developers adopt tools practices track reused code ensure compliance licensing requirements mitigate risks associated unverified code quality Fostering practice systematically reviewing documenting reused code enhances reliability maintainability also contributes overall sustainability projects Additionally staying informed updates reused code integrating updates promptly reduce risks associated outdated insecure components 62 Businesses Businesses rely open source must proactively address inherent risks copybased reuse including security vulnerabilities potential noncompliance licensing terms Investing robust tools tracking maintaining reused code critical safeguarding supply chain effort encompass implementing workflows regularly updating reviewing reused components Moreover businesses actively support smaller open source projects provide valuable code contributions support enhances quality reliability businesscritical also fosters goodwill collaboration within open source community taking steps businesses effectively mitigate risks strengthening ecosystem upon rely 63 Open Source Community open source community plays important role ensuring safe effective reuse code promoting best practices ethical secure reuse adopting standardized licensing improving quality benchmarks community minimize risks build trust shared resources Equally important supporting small mediumsized projects contribute significantly reusable code base Providing mentorship funding collaboration opportunities bolster overall open source ecosystem fostering innovation cooperation across projects Additionally establishing centralized repositories resources facilitate traceability offer detailed metadata provenance authorship licensing streamline reuse process mitigate associated risks efforts collectively enhance reliability sustainability scalability open source 64 Researchers Educators Researchers unique opportunity investigate finergrained reuse patterns instances involving slight modifications partial reuse better understand factors influencing reuse longterm impact quality security insights guide development tools methodologies promote safe effective reuse practices Educators integrate lessons ethical reuse practices licensing compliance dependency management engineering curricula leveraging realworld case studies addressing practical challenges balancing development speed security concerns educators equip future developers navigate complexities reuse responsibly approach help ensure next generation professionals actively supports sustainability growth open source ecosystems 65 OSS Platform Maintainers Platforms like GitHub GitLab wellpositioned enhance practices surrounding copybased reuse Improving traceability mechanisms preserve provenance authorship licensing metadata essential minimizing risks unintentional license violations outdated dependencies Integrating features automated detection license conflicts dependency vulnerabilities changes reused code empower developers manage projects efficiently securely Additionally platforms offer educational resources inplatform guidance encourage best practices reuse compliance fostering culture informed collaborative reuse platform maintainers contribute significantly longterm sustainability resilience open source ecosystem
::::
7 LIMITATIONS 71 Internal Validity 711 Commit Time identification first occurrence consequently building reuse timeline blob based commit timestamp time necessarily accurate depends user’s system time dataset utilized followed suggestions Flint et al 22 methods eliminate incorrect questionable timestamps increases reliability reuse timeline also used version history information ensure time parent commits postdate child commits 46 adds extra layer consistency validation enhancing accuracy data 712 Originating accuracy origination estimates highly reliant completeness data Even assume World Code WoC collection exhaustive possible blobs may originated private repository copied public one means originating repository WoC may actual creator blob scenario suggests even comprehensive dataset could instances code reuse remain undetected adding another layer complexity understanding full extent reuse across open source projects example 3D cannon pack assettext15 committed 38 projects indexed WoC However asset originally created earlier Unity Asset Store 46 utilizing extensive WoC collection provide broad detailed analysis code reuse capturing significant portion open source activity even instances privatetopublic transitions missed Additionally examples identified 3D cannon pack asset highlight practical implications realworld relevance findings demonstrating robustness analysis despite potential data gaps approach addresses inherent challenges tracking code origination reuse offering framework refined expanded future research improve accuracy comprehensiveness 713 Copy Instance unique combination blob originating destination might always accurately represent actual pattern reuse destination projects could potentially reuse blob different source originating instance three projects—A Btext15httpsassetstoreunitycompackages3dpropsweaponsstylishcannonpack174145 C—in order blob creation C might copy either B Additionally certain blobs reused created independently repository empty string standard template automatically generated common tool 46 blobs excluded using list provided WoC 62 Despite limitation results remain significant recognizing potential indirect reuse independently created blobs provide nuanced understanding reuse landscape accounting complexity code propagation across projects Excluding independently created blobs utilizing WoC’s comprehensive list ensures analysis focuses genuine reuse instances enhancing reliability findings 72 External Validity 721 Bloblevel Reuse work focuses solely reuse entire blobs deliberately excluding reuse partial code segments within files bloblevel reuse common covers subset broader code reuse landscape Bloblevel reuse relevant scenarios larger code blocks consisting entire files even groups files reused compared statement functionlevel reuse means results might implicit bias towards programming languages ecosystems rely heavily complete files potentially overlooking reuse practices prevalent languages favor modular snippetbased reuse limitation also implies different versions file even differ one character generate different blobs due distinct file hashes Consequently blob reuse equate file reuse Defining file reuse challenging difficult determine constitutes equivalence files different projects 46 could potential reason higher level reuse binary blobs relatively harder modify Despite limitations results remain significant several reasons Prevalent Pattern concentrating entire blob reuse address prevalent impactful pattern development allows us provide valuable insights substantial portion code reuse practices Clarity Precision Analyzing entire blobs offers clear precise method identifying reuse avoiding ambiguity complexity associated defining partial file reuse clarity enhances reliability findings Efficiency Scalability Bloblevel analysis computationally efficient scalable enabling us process large datasets draw meaningful conclusions extensive data scalability important comprehensive empirical studies Foundation Future Research work lays groundwork future studies build findings explore partial file reuse nuanced aspects code reuse addressing welldefined scope provide solid foundation subsequent research summary focus blob reuse introduces certain limitations also provides clear scalable impactful insights code reuse practices targeted approach enables us contribute valuable findings field despite inherent complexities defining analyzing file reuse Although bloblevel reuse less granular statement methodlevel reuse findings blob level would also apply subbloblevel analysis adjust bloblevel reuse Future studies needed investigate extent different levels types code reuse overlap differ 722 Survey Response Rate relatively low response rate survey may due perception respondents copying code sensitive subject concerns may impacted responses even cases developers chose participate suggests work may needed design surveys create impressions Additionally since many reuse instances happened long time ago developers might forgotten Therefore important conduct regular surveys capture experiences developers still remember practices
::::
8 FUTURE WORK 81 CodeSnippet Granularity discussed methodology section going finer granularity bloblevel detect code reuse practically feasible Nevertheless approaches make relatively tractable problem Specifically hashing abstract syntax tree AST code snippet classes functions blob mapping blobs hashes could potentially make finergrained code reuse detection feasible Assuming average k code snippets 16 billion blobs parsing hashing operation complexity resulting O16 times 109 times k perform selfjoin created map blob syntax tree hash b2AST using AST hash key selfjoin complexity depends number unique hashes distribution worst case every blob unique hashes join operation would approach O16 times 109 times k2 However join complexity would typically significantly less many common hashes realistic estimate assumes number unique AST hashes h much smaller total number entries b2AST map making join complexity closer Oh times 16 times 109 times k join although potentially large feasible pairwise comparisons entire blobs due efficient handling common hashes examining code reuse granularity code snippets could potentially uncover far intricate network reuse approach might reveal patterns practices noticeable looking solely wholefile bloblevel reuse Although increased complexity challenging manage offers valuable opportunities comprehensive analysis reuse 46 82 DependencyBased Reuse work aimed demonstrate prevalence importance copybased reuse gain comprehensive understanding code reuse important analyze copybased dependencybased reuse type reuse reveals different aspects developers leverage existing code projects studying side side paint complete picture extent nuances reuse development Ignoring one favor would provide incomplete narrative 46 83 Upstream Repository highlighted limitations section currently lack precise knowledge source repository reuses file tend assume originating repository instances copying However assumption may capture realworld complexity reuse enhance understanding developers identify suitable repositories reuse could potentially leverage metaheuristic algorithms artificial intelligence techniques advanced methods might enable us predict actual source reused artifacts instance copying greater accuracy 46 84 Open Source Supply Chain Network Directed Acyclic Graphs DAGs instrumental clone detection reuse literature due ability model analyze complex relationships dependencies various components context copybased reuse dataset created using World Code WoCtextsuperscript16 infrastructure leveraged construct DAGs represent flow reuse across different repositories dataset’s detailed tracking blob copies including origins destinations provides rich source data map relationships accurately drawing DAGs researchers visualize analyze propagation reused blobs identifying critical nodes projects blobs play central role reuse network visualization helps understanding structure dynamics reuse highlighting patterns reused blobs central projects reuse network potential vulnerabilities licensing issues propagating reused blobs DAGs reveal reuse spreads across projects helping identify projects primary sources reusable blobs code flows different projects mapping reuse network possible pinpoint critical points vulnerabilities licensing issues could propagate allowing targeted interventions mitigate risks Understanding reuse network also aids developing better tools practices managing code quality ensuring reused code maintained updated consistently across projects use Studies largescale clone detection Sajnani et al 83 Koschke 52 provide foundational methodologies leveraging DAGs contexts methodologies adapted extended using dataset enhance understanding copybased reuse open source development 85 Tool Development discussed background section different types code reuse impacts several critical areas including security licensing code quality Understanding implications addressing important advancing development practices Security Reused code propagate vulnerabilities across multiple projects 78 instance security flaw exists reused blob potentially affect projects include blob Analyzing reuse patterns help identify critical points vulnerabilities might spread allow proactive mitigation measures notable incidents widespread code reuse led security breaches example Heartbleed bug OpenSSL farreaching impacts due extensive reuse affected code across numerous projects Future research focus developing automated tools scan reused code known vulnerabilities suggest patches proactive approach enhance security posture systems Compliance Reused code may carry licensing obligations need respected Failure comply obligations lead legal disputes financial penalties understanding reuse patterns organizations ensure meet licensing requirements instances companies faced legal challenges due improper reuse code restrictive licenses example using GPLlicensed code proprietary without complying GPL terms led lawsuits Developing tools automatically check license compliance code reused help organizations avoid legal pitfalls tools flag potential issues provide guidance resolve Code Quality Reused code may always meet quality standards adopting Ensuring reused code adheres best practices coding standards essential maintaining overall code quality Poorly written code lead maintenance challenges degraded performance adopting projects Future work focus creating tools assess quality reused code suggest improvements tools analyze code adherence coding standards detect code smells recommend refactoring textsuperscript16For information access data please visit httpsgithubcomwochacktutorial Package Managers Developing package managers tailored different programming languages communities highly beneficial managers offer relevant effective support managing code reuse specific environments Additionally enhancing existing package managers features reuse tracking version control automated updates improve development efficiency reduce associated risks code reuse Community Engagement Engaging open source communities develop tools practices address unique needs different ecosystems collaborating communities ensure widespread adoption effectiveness Continuously gathering user feedback iterating tools enhance functionality usability also important iterative process helps create robust reliable tools meet evolving needs developers
::::
9 CONCLUSIONS conclusion study highlights nonnegligible role copybased reuse open source development leveraging extensive World Code WoC dataset provided comprehensive analysis code reuse revealing substantial portion open source projects engage practice findings indicate 69 blobs OSS reused least 80 projects reused blobs another widespread reuse emphasizes efficiency gains OSS development also raises concerns security legal compliance variation reuse patterns across programming languages underscores influence languagespecific ecosystems practices Moreover higher propensity binary blob reuse suggests need tailored tools support different types reuse Future research focus improving accuracy comprehensiveness reuse detection exploring impact partial file reuse survey results enrich understanding reuse practices found many creators intended resources reuse indicating collaborative mindset among developers Reusers generally found reused blobs helpful Despite positive perceptions reusers showed relatively low concern potential bugs changes original files low level concern could suggest either high level trust quality reused code lack awareness associated risks Additionally survey revealed moderate interest using package managers handle changes reused files indicates potential demand tools streamline manage code reuse effectively Overall work provides insights patterns factors affecting code reuse advocating better management support tools enhance sustainability security OSS addressing identified risks leveraging collaborative nature OSS community improve code reuse practices outcomes ACKNOWLEDGMENTS work supported part National Science Foundation Award Numbers 1901102 2120429 authors additionally thank Dr James Herbsleb Dr Bogdan Vasilescu valuable advice insightful comments helped improve work authors also thank reviewers constructive feedback suggestions helped enhance quality paper REFERENCES 1 Qurat Ul Wasi Haider Butt Muhammad Waseem Anwar Farooque Azam Bilal Maqbool 2019 systematic review code clone detection IEEE access 7 2019 86121–86144 2 Le Ons Mlouki Foutse Khomh Giuliano Antoniol 2017 Stack overflow code laundering platform 2017 IEEE 24th International Conference Analysis Evolution Reengineering SANER IEEE 283–293 3 Corey Angst Ritu Agarwal Vallabh Sambamurthy Ken Kelley 2010 Social contagion information technology diffusion adoption electronic medical records US hospitals Management Science 56 8 2010 1219–1241 4 Giuliano Antoniol Massimiliano Di Penta Ettore Merlo 2004 automatic approach identify class evolution discontinuities Proceedings 7th International Workshop Principles Evolution 2004 IEEE 31–40 5 Zubin Austin Jane Sutton 2014 Qualitative research Getting started Canadian journal hospital pharmacy 67 6 2014 436 6 Tegawendé F Bissyandé Ferdian Thung David Lo Lingxiao Jiang Laurent Réveillere 2013 Popularity interoperability impact programming languages 100000 open source projects 2013 IEEE 37th annual computer applications conference IEEE 303–312 7 Kelly Blincoe Jyoti Sheoran Sean Goggins Eva Petakovic Daniela Damian 2016 Understanding popular users Following affiliation influence leadership GitHub Information Technology 70 2016 30–39 8 Hudson Borges Andre Hora Marco Tulio Valente 2016 Predicting popularity github repositories Proceedings 12th international conference predictive models data analytics engineering 1–10 9 Lina Boughton Courtney Miller Yasemin Acar Dominik Wermke Christian Kästner 2024 Decomposing Measuring Trust OpenSource Supply Chains Proceedings 2024 ACMIEEE 44th International Conference Engineering New Ideas Emerging Results 57–61 10 Virginia Braun Victoria Clarke 2006 Using thematic analysis psychology Qualitative research psychology 3 2 2006 77–101 11 Alan W Brown Kurt C Wallnau 1998 current state CBSE IEEE 15 5 1998 37–46 12 Andrea Capiluppi Patricia Lago Maurizio Morisio 2003 Characteristics open source projects Seventh European Conference onSoftware Maintenance Reengineering 2003 Proceedings IEEE 317–327 13 Ashley Castleberry Amanda Nolen 2018 Thematic analysis qualitative research data easy sounds Currents pharmacy teaching learning 10 6 2018 807–815 14 Nicholas Christakis James H Fowler 2013 Social contagion theory examining dynamic social networks human behavior Statistics Medicine 32 2013 556–577 Issue 4 httpsdoiorg101002sim5408 15 Russ Cox 2019 Surviving Dependencies reuse finally comes risks Queue 17 2 2019 24–47 16 John W Creswell J David Creswell 2017 Research design Qualitative quantitative mixed methods approaches Sage publications 17 Kevin Crowston James Howison 2005 social structure free open source development 18 Norman K Denzin 2017 research act theoretical introduction sociological methods Routledge 19 Massimiliano Di Penta Daniel German YannGaël Guéhéneuc Giuliano Antoniol 2010 exploratory study evolution licensing 2010 ACMIEEE 32nd International Conference Engineering Vol 1 IEEE 145–154 20 Muyue Feng Weixuan Mao Zimu Yuan Yang Xiao Gu Ban Wei Wang Shiyang Wang Qian Tang Jiahuan Xu Su et al 2019 Opensource license violations binary large scale 2019 IEEE 26th International Conference Analysis Evolution Reengineering SANER IEEE 564–568 21 Felix Fischer Konstantin Böttinger Huang Xiao Christian Stransky Yasemin Acar Michael Backes Sascha Fahl 2017 Stack Overflow Considered Harmful Impact CopyPaste Android Application Security 2017 IEEE Symposium Security Privacy SP 121–136 httpsdoiorg101109SP201731 22 Samuel W Flint Jigyasa Chauhan Robert Dyer 2021 Escaping time pit Pitfalls guidelines using timebased git data 2021 IEEEACM 18th International Conference Mining Repositories MSR IEEE 85–96 23 William Frakes Carol Terry 1996 reuse metrics models ACM Computing Surveys CSUR 28 2 1996 415–435 24 William B Frakes Christopher J Fox 1995 Sixteen questions reuse Commun ACM 38 6 1995 75–ff 25 William B Frakes Kyo Kang 2005 reuse research Status future IEEE transactions Engineering 31 7 2005 529–536 26 William B Frakes Giancarlo Succi 2001 industrial study reuse quality productivity Journal Systems 57 2 2001 99–106 27 Mark Gabel Zhendong Su 2010 study uniqueness source code Proceedings eighteenth ACM SIGSOFT international symposium Foundations engineering 147–156 28 Jonas Gamalielsson Björn Lundell 2014 Sustainability Open Source communities beyond fork LibreOffice evolved Journal systems 89 2014 128–145 29 CJ Michael Geisterfer Sudipto Ghosh 2006 component specification study perspective component selection reuse Fifth International Conference CommercialofftheShelf COTSBased Systems ICCBSS’05 IEEE 9–pp 30 Daniel German 2002 evolution GNOME Proceedings 2nd Workshop Open Source Engineering 20–24 31 Daniel German Massimiliano Di Penta YannGael Gueheneuc Giuliano Antoniol 2009 Code siblings Technical legal implications copying code applications 2009 6th IEEE International Working Conference Mining Repositories IEEE 81–90 32 Daniel German Ahmed E Hassan 2009 License integration patterns Addressing license mismatches componentbased development 2009 IEEE 31st international conference engineering IEEE 188–198 33 Mohammad Gharehyazie Baishakhi Ray Vladimir Filkov 2017 Crossproject code reuse github 2017 IEEEACM 14th International Conference Mining Repositories MSR IEEE 291–301 34 Mohammad Gharehyazie Baishakhi Ray Mehdi Keshani Masoumeh Soleimani Zavosht Abbas Heydarnoori Vladimir Filkov 2019 Crossproject code clones GitHub Empirical Engineering 24 3 2019 1558–1573 35 Antonios Gkortzis Daniel Feitosa Diomidis Spinellis 2021 reuse cuts ways empirical analysis relationship security vulnerabilities Journal Systems 172 2021 110653 36 Georgios Gousios 2013 GHTorrent dataset tool suite 2013 10th Working Conference Mining Repositories MSR IEEE 233–236 37 Georgios Gousios Diomidis Spinellis 2012 GHTorrent’s data firehose 2012 9th IEEE Working Conference Mining Repositories MSR IEEE 12–21 38 Greg Guest Arwen Bunce Laura Johnson 2006 many interviews enough experiment data saturation variability Field methods 18 1 2006 59–82 39 Stefan Haefliger Georg Von Krogh Sebastian Spaeth 2008 Code reuse open source Management science 54 1 2008 180–193 40 Steve Hanna Ling Huang Edward Wu Saung Li Charles Chen Dawn Song 2012 Juxtapp scalable system detecting code reuse among android applications International Conference Detection Intrusions Malware Vulnerability Assessment Springer 62–81 41 Hideaki Hata Raula Gaikovina Kula Takashi Ishio Christoph Treude 2021 Research artifact potential metamaintenance GitHub 2021 IEEEACM 43rd International Conference Engineering Companion Proceedings ICSECompanion IEEE 192–193 42 Hideaki Hata Raula Gaikovina Kula Takashi Ishio Christoph Treude 2021 file different changes potential metamaintenance github 2021 IEEEACM 43rd International Conference Engineering ICSE IEEE 773–784 43 Lars Heinemann Florian Deissenboeck Mario Gleirscher Benjamin Hummel Maximilian Irlbeck 2011 extent nature reuse open source java projects International Conference Reuse Springer 207–222 44 David W Hosmer Jr Stanley Lemeshow Rodney X Sturdivant 2013 Applied logistic regression John Wiley Sons 45 Katsuro Inoue Yuya Miyamoto Daniel German Takashi Ishio 2021 Finding codeclone snippets large sourcecode collection CCgrep Open Source Systems 17th IFIP WG 213 International Conference OSS 2021 Virtual Event May 12–13 2021 Proceedings 17 Springer 28–41 46 Mahmoud Jahanshahi Audris Mockus 2024 Dataset Copybased Reuse Open Source 2024 IEEEACM 21st International Conference Mining Repositories MSR IEEE 42–47 47 Lingxiao Jiang Ghassan Misherghi Zhendong Su Stephane Glondu 2007 Deckard Scalable accurate treebased detection code clones 29th International Conference Engineering ICSE’07 IEEE 96–105 48 Elmar Juergens Florian Deissenboeck Benjamin Hummel Stefan Wagner 2009 code clones matter 2009 IEEE 31st International Conference Engineering IEEE 485–495 49 Cory J Kapser Michael W Godfrey 2008 “Cloning considered harmful” considered harmful patterns cloning Empirical Engineering 13 2008 645–692 50 Naohiro Kawamitsu Takashi Ishio Tetsuya Kanda Raula Gaikovina Kula Coen De Roover Katsuro Inoue 2014 Identifying source code reuse across repositories using lcsbased source code similarity 2014 IEEE 14th international working conference source code analysis manipulation IEEE 305–314 51 Stefan Koch Georg Schneider 2002 Effort cooperation coordination open source GNOME Information Systems Journal 12 1 2002 27–42 52 Rainer Koschke 2007 Survey research clones 53 Robert V Krejcie Daryle W Morgan 1970 Determining sample size research activities Educational psychological measurement 30 3 1970 607–610 54 Charles W Krueger 2001 Easing transition mass customization International Workshop ProductFamily Engineering Springer 282–293 55 Charles W Krueger 1992 reuse ACM Computing Surveys CSUR 24 2 1992 131–183 56 Piergiorgio Ladisa Henrik Plate Matias Martinez Olivier Barais 2023 Sok Taxonomy attacks opensource supply chains 2023 IEEE Symposium Security Privacy SP IEEE 1509–1526 57 Jure Leskovec Christos Faloutsos 2006 Sampling large graphs Proceedings 12th ACM SIGKDD international conference Knowledge discovery data mining 631–636 58 Zhenmin Li Lu Suvda Myagmar Yuanyuan Zhou 2006 CPMiner Finding copypaste related bugs largescale code IEEE Transactions Engineering 32 3 2006 176–192 59 Long Liang Xiaobo Wu Jing Deng Xin Lv 2022 Research Risk Analysis Governance Measures Opensource Components Information System Transportation Industry Procedia Computer Science 208 2022 106–110 httpsdoiorg101016jprocs202210017 60 Cristina V Lopes Petr Maj Pedro Martins Vaibhav Saini Di Yang Jakub Zitny Hitesh Sajnani Jan Vitek 2017 DéjàVu map code duplicates GitHub Proceedings ACM Programming Languages 1 OOPSLA 2017 1–28 61 Adolfo LozanoTello Asunción GómezPérez 2002 BAREMO choose appropriate component using analytic hierarchy process Proceedings 14th international conference engineering knowledge engineering 781–788 62 Yuxing Chris Bogart Sadika Amreen Russell Zaretzki Audris Mockus 2019 World code infrastructure mining universe open source VCS data 2019 IEEEACM 16th International Conference Mining Repositories MSR IEEE 143–154 63 Yuxing Tapajit Dey Chris Bogart Sadika Amreen Marat Valiev Adam Tutko David Kennard Russell Zaretzki Audris Mockus 2021 World code Enabling research workflow mining analyzing universe open source vcs data Empirical Engineering 26 2 2021 1–42 64 Yuxing Audris Mockus Russel Zaretzki Randy Bradley Bogdan Bichescu 2020 methodology analyzing uptake technologies among developers IEEE Transactions Engineering 48 2 2020 485–501 65 Mark Mason et al 2010 Sample size saturation PhD studies using qualitative interviews 66 Hafedh Mili Fatma Mili Ali Mili 1995 Reusing Issues research directions IEEE transactions Engineering 21 6 1995 528–562 67 Michael Mitzenmacher Eli Upfal 2017 Probability computing Randomization probabilistic techniques algorithms data analysis Cambridge university press 68 Audris Mockus 2007 Largescale code reuse open source First International Workshop Emerging Trends FLOSS Research Development FLOSS’07 ICSE Workshops 2007 IEEE 7–7 69 Audris Mockus 2019 Insights open source supply chains keynote Proceedings 2019 27th ACM Joint Meeting European Engineering Conference Symposium Foundations Engineering Tallinn Estonia ESECFSE 2019 Association Computing Machinery New York NY USA 3 httpsdoiorg10114533389063342813 70 Audris Mockus 2022 Tutorial Open Source Supply Chains httpsmockusorgpapersSSCISEC22pdf 71 Audris Mockus 2023 Securing Large Language Model Supply Chains httpsmockusorgpaperswocllmpdf ASE’23 LLMs Engineering 72 Audris Mockus Diomidis Spinellis Zoe Kotti Gabriel John Dusing 2020 complete set related git repositories identified via community detection approaches based shared commits Proceedings 17th International Conference Mining Repositories 513–517 73 Chinenye Okafor Taylor R Schorlemmer Santiago TorresArias James C Davis 2022 Sok Analysis supply chain security establishing secure design properties Proceedings 2022 ACM Workshop Supply Chain Offensive Research Ecosystem Defenses 15–24 74 Joel Ossher Sushil Bajracharya Cristina Lopes 2010 Automated dependency resolution open source 2010 7th IEEE Working Conference Mining Repositories MSR 2010 IEEE 130–140 75 David Lorge Parnas 1972 criteria used decomposing systems modules Commun ACM 15 12 1972 1053–1058 76 Shi Qiu Daniel German Katsuro Inoue 2021 Empirical study dependencyrelated license violation javascript package ecosystem Journal Information Processing 29 2021 296–304 77 Baishakhi Ray Daryl Posnett Vladimir Filkov Premkumar Devanbu 2014 large scale study programming languages code quality github Proceedings 22nd ACM SIGSOFT international symposium foundations engineering 155–165 78 David Reid Mahmoud Jahanshahi Audris Mockus 2022 extent orphan vulnerabilities code reuse open source Proceedings 44th International Conference Engineering 2104–2115 79 Jeffrey Roberts IlHorn Hann Sandra Slaughter 2006 Understanding motivations participation performance open source developers longitudinal study apache projects Management Science 52 7 July 2006 984–999 80 Chanchal Kumar Roy James R Cordy 2007 survey clone detection research Queen’s School Computing TR 541 115 2007 64–68 81 Chanchal K Roy James R Cordy Rainer Koschke 2009 Comparison evaluation code clone detection techniques tools qualitative approach Science computer programming 74 7 2009 470–495 82 Julia Rubin Marsha Chechik 2013 survey feature location techniques 29–58 pages 83 Hitesh Sajnani Vaibhav Saini Jeffrey Svajlenko Chanchal K Roy Cristina V Lopes 2016 Sourcerercc Scaling code clone detection bigcode Proceedings 38th international conference engineering 1157–1168 84 Mohammadreza Samadi Alexander Nikolaev Rakesh Nagi 2016 subjective evidence model influence maximization social networks Omega 59 2016 263–278 85 Susan Elliott Sim Charles LA Clarke Richard C Holt 1998 Archetypal source code searches survey developers maintainers Proceedings 6th International Workshop Program Comprehension IWPC’98 Cat 98TB100242 IEEE 180–187 86 Manuel Sojer Joachim Henkel 2010 Code reuse open source development Quantitative evidence drivers impediments Journal Association Information Systems 11 12 2010 2 87 Chintakindi Srinivas Vangipuram Radhakrishna CV Guru Rao 2014 Clustering classification component efficient component retrieval building component reuse libraries Procedia Computer Science 31 2014 1044–1050 88 Student 1908 probable error mean 25 pages 89 Jane Sutton Zubin Austin 2015 Qualitative research Data collection analysis management Canadian journal hospital pharmacy 68 3 2015 226 90 Jeffrey Svajlenko Iman Keivanloo Chanchal K Roy 2013 Scaling classical clone detection tools ultralarge datasets exploratory study 2013 7th International Workshop Clones IWSC IEEE 16–22 91 Jeffrey Svajlenko Chanchal K Roy 2014 Evaluating modern clone detection tools 2014 IEEE international conference maintenance evolution IEEE 321–330 92 Jeffrey Svajlenko Chanchal K Roy 2015 Evaluating clone detection tools bigclonebench 2015 IEEE international conference maintenance evolution ICSME IEEE 131–140 93 Jason Tsay Laura Dabbish James Herbsleb 2014 Influence social technical factors evaluating contribution GitHub Proceedings 36th international conference engineering 356–366 94 Bogdan Vasilescu Alexander Serebrenik Vladimir Filkov 2015 data set social diversity studies github teams 2015 IEEEACM 12th working conference mining repositories IEEE 514–517 95 David Weiss Chi Tau Robert Lai 1999 productline engineering familybased development process AddisonWesley Longman Publishing Co Inc 96 Katrin Weller Katharina E KinderKurlanda 2016 manifesto data sharing social media research Proceedings 8th ACM Conference Web Science 166–172 97 Martin White Michele Tufano Christopher Vendome Denys Poshyvanyk 2016 Deep learning code fragments code clone detection 2016 31st IEEEACM International Conference Automated Engineering ASE IEEE 87–98 98 Dapeng Yan Yuqing Niu Kui Liu Zhe Liu Zhiming Liu Tegawendé F Bissyandé 2021 Estimating attack surface residual vulnerabilities open source supply chain 2021 IEEE 21st International Conference Quality Reliability Security QRS IEEE 493–502 99 Robert K Yin 2015 Qualitative research start finish Guilford publications 100 Yuhang Zhao Ruigang Liang Xiang Chen Jing Zou 2021 Evaluation indicators opensource review Cybersecurity 4 2021 1–24
::::
Forking Changed Last 20 Years Study Hard Forks GitHub Shurui Zhou Carnegie Mellon University USA Bogdan Vasilescu Carnegie Mellon University USA Christian Kästner Carnegie Mellon University USA ABSTRACT notion forking changed rise distributed version control systems social coding environments like GitHub Traditionally forking refers splitting independent development branch call hard forks research hard forks conducted mostly preGitHub days showed hard forks often seen critical may fragment community Today social coding environments opensource developers encouraged fork order contribute community call social forks may also influenced perceptions practices around hard forks revisit hard forks identify study classify 15306 hard forks GitHub interview 18 owners hard forks forked repositories find among others hard forks often evolve social forks rather planned deliberately perception hard forks indeed changed dramatically seeing often positive noncompetitive alternative original ACM Reference Format Shurui Zhou Bogdan Vasilescu Christian Kästner 2020 Forking Changed Last 20 Years Study Hard Forks GitHub 42nd International Conference Engineering ICSE ’20 May 23–29 2020 Seoul Republic Korea ACM New York NY USA 12 pages httpsdoiorg10114533778113380412
::::
1 INTRODUCTION notion forking opensource evolved Traditionally forking practice copying repository splitting new independent development often new name forking rare typically intended compete supersede original 15 30 32 Nowadays forks distributed version control systems public copies repositories developers make changes potentially necessarily intention integrating changes back original repository rise social coding explicit support distributed version control systems forking repositories explicitly promoted sites like GitHub Bitbucket GitLab indeed become popular 19 34 example identified 114120 GitHub projects 50 forks 9164 projects 500 forks June 2019 numbers rising quickly However modern forks forks traditional sense prior work 53 distinguish social forks referring creating public copy repository social coding site like GitHub often goal contributing original hard forks referring traditional notion splitting new development branch Hard forks discussed controversially throughout history free opensource one hand free opensource licenses codified right create hard forks seen essential guaranteeing flexibility fostering disruptive innovations 15 30 32 useful encouraging survivalofthefittest model 48 hand hard forks frequently considered risky projects since could fragment community lead confusion developers users 15 26 30 36 strong norm forking many well known hard forks exist eg LibreOffice Jenkins iojs see Fig 1 many well known cases communities survived healthy hard fork prominent exception BSD variants Prior research forking free opensource projects focused motivations behind hard forks 8 12 13 26 31 39 47 controversial perceptions around hard forks 6 15 26 30 36 49 outcomes hard forks including studying factors influence outcomes 39 49 However essentially research conducted rise social coding much SourceForge GitHub launched 2008 became dominant opensource hosting site around 2012 cf Fig 1 paper argue perceptions practices around forking could changed significantly since SourceForge’s heydays contrast strong norm forking back conjecture promotion social forks sites like GitHub often blurry line social hard forks may encouraged forking lowered bar also hard forks time advances tooling especially distributed version control systems like Git 7 transparency mechanisms social coding sites 10 may enabled new opportunities Figure 1 Timeline popular opensource forking events popularity approximated Google Trends changed common practices perceptions professionalization opensource development increasing involvement corporations even corporate ownership opensource projects may tilted perceptions Therefore argue time revisit replicate extend research hard forks asking central question work perceptions practices around hard forks changed Updating deepening understanding regarding practices perceptions around hard forks inform design better tools management strategies facilitate efficient collaboration Furthermore attempt automate process identifying hard forks among social forks quantifying frequent hard forks across GitHub previous research cover Using mixedmethods empirical design combining repository mining 18 developer interviews investigate Frequency hard forks attempt quantify frequency hard forks among mostly social forks GitHub Specifically design refine classifier automatically detect hard forks find 15306 instances showing hard forks significant concern even though relative numbers low Common evolution patterns hard forks classify evolution hard forks corresponding upstream repository observe outcomes including whether fork upstream repositories sustain activities whether synchronize development develop classification visualizing qualitatively analyzing evolution patterns using card sorting subsequently automate classification process analyze detected hard forks find many hard forks sustained extended periods substantial number hard forks still least occasionally exchange commits upstream repository Perceptions hard forks interviews 18 opensource maintainers forks corresponding upstream repositories solicit practices perceptions regarding hard forks analyze whether align ones reported presocialcoding research find ‘stigma’ often reported around hard forks largely gone indeed forks including hard forks generally seen positive many hard forks complementing rather competing upstream repository Furthermore social forking encouraging forks contribution mechanism find many hard forks deliberately planned evolve slowly social forks Overall contribute 1 method identify hard forks 2 dataset 15306 hard forks GitHub 3 classification analysis evolution patterns hard forks 4 results interviews 18 open source developers reasons hard forks interactions across forks perceptions hard forks research focuses development practices GitHub far dominant opensource hosting platform cf Fig 1 key establishing social forking phenomenon Even large projects primarily hosted sites often public mirror GitHub allowing us gather fairly representative picture entire opensource community main research instruments semistructured interviews openended questions repository mining GitHub API research planned exact replication prior work exceeds scope prior studies comparing social hard forks many facets seek replicate prior findings eg regarding motivations outcomes hard forks considered conceptual replication 24 43
::::
2 PAST RESEARCH FORKING
::::
21 Types Forking popularly understood ‘forking project’ changed last decades line prior work 53 distinguish hard forks social forks Hard forks Traditionally forking refers copying order continue separate often competing line development name direction also typically change Developers might fork eg unhappy direction governance deciding create divergent version line vision 15 preGitHub days ways contribute opensource varied widely rather using public forks one would typically create local copies make changes send patch files Social forks Popularized GitHub ‘forking’ also refers public copies opensource repositories often created shortterm feature implementation often intention contributing back upstream repository fork GitHub thus typically intended start independent development line uniform mechanism distributed development thirdparty contribution ie pull requests 10 19 fact forking function GitHub frequently used even bookmarking mechanism keep copy without intention performing changes 25 GitHub nowadays forms forking exist conjecture vast majority forks social forks however obvious distinguish two kinds without closer analysis technical level forks created cloning repositories distributed version control systems case fork maintains history upstream simply copying files starting new history latter common preGitHub days forks created directly GitHub clone automatically created GitHub tracks visually shows relationship fork upstream projects significant research hard forks social forks hardforking research typically older conducted almost exclusively GitHub social forking Research social forking recent focuses much contribution process issues around managing contributions single
::::
22 Motivations Forking Reasons developers might create hard fork existing opensource vary widely Motivations forks studied primarily SourceForge advent social coding environment 8 12 13 26 31 39 47 per Robles GonzálezBarahona 39 common motivations hard forks Technical Variants targeting specific needs user segments accommodated upstream common motivation 31 grows matures contributors’ goals perspectives may diverge may want take different direction taken extreme hard forks used variant management multiple related different projects originating source maintained separately 3 13 14 45 Governance disputes contributors created hard forks feel feedback heard maintainers accepting patches slowly original hard fork even threat creating one help developers negotiate governance disputes 17 recent examples hard forks caused governance disputes include Nodejs 42 50 Docker 51 common forms disputes occur companies involved try influence direction try closesource monetize future versions Hudson OpenOffice Discontinuation original hard fork revive original developers ceased work example back 1990s Apache web server took abandoned NCSA HTTPd Commercial forks Companies sometimes fork opensource projects create branded version sometimes enhanced closedsource features example Apple’s fork KDE’s KHTML rendering engine Webkit Legal reasons might consider different licenses trademark dispute may arise changes laws eg regarding encryption require technical changes Hard forks used split development different jurisdictions Personal reasons Interpersonal disputes irreconcilable differences nontechnical nature lead rift various parties forks OpenBSD classic example contrast older work hard forks recent work also investigated motivation practices behind social forks example Fung et al 16 report 14 percent active forks nine popular JavaScript projects integrated back changes Subsequently researchers studied social forks larger scale reported around 50 percent forks GitHub never integrate code changes back 23 53 addition Jiang et al 23 reported 10 percent study participants used forks backup purposes study revisit question motivation hard forks explore whether changed rise social coding
::::
23 Outcomes Hard Forks Wheeler 49 Robles GonzálezBarahona 39 distinguish five possible outcomes hard forks Successful branching typically differentiation original fork succeed remain active prolonged period time fragmenting community smaller subcommunities BSD variants notable examples Fork merges back upstream fork sustain independence merges changes back upstream eg resolving dispute triggered hard fork first place iojs fork Nodejs 50 Discontinuation fork fork initially active sustain activity example libc split glibc glibc maintainers invested improvements win back users fork failed Discontinuation upstream fork outperforms upstream upstream discontinued fork revives already dead upstream example XFree86 moved away GPLcompatible license forked created Xorg quickly adopted developers users soon XFree86 core team disbanded development ceased fail projects fail fork fails revive dead Wheeler 49 conjectured rare fork upstream sustain activities Robles GonzálezBarahona 39 quantified frequency outcome sample 220 forked opensource projects referenced Wikipedia 2011 ie selection biased toward wellknown projects achieved certain level success found successful branching common 436 followed discontinuation fork 298 discontinuation upstream 138 failure merges relatively rare 87 32
::::
24 Pros Cons Hard Forks Hard forks long discussed controversially 90s 2000s forking seen important right also something avoid possible unless last resort strong norm forking fragments communities cause hard feelings people involved free movement traditionally seen forking something avoid forks split community introduce duplicate effort reduce communication may produce incompatibilities 39 Specifically tear community apart meaning people community pick sides 6 15 26 30 36 49 fragmentation also threaten sustainability opensource projects scarce resources additionally scattered changes need performed redundantly across multiple projects eg 3D printer firmware Marlin fixed issue PR 10119 two years problem fixed hard fork Ultimaker PR 118 time right forking also seen important political tool community threat fork alone cause leaders pay attention issues may ignore otherwise issues actually important potentially improve current practices 49 contrast social forks seen something almost exclusively positive actively encouraged 4 mechanism contribute opensource projects actively embrace external contributors 19 46 Although maintainers complain burden dealing many thirdparty contributions 21 46 researchers warn inefficiencies regarding lost contributions duplicate work 38 52 53 aware calls constrain social forking Importantly though study show distinction social hard forks fluent Social coding platforms contain kinds forks always easy distinguish Diffusion efforts fragmentation communities always feared discussions hard forks observed also GitHub Many secondary forks ie forks forks contribute forks original repository forks slowly drift apart 16 45 key question thus whether popularity social forking encourages also hard forks causes similar fragmentation sustainability challenges feared past believe necessary revisit hard forking rise social coding GitHub Specifically aim understand hardfork phenomenon current socialforking environment understand perceptions practices may changed
::::
3 RESEARCH QUESTIONS METHODS described Sec 2 conventional use term forking well corresponding tooling changed rise distributed version control social coding platforms conjecture also influenced hard forks Hence overall research question perceptions practices around hard forks changed explore different facets hard forks including motivations outcomes perceived stigma cf Sec 2 also attempt identify frequent hard forks across GitHub discuss developers navigate tension often blurry line social hard forks adopt concurrent mixedmethod exploratory research strategy 9 combine repository mining – identify hard forks outcomes – interviews maintainers forks upstream projects – explore motivations perceptions Mixing multiple methods allows us explore research question simultaneously multiple facets triangulate results addition use results repository mining guide selection interviewees explicitly decided exact replication 24 43 prior work contexts changed significantly Instead guide research previously explored facets hard forks revisit part repository mining interviews contrast findings reported preGitHub studies addition limit research previously explored facets explicitly explore new facets tension social hard forks emerged technology changes discovered interviews 31 Instrument Visualizing Fork Activities created commit history graphs custom visualization commit activities forks illustrated Figure 2 help develop debug classifiers Sec 32 33 also prepare interviews Given pair fork corresponding upstream repositories clone analyze joint commit graph two assigning every commit two one five states 1 created forking point 2 upstream synchronized 3 fork unmerged 4 created upstream synchronized fork 5 created fork merged upstream Technically nutshell build prior commit graph analysis 53 merge edges assigned weight 1 edges weight 0 shortest path commit branch either fork upstream repository identifies commit originates whether merged direction1 subsequently plot activities two repositories time aggregated threemonth intervals larger dots indicate commits plots include additional arrows synchronization upstream fork merge fork upstream activities plots quickly visually inspect development activities forking point well whether fork upstream repository interact 32 Identifying Hard Forks Identifying hard forks reliably challenging PreGitHub work often used keyword searches descriptions eg ‘software fork’ relied external curated sources eg Wikipedia 39 Today sites like GitHub hard forks use mechanisms social forks without explicit distinction Classifier development work want gather large set hard forks even approximate frequency hard forks among 47 million forks GitHub end need scalable automated classifier aware existing classifier except prior work 53 classified forks hard forks least two pull requests least 100 unmerged commits project’s name changed Unfortunately found classifier missed many actual hard forks false negatives thus went back drawing board develop new one proceeded iteratively repeatedly trying validating combining various heuristics would try heuristic detect hard forks manually sample significant number classified forks identify false positives false negatives revising heuristic combining steps Commit history graphs cf Sec 31 qualitative analysis forks Sec 33 useful debugging devices process iterated reached confidence results low rate false positives 1There nuances process due technicalities Git GitHub example upstream repository deletes branch forking joint commit graph would identify code exclusive fork end discard commits older forking timestamp GitHub details available opensource implementation httpsgithubcomshuiblueVisualHardFork final classifier proceeds two steps first use multiple simple heuristics identify candidate hard forks second use detailed expensive analysis decide candidates actual hard forks first step identify candidate hard forks among repositories labeled forks GitHub Contain phrase “fork of” description H1 use GitHub’s search API find repositories contain phrase “fork of” description fork another idea inspired prior work 31 look projects explicitly label forks defined “selfproclaimed forks” ie developers explicitly change description cloning upstream repository work around GitHub’s API search limit 1000 results per query partitioned query based different time ranges repository created Next compare description fork upstream make sure description copied upstream ie upstream already selfproclaimed fork Received external pull requests H2 Using June 2019 GHTorrent dataset 18 identified GitHub repositories labeled forks received least three pull requests excluding pull requests issued fork’s owner avoid counting developers use process feature branches consider external contributions fork signal fork may attracted community substantial unmerged changes H3 Using GHTorrent dataset identify forks least 100 commits indicating significant development activities beyond typical social forks least 1year development activity H4 Similar previous heuristic look prolonged development activities beyond common social forks Specifically identify forks candidates time first last commit spans one year changed name H5 check fork’s name GitHub changed upstream repository’s name Levenshtein distance geq 3 heuristic comes observation social forks change names forks intending go different direction create separate community tend change names commonly eg Jenkins forked Hudson repository meets least one criteria considered candidate show many candidates heuristic identified second column Fig 3b Note heuristics use GHTorrent additionally validated results checking whether fork upstream pair still exist GitHub whether measures align reported GitHub API5 line prior work 25 53 remove repositories using GitHub document storage course submission – among forked projects GitHub Specifically manual review discard repositories containing keywords ‘homework’ ‘assignments’ ‘course’ ‘codecamp’ ‘documents’ description discard repositories whose name starts ‘awesome’ usually document collections discard repositories programminglanguagespecific files per GitHub’s language classification queried API discard candidates fewer three stars GitHub Stars lightweight mechanism developers indicate interest common measure popularity threshold three stars low still requires minimum amount public interest According GHTorrent data 125 million GitHub repositories 2 million repositories 16 three stars discard candidates without commits fork typically projects performed name change single postfork action discard candidates 30 commits fork merged upstream indicates social forks active contributions upstream candidates identified 100 commits 1 year activity discard thresholds met considering unmerged commits exclusive fork discard candidates owned developers contributed 30 commits pull requests upstream repository typically indicates core team members upstream using social forks feature development discard candidates fork created right upstream stopped updating fork owned organization account upstream owned user account common pattern observed indicating ownership transfer classifier identifies total 15306 hard forks across GitHub Fig 3b show heuristics identified hard forks overlap different heuristics Fig 3a Classifier validation validate precision classifier manually inspected random sample 300 detected hard forks manually analyzing fork’s upstream repository’s history commit messages classified 14 detected hard forks Rule Candidates Actual H1 10609 551 H2 23109 7043 H3 14956 810 H4 33073 11268 H5 20358 5568 Total 63314 15306 likely false positives suggesting acceptable accuracy 95 Note manual labeling best effort approach well distinction social hard fork always clear see also discussion interview results Sec 44 Analyzing false negatives recall challenging hard forks rare projects listed previous papers old detect GitHub dataset aware labeled dataset manually curated list known hard forks mentions web resources mentions interviews 3 hard forks fork upstream repository GitHub detect classifier size labeled dataset small make meaningful inferences recall
::::
33 Classifying Evolution Patterns identified different evolution patterns among analyzed forks using iterative approach inspired card sorting 44 Evolution patterns describe hard fork corresponding upstream coevolve help characterize forking outcomes addition used evolution patterns diversify interviewees Specifically printed cards commit history graphs 100 randomly selected hard forks see Sec 32 three authors jointly grouped cards identified common patterns cardsorting open meaning predefined groups groups emerged evolved analysis process Afterward manually built classifier detects forks identified pattern applied classifier entire dataset inspected automatically classified forks actually fit patterns intended refining classifier thresholds needed picked another 100 hard forks fit none previously defined patterns sorted looking additional patterns similarly proceeded within pattern looking 100 hard forks see whether split pattern repeated process could identify patterns several iterations arrived stable list 15 patterns could classify 977 hard forks list patterns corresponding example commit history graph Tab 2 patterns use characteristics relate previously found outcomes fork upstream discontinued also consider additional characteristics corresponding features available easily observable distributed version control eg whether fork upstream merge synchronize present patterns hierarchical form process revealed classification fairly obvious tree structure specifically looking hierarchical structure
::::
34 Interviews solicit views perceptions conducted 18 semistructured interviews developers typically 20 40 minutes Despite reaching fewer developers opted interviews rather surveys due exploratory nature research Interviews allow indepth exploration emerging themes Interview protocol designed protocol 2 covers relevant dimensions earlier research touches expected changes including reasons forking perceived stigma forking distinction possible tensions social hard forks asked fork owners decision process lead hard fork practices afterward eg renamed projects current relationship upstream eg whether still monitor even synchronize future plans contrast asked owners upstream projects extent aware interact monitor hard forks degree concerned forks even take steps avoid addition asked participants long history opensource activity observed changes practices perceptions others time interviews semistructured allowing exploration topics brought participants interview protocol evolved interview reacted confusion questions insights found earlier interviews refined added questions explore new insights detail subsequent interviews – example first interviews added questions tradeoff inclusive changes versus risking hard forks questions regarding practices tooling coordinate across repositories ground interview concrete experience rather vague generalizations focused interview single repository interviewee involved bringing questions back specific repository discussion became generic Participant recruitment selected potential interviewees among maintainers 15306 identified hard forks corresponding upstream repositories consider maintainers public email address GitHub profile active analyzed repositories within last 2 years reduce risk misremembering sampled candidates evolution patterns Sec 33 sent 242 invitation emails Overall 18 maintainers volunteered participate study 7 response rate Ten opted interviewed email one
::::
Table 1 Background information participants Par Domain StarsU StarsF LOC Role Expyr P1 Blockchain 20 10 10K F 19 P2 Reinforcement learning 10K 1K 30K F 3 P3 Mobile processing 70 20K F 6 P4 Video recording 100 300K F 18 P5 Helpdesk system 2K 10 800K F 5 P6 CRM system 30 200 800K F 10 P7 Physics engine 300 100K F 15 P8 Social platform 500 230 500K F 20 P9 Reinforcement learning 20 20 30K 2ndF 3 P10 Game Engine 500 10 200K 2ndF 21 P11 Networking 300 100 500K F 10 P12 Email library 10K 20K FU 32 P13 Game engine 3K 70 20K F 11 P14 Machine learning 30K 50 60K F 8 P15 Image editing 70 10 20K F 20 P16 Image editing 70 10 20K U 10 P17 Microcontrollers 9K 1K 300K U 6 P18 Maps 400 10 100K U 9 F Hard Fork Owner U Upstream Maintainer 2ndF Fork Hard Fork upstream projects GitHub number stars unknown Numbers rounded one significant digit chat app others phone teleconferencing Table 2 map interviewees evolution pattern primary fork discussed though interviewees may multiple roles different projects Naturally interviewees biased toward hard forks still active response rate also lower among maintainers upstream repositories maybe less invested talking forking Table 1 list information interviewees primary hard fork discussed interviewees experienced opensource developers specifically many 10 years experience participating opensource community meaning interacted earlier opensource platform Sourceforge interviews reached saturation last interviews provided marginal additional insights Analysis analyzed interviews using standard qualitatively research methods 41 transcribing interviews two authors coded interviews independently authors subsequently discussed emerging topics trends Questions disagreements discussed resolved together needed asking follow questions interviewees
::::
35 Threats Validity Credibility study exhibits threats validity credibility typical expected kind exploratory interview studies used analysis archival GitHub data Distinguishing social hard forks difficult even human raters distinction primarily one intention experience make judgment call high interrater reliability forks always repositories cannot accurately classified without additional information build evaluate classifiers based best effort strategy discussed check later steps data GitHub API early steps identify candidate hard forks may affected missing incorrect data GHTorrent dataset addition history Git repositories reliable timestamps may incorrect users rewrite histories fact addition merges difficult track code changes merged new commit ‘squashing’ rather traditional merge commit consequence despite best efforts inaccuracies classification hard forks individual commits expect lead underreporting hard forks underreporting merged code analyze data rightcensored time series data detect projects seized activity past cannot predict future thus seeing larger chance older forks discontinued study limited hard forks fork upstream repository hosted GitHub forking relationship tracked GitHub GitHub far dominant hosting service open source study cover forks created typically older projects hosted elsewhere forks created manually cloning copying source code new repository addition interviews typical interview studies field biased toward answer developers chose make email public chose answer interview request underrepresented maintainers upstream repositories sample
::::
4 RESULTS explore practices perceptions around hard forks along four facets emerged interviews data
::::
41 Frequency Hard Forks classifier identified 15306 hard forks confirming hard forks generally rare phenomenon June 2019 GitHub tracks 47 million repositories marked forks 5 million distinct upstream repositories among GitHub’s 125 million repositories Among vast majority forks activity forking point stars active forks limited activity indicative social forks 02 GitHub’s 47 million forks 3 stars analysis evolution patterns Tab 2 reveals cases upstream repository hard fork remain active extended periods time common patterns 1 2 4–7 1157 hard forks 88 hard forks actually survive upstream upstream active fork created patterns 8–11 7280 hard forks 476 many also run steam eventually patterns 3 12–15 6671 hard forks 436 hard forks created forks active projects patterns 4–15 14254 hard forks 93 substantial number cases hard fork created revive dead pattern 1–3 1052 hard forks 68 cases even triggering coinciding revival upstream pattern 2 56 hard forks 036 also hard fork sustain activity pattern 3 420 hard forks 27 Discussion implications Even though percentage hard forks low total number attempted sustained hard forks Considering significant cost hard fork put community fragmentation also potential power community hard forks argue hard forks important phenomenon study even comparably rare Whereas previous work typically looked small number hard forks research tooling around hardfork issues typically focus well known projects variants BSD 35 Marlin 28 artificial academic variants 14 22 detected significant number hard forks many recent using many different languages rich pool future research release dataset hard forks corresponding visualizations dataset paper 2
::::
42 Hard Forks Created Avoid first glance interviewees give reasons creating hard forks align well prior findings including especially continuing discontinued projects projects unresponsive maintainers P1 P2 P8 disagreements around governance P2 P12 diverging technical goals target populations P3 P5 Table 2 Evolution patterns hard forks Id Category Total Subcategory Example Count Interviewees 1 Success F active 2 Qt 632 Upstream remains inactive 576 P12 2 Revive Dead Upstream active 56 3 success F active 2 Qt 420 420 4 merge 26 P10 5 Alive 723 sync 107 P2 P13 P15 6 merge sync 28 P9 7 interaction 562 P1 P3 P4 P5 P7 P14 8 merge 174 9 Fork Lived Longer 7280 sync 686 10 Forking Active merge sync 107 11 interaction 6313 P6 P8 P11 12 merge 388 13 Fork live upstream 6251 sync 762 14 merge sync 199 15 interaction 4902 P6 P11 P13 P14 P17 discussed identified 1052 hard forks Tab 2 patterns 1–3 68 forked inactive interesting common theme emerged interviews though many hard forks deliberately created hard forks initially half interviewees described initially created fork intention contributing upstream repository social fork faced obstacles decided continue Common obstacles unresponsive maintainers P1 P2 P8 rejected pull requests P11 P13 P14 typically change considered beyond scope example P2 described “before forking started opening issues pull requests lack response part got news 2 months fork getting interest others” Similarly maintainers reported fork initially created minor personal changes evolved hard fork changes became elaborate others found useful P2 P14 P17 example P14 described upstream constantly evolving code base became quickly incompatible libraries decided fix issue also adding functionality people found fork started migrate Several maintainers also explicit thoughts avoid hard forks maintainers projects forked fork owners may forked largely mirror common reasons forking ie transparent governance responsive inclusive feature requests example P2 suggests reactive community thus considers unlikely forked similarly P16 decided generally “respond issues timely manner make good” faith effort incorporate PRs possibly fix issues add features needs arrives” reduce need hard forks Beyond P2 also mentioned created contributing guide issue templates coordinate contributors efficiently P14 suggested “credit contributors” explicitly release notes order keep contributors stay community Discussion Implications Whereas forking typically seen deliberate decision preGitHub days required explicit steps set repository fork find new name nowadays many hard forks seem happen without much initial deliberation Social coding environments actively encourage forking contribution mechanism significantly lowers bar create fork first place without think new name potential consequences like fragmenting communities fork exists initially created social fork seems often gradual development developers explicitly consider fork separate development line fact many hard forks seem triggered rather small initial changes interview results align observation 36 detected hard forks GitHub changed project’s name cf Fig 3a importantly theme emerged throughout interviews hard forks likely avoidable general project’s tension specific begin general one hand projects inclusive community contributions risk becoming large broad become expensive maintain eg P17 suggests maintainers need take maintenance thirdparty contributions niche use cases difficult use eg lots configuration options much complexity hand projects staying close original vision keeping narrow scope may remain focused smaller easier maintain code base risk alienating users fit original vision may create hard forks One could argue hard forks good test bed contributions diverge original despite costs community fork dies might suggest lack support may good decision integrate contributions main context family related projects serve slightly different needs target populations still coordinate may way overcome specificitygenerality dilemma supporting multiple projects specific mission together target significant number use cases However current technology support coordination across multiple hard forks well discuss next 43 Interactions Fork Upstream Repository Many interviewees indicate interested coordinating across repositories either merging changes back upstream eventually monitor activity upstream repository incorporate select changes hard fork owners see competing upstream rather part larger instance although fork owner P13 1500 commits ahead upstream still said “I would consider independent relying upstream could make independent stop getting improvements it’s credit make easy many hundreds developers contribute patches accept patches regulate goes well makes merging changes fork much easier” P4 P11 indicate would like merge reason hard fork disappears typically governance practices personal disputes Also upstream maintainers tend usually interested happens forks example P17 maintainer thousands mostly social forks said “I try aware important forks try get know person fork follow activities extent” However even though many interviewees expressed intentions see little evidence actual synchronization merging across forks repositories example P1 P4 P8 P11 mention interested eventually merging back upstream repository done yet concrete plans point Similarly P2 P6 P10 indicate interested changes upstream projects actually monitor synchronized long time evolution patterns similarly show synchronization upstream fork merging fork upstream rare 1618 hard forks active upstream repositories ever synchronize merge Tab 2 patterns 4–6 8–10 12–14 might explain difference intentions observed actions synchronization merging becomes difficult two repositories diverge substantially monitoring repositories becoming overwhelming current tools example P2 reports occasionally synchronize minor improvements fork diverged much synchronize larger changes P10 experienced problems synchronizing frequently thus faced incomplete implementations selectively synchronizes features interest line prior observations monitoring change feeds 5 10 33 52 interviewees report systematically monitoring changes repositories onerous current tools like GitHub’s network graph difficult use scale P11 P16 Discussion Implications Tooling changed significantly since preGitHub days prior studies hard forks may allow new forms collaboration across forks Git specifically supports merges across distributed version histories well selectively integrating changes ‘cherry picking’ feature GitHub similar social coding pages track forks allowing developers subscribe changes select repositories generally make changes forks transparent 10 11 52 Essentially interviewees familiar GitHub’s network view 1 visually shows contributions time across forks branches Even though advances tooling provide new opportunities coordination across multiple forks maintainers interested coordinating considering multiple forked projects part larger community current tools support use case well Current tools work well shortterm social forks tend work less well coordinating changes across repositories diverged significantly provides opportunities researchers explore tooling concepts monitor manage integrate changes across family hard forks Recent academic tools improved monitoring 33 52 crossfork change migration 35 37 potentially promising yet accessible easily practitioners Also experimental ideas virtual productline platforms unify development multiple variants 3 14 29 40 45 may provide inspiration maintaining coordinating hard forks though typically currently support distributed nature development competing hard forks technical solution could solve specificitygenerality dilemma cf Sec 42 allowing subcommunities handle specific features without overloading upstream without fragmenting overall community believe dataset 15306 hard forks useful develop evaluate tools realistic setting 44 Perceptions Hard Forking discussion maintainers confirmed line hard forks social forks somewhat subjective prompted could draw distinctions largely mirror definition longterm focus extensive changes fork community example P2 agree fork independent upstream different goals suggests fork better code quality better community management practices remaining connection upstream bug fixes incorporates time time Also P6 considers fork independent given quicker release cycle significantly refactoring code base interviewees dominant meaning fork social fork asked perceptions forks interviewees initially thought social forks strong positive associations eg others contributing onboarding newcomers finding collaborators generally fostering innovation instance P6 described advantages social forking “it encourages developers go direction original may gone” similarly P9 thought “it could boost creative ideas communities” One interviewee also mentioned young projects primarily focus growth forked positive signal meaning useful people Social forks dominant interviewees’ mind default frequently refocus interview hard forks asked specifically hard forks several interviewees raised concerns potential community fragmentation P4 P6 P17 worried incompatibilities especially confusing end users P3 P9 P14 P17 would preferred see hardfork owners contribute upstream instead P3 P8 P12 However concerns mostly phrased hypotheticals contrasted positive aspects Many interviewed owners hard forks see competing upstream repository consider address different problem target different user population example P10 described fork “light version” upstream targeting different group users understandable hardfork owners see forks justified also interviewed owners upstream projects positive opinions forks example P17 expressed forks good reason focus different target population case beginners forks may benefit larger community bringing users P18 suggested even would support contribute forks occasionally contributing long benefit larger community Discussion Implications Overall see perception forking significantly changed compared perceptions reported earlier work Forking used rather negative connotation preGitHub days largely regarded last resort avoided fragment community confuse users GitHub’s rebranding word forking stigma around hard forking seems mostly disappeared word mostly positive connotations developers associated positively external contributors community still concern community fragmentation rarely concrete concern actual reasons behind hard fork Transparent tooling seems help acceptance considering multiple hard forks part larger community mutually benefit expect favorable view combined lower technical barriers Sec 42 higher expectations coordination Sec 43 makes hard forks phenomenon expect see However positive expectations turn frustration disengagement valuable contributors sustain open source fragmentation leads competition confusion coordination breakdowns due insufficient tooling right tooling coordination merging think hard forks powerful tool exploring new larger ideas testing whether sufficient support features ports niche requirements new target audiences eg solving specificitygenerality dilemma discussed Sec 42 deliberate process end though necessary explicitly understand hard forks part larger community around possibly even explicitly encourage hard forks specific explorations beyond usual scope social forks believe many ways support development hard forks coordinate distributed developers beyond social coding site offer small scale today Examples include 1 early warning system alerts upstream maintainers emerging hard forks eg external bots maintainers could use encourage collaboration competition fragmentation desired 2 way declare intention behind fork eg explicit GitHub support dashboard show multiple projects important hard forks interrelate eg pointing hard forks provide ports specific operating systems 3 means identify essence novel contributions forks eg history slicing 27 code summarization 52
::::
5 CONCLUSION rise social coding explicit support distributed version control systems forking repositories explicitly promoted sites like GitHub become popular However modern forks hard forks traditional sense paper automatically detected hard forks evolution patterns interviewed opensource developers forks upstream repositories study perceptions practices found perceptions practices indeed changed significantly Among others hard forks often evolve social forks rather planned deliberately developers less concerned community fragmentation frequently perceive hard forks positive noncompetitive alternatives original projects also outlined challenges suggested directions future work Acknowledgements Zhou Kästner supported part NSF awards 1552944 1717022 1813598 AFRL DARPA FA87501620042 Vasilescu supported part NSF awards 1717415 1901311 Alfred P Sloan Foundation REFERENCES 1 2008 GitHub Network View httpshelpgithubcomenarticlesviewingarepositorysnetwork 2 2020 Appendix httpsgithubcomshuiblueICSE20hardforkappendix 3 Michal Antkiewicz Wenbin Ji Thorsten Berger Krzysztof Czarnecki Thomas Schmorleiz Ralf Lämmel Ştefan Stănciulescu Andrzej Wąsowski Ina Schaefer 2014 Flexible Product Line Engineering Virtual Platform Proc Intl Conf Engineering ICSE ACM 532–535 4 Matt Asay 2014 fork next opensource Blog Post httpswwwtechrepubliccomarticlewhyyoushouldforkyournextopensourceproject 5 Christopher Bogatin Christian Kästner James Herbsleb Ferdian Thung 2016 Break API Cost Negotiation Community Values Three Ecosystems Proc Int’l Symposium Foundations Engineering FSE ACM 109–120 6 Pete Bratach 2017 Open Source Projects Fork Blog Post httpsthenewstackioopensourceprojectsfork 7 Caius Brindescu Mihai Codoban Sergiu Shmaruktiachi Danny Dig 2014 Centralized Distributed Version Control Systems Impact Changes Proc Int’l Conf Engineering ICSE ACM 323–333 8 Bee Bee Chua 2017 Survey Paper Open Source Forking Motivation Reasons Challenges 21st Pacific Asia Conference Information Systems PACIS 75 9 John W Creswell J David Creswell 2017 Research design Qualitative quantitative mixed methods approaches Sage publications 10 Laura Dabbish Colleen Stuart Jason Tsay Jim Herbsleb 2012 Social coding GitHub transparency collaboration open repository Proc Conf Computer Supported Cooperative Work CSCW ACM 1277–1286 11 Laura Dabbish Colleen Stuart Jason Tsay James Herbsleb 2013 Leveraging transparency IEEE 30 1 2013 37–43 12 James Dixon 2009 Forking Protocol Fork Open Source Blog Post httpsjamesdixonwordpresscom20090513differentkindsofopensourceforkssaladdinnerandfish 13 Neil Ernst Steve Easterbrook John Mylopoulos 2010 Code forking opensource requirements perspective arXiv preprint arXiv10042889 2010 14 Stefan Fischer Lukas Linsbauer Roberto Erick LopezHerrejon Alexander Egyed 2014 Enhancing cloneandown systematic reuse developing variants Proc Int’l Conf Maintenance ICSM IEEE 391–400 15 Karl Fogel 2005 Producing open source run successful free O’Reilly Media Inc 16 Kam Hay Fung Aybüke Aurum David Tang 2012 Social Forking Open Source Empirical Study Proc Int’l Conf Advanced Information Systems Engineering CAiSE Forum Citeseer 50–57 17 Jonas Gamalielsson Björn Lundell 2014 Sustainability Open Source Communities beyond Fork LibreOffice Evolved Journal Systems 89 2014 128–145 18 Georgios Gousios 2013 GHTorrent dataset tool suite Proc Working Conf Mining Repositories MSR IEEE Press 233–236 19 Georgios Gousios Martin Pinger Arie van Deursen 2014 exploratory study pullbased development model Proc Int’l Conf Engineering ICSE ACM 345–355 20 Georgios Gousios Bogdan Vasilescu Alexander Serebrenik Andy Zaidman 2014 Lean GHTorrent GitHub data demand Proc Working Conf Mining Repositories MSR ACM 384–387 21 Georgios Gousios Andy Zaidman MargaretAnne Storey Arie Van Deursen 2015 Work Practices Challenges PullBased Development Integrator’s Perspective Proc Int’l Conf Engineering ICSE Vol 1 358–368 22 Wenbin Ji Thorsten Berger Michal Antkiewicz Krzysztof Czarnecki 2015 Maintaining Feature Traceability Embedded Annotations Proc Int’l Product Line Conf SPLC ACM 61–70 23 Jing Jiang David Lo Jiuhuan Xin Xia Paveent Singh Kochhar Li Zhang 2017 developers fork GitHub Empirical Engineering 22 1 2017 547–578 24 Natalia Juristo Omar Gómez 2010 Replication engineering experiments Empirical engineering verification Springer 60–88 25 Eirini Kallianvakou Georgios Gousios Kelly Blinco Leif Singer Daniel German Daniela Damian 2016 indepth study promises perils mining GitHub Empirical Engineering 21 5 2016 2035–2071 26 Andrew St Laurent 2004 Understanding Open Source Free Licensing Guide Navigating Licensing Issues Existing New O’Reilly Media Inc 27 Yi Li Chenguang Zhu Julia Rubin Marsha Chechik 2017 Semantic slicing version histories IEEE Trans Softw Eng TSE 44 2 2017 182–201 28 Max Lillack Ştefan Stănciulescu Wilhelm Hedman Thorsten Berger Andrzej Wąsowski 2019 Intentionbased Integration Variants Proceedings 41st International Conference Engineering ICSE ’19 IEEE Press Piscataway NJ USA 831–842 29 Leticia Montalvillo Oscar Díaz 2015 Tuning GitHub SPL development branching models repository operations product engineers Proceedings 19th International Conference Product Line ACM 111–120 30 Linus Nyman 2014 Hackers forking Proc Int’l Symposium Open Collaboration OpenSym ACM 6 31 Linus Nyman Tommi Mikkonen 2011 Fork Fork Fork Motivations SourceForge Projects Proc IEP’11 Int’l Conf Open Source Systems Springer 259–268 32 Linus Nyman Tommi Mikkonen Juho Lindman Martin Fougère 2012 Perspectives Code Forking Sustainability Open Source Open Source Systems LongTerm Sustainability 2012 274–279 33 Rohan Padhye Senthil Mani Vibha Singhal Sinha 2014 NeedFeed Taming Change Notifications Modeling Code Relevance Proc Int’l Conf Automated Engineering ASE ACM 665–676 34 Ayushi Rastogi Nachiappan Nagappan 2016 Forking Sustainability Developer Community Participation—An Empirical Investigation Outcomes Reasons Proc Int’l Conf Analysis Evolution Reengineering SANER Vol 1 IEEE 102–111 35 Baishakhi Ray Miryung Kim Suzette Person Neha Rungta 2013 Detecting characterizing semantic inconsistencies ported code Proc Int’l Conf Automated Engineering ASE IEEE 367–377 36 Eric Raymond 2001 Cathedral Bazaar Musings linux open source accidental revolutionary O’Reilly Media Inc 37 Loyao Ren 2019 Automated Patch Forging Across Forked Projects Proc Int’l Symposium Foundations Engineering FSE ACM New York NY USA 1199–1201 38 Loyao Ren Shurui Zhou Christian Kästner Andrzej Wąsowski 2019 Identifying Redundancies Forkbased Development Proc Int’l Conf Analysis Evolution Reengineering SANER IEEE 230–241 39 Gregorio Robles Jesús GonzálezBarahona 2012 Comprehensive Study Forks Dates Reasons Outcomes Proc IEP’12 Int’l Conf Open Source Systems 1–14 40 Julia Rubin Marsha Chechik 2013 framework managing cloned product variants Proceedings 2013 International Conference Engineering IEEE Press 1233–1236 41 Johnny Saldana 2015 coding manual qualitative researchers Sage 42 Anand Mani Sankar 2015 Nodejs vs iojs fork Blog Post httpanandmanisankarcompostsnodejsiojswhythefork 43 Stefan Schmidt 2009 Shall really powerful concept replication neglected social sciences Review General Psychology 13 2 2009 90–100 44 Donna Spencer 2009 Card sorting Designing usable categories Rosenfeld Media 45 Stefan Stănciulescu Thorsten Berger Eric Walkingshaw Andrzej Wąsowski 2016 Concepts operations feasibility projectionbased variation control system Proc Int’l Conf Maintenance Evolution ICSME IEEE 323–333 46 Igor Steinmacher Gustavo Pinto Igor Scaliente Wiese Marco Aurélio Gerosa 2018 Almost study quasicontributors opensource projects Proc Int’l Conf Engineering ICSE IEEE 256–266 47 Robert Viseur 2012 Forks impacts motivations free open source projects International Journal Advanced Computer Science Applications 3 2 2012 117–122 48 Steve Weber 2004 success open source Harvard University Press 49 David Wheeler 2015 Open Source SoftwareFree OSSFS FLOSS FOSS Look Numbers Blog Post httpsdwheelercomossfswhyhtml 50 Owen Williams 2015 Nodejs iojs settling differences merging back together Blog Post httpsthenextwebcomdd20150616nodejsandiojsaresettlingtheirdifferencesmergingbacktogether 51 Alex Williams Joab Jackson 2016 Docker Fork Talk Split Table Blog Post httpsthenewstackiodockerforktalksplitnowtable 52 Shurui Zhou Ştefan Stãnciulescu Olaf Leßenich Yingfei Xiong Andrzej Wąsowski Christian Kästner 2018 Identifying Features Forks Proc Int’l Conf Engineering ICSE ACM Press 105–116 53 Shurui Zhou Bogdan Vasilescu Christian Kästner 2019 Fork Study Inefficient Efficient Forking Practices Social Coding Proc Europ Engineering ConfFoundations Engineering ESECFSE ACM Press New York NY 350–361
::::
empirical study integration activities distributions open source Bram Adams · Ryan Kavanagh · Ahmed E Hassan · Daniel German Published online 31 March 2015 © Springer ScienceBusiness Media New York 2015 Abstract Reuse components either closed open source considered one important best practices engineering since reduces development cost improves quality However since reused components definition generic need customized integrated specific system useful Since integration systemspecific integration effort nonnegligible increases maintenance costs especially one component needs integrated paper performs empirical study multicomponent integration context three successful open source distributions Debian Ubuntu FreeBSD distributions integrate thousands open source components operating system kernel deliver coherent product millions users worldwide empirically identified seven major integration activities performed maintainers distributions documented activities performed maintainers evaluated refined identified activities input six maintainers three studied distributions documented activities provide common vocabulary component integration open source distributions outline roadmap future research integration Communicated Filippo Lanubile B Adams ✉ MCIS Polytechnique Montréal Montréal Canada email bramadamspolymtlca R Kavanagh · E Hassan SAIL Queen’s University Kingston Canada R Kavanagh email ryancsqueensuca E Hassan email ahmedcsqueensuca German University Victoria Victoria Canada email dmguvicca Keywords integration · reuse · Open source distributions · Debian · Ubuntu FreeBSD
::::
1 Introduction reuse “the use existing knowledge construct new software” Frakes Kang 2005 Reuse roughly consists two major steps Basili et al 1996 1 identifying suitable component reuse 2 integrating target system example vendors mobile phones typically reuse “upstream” ie externally developed operating system component device customized proprietary device drivers control panels utilities Jaaksi 2007 Reuse commonplace shown studies projects different sizes China Finland Germany Italy Norway Chen et al 2008 Hauge et al 2008 2010 Jaaksi 2007 Li et al 2008 2009 example almost half Norwegian companies reuse “Open Source” OSS products Hauge et al 2008 30 functionality OSS projects general reuse existing components Sojer Henkel 2010 Although reuse speeds development leverages expertise upstream general improves quality cost product Basili et al 1996 Gaffney Durek 1989 Szyperski 1998 entirely risk costfree particular integration step reuse consumes large amount effort resources Boehm Abts 1999 Brownsword et al 2000 Di Cosmo et al 2011 Morisio et al 2002 various reasons “Glue code” Yakimovich et al 1999 needs developed maintained make component fit target system developers need continuously assess impact glue code new versions component new version bring unpredictable set bug fixes features Furthermore component might depend components whose bugs could propagate target system undocumented ways Dogguy et al 2010 McCamant Ernst 2003 Orsila et al 2008 Trezentos et al 2010 ability make local changes source code reused component introduces even challenges since integrator typically familiar reused component’s code base hence easily introduce bugs local changes Hauge et al 2010 Li et al 2005 Merilinna Matinlassi 2006 Stol et al 2011 Tiangco et al 2005 Ven Mannaert 2008 Worse local changes contributed back owner reused component organization made changes need maintain possibly reapply future versions component Spinellis et al 2004 Ven Mannaert 2008 Thus far empirical studies integration components Brownsword et al 2000 Hauge et al 2010 Li et al 2005 Merilinna Matinlassi 2006 Morisio et al 2002 Stol et al 2011 Ven Mannaert 2008 concentrated base case integrating one component target system practice however organizations tend integrate one two components brings along set unique challenges Morisio et al 2002 Van Der Linden 2009 Ven Mannaert 2008 especially given popularity open source development timespan one release organization needs coordinate integration updates multiple vendors typically totally independent release dates Boehm Abts 1999 Brownsword et al 2000 example Jaaksi 2007 Nokia’s N800 tablet platform reused 428 OSS components 25 reused eg bzip2 GNU Chess 50 changed locally eg graphics subsystem 25 developed inhouse using open source practices “inner source” ISS unclear organizations like Nokia keep system stable secure amidst integration many different components Hauge et al 2010 Furthermore clear need Boehm Abts 1999 Crnkovic Larssom 2002 Merilinna Matinlassi 2006 dedicated training education developers organizations integration since world open source need collaborate providers 3rd party components external contributors benefit external contributions avoid maintain bug fixes customizations oneself paper aims improve understanding multicomponent integration empirically studying documenting major integration activities performed OSS distributions GonzalezBarahona et al 2009 OSS distribution basically “packaging organization” Ruffin Ebert 2004 Merilinna Matinlassi 2006 ie organization integrates upstream components common platform similar product lines Meyer Lehnerd 1997 Pohl et al 2005 ironing bugs intellectual property issues providing extensive documentation training integrated components Reusing OSS component established distribution provides confidence quality component Tiangco et al 2005 hence many companies use OSS distributions basis products like routers mobile phones storage devices Koshy 2013 Examples established OSS distributions Eclipse GNOME operating system distributions like Debian Ubuntu focus operating system distributions henceforth called “OSS distribution” bundle customize OSS operating system kernels eg Linux BSD system utilities eg compilers file management tools enduser eg text processors games browsers dependencyaware package system almost 400 active OSS distributions year 26 new ones born Lundqvist 2013 Given growing competition distributions need release new features versions ever shorter time frame Hertzog 2011 Remnant 2011 Shuttleworth 2008 millions desktop users server installations achieve rely hundreds volunteers integrate latest versions bug fixes tens thousands integrated upstream components empirically studied major integration activities three popular successful OSS distributions ie Debian Ubuntu FreeBSD using qualitative analysis accumulated 29 years historical change bug data document activities steps used perform structured format distilling stateofthepractice tools processes followed actors involved activity providing concrete examples comparing findings prior research integration outside context OSS Six members maintenance community analyzed distributions discussed refined documented activities provided feedback usefulness completeness activities Similar concept design patterns Gamma et al 1995 reference architectures Bowman et al 1999 documented activities used 1 organizations common terminology discussing improving integration activities components 2 researchers set road map research integration since integration remains largely unexplored research area Goode 2005 Hauge et al 2010 Stol et al 2011 main contributions paper – Identification documentation seven major integration activities processes follow three major OSS distributions – Identification major challenges tool support research integration activities – Evaluation feedback identified activities challenges six integration maintainers release managers analyzed distributions paper structured follows First Section 2 discusses background related work integration OSS distributions Section 3 presents design qualitative analysis Section 4 documents seven integration activities identified analysis followed discussion open challenges identified Section 5 evaluation findings six practitioners Section 6 conclude threats validity Section 7 conclusion Section 8 study
::::
2 Background Related Work section discusses background related work integration open source distributions Table 1 summarizes key technical terms used throughout paper 21 Integration Reuse black box white box Frakes Terry 1996 Black box reuse refers “Commercial Shelf” COTS components Boehm Abts 1999 term meaning reuse identification integration component eg class library system OSS reuse reuse Open Source COTS reuse black box reuse based Commercial Shelf components ISS reuse reuse Inner Source ie OSS developed inhouse integrator organization integrates third party component product maintainer individual team physical integration behalf integrator downstream synonym “integrator” upstream organization open source company whose components integrated another upstream component component developed upstream reused multicomponent integration integration one upstream component packaging organization integrator whose business goal package upstream components coherent platform offered sale reuse package upstream component integrated OSS distribution using distribution’s packaging format eg “rpm” binary distribution distribution providing compiled code packages sourcebased distribution distribution providing source code packages compilation enduser’s machine derived distribution “child” distribution customizes packages existing “parent” distribution adds additional packages source code typically available Hence components configured plugged target system White box reuse provides access component’s source code customize needs target system either component OSS Spinellis et al 2004 developed inhouse following open source principles “inner source” ISS practice increasingly common large companies like AlcatelLucent HP Nokia Philips SAP Stol et al 2011 OSS ISS reuse also common base platform product lines van der Linden et al 2007 Pohl et al 2005 Van Der Linden 2009 since 95 platform consists “commoditized” features readily available upstream projects general reuse creates winwin situation reusing organization upstream whose reused former benefits features provided component terms productivity product quality Frakes Kang 2005 Szyperski 1998 upstream benefits financially licensing andor qualitatively various forms feedback form defect reports code contributions user experiences However despite differences COTS OSSISS forms reuse introduce dependency upstream COTSOSS Di Giacomo 2005 Hauge et al 2010 Lewis et al 2000 Mistrík et al 2010 Morisio et al 2002 another division inside organization ISS Van Der Linden 2009 lead hidden maintenance costs reuse studied extensively perspective make system reusable Coplien et al 1998 DeLine 1999 Frakes Kang 2005 Mattsson et al 1999 Parnas 1976 Pohl et al 2005 select components reuse Bhuta et al 2007 Chen et al 2008 Li et al 2009 resolve legal issues regarding reuse German et al 2010 factors impact collaboration component provider integrators Brooks 1995 Curtis et al 1988 Herbsleb Grinter 1999 Herbsleb et al 2001 Seaman 1996 particular Curtis et al 1988 found based interviews need communicate outside team department even company boundaries opens worms eg fingerpointing silos domain knowledge limited communication channels lack contact persons misunderstanding due different context negatively impact integration process Herbsleb Grinter 1999 Herbsleb et al 2001 empirically proved need involve people indeed relates time necessary resolve bugs integration issues contrast concrete activities involved integration reused components well costs studied substantially less detail Especially multicomponent integration one potentially large number typically open source components reused organization time empirical evidence currently lacking Morisio et al 2002 Van Der Linden 2009 Ven Mannaert 2008 Lewis et al 2000 note “The greater number components greater number version releases potentially coming different times” Hence kind activities integration imply activities relate known activities singlecomponent integration explaining study addresses questions first discuss prior work COTS OSS ISS reuse 211 COTS Reuse Integration Brownsword et al 2000 studied 30 mediumtolarge commercial projects analyze hidden integration activities COTS reuse found organization important informed new versions promising COTS components continuously monitor impact components organization’s code base also point maintenance issues glue code configuration COTS component fact projects control upstream However findings rather highlevel explain projects coped multicomponent integration Lewis et al 2000 relate experience COTS reuse 16 government organizations especially stress loss control soon contract COTS reuse signed clause adaptation negotiated result additional costs line Changing one’s system looking another COTS component preferable requesting pay component vendor adapt component main question studied organizations’ mind “How upgrade operational system without great deal disruption” consensus whether one always update latest version reused component wait new major version incorporate pressing changes eg security fixes questions aggravated organizations reusing dozens components causes additional coordination issues similar study performed Morisio et al 2002 NASA integration costly aspect COTS reuse yet integration activities varied widely across projects Glue code main means integration authors note successful projects stay contact COTS component provider throughout lifecycle system avoid surprises next version COTS 212 OSS Reuse Integration Merilinna Matinlassi 2006 performed literature survey structured interviews nine smalltomedium Finnish companies reuse OSS components found integration problems primarily due heterogeneous environments components need support well lack documentation forcing companies rely primarily experience Merilinna et al identified three ways deal integration problems using OSS components COTS component changes code contributing changes back upstream using packaging organization like OSS distribution mediator upgrading new version reused component also help case thorough analysis OSS component reused avoid many problems Ven Mannaert 2008 performed interviews members commercial reusing OSS components examined detail tradeoff changing code contributing changes back Even though wants avoid maintaining local changes since costly alternative contributing changes upstream also requires investment time resources example get know contribution procedures keep track future evolution upstream Even patch accepted upstream organization developing patch might still required maintain since insight Ven et al recommend contribute patches local changes sufficiently generic maintain patches oneself specific worst case fork upstream even though fork small chance success Merilinna Matinlassi 2006 Ven Mannaert 2008 identified two integration activities also identified study ie Upstream Sync Local Patch approached activities perspective packaging organization multicomponent integration documented structured way 213 ISS Reuse Integration Stol et al 2011 studied emerging practice developing reusing code inhouse using open source practices ISS ISS popular phenomenon large companies since provides benefits OSS reuse without giving control companies offer employees infrastructure ISS reuse others make part development strategy systematic literature study detailed study ISS inside organization shows costly ISS issues due integration addition integration issues related OSS reuse general challenges like backwards compatibility peculiar interplay ISS team teams company identified example ISS team send “delivery advocate” teams help integrate ISS components However various activities company ISS reusespecific example ISS team receives components initially specific team organization integration becomes responsible starts acting upstream teams organization even though original developers still collaborate development component paper OSS distributions upstream projects separate independent entities Finally Van Der Linden 2009 reports adoption OSS ISS reuse product lines Meyer Lehnerd 1997 Pohl et al 2005 platform product lines built largely consists common functionality many components available Reuse OSS ISS components functionality improves quality speed development however also introduces dependency upstream projects platform products based platform addition best practices mentioned close collaboration upstream projects symbiotic fashion key keeping track new features changes established reporting fixing bugs Although OSS distributions seen product line study focuses especially identification structured documentation major integration activities context multicomponent integration 22 Open Source Distributions paper focuses maintenance activities involved integration context OSS distributions since context enables us study integration multicomponent open source setting OSS distributions one wellknown open source packaging organizations GonzalezBarahona et al 2009 Ruffin Ebert 2004 distributions integrate collection upstream components consisting operating system kernel eg Linux BSD core libraries compilation tools users like desktop applications web browsers Thanks inclusion OSS distribution integrated upstream projects reach millions users without market Although distributions especially known Linux BSD world even commercial products like Microsoft Windows Mac OS X considered distributions ship ISS projects OSS hundreds OSS distributions integrate thousands upstream components Figure 1 shows total number currently active Linux distributions grown 380 addition 135 discontinued distributions shown increasing less 26 distributions year Lundqvist 2013 BSD family open source kernels twelve currently active distributions Comparison BSD operating systems 2011 addition 22 distributions either discontinued unclear status popular Linux distributions like Debian Ubuntu integrate 24000 OSS components whereas FreeBSD popular BSD distribution integrates almost 23000 components Debian distribution doubles size every 2 years passed mark 300 MLOC 2007 GonzalezBarahona et al 2009 Despite large scale integrating OSS project’s components distribution goes far beyond blackbox reuse First upstream components need turned distributable “package” Distributions Debian Ubuntu Fedora compile components particular architecture split compiled libraries executables across one “binary” packages packages together packages depend automatically installed using distributionspecific package management system “apt” “dpkg” “yum” Sourcebased distributions like FreeBSD distribute possibly customized source code upstream component enduser socalled “source” package FreeBSD uses term “port” compilation user’s machine Unless otherwise specified term “package” paper refer “binary” “source” port packages building packaging upstream component new package needs tested delivered enduser package becomes available endusers including integrators real integration maintenance work starts since packages dependent packages need continuously updated new versions packaged component Similarly bugs package detected fixed promptly appropriate patches sent back upstream developed packaged component Local changes package sent back however need maintained kept uptodate distribution User complaints triaged processed distribution well escalating upstream appropriate Organizations reuse component typically Koshy 2013 Merilinna Matinlassi 2006 appoint person group people ie “maintainers” perform coordinate integration activities organization’s behalf Organizations like OSS distributions dealing multiple upstream projects components typically multiple maintainers one responsible group related upstream components Figure 2 shows interactions distribution’s maintainer bold major actors distribution maintainer packages customizes upstream component interacting upstream whenever necessary example understand changes new release communicate reported bugs Customizations result local patches applied vanilla upstream component patched component packaged using distribution’s package management tool package tested project’s package community consists volunteering contributors testers stabilized packages also used endusers contribute bug reports suggestions contacting maintainer maintainer’s work ultimately ends official release distribution hence maintainers coordinated release manager charge common activities release manager discussing releasecritical bugs projectwide packaging policies maintainer enforcing deadlines Given size distribution maintainers responsible multiple components packaged one packages Debian around 2400 participants 2013 maintainers 24000 integrated components ratio 10 components per maintainer FreeBSD around 400 freeBSD developers 2013 maintainers 23000 components ratio 575 Ubuntu around 150 Ubuntu universe contributors team 2013 MOTU team 2013 Ubuntu core development team 2013 maintainers 24000 components ratio 160 since packages inherited asis Debian thus requiring less work Given high maintainertocomponent ratios maintainers often team share package responsibilities even still need divide attention limited time across many components addition maintainers developers packages maintaining means even time spent fully understand changes contact upstream developers change Brownsword et al 2000 Stol et al 2011 Finally various proposals launched shorten time frame releases distributions Hertzog 2011 Remnant 2011 even synchronize releases distributions Shuttleworth 2008 complicates task package maintainers paper identifies documents integration activities must done daily basis maintainers three successful OSS distributions Previous research focused exclusively stakeholders Fig 2 governance processes distributions Sadowski et al 2008 release management Michlmayr et al 2007 van der Hoek Wolf 2003 packagedeveloper community Scacchi et al 2006 evolution size complexity packages GonzalezBarahona et al 2009 dependencies packages German et al 2007 Given central role package maintainers success distribution responsibilities challenges need understood order streamline interaction OSS distribution upstream bring new maintainers quickly uptospeed Furthermore previous work focused especially integration individual components packaging organizations like OSS distributions need deal integration thousands components time users expecting latest versions component integrated Finally open source development forces organizations collaborate external parties reap full benefits quality innovation achieved open source components organizations waste substantial effort example maintain local patches Hence studying integration activities distributions help us understand integration multicomponent open source context following section presents approach followed identify analyze major integration activities three large OSS distributions
::::
3 Case Study Setup goal paper empirically identify document major integration activities use packaging organizations multicomponent OSS integration existing empirical work focused exclusively singlecomponent integration Since wide range packaging organizations exists first step focus experienced integration experts area OSS reuse ie OSS distributions particular perform qualitative analysis three largest successful OSS operating system distributions ie Debian Ubuntu FreeBSD Although results consist integration activities performed OSS distributions activities unique OSS integration subset integration activities performed commercial organizations Whereas commercial setting organizations used buy develop dependencies OSS setting requires one collaborate variety external stakeholders avoid stuck one’s patches customizations Avoiding requires different set integration activities fact activities need trickle back commercial organizations started adopt OSS practices internally ISS reuse help organizations well open source projects paper addresses following question core set activities OSS dealing integration multiple 3rd party components question allows us empirically study done OSS integration done challenges expert integrators still face particular also helps us understand stateoftheart techniques use OSS projects facilitate integration activities section discusses methodology study also illustrated Fig 3 first performed qualitative analysis identify document major integration activities evaluated findings stakeholders three distributions Fig 3 Overview case study methodology 31 Subject Selection obtain representative sample selected mixture binary sourcebased derived independent OSS distributions derived “child” distribution automatically inherits packages “parent” distribution customizes packages also adds packages order enforce uniform lookandfeel focus specific types packages specialize certain set users eg office workers vs music producers Although derived distribution saves substantial integration time also leads unique set integration activities since level derivation adds additional layer integration process looking history open source distributions Lundqvist 2013 Debian Ubuntu clearly stand two influential distributions 410 distributions deriving Debian 211 380 active 135 discontinued distributions 90 Ubuntu 17 FreeBSD particular Debian distribution 81 child distributions 105 distributions deriving child distributions “grandchildren” 24 greatgrandchildren 1 greatgreatgrandchild Lundqvist 2013 latter potentially needs integrate packages four ancestors well upstream OSS projects directly Ubuntu 79 children 11 grandchildren Lundqvist 2013 FreeBSD 15 children 1 grandchild 1 greatgrandchild Comparison BSD operating systems 2011 found impact distributions distributions also translated well popularity terms number users contrast mobile app stores official popularity poll ranking OSS distributions However since May 2001 one leading sources OSS distributions distrowatchcom web site contains announcements new versions distributions well detailed historical overviews distribution either Linux BSDbased One major features weekly basis site keeps track many people search click distribution Although ranking map 1to1 number downloads give important indication popularity OSS distributions Despite age first Debian release made 16th August 1993 Debian still fourth popular binary distribution time case study Ubuntu second popular binary derived distribution decided study top binary distribution time case study ie Linux Mint since rather recent distribution derived Ubuntu without sufficient historical data available third popular distribution Fedora since independent DebianUbuntu ecosystem also study distribution source codebased distribution picked popular source codebased BSD distribution ie FreeBSD distribution Note FreeBSD also popular BSD distribution general according 2005 BSD Usage Survey BSD Certification Group 2005 32 Data Sampling study integration activities systematically analyzing categorizing revising historical package data Debian Ubuntu FreeBSD create classification integration activities Given large number packages packageversions three distributions Table 2 could examine manually Instead distribution sampled enough packageversions obtain confidence interval length 5 within Table 2 Characteristics data three subject distributions Debian Ubuntu FreeBSD start 16081993 20102004 111993 start data 12032005 20122005 21081994 end data 16082011 14092011 01092011 components 24263 25345 22733 packages 92277 66595 22733 pkg versions 896757 446324 162135 releases 4 14 8 major55 minor maintainers 2400 150 400 95 confidence level taking account large population size Cochran 1963 textsample size fracss1 fracsstextpkg versions ss fracZ2 cdot p cdot 1 p0052 Z 196 text 95 conf level p 05 text pop unknown variability means find integration activity hold n sampled packageversions say 95 certainty n pm 5 packageversions exhibit activity example 7 pm 5 would mean activity would hold 95 certainty 2 12 packageversions Although three distributions different number packageversions asymptotic nature sample size formula obtained number packageversions 384 distribution 33 Data Extraction randomly sampled 384 packageversions distribution automatically extracted selected packageversion corresponding change log message change log basically consists detailed Koshy 2013 bullet list containing highlevel textual summary major changes particular packageversion well explicit IDs fixed bugs Figure 4 shows example change log message Debian packageversion Ubuntu FreeBSD use similar format Except two changes changes Fig 4 fix open bug reports reports’ identifier pasted inside change log distributions stipulate new packageversion documented change log Debian 2011 used change log data starting point analysis packageversion interpret change log’s reported changes manually analyzed referenced bug reports via distributions’ bug repository explained distribution uses different technology change logs bug repository able write scripts automate fetching logs reports bug reports often contained references emails distribution’s mailing lists sometimes contained patches proposed possible bug fix present also studied messages patches Finally clarify technical terms understand particularly unclear bugs changes used distribution’s developer documentation accessible distribution’s web site worst case relevant web search especially finding relevant communication online fora necessary small number cases discuss obtained data three distributions data found online paper’s replication package Adams et al 2015 Debian obtained names integrated components across Debian’s entire history socalled snapshot archive server containing versions packages time allowing scriptable access via public JSONbased API every integrated component retrieved version numbers timestamps list binary package names associated component since component split across multiple packages sampling 384 packageversions downloaded corresponding change log using simple script Debian’s change log repository Bug reports mentioned change logs found bug repository using bug identifier Related email messages data mentioned bug reports found using web search Ubuntu used Python API Launchpad collaboration platform retrieve names version numbers Ubuntu packages ever existed Ubuntu derived Debian filtered Ubuntu packages include ones customized Ubuntu since packages identical Debian packages Ubuntucustomized packages version number ending “MubuntuN” “M” “N” numbers following special convention found 133311 package versions belonging 26858 packages Except different location change logs bug reports used approach data extraction Debian 1httpsnapshotdebianorg 2httppackagesdebianorgchangelogspoolmain 3httpbugsdebianorgcgibinbugreportcgibugXYZ XYZ bug identifier 4httpapilaunchpadnet10 5httpchangelogsubuntucomchangelogspoolmain 6Manual search using bug identifier httpsbugslaunchpadnetubuntu FreeBSD data extraction bit involved since sourcebased repository reason retrieved copy FreeBSD version control system CVS contains local file changes ever made reused components Since CVS changes finegrained considered “version” releases coarsegrained multiple port versions exist two official releases reconstruct port versions grouping related CVS changes together used FreeBSD convention port’s Makefile expected PORTREVISION variable changed “each time change made port significantly affects content structure derived package” FreeBSD porter’s handbook 2011 maintainer change PORTREVISION related PORTVERSION variable corresponding changes deemed important enough automatically picked users update installation interpret “changes change PORTREVISION variable define new port version” similar definition “version” binary packages practice determined port timestamps changes change PORTREVISION andor PORTVERSION grouped changes port’s files two consecutive PORTREVISION changes excluding first PORTREVISION change one port version treated changes including first Makefile revision first PORTREVISION account initial import port wrote scripts queried CVS repository commit log messages start end date port version change logs resulting port versions correspond concatenation commit log messages Finally bug reports obtained FreeBSD’s bug repository based bug identifiers mentioned change logs 34 Data Analysis Since classification integration activities start initially first author studied Debian distribution pilot manually interpreted changes documented change log sampled packageversion looked bug reports referenced change log order understand bugs resolved features added done latter bug reports’ comments important source information fully understand scope context complex changes sometimes consult email messages referenced bug reports patches attached case doubt usage unfamiliar technical terms inside stories distribution’s developer documentation considered worst case web search performed clear exactly integrators done produce analyzed packageversion packageversion tagged observed activity summarize rationale behind version Two examples activities could “new release” “package dependency change” one tag could assigned version since new version package typically consists multiple changes seen earlier Fig 4 repeating procedure sampled Debian versions constantly revising already analyzed versions new tags found initial tagging schema built representing different activities go packageversion 7 ftp3ieFreeBSDorgFreeBSDdevelopmentFreeBSDCVSports 8 pserveranoncvsanoncvstwFreeBSDorghomencvs 9 httpwwwFreeBSDorgcgiqueryprcgiprXYZ XYZ bug identifier finishing pilot Debian first two authors revised obtained tagging schema leveraging second author’s experience DebianKubuntu maintainer developer tags merged others renamed resulting tagging schema hand revised Debian analysis standardize tags used Afterwards authors analyzed Ubuntu FreeBSD data using tagging schema starting point using approach Debian Conflicts tagging authors manually resolved discussion find additional tags Ubuntu FreeBSD giving us confidence completeness initial tagging schema Eventually obtained seven popular tags two less popular ones catchall tag multiple unique less frequent activities unrelated tags excluded latter three tags analysis come back Section 6 replication package Adams et al 2015 contains tags noteworthy observations sampled package versions 35 Identification Documentation Activities seven popular tags obtained manual analysis correspond unique integration activities however distribution could terminology workflow activity Hence order abstract commonalities variabilities across distributions particular activity tag authors together distilled intent motivation common tasks current practices across distributions based 1 information encountered change logs bug reports mailing lists sampled packageversions well 2 second author’s experience DebianKubuntu developer iterative process trying separate essential steps used integration activity implementation details exceptions particular distribution Typically author would refine one two patterns send next author refinement changes made activity Similar design patterns Gamma et al 1995 “captured activities form people use effectively” integration activity documented rigid format intent motivation major tasks involved activity participants possible interactions activities notable instances activity three studied distributions Debian Ubuntu FreeBSD Interactions based cooccurrence activities data also tried compare activity prior work integration literature put activity context tagging integration activities abstraction pattern form authors encountered recurring issues problems package maintainers issues problems noted author individually compared clustered obtain set challenges across 4 research areas filtering challenges already addressed related work obtained 13 concrete challenges limitations based data seemed hold back maintainers activities crosscheck challenges together activities documented performed validation practitioners next step 36 Validation Activities Practitioners order get feedback correctness usefulness documented integration activities challenges contacted members package maintenance community Debian Ubuntu FreeBSD asked 1 verify correctness activities derived abstracted change log bug report historical data well challenges uncovered 2 provide feedback usefulness activities well activities challenges might missed analyzing sampled packageversions Based extensive experience 3 distribution communities second fourth author first compiled shortlist package maintainers release engineers experienced maintaining large packages contacted people shortlist email since email preferred channel communication maintainers maintainers volunteers spread across world without fixed office played idea creating bug report study since maintainers track bug repository package closeby however since bug reports public broadcast medium people would able chime perhaps influence maintainer discarded bug repository purposes eventually received feedback 3 maintainers M1 M2 M3 active Debian Ubuntu one M6 Debian 1 M5 Ubuntu 1 M4 FreeBSD least five ten years experience since role package maintainer release engineer deserved years active involvement distribution Note respect anonymity refer “maintainers” use symbolic name contacting maintainers provided draft paper asked feedback documented activities challenges particular asked following questions evaluate usefulness completeness activities challenges Q1 activities miss Q2 documented activities used Q3 existing tools techniques activities miss Q4 challenges miss Q5 promising toolstechniques see coming address challenges maintainers replied five questions email six also provided higherlevel comments paper one maintainer providing annotated pdf detailed comments Despite busy schedule asynchronous nature email communication one cannot force someone reply two maintainers left two questions blank come back Section 6 email replies analyzed two authors summarized table Table 5 order compare findings across 6 maintainers high level obtained feedback showed us whether activities whole made sense whereas lower level exposed inaccuracies missed workarounds factual errors used feedback flesh description seven documented activities 13 challenges obtain final version activities documented present paper contacted members suggested five additional activities however since sufficient empirical support activities data sample add documented activities Instead discuss additional activities Section 6 Table 3 Overview integration activities prevalence three distributions Activities horizontal line common enough documented Activity Explanation Deb Ub Fre New Package Integrating new 104 078 1354 B Upstream Sync Updating new upstream version 4089 4375 5781 C Dependency Management Managing changes dependencies 3880 3073 2839 Packaging Change Changing package’s packaging logic 4349 4401 3880 E Productwide Concern Enforcing policies across packages 495 313 2500 F Local Patch Patching upstream source code locally 2240 2839 1224 G Maintainer Transfer Managing unresponsive maintainers 573 000 286 H Security Patching security vulnerability 443 130 078 Internationalization Internationalization packages 417 156 026 J Catchall rare activities 234 495 104
::::
4 Integration Activities Distributions Table 3 gives overview short explanation seven major integration activities documented well three less common ones table also provides percentage sampled Debian Ubuntu FreeBSD packageversions involve activities within confidence interval 5 numbers also plotted Fig 5 Since new version component involve multiple integration activities percentages plots add 100 Upstream Sync Dependency Management Packaging Change frequently occurring activities Debian FreeBSD Local Patch also common three projects whereas New Package Productwide Concern common FreeBSD next subsections discuss seven major integration activities detail activity provide Intent Short outline goal activity Motivation Short description role rationale activity Major tasks major steps involved activity Participants list stakeholders Fig 2 involved major tasks activity Popularity integration activities Table 3 384 sampled Debian b Ubuntu c FreeBSD packageversions confidence interval length 5 95 confidence level Interactions Activities cooccurred substantially given activity packageversions hence related Literature Discussion prior work approaches activity well prevalence activity outside context OSS distributions Notable instances Concrete examples activity sampled Debian Ubuntu FreeBSD packageversions New Package Intent Integrating previously unpackaged upstream component distribution Motivation users distribution maintainer package require new functionality provided component identified yet part distribution Major Tasks Recruiting Maintainer responsible integrating new component liaising upstream one important decisions take Koshy 2013 Merilinna Matinlassi 2006 commonly upstream developer motivated enduser requests upstream component integrated distribution One distribution’s maintainers might pick request become maintainer Alternatively upstream developer package component ask distribution maintainer “sponsor” package ie review upload distribution’s package repository case although majority integration done upstream maintainer still end responsibility Another possibility distribution appoints maintainer integration new component clear need distribution Packaging Upstream requires access project’s source code except binaryonly packages like Adobe Flash verification license maintainer proceeds determine buildtime runtime dependencies package dependent component yet distribution packaged first process trialanderror trying build package fixing dependency problems maintainer might customize makefiles would build correctly environment distribution porting package platforms Linux GNUbased ones often needed remove dependencies Linux GNUspecific libraries functionality take significant effort Finally maintainer needs make sure package follows distribution’s policies specific locations configuration files manual pages Creating Package’s Metadata maintainer responsible creating package metadata like package name version number list dependent packages metadata necessary add package distribution’s package management system “apt” DebianUbuntu port system FreeBSD enable automatic systematic building packaging deployment Integration Testing package must build run consistently supported architectures Typically two rounds tests used verify package first round involves maintainers ironing obvious functionality platform issues second round involves uploading package staging area eg “unstable” Debian expert endusers install use daily work Bugs identified users reported together possible patches maintainer incorporates feedback new version package reuploaded distributions like Ubuntu tools automatically run integration testing identify integration issues Publishing Package staged package contains severe bugs might temporarily removed staging archive bugs resolved package stable certain period time becomes eligible inclusion upcoming release package either moved release’s archive DebianUbuntu source code repository FreeBSD Participants maintainer upstream developer package community expert enduser Interactions New Package prerequisite six activities usually occurs ie packageversion involves New Package activity 23 ± 5 FreeBSD packageversions also involves Local Patch fix bug make package compile Literature context COTS reuse additional tasks involved especially contract negotiations Information Technology Resources Board 1999 Navarrete et al 2005 Lewis et al 2000 note “Vendors driven profits cooperative responsive perceived interest so” Various guidelines risk assessment tools exist help companies federal departments select right COTS components Information Technology Resources Board 1999 Lewis et al 2000 example recommend find COTS components fit existing architecture possibly adjust architecture first rather requiring COTS vendor customize component system hand since could costly different OSS distributions monetary incentives typically exist OSS distributions sometimes carry enough weight convince upstream components adapt rather way around Although applicable case packaging organizations like OSS distributions identification COTSOSS components reuse known challenge well Morisio et al 2002 Stol et al 2011 typically requiring extensive web literature research insightful recommendations experts maintainer recruitment integration testing known research problems tasks less known research Notable Instance New Package customization irssipluginotr Ubuntu IRC client plugin integrated July 2008 first customization changed location documentation Ubuntu default location second customization fixed package’s build process download required header files build since Ubuntu build servers network access B Upstream Sync Intent Bringing package uptodate newer version upstream component Motivation shown Fig 5 synchronizing existing packages distribution newer upstream version forms core activity integration Endusers expect package maintainers update packages latest features bug fixes soon possible maintainers concerned longterm stability package Major Tasks 1 Becoming Aware New Upstream Release largely depends distributionspecific dashboards automatically track development progress upstream projects example Debian’s watch file mechanism specifies 1 URL upstream project’s download page releases component well 2 regular expression identify source code version number release highest version number surpasses current version means new release available Derived distributions eg Ubuntu need synchronize upstream projects also parent distribution typically start new release cycle example 167 analyzed Ubuntu packageversions involving Upstream Sync 99 versions synchronized upstream 65 synchronized parent distribution Debian 3 synchronized Since derived distribution leverage Upstream Sync activities performed maintainers parent distribution risk assessment task 2 becomes slightly easier However keeping track patch synchronized upstream requires rigorous bookkeeping Projects use custom dashboards sometimes interfacing bug reporting infrastructure Assessing Risk Upstream Release requires maintainer review changes previous upstream version Rodin Aoki 2011 order estimate whether new version productionready changes run risk breaking important functionality endusers always need new features bug fixes Despite importance analysis practice currently largely manual task supported basic tools like “diff” Rodin Aoki 2011 change commit log messages email communication upstream developers experience outcome risk assessment often update full new release “cherrypick” select number acceptable changes changes made upstream another distribution merge changes current packageversion discarding changes example upcoming release distribution might nearby making full import new version component risky Instead maintainers would cherrypick showstopper bug fixes interested distributions like FreeBSD prefer cherrypick ie either take new version component whole update Updating Customization involves revisiting customizations patches performed earlier versions packaged component eg initial New Package later Local Patch activities Maintainers typically submit patches upstream merged consequence patches longer need maintained locally discarded maintainer patches however need updated maintainer cleanly applied new version upstream package like task 2 requires manual analysis patch new packageversion Updating Package’s Metadata cf task 3 New Package Integration Testing cf task 4 New Package Publishing Package cf task 5 New Package Participants maintainer upstream developer Interactions Upstream Sync pivotal activity accompanied activity except New Package definition Upstream Sync occurs mostly together Packaging Change Dependency Management Local Patch sourcebased distributions Productwide Concern Literature Together Local Patch Upstream Sync discussed integration activity literature independent type reuse COTSOSSISS organization OSScommercial Lewis et al 2000 Navarrete et al 2005 source issues related Dependency Management sometimes even preventing Upstream Sync packages example Begel et al Begel et al 2009 report Microsoft 9 775 surveyed engineers rely teams inform changes component rely Researchers Merilinna Matinlassi 2006 de Souza Redmiles 2008 practitioners Koshy 2013 recommend continuously monitor inquire new versions impact system even appointing specific gatekeeper responsible also helps mitigate one largest risks reuse component vendor going business Lewis et al 2000 Since reuse induces dependency provider COTSOSSISS component fully controls component’s evolution Lewis et al 2000 researchers reported two extreme approaches deal dependency swiftly updating new component version Brownsword et al 2000 Stol et al 2011 Van Der Linden 2009 versus sticking particular version patching organization’s particular needs Merilinna Matinlassi 2006 Ruffin Ebert 2004 Van Der Linden 2009 systematic methodology decide two approaches hybrid approaches like cherrypicking Lewis et al 2000 typically personal experience deciding factor Merilinna Matinlassi 2006 factors like safetycritical nature system play role well Lewis et al 2000 Interestingly many integration issues could fact avoided new component version would backwards compatible previous version Crnkovic Larssom 2002 Stol et al 2011 outside control organization reuses component Notable Instances lowrisk Upstream Sync Gnash Ubuntu Flash player updated upstream version 087 March 2010 52225410 right start Ubuntu feature freeze window ie close next release Since new features technically allowed freeze window member Ubuntu release team needed explicitly approve Upstream Sync Gnash package inherited Debian update mostly contained bug fixes version 087 quickly got synced Upstream Sync taking long time Krita 2111 Debian painting program KOffice suite broken early May 2010 one libraries depends libkdcraw7 replaced newer version libkdcraw8 Upstream Sync KDE 443 580782 Unfortunately solution Upstream Sync KOffice 220 took 2 months new version KOffice introduced many new functionalities requiring package tested thoroughly patch cherrypicked another distribution libpt 11010 Ubuntu crossplatform library relied new gspca webcam driver provided 2627 Linux kernel driver work programs libraries consuming webcam stream load libv4l wrapper libraries runtime forcing 62 Ubuntu packages modified Since three weeks earlier patch uploaded Fedora another distribution make changes libpt patch cherrypicked Debian Ubuntu C Dependency Management Intent Keeping track dependencies package make sure properly built run Motivation Packages depend packages built eg compilers static libraries run eg dynamic libraries services example data set Debian packages containing dynamic libraries average 64 packages depending 10This notation refers bug report distribution’s bug repository directly median 20 476 transitively median 30 package many packages “reversedependencies” depend changes example Upstream Sync change might break reversedependencies special case change “library transitions” ie changes public interface shared library might force dozens packages rebuilt worst case adapted new interface via source code changes example C runtime library would change packages using C might need changed andor rebuilt Major Tasks Becoming Aware Dependency Changes either happens automatically see Upstream Sync based announcement maintainer dependent package change significantly latter announcement typically sent release manager affected maintainers leaving time discuss repercussions update case announcement done minimum maintainer notice change API updated interface version “SONAME” dynamic library example dynamic library “libfoo” interface version 1 would SONAME “libfooso1” SONAME suddenly changed “libfooso2” upstream maintainers would know API component changed substantially Assessing Risk Dependency Change similar task 2 Upstream Sync Determining whose packages broke change largely manual task requiring insight API used packages whose implementation algorithms typically unknown maintainer Unfortunately tool support available practice assist task Typically build logs checked errors package driven small smoke test scenario Fixing Damage either happens atomically ie changed package reversedependencies updated FreeBSD interleaved ie packages updated independently DebianUbuntu Atomic updates delay new packageversion long broken packages updated successfully least end user impacted inconsistent packages Distributions like Fedora Ubuntu use sandbox build environments atomically update transitioning library reversedependencies isolation without affecting packages hence users Fedora 2011 Whether update model atomic maintainer library causing changes responsible performing rebuilds maintainer analyses build test logs determine packages failed build attempts write patches using knowledge API changes fails needs assist failing packages’ maintainers resolve transition issues similar delivery advocates ISS reuse Stol et al 2011 keep track packages already rebuilt release manager maintainers use tracking system Ubuntu Debian use custom library transition tracker Ubuntu sometimes uses bug tracker Updating Packages’ Metadata cf task 3 New Package Integration Testing cf task 4 New Package whole transition complete atomic model updated package separately interleaved model Publishing Package cf task 5 New Package 11If maintainer finds interface change without SONAME update would contact upstream ask update SONAME perform Upstream Sync updated library resuming Dependency Management library’s reversedependencies Participants maintainers changed package reversedependencies release manager Interactions Dependency Management accompanied activity except New Package occurs mostly together Upstream Sync Packaging Change Local Patch sourcebased systems Productwide Concern Literature Similar Upstream Sync Dependency Management independent kind reuse organization Begel et al 2009 observed wide range mitigation techniques dependency problems Microsoft ranging minimizing number dependencies explicitly planning backup strategies deal dependency issues companies one studied de Souza et al 2004 de Souza Redmiles 2008 stressed importance vendorintegrator communication reduce effort required “impact management” reused APIs Managers first build impact network consisting people affecting affected component use frequent email communication people assigned explicitly particular API ISS component Stol et al 2011 manage forward ie teams backward ie team dependency impact Similar major companies like Google Whittaker et al 2012 well studied OSS distributions team required inform clients major API breakage de Souza Redmiles 2008 note however one forget ripple effect “indirect” ie transitive dependencies Similar Upstream Sync backwards compatibility dependent packages avoid many integration issues Crnkovic Larssom 2002 Stol et al 2011 Furthermore many Dependency Management issues due unnecessarily high coupling components relying implementation details Spinellis et al 2004 private APIs Stol et al 2011 Hence using components via explicit Stol et al 2011 stable Merilinna Matinlassi 2006 interfaces avoid many problems Finally packaging organizations like distributions eliminate many dependency issues users providing assemblies sets integrated components instead individual components many distributions offer socalled “virtual” packages example integrate core packages Perl KDE GNOME Notable Instances surprise library transition library interface change libfm 01141 Debian file manager library announced upstream developer consequence applications built old version libfm “libfmso0” pcmanfm file manager broke 600387 dynamic linker way knowing “libfmso0” longer original library version packages built rather new version different interface named “libfmso1” Problems nonatomic fixes dependency changes transition Perl 510 Debian Perl programming language ecosystem Perl 512 end April 2011 619117 took slightly two weeks 400 packages directly indirectly depending Perl including highprofile ones vim subversion rxvtunicode GNOME installable staging area dependencies rebuilt consistently Perl 512 dependency change requiring rebuild chances acceptance Boost 1341 Ubuntu generalpurpose C library Ubuntu 710 looked slim since Ubuntu entered “Feature Freeze” bug fixes still accepted upcoming release Boost’s reversedependencies updated However contributor championing new Boost release able convey urgency release fixes showstopper bugs package maintainer verified reversedependencies could rebuilt without source code changes Packaging Change Intent Changing packaging logic metadata fix packaging bugs follow new packaging guidelines change default configuration either binary source packages Motivation packaging process combines build process McIntosh et al 2011 upstream component dependency management packaging machinery distribution Hence understanding packaging process trivial process bugs slip frequently Furthermore packaged component evolves packaging requirements evolve well example new features might added need configured package Packaging Change activity covers changes packaging building installation logic metadata package Major Tasks Replicating Reported Problems prerequisite order fix packaging problem Ideally maintainer would like clone packaging environment bug reporter least complete description build platform installed libraries versions Tools exist generate description submitting bug reports yet inexperienced bug reporters often know forget use Understanding Build Packaging Process necessity order able fix packaging bugs enhance packaging logic understanding currently based interpreting build execution logs packages Furthermore trialanderror commonly used changing packaging logic Since dedicated way test build packaging changes maintainer verifies correctness changes manually installing package running unit user tests package Integration Testing cf task 4 New Package Publishing Package cf task 5 New Package Participants maintainer package community testing expert enduser Interactions activity performed activities New Package Upstream Sync Frequently activity requires Local Patch Literature Packaging Change activity discussed thoroughly prior research except wellknown difficulty configuring COTSOSSISS components Stol et al 2011 configuration issues due fact default components need generic contain many features whereas specific integrator needs need adapt packaging logic specific domain packaging organizations OSS distributions subset since mediator upstream components final users hence require upstream components fit package management system Notable Instances package missing files librt shared library implementing POSIX Advanced Realtime specification dropped without warning GNU standard C library Debian libc6 23618 breaking XFS file system package 381881 resolve case Dependency Management XFS Packaging Change made libc6’s package metadata indicate librt longer provided Broken packaging changed guidelines Versions 26 32 Python Ubuntu Python programming language ecosystem suddenly failed build Ubuntu 738213 essential libraries like libdb zlib python depended could found anymore build platform change directory layout result work enabling 32 64 bit versions libraries installed single machine Broken packaging upstream changes GNU Octave FreeBSD developers changed layout web site well build logic projects 144512 maintainer fix code fetching script refactor existing build script shared GNU Octave ports separate scripts individual ports E Productwide Concern Intent Applying productwide policies strategic decisions integrated packages Motivation Since distribution integrates thousands packages important rules strategic decisions followed order make distribution coherent consistent example new standard package help files adopted packages either pace Similarly strategic decisions transition new version core library move new default window manager followed uniformly possible involved packages Major Tasks 1 Determining Ownership Timing Changes happens discussions coordinator release manager volunteer productwide concern affected maintainers coordinator notifies affected package maintainers decision explaining motivation Productwide Concern end goal different steps involved getting steps depend enforcement strategy use Enforcing Concern happens either centralized distributed enforcement centralized enforcement Productwide Concern coordinator applies concern’s changes affected packages Maintainers need test package still works report bug distributed enforcement package maintainers briefed coordinator charge change package gives freedom implement Productwide Concern see fit might delay updates packages’ reversedependencies concern enforced coordinator continuously monitors status concern via dashboards mailing lists andor bug reporting systems Debian uses distributed enforcement FreeBSD uses centralized enforcement Ubuntu uses Derived distributions like Ubuntu automatically leverage Productwide Concern changes performed contributors parent distribution FreeBSD coordinators use regular expressions change packaging logic hundreds ports thanks strict naming conventions packaging logic Given high risk productwide changes FreeBSD coordinator needs approval release manager whole distribution rebuilt distribution’s build cluster check effects productwide change Integration Testing cf task 4 New Package Publishing Package cf task 5 New Package Participants maintainer coordinator release manager Interactions Productwide Concern typically accompanied Dependency Management Upstream Sync Packaging Change Literature Similar Packaging Change Productwide Concern relatively unknown activity example Curtis et al 1988 identify issue “Projects must aligned company goals affected corporate politics culture procedures” stress “interteam group dynamics” integrator upstream significantly complicates already complex “intrateam group dynamics” However concrete advice discussion tasks involved provided especially context multicomponent integration scale OSS distributions thousands integrated components Notable Instances massive migration GCC 4 Debian July 2005 example Productwide Concern distributed enforcement Since compiler suite broke C programs compiled earlier GCC versions C packages using GCC rebuilt approach typically followed cases like thistextsuperscript1213 permanently rename packages rebuilding attaching suffix like “b2” ensures visibility rebuilt packages enabling packages explicitly depend rebuilt versions migration Dash default command shell Ubuntu 610 October 2006 Debian Lenny February 2009 illustrates differences centralized distributed enforcement Ubuntu coordinator instantaneously made Dash default shell breaking many packages’ scripts build files centralized Although several users enraged coordinator consistently referred maintainers upstream developers failing packages fix incompatible Bashspecific code “bashisms” web site official migration strategies workarounds provided Debian discussed move Dash independently Ubuntu movetextsuperscript14 Ubuntu coordinator convinced importance clear release goals communication stakeholders Debian developers built tools screen packages known bashisms Maintainers packages containing bashisms notified email requested fix bashisms certain date distributed F Local Patch Intent Maintaining local fixes andor customizations package Motivation Integrators users find bugs packages bugs packagespecific others due integration package distribution Typically maintainers encouraged send fixes kinds bugs upstream upstream take ownership code maintenance include default practice however many integration bug fixes accepted upstream take time adopted tend end local patches need maintained integrator reapplied integrator upon Upstream Sync holds customization changes specific distribution example Productwide Concern Major Tasks 1 Getting Local Patch Accepted Upstream requires patch fixes bug clean way follows programming guidelines upstream developers thorough textsuperscript12httpbitlyFOCJHf textsuperscript13httplwnnetArticles160330 textsuperscript14httpbitlyz3ORxT testing maintainer submits patch preferred bug reporting system upstream report detailed possible making clear bug fixed version impact users distribution Either patch accepted reasonable period time accepted maintainer discard Local Patch Otherwise maintainer responsible maintaining reapplying Local Patch across future versions package Maintaining Patch upon Upstream Sync maintainer’s responsibility Local Patch accepted upstream ever cf task 3 Upstream Sync Local Patch common activity involving 221 ± 5 Debian 284 ± 5 Ubuntu 122 ± 5 FreeBSD packageversions versions 7 ± 5 Debian 03 ± 5 Ubuntu 0 ± 5 FreeBSD update existing Local Patch whereas 247 ± 5 Debian 119 ± 5 Ubuntu 63 ± 5 FreeBSD could stop maintaining Local Patch included new upstream version keep track local patches Debianbased distributions use patch management systems “quilt” “dpatch” “git” FreeBSD maintainers manage patches manually Updating Package’s Metadata cf task 3 New Package Integration Testing cf task 4 New Package Publishing Package cf task 5 New Package Participants maintainer upstream developer bug reporter Interactions Local Patch typically accompanied Upstream Sync Packaging Change Dependency Management Literature paradox one hand submit patch upstream avoid maintenance hand hard time getting patch accepted studied integration challenge literature across different kinds reuse organizations Bac et al 2005 Brownsworth et al 2000 Merilinna Matinlassi 2006 Spinellis et al 2004 Stol et al 2011 silver bullet exists although similar Upstream Sync Dependency Management close collaboration organization upstream generally recommended Stol et al 2011 even case COTS Morisio et al 2002 However collaboration takes lot time effort goodwill also guarantee upstream accept maintain patch Ven Mannaert 2008 fact often happens even accepted patch still needs maintained downstream organization since organization required expertise Jaaksi 2007 opposite approach successful case ISS ISS team reaches teams reuse components help integration Stol et al 2011 Alternatively one could use COTSstyle glue wrapper code avoid changing actual code altogether Di Giacomo 2005 Van Der Linden 2009 However approaches less powerful one loses benefits OSSISS still require maintenance kind middle ground many organizations use packaging organizations like OSS distributions maintenance buffer upstream Merilinna Matinlassi 2006 shifting problem distributions presence sufficient industrial partners one could even consider making independent fork upstream component quite costly end successful practice Ven Mannaert 2008 Note patches local usage configuration never picked upstream hence require eternal maintenance applies especially endusers might local patches top distribution’s package Notable Instance patch quickly adopted upstream Debian Ubuntu packages GNOME sensorsapplet DebianUbuntu desktop widget temperature sensors featured “ugly outdated icons” 69800 newer icons comply license policy Debian Ubuntu fix Ubuntu maintainer built local patch top Debian package use newer icons Ubuntu upstream developer contacted icon designer make new icons compatible Debian adding additional license icons example “Disjunctive” legal pattern German Hassan 2009 designer complied Ubuntu maintainer reported license change Debian maintainer could drop Local Patch Local Patch cause havoc notorious security hole OpenSSL Debian package implementation SSLTLS protocols introduced Debian local patch lasted May 2006 May 2008 call function adding randomness cryptographic key accidentally commented Local Patch 363516 Debian maintainer contacted upstream fully disclose plans largely ignored patch never sent upstream inclusion afterwards complicate issue address mailing list contacted Debian real OpenSSL development list since one hidden nondevelopers security hole propagated 44 derived distributions without maintainers contributors involved identifying bug G Maintainer Transfer Intent Maintaining package maintainer absent unwilling incapable maintain package Motivation package maintainer major responsibility since requires mediating upstream projects enduser typically multiple packages time However maintainers may periods cannot spend required time integration may lose interest certain packages could become unresponsive bug reports user requests worst case package could even orphaned maintainer quits prevent packages product based Van Der Linden 2009 stalling OSS distributions need provide means keep packages evolving bypassing overriding maintainer Major Tasks 1 Overriding Maintainer depends distribution organizes package ownership package maintenance shared across distribution developers collectively concept overriding maintainer relevant Ubuntu example packages commercially supported Main Restricted archives managed team known Core Developers whereas packages commercially unsupported Universe Multiverse archives supported community guidance team known “Masters Universe” MOTU developer modify package long managed developer’s collective change introduce unnecessary 15httplwnnetArticles282038 16httpbitlyw7rn04 17httpwwwlinksorgp327 divergences compared upstream case disagreement amongst developers conflict resolution procedures place rarely need used Distributions individual package ownership hand need Maintainer Transfer policy take role maintainer becomes unresponsive disappears altogether contributor proposing Upstream Sync Dependency Management Infrastructure Change Local Patch fulfils certain criteria explicitly mark change Maintainer Transfer Debian example called “NonMaintainer Upload” NMU valid changes fix important known bug Debian provides “nmudiff” tool help contributors submit NMUs unique property Maintainer Transfer change timer attached delay depending severity proposed change eg FreeBSD typically uses delay 2 weeks Unless maintainer replies change time change set go automatically timer expires maintainer replies time request suspending timer order review change approved contributor needs revise change corresponding maintainer’s comments found 57 ± 5 Debian 29 ± 5 FreeBSD packageversions contain instance Maintainer Transfer Ubuntu collective package ownership hence transfers minmedianmax number days changes accepted 015556 days Debian 116465 days FreeBSD Debian median value low indicating maintainers often commit Maintainer Transfer timer goes FreeBSD timeouts much common cases maximum timeout Debian 325110 FreeBSD 140303 correspond packages temporarily orphaned ie maintainer officially stepped Supporting Orphaned Packages typically done ad hoc team volunteers based casual contributions reported critical bugs Debian QA team typically jumps make changes orphaned packages Adopting Orphaned Packages either happens volunteers interested orphaned package convention contributor provides patches orphaned package automatically becomes new maintainer example feedback received patch FreeBSD within three months maintainer deemed abandoned package contributor may assume maintainership FreeBSD Documentation 2011 Section 55 Participants maintainer contributor Interactions Maintainer Transfer cooccur activities except New Package Literature could find reference Maintainer Transfer activity literature However Curtis et al 1988 Lewis et al 2000 stress importance “systemlevel thinkers” maintainers able sufficiently understand specific domain integrated component well overall architecture system According analysis Maintainer Transfer activity would kick soon maintainer component would possess skills Notable Instances NMU helping busy maintainer httrack 340431 Debian offline browser fixed issue file system locations test files bug reported 11th October 2006 followed one week later proposed NMU contributor couple hours later NMU approved maintainer noted 392419 “Thanks lot didn’t yet sic change sic review issue” NMU strings attached maintainer libcdio 0782dfsg121 Debian library accessing CD media warned 20th January 2008 C header file issues upcoming release GCC 43 Productwide Concern Two months later contributor sent NMU patch fixing compiler errors One day later maintainer chimes 461683 “I don’t object NMU know haven’t handling libcdio package best possible way wish NMU please consider applying patches sent bug reports” NMU approved day hostile NMU 18th May 2007 contributor requested Upstream Sync new upstream release 132 libjcalendarjava 12261 Debian calendar picker component also proposed Packaging Change support Kaffe Java VM However since nothing happened one week contributor added comment bug reports stating “I planning NMU nothing happens again” 424981 424982 next day maintainer replied 424981 “I admit I’m reactive NMU checked Jcalendar 132 backwards compatible version 12” Nothing happened 15 months NMU timer expired NMU went
::::
5 Identified Integration Challenges seven discussed integration activities document complexity integration Even simplest case ie black box integration maintainers still need package integrated New Package verify integrated product compatible Upstream Sync follow Dependency Management changes like library transitions case white box integration integrated projects need customized fixed Local Patches streamlined productwide policies Productwide Changes time packaging logic configuration files need kept uptodate Packaging Change maintainer activity needs monitored Maintainer Transfer paraphrase Curtis et al 1988 “are claiming discovered new insights” OSS integration instead identified documented core integration activities maintainers three large OSS distributions perform daily basis “to help identify factors must attacked improve” integration Although distributions guidelines address activities Debian 2011 FreeBSD Documentation 2011 differences terminology eg “NMU” vs “timeout” technical procedures eg centralized vs distributed Productwide Concern make confusing understand compare activities study possible tools techniques improve activities Hence unifying vocabulary provide key understand integrating process upstream components complementing existing work code integration Coplien et al 1998 DeLine 1999 Frakes Kang 2005 Parnas 1976 Pohl et al 2005 selection reusable components Bhuta et al 2007 Chen et al 2008 Li et al 2009 Finally also compared activities prior work particular commercial settings Throughout analyses documentation 7 integration activities distilled 13 concrete challenges summarized Table 4 across four different research areas challenges discussed earlier paper Ubuntu Debian Table 4 Open challenges integration activities Area Challenge packaging · insight upstream build process · automatic buildruntime dependency extraction · accurate replication packaging environment testing · crossplatform testing package dependencies · integration testing packaging · accurate replication functionality issues evolution · determining best moment Upstream Sync · insight upstream changes · recommendations important API changes · management ownership package changes merging · prediction integration defects · identifying opportunities cherrypicking · insight merge status Local Patches currently process designing automatic unit integration testing system packaging process Similar defect prediction work code level prediction integration defects effort involved fixing defects would extremely useful initial work Mohamed et al 2008 Yakimovich et al 1999 work needed bring techniques practitioners Similarly kind bugzilla repository managing ownership changes ie update reversedependencies perform Productwide Concern act NMU needed improve communication across involved parties Insight upstream build process Adams et al 2007 Qiang Godfrey 2001 currently relies manual tracing analysis build runtime logs packages rudimentary scripts checking runtime dependencies general however ability accurately replicate bugs code build missing Packaging environments vary widely users certain combinations package distribution versions causing subtle packaging runtime problems Current bug reporting tools automatically include detailed platform information yet information often insufficient identify Dependency Management changes challenges impact even three largest popular OSS distributions powerful tool process support essential OSS integration activities complementing mailing lists bug repositories custom dashboards example track library transitions currently use organizations researchers studying challenges API changes Dagenais Robillard 2008 merge defects Brun et al 2011 Shihab et al 2012 Clearly research needed support maintainers field
::::
6 Evaluation six contacted maintainers pointed small factual errors earlier version documented integration activities recent advances eg regarding automatic test systems built Debian Ubuntu However fundamental errors identified activity discarded identified inaccuracies fixed activity descriptions Regarding completeness usefulness documented activities Table 5 summarizes replies six contacted maintainers explained Section 36 two maintainers M2 M5 provided empty replies least two questions M1 left one question open Hence obtained empty replies Q2 Q3 Q5 discuss question’s answers Q1 activities miss Five maintainers pointed missing activities although many captured form “Upstream Lobbying” fact mentioned part Local Patch M4 found deserved activity Interestingly M6 mentioned inverse kind lobbying ie lobbying derived distributions newly reported fixed bugs Instead splitting Local Patch decided keep activity add detail lobbying part B “Postrelease Maintenance” suggested M4 M2 dedicated integration activity encompassing activities occurring new packageversion made new release distribution M4 notes “while maintainer isn’t required support use product sic often first person contacted someone can’t get build FreeBSD” activities capture activity outcome example form Packaging Change Local Patch many emails could exchanged regarding maintenance problem without M1 M2 M3 Q1 licensecopyright analysis vulnerability resolution postrelease maintenance Q2 people unfamiliar topic reply major activities easytoread way Q3 reply reply detailexamples Q4 license tracking none none Q5 DEP5CDBS license checking reply automated testing autom dep checking autom dep checking M4 M5 M6 Q1 upstream lobbying package endoflife monitoring downstream postrelease maintenance distributions bugspatches Q2 useful overview reply nice intro document activities distro dev Q3 nothing none Q4 timely integration none monitoring status desktop vs enterprise packages hundreds variants distribution Q5 good question − reply improvements package process atomic package updates corresponding change log item bug report ie data set capture discussions Although hints less important integration issues since need fixed acted upon form future work analyze mailing list data distributions uncover part integration work C “LicenseCopyright Analysis” mentioned M1 important activity “copyrightlicensing analysis isn’t mentioned anywhere yet it’s often tiresome process creating new package often forgot sic update upstream sync” License analysis occur often data set example Ubuntu samples found one occurrence version “04 − 0Ubuntu1” package “brandingubuntu” case license files specified GPL reason activity captured category “Vulnerability Resolution” pointed M2 missing activity ie steps performed address vulnerability timely manner release Although one top 7 activities hence documented detail us vulnerability resolution occurred relatively often Table 3 occurring 44 ± 5 Debian 13 ± 5 Ubuntu 08 ± 5 FreeBSD packageversions data shows vulnerabilities reported fixed upstream Similar Upstream Sync distributions first become aware vulnerabilities update packages soon fix available reason vulnerability changes tend use NMUs see Maintainer Transfer since security team wants update vulnerable package soon possible overruling maintainer necessary Often vulnerability fixes cherrypicked leaving upstream changes next official Upstream Sync example cupsbase revision 144 FreeBSD 24th January 2005 fixed vulnerability Cups printer server identified reported upstream university student php4 44403ubuntu1 Ubuntu cherrypicked 8 upstream vulnerability fixes php programming language 19th December 2005 Since full details vulnerabilities processed internally available publicly available databases since less common seven documented activities detailed analysis integration activity future work E “Package Endoflife” missing often overlooked activity according M5 packages lose user maintainer interest time hence distribution evolves integration activities need performed package either nobody steps substantial effort required maintainers keep package uptodate Similarly older version library rendered obsolete newer one older version starts create conflicts newer one older version needs removed distribution However find evidence activity data samples Maintainer Transfer activity comes closest since one occurs unmaintained package “saved” endoflife new maintainer Surprisingly Internationalization activity ninth frequent activity found Table 3 mentioned maintainer activity comprises work related translation adaptation package cultures eg different currencies Xia et al 2013 Since distributions reach significantly users individual upstream could reach packaged higher chance used nonEnglish locales Hence distributions typically dedicated teams addressing internationalization needs packages example debianl10nenglish team works translation templates packages facilitate job translators often engineering experts Distributions typically solicit Internationalization patches development frozen ie basic new functionality stabilized bug fixes still allowed Although Internationalization changes typically harmless rare cases keep packages executing January 2006 example incomplete Japanese character prevented xchat IRC client FreeBSD executing 1character fix translation template fixed issue Q2 documented activities used M1 M3 M4 agree documented patterns provide clear overview major integration activities useful novices M1 well stakeholder involved integration M3M4 M4 noted activities necessarily need used direct documentation could also used check well distribution collects data monitors progress integration activity M3 informed us structured accessible explanations major integration activities piqued interest two package testers believes success M6 recommended us “reach developers communities documentation Eg could write blog post providing introduction paper targeted distribution devs” planning follow suggestion Q3 missing documented activities M3 interested getting details examples activity M4 wanted know recommended practices tools activity documented activities purpose describe major tasks implemented three considered distributions without dedicated section “best practices” Given many challenges identified Section 5 well Section 2 many activities rely manual work hence yet best practices Q4 challenges miss M1 mentioned license tracking M4 noted largest challenge perform activity perform time Given ever shorter time frame releases Hertzog 2011 Remnant 2011 Shuttleworth 2008 indeed important constraint identified challenges Furthermore right activity particular moment also depends enduser “desktop users want updates ASAP enterprise users don’t want change multiple years” echoes known phenomena Microsoft’s monthly “patch Tuesday” Lemos 2003 Mozilla’s extended support releases companies Khomh et al 2012 M4 concluded warning challenges represented hundreds variations build systems versioning schemes projects etc Slightly related M6 noted “something orthogonal management large amount packages getting global overview status easy” ties managementrelated challenges Table 4 identified data Q5 promising toolstechniques see coming address challenges M1 M3 expect automated dependency checking tools become mainstream ie “It may take time make automatic getting closer every day” tools would improve least Upstream Sync Dependency Management activities M1 mentioned two promising license analysis tools M3 remarked “We already automated testing tools Ubuntu see QA team heading right direction here” M6 saw advent atomic Dependency Management packaging process improvements promising development Overall six maintainers liked work found documented activities described daily activities “quite well” M6 would necessarily use documented representation activities targeted towards novices except systematically check activities distribution tracking M4 missing important activities identified particular license analysis tracking licensing changes vulnerability resolution postrelease maintenance well missing challenges especially time pressure Finally tool support dependency checking expected arrive medium term however many challenges remain open
::::
7 Threats Validity respect construct validity several threats consider First used change log messages representative record maintainers’ activities based important bug reports identified indepth manual analysis necessary mailing list messages kinds documentation formally verify accuracy data sources completeness Although M6 warned log message first version Debian package always mention whether Local Patch performed none 4 instances New Package found suffered issue evidence suggests logs incorrect three analyzed distributions require maintainers provide log messages Debian 2011 Koshy 2013 since primary input end users maintainers affected changes package fact bug reports mailing lists form official means communication OSS distributions together IRC chat messages cases bug report identifier missing cf Fig 4 either change log item sufficiently clear able find related email message via web search Second analyzed subset packageversions hence change logs mitigate threat randomly sampled large enough subset packageversions obtain confidence interval ±5 95 confidence level Furthermore activities identified Ubuntu FreeBSD add new activity top identified Debian Third algorithm reconstructing “versions” FreeBSD CVS commits depends conventions documented FreeBSD explicitly enforced possible recovered versions either finegrained underapproximating actual number activities performed version coarsegrained overapproximating Feedback package maintainers confirmed algorithm correct deviations guidelines minimal Fourth since study individual packageversions sample could contain multiple versions packages one version packages version remaining packages approach necessary since large projects like KDE GNOME involve integration effort smaller projects hence need weight study addition projects typically also larger number associated packages increases weight risk sampling decision biases observed activities small since ecosystems like KDE GNOME consist hundreds different applications tools developed hundreds developers packaged dozens maintainers words even inside one ecosystem still expect large diversity integration activities Regarding internal validity mentioned rely accuracy completeness logs packageversion Even event activities documented logs specific reason believe activities would less documented others hence effect would cancel across different activities example Postrelease Maintenance missed results since “unimportant” discussions ie without explicit bug report patch attached trace change log referenced bug reports across three distributions Furthermore nature manual classification implies might misclassifications activities well challenges overcome logs interpreted two authors experience integration tasks one DebianKubuntu developer discussed decisions order resolve differences obtain consensus discussions also resolved possible bias introduced first set tags derived one authors Furthermore validate discovered patterns integration open challenges reached six maintainersrelease engineers Debian Ubuntu FreeBSD evaluate provide feedback patterns Nonetheless quantitative results paper prevalence activity exploratory extrapolate results evaluation six maintainers performed entirely via email since preferred means communication maintainers bug repositories discussed suited Furthermore asynchronous nature emails provided breathing space maintainers made easier organize feedback amongst voluntary open source activities daytime job Even still observed questions addressed future work might complement asynchronous messages via email synchronous followup via example instant messaging using IRC open replies maintainers well selection maintainers evaluation also could introduce bias M2 provided three open replies M5 two open replies M1 one open reply yielding total 6 open replies 30 20 Due distribution open replies across questions question obtained least four concrete replies two obtained six replies Furthermore open replies spread across Debian Ubuntu maintainers reducing overall impact missing data even Regarding selection bias six maintainers experienced maintainers respective OSS distribution covering range different packages according size domain alternative evaluation methodology would first perform survey interview research findings would empirically analyzed validated change log data However would bias results activities stakeholders think would important necessarily important activities actually essential activities never would surfaced respect external validity analyzed three largest OSS distributions exemplars packaging organizations Since integration central activity OSS distributions expect identified activities representative many activities packaging organizations would face case OSS reuse example packaging organizations like GNOME KDE even “regular” Java C systems reuse multiple open source libraries well deal Upstream Sync eg reusing new version log4j Dependency Management eg adding dependencies new version log4j Local Patch eg customizing new version log4j fix bug Nevertheless manual analysis kinds OSS distributions eg Fedorabased packaging organizations general organization performs multicomponent integration necessary confirm conjectures validate generalizability seven integration activities analysis might discover new activities example case package organizations build products endusers rather middleware frameworks companies build 8 Conclusion reuse major tenet engineering yet integration activities accompany COTS OSS ISS context introduce unforeseen maintenance costs Since empirical research necessary area help organizations reuse components successfully since studies thus far focused integration individual components andor nonOSS integration performed largescale study three successful OSS distributions ie Debian Ubuntu FreeBSD Analysis large sample change log messages bug reports historical integration data resulted identification seven major integration activities whose processes documented patternlike fashion help organizations researchers understand responsibilities involved integration activities shown nontrivial requiring large amount effort validated six maintainers three distributions Based seven documented activities major challenges turned related cherrypicking safe changes new upstream release management dependencies packages testing packages coordination among maintainers Models tools needed support integration activities providing unified terminology across distributions documenting integration activities structured way catalogue activities enables maintainers open source distributions organizations interested reusing OSS ISS components researchers better understand challenges activities face plan policies tools methods address challenges Together studies integration dedicated training program integration could built aimed developers managers aim reducing least stabilizing maintenance costs caused integration Finally encouragingly distribution maintainers contacted hope documented activities challenges inspire researchers start research program domain reuse integration Acknowledgments authors would like thank maintainers release engineers Debian Ubuntu FreeBSD participated study either directly providing feedback documented activities indirectly providing insights fascinating world OSS distributions References Adams B De Schutter K Tromp H De Meuter W 2007 Design recovery maintenance build systems Proceedings Intl Conf Soft Maint pp 114–123 Adams B Kavanagh R Hassan AE German DM 2015 Replication package httpmcispolymtlcapublications2014integrationossdistributionadamsetalzip Bac C Berger Deborde V Hamet B 2005 howto contribute libre integrate inhouse application Proceedings 1st Intl Conf Open Source Systems OSS113–118 Basili VR Briand LC Melo WL 1996 reuse influences productivity objectoriented systems Commun ACM 3910104–116 Begel Nagappan N Poile C Layman L 2009 Coordination largescale teams Proceedings 2009 ICSE Workshop Cooperative Human Aspects Engineering CHASE ’09 pp 1–7 Washington DC USA IEEE Computer Society Bhuta J Mattmann C Medvidovic N Boehm BW 2007 Framework Assessment Selection Components Connectors COTSBased Architectures WICSA page 6 Information Technology Resources Board 1999 Assessing risks commercialofftheshelf applications Technical report ITRB Boehm B Abts C 1999 COTS integration Plug pray Computer 321135–138 Bowman Holt RC Brewster NV 1999 Linux case study extracted architecture Proceedings 21st Intl Conf Engineering ICSE pp 555–563 Brooks FP Jr 1995 Mythical Manmonth Anniversary Ed AddisonWesley Longman Publishing Co Inc Boston USA Brownsword L Oberndorf Sledge CA 2000 Developing new processes COTSbased systems IEEE Softw 17448–55 Brun Holmes R Ernst MD Notkin 2011 Proactive detection collaboration conflicts Proceedings 19th ACM SIGSOFT Symp 13th European Conf Foundations Engineering ESECFSE pp 168–178 Chen W Li J J Conradi R Ji J Liu Chunnian 2008 empirical study development open source components Chinese industry Softw Process 1389–100 Cochran WG 1963 Sampling Techniques 2nd edn John Wiley Sons Inc New York Coplien J Hoffman Weiss 1998 Commonality variability engineering IEEE Softw 1537–45 Crnkovic Larssom 2002 Challenges componentbased development J Syst Softw 613201–212 Curtis B Krasner H Iscoe N 1988 field study design process large systems Commun ACM 31111268–1287 Dagenais B Robillard MP 2008 Recommending adaptive changes framework evolution Proceedings 30th Intl Conf Engineering ICSE pp 481–490 de Souza CRB Redmiles Cheng LT Millen Patterson J 2004 Sometimes need see walls field study application programming interfaces Proceedings 2004 ACM Conference Computer Supported Cooperative Work CSCW ’04 pp 63–71 New York NY USA ACM de Souza CRB Redmiles DF 2008 empirical study developers’ management dependencies changes Proceedings 30th International Conference Engineering ICSE ’08 pages 241–250 New York NY USA ACM participants 2013 httpwwwdebianorgdevelpeople Debian 2011 Debian Developer’s Reference 2011 edition DeLine R 1999 Avoiding packaging mismatch flexible packaging Proceedings 21st Intl Conf Engineering ICSE pp 97–106 Developer’s Reference Team Barth Di Carlo Hertzog R Nussbaum L Schwarz C Jackson 2011 Debian Debian Di Cosmo R Di Ruscio Pelliccione P Pierantonio Zacchiroli 2011 Supporting evolution componentbased foss systems Sci Comput Program 761144–1160 Di Giacomo P 2005 COTS open source components really different battlefield Proceedings 4th intl conf COTSBased Systems ICCBSS pp 301–310 Dogguy Glondu Le Gall Zacchiroli 2010 Enforcing typesafe linking using interpackage relationships Proc 21st Journées Francophones des Langages Applicatifs JFLA p 25p Frakes W Terry C 1996 reuse metrics models ACM Comput Surv 282415–435 Frakes WB Kang K 2005 reuse research status future IEEE Trans Softw Eng 31529–536 FreeBSD porter’s handbook 2011 httpbitlyFQDPhP freeBSD developers 2013 httpwwwfreebsdorgdocenarticlescontributorsstaffcommittershtml Gaffney JE Durek TA 1989 reuse – key enhanced productivity quantitative models Inf Softw Technol 315258–267 Gamma E Helm R Johnson R Vlissides J 1995 Design patterns elements reusable objectoriented AddisonWesley Longman Publishing Co Inc German DM GonzalezBarahona JM Robles G 2007 model understand building running interdependencies Proceedings 14th Working Conf Reverse Engineering WCRE pages 140–149 German DM Hassan AE 2009 License integration patterns addressing license mismatches componentbased development Proceedings ICSE pp 188–198 German DM Webber JH Di Penta 2010 Lawful engineering Proceedings FSESDP wrksh Future Soft Eng research FoSER pp 129–132 GonzalezBarahona JM Robles G Michlmayr Amor JJ German DM 2009 Macrolevel evolution case study large compilation Empirical Softw Engg 14262–285 Goode 2005 Something nothing management rejection open source australia’s top firms Inf Manage 425669–681 BSD Certification Group 2005 BSD usage survey Technical report BSD Certification Group Hauge Ø Ayala C Conradi R 2010 Adoption open source softwareintensive organizations systematic literature review Inf Softw Technol 52111133–1154 Hauge Ø Sørensen CF Conradi R 2008 Adoption open source industry Proceedings 4th IFIP WG 213 Intl Conf Open Source Systems OSS vol 275 pp 211–221 Herbsleb JD Grinter 1999 Splitting organization integrating code Conway’s law revisited Proceedings 21st International Conference Engineering ICSE ’99 pp 85–95 New York NY USA ACM Herbsleb JD Mockus Finholt TA Grinter 2001 empirical study global development distance speed Proceedings 23rd International Conference Engineering ICSE ’01 pp 81–90 Washington DC USA IEEE Computer Society Hertzog R 2011 Towards Debian rolling Debian CUT manifesto httpraphaelhertzogcom20110427towardsdebianrollingmyowndebiancutmanifesto Jaaksi 2007 Experiences product development open source Proc IFIP Working Group 213 Open Source Soft volume 234 pp 85–96 Springer Koshy J 2013 Building products FreeBSD httpwwwfreebsdorgdocenarticlesbuildingproducts 2013 Khomh F Dhaliwal Zou Adams B 2012 faster releases improve quality – empirical case study Mozilla Firefox Proceedings 9th IEEE Working Conf Mining Repositories MSR pp 179–188 Zurich Switzerland Lemos R 2003 Microsoft details new security plan httpnewscnetcomMicrosoftdetailsnewsecurityplan2100100235088846html Lewis P Hyle P Parrington Clark E Boehm B Abts C Manners R 2000 Lessons learned developing commercial offtheshelf COTS intensive systems Technical report SERC Li J Conradi R Bunse C Torchiano Slyngstad OPN Morisio 2009 Development offtheshelf components 10 facts IEEE Softw 2680–87 Li J Conradi R Slyngstad OP Torchiano Morisio Bunse C 2008 stateofthepractice survey risk management development offtheshelf components IEEE Trans Softw Eng 34271–286 Li J Conradi R Slyngstad OPN Bunse C Khan U Torchiano Morisio 2005 empirical study offtheshelf component usage industrial projects Proceedings 6th intl conf Product Focused Process Improvement PROFES pp 54–68 van der Linden FJ Schmid K Rommes E 2007 product lines action best industrial practice product line engineering Springer Berlin Heidelberg Van Der Linden F 2009 Applying open source principles product lines Eur J Informa Prof UPGRADE 332–40 Lundqvist 2013 GNULinux distribution timeline httpfuturistsegldt Mattsson Bosch J Fayad 1999 Framework integration problems causes solutions Commun ACM 421080–87 McCamant Ernst MD 2003 Predicting problems caused component upgrades Proceedings Symposium Foundations Engineering pp 287–296 McIntosh Adams B Kamei Nguyen Hassan AE 2011 empirical study build maintenance effort Proceedings ICSE pages 141–150 Merilinna J Matinlassi 2006 State art practice opensource component integration Proceedings 32nd Conf Engineering Advanced Applications EUROMICRO pp 170–177 Meyer MH Lehnerd AP 1997 power product platforms Free Press New York Michlmayr Hunt F Probert 2007 Release management free projects practices problems Open Source Development Adoption Innovation v 234 pp 295–300 Mistrík Grundy J Hoek Whitehead J 2010 Collaborative engineering challenges prospects chapter 19 1st edn Springer Berlin Heidelberg pp 389–402 Mohamed Ruhe G Eberlein 2008 Optimized mismatch resolution COTS selection Softw Process 132157–169 Morisio Seaman CB Basili VR Parra Kraft SE Condon SE 2002 COTSbased development processes open issues J Syst Softw 613189–189 Navarrete F Botella P Franch X 2005 agile COTS selection methods Proceedings 31st EUROMICRO Conference Engineering Advanced Applications EUROMICRO ’05 pp 160–167 Washington DC USA IEEE Computer Society Orsila H Geldenhuys J Ruokonen Hammouda 2008 Update propagation practices highly reusable open source components Proceedings 4th IFIP WG 213 Int Conf Open Source Systems OSS vol 275 pp 159–170 Parnas DL 1976 design development program families IEEE Trans Softw Eng 21–9 Pohl Klaus Böckle G van der Linden FJ 2005 product line engineering foundations principles techniques Springer New York Remnant SJ 2011 new release process Ubuntu httpnetsplitcom20110908newubuntureleaseprocess Rodin J Aoki 2011 Debian New Maintainers’ Guide Debian Ruffin Ebert C 2004 Using open source product development primer IEEE Softw 21182–86 Sadowski BM SadowskiRasters Gaby Duysters G 2008 Transition governance mature open source community evidence Debian case Inf Econ Policy 204323–332 Scacchi W Feller J Fitzgerald B Hissam Lakhani K 2006 Understanding freeopen source development processes Softw Process Improv Pract 112 Seaman CB 1996 Communication costs code design reviews empirical study Proceedings 1996 Conference Centre Advanced Studies Collaborative Research CASCON ’96 pp 34– IBM Press Shihab E Bird C Zimmermann 2012 effect branching strategies quality Proceedings ACMIEEE intl symp Empirical Engineering Measurement ESEM pp 301–310 Shuttleworth 2008 art release httpwwwmarkshuttleworthcomarchives146 Sojer Henkel J 2010 Code reuse open source development quantitative evidence drivers impediments J Assoc Inf Syst 11iss12 Spinellis Szyperski C Guest editors’ introduction open source affecting development 2004 IEEE Softw 21128–33 Stol KJ Babar Avgeriou P Fitzgerald B 2011 comparative study challenges integrating open source inner source Inf Softw Technol 53121319–1336 Szyperski C 1998 Component beyond objectoriented programming AddisonWesley Publishing Co Fedora 2011 Package update HOWTO httpfedoraprojectorgwikiPackageupdate FreeBSD Documentation 2011 FreeBSD Porter’s Handbook FreeBSD Foundation Tiangco F Stockwell Sapsford J Rainer Swanton E 2005 Opensource occupational health application case heales medical ltd Procs 1130–134 Trezentos P Lynce Oliveira AL 2010 Aptpbo solving dependency problem using pseudoboolean optimization Proceedings IEEEACM intl conf Automated Engineering ASE pp 427–436 Qiang Godfrey 2001 buildtime architecture view Proceedings ICSM pp 398– MOTU team 2013 httpslaunchpadnet7Emotumembers Ubuntu core development team 2013 httpslaunchpadnet7Eubuntucoredevmembers Ubuntu universe contributors team 2013 httpslaunchpadnetuniversecontributorsmembers van der Hoek Wolf AL 2003 release management componentbased Softw Pract Exper 3377–98 Ven K Mannaert H 2008 Challenges strategies use open source independent vendors Inf Softw Technol 50910991–1002 Whittaker J Arbon J Carollo J 2012 google tests AddisonWesley Professional Comparison BSD operating systems 2011 httpenwikipediaorgwikiComparisonofBSDoperatingsystems Xia X Lo Zhu F Wang X Zhou B 2013 internationalization localization industrial experience Proceedings 18th Intl Conf Engineering Complex Computer Systems ICECCS pp 222–231 Yakimovich Bieman JM Basili VR 1999 architecture classification estimating cost COTS integration Proceedings 21st Intl Conf Engineering ICSE pp 296–302 Bram Adams assistant professor Polytechnique Montréal Canada obtained PhD GHSEL lab Ghent University Belgium adjunct assistant professor Analysis Intelligence Lab Queen’s University Canada research interests include release engineering general well integration build systems particular work published premier engineering venues TSE ICSE FSE ASE EMSE MSR ICSME addition coorganizing RELENG 2013 2015 1st IEEE SW Special Issue Release Engineering coorganized PLATE ACP4IS MUD MISS workshops MSR Vision 2020 Summer School PC cochair SCAM 2013 SANER 2015 ICSME 2016 Ryan Kavanagh Bachelor Computing Honours student Computing Mathematics Queen’s University research assistant SAIL lab Dr Hassan McGill University Microsoft Research Cambridge Ryan started contributing Ubuntu derived distributions February 2006 high school December 2011 became official Debian developer spare time Ryan avid piper various Canadian titles belt Ahmed E Hassan Canada Research Chair CRC Analytics NSERCBlackBerry Engineering Chair School Computing Queen’s University Canada research interests include mining repositories empirical engineering load testing log mining Hassan received PhD Computer Science University Waterloo spearheaded creation Mining Repositories MSR conference research community Hassan also serves editorial boards IEEE Transactions Engineering Springer Journal Empirical Engineering Springer Journal Computing Contact ahmedcsqueensuca Daniel German professor Computer Science University Victoria completed PhD University Waterloo 2000 work spans areas mining repositories open source intellectual property engineering
::::
Variant Forks – Motivations Impediments John Businge∗ Ahmed Zerouali‡ Alexandre Decan† Tom Mens† Serge Demeyer∗ Coen De Roover‡ ∗University Antwerp Antwerp Belgium †University Mons Mons Belgium ‡Vrije Universiteit Brussels Brussels Belgium johnbusinge sergedemeyer uantwerpenbe alexandredecan tommens umonsacbe ahmedzerouali coenderoover vubbe Abstract—Social coding platforms centred around git provide explicit facilities share code projects forks pull requests cherrypicking name Variant forks interesting phenomenon respect permit different projects peacefully coexist yet explicitly acknowledge common ancestry Several researchers analysed forking practices open source platforms observed variant forks get created frequently However little known motivations launching variant fork mainly technical eg diverging features governance eg diverging interests legal eg diverging licences factors come play report results exploratory qualitative analysis motivations behind creating maintaining variant forks surveyed 105 maintainers different active open source variant projects hosted GitHub study extends previous findings identifying number finegrained common motivations launching variant fork listing concrete impediments maintaining coexisting projects Index Terms—Mainlines Variants GitHub ecosystems Maintenance Variability INTRODUCTION collaborative nature open source OSS development led advent social coding platforms centred around git version control system GitHub BitBucket GitLab platforms bring collaborative nature code reuse OSS development another level via facilities like forking pull requests cherrypicking Developers may fork mainline repository new forked repository take governance latter preserving full revision history former advent social coding platforms forking rare typically intended compete original 1–6 rise pullbased development 7 forking become common community typically characterises forks purpose 8 Social forks created isolated development goal contributing back mainline contract variant forks created splitting new development branch steer development new direction leveraging code mainline 9 Several studies investigated motivations behind variant forks context OSS projects 1–6 However conducted rise social coding platforms known GitHub significantly changed perception practices forking 8 social coding era variant projects often evolve social forks rather planned deliberately 8 end social coding platforms often enable mainlines variants peacefully coexist rather compete Little known motivations creating variants social coding era making worthwhile revisit motivation creating variant forks Social coding platforms offer many facilities code sharing eg pull requests cherrypicking projects coexist one would expect variant forks take advantage common ancestry frequently exchange interesting updates eg patches common artefacts Despite advanced codesharing facilities Businge et al observed limited code integration using git GitHub facilities mainline variant projects 10 suggests code sharing facilities enough graceful coevolution making worthwhile investigate impediments coevolution therefore explore two research questions RQ1 developers create maintain variants GitHub literature predating git social coding platforms identified four categories motivations creating variant forks technical eg diverging features governance eg diverging interests legal eg diverging licences personal eg diverging principles RQ1 aims investigate whether motivations variant forks still whether new factors come play RQ2 variant projects evolve respect mainline despite advanced code sharing facilities limited code integration mainline variant projects possible cause could related teams working variants mainline structured Therefore RQ2 investigates overlap teams maintaining mainline variant forks teams interact hope identify impediments coevolution investigations based online survey conducted 105 maintainers involved different active variant forks hosted GitHub contributions manifold identify new reasons creating maintaining variant forks identify categorize different code reuse change propagation practices variant mainline confirm little code integration occurs variant mainline uncover concrete reasons phenomenon discuss implications findings tools help achieve efficient code integration collaboration mainlines diverging variant forks replication package found 1 II RELATED WORK Previous research focused motivations creating maintaining variant forks B interaction variant forks mainline Motivations creating maintaining variant forks Several studies investigated motivations creating maintaining variant forks However studies carried SourceForge predating advent social coding platforms like GitHub 1–5 11 Several early studies report perceived controversy around variant forks 5 12–17 Jiang et al 18 state although forking may controversial OSS community encouraged builtin feature GitHub report developers create social forks repositories submit pull requests fix bugs add new features Zhou et al 8 conclude variant forks started social forks perceptions forks changed advent GitHub Robles GonzálezBarahona 2 carried comprehensive preGitHub study carefully filtered list 220 potential forks referenced Wikipedia report motivations outcomes forking 220 projects literature uncovered number motivations creating variants present mainline variant coevolve together motivation reviving abandoned considered study since involve coevolution variants ○ Technical addition functionality Sometimes developers want include new functionality main developers accept contribution example Poppler fork xpdf relying poppler library 2 ○ Governance disputes contributors community create variant feel feedback heard maintainers mainline unresponsive slow accepting patches wellknown example fork GNU Emacs originally Lucid created result significant delays bringing new version support Energize C IDE 19 ○ Legal issues includes disagreements license trademarks changes conform rules regulations example XOrg originated XFree86 2 19 XFree86 originally MITX open source license GPLcompatible changed one GPLcompatible caused many practical problems serious uproar community resulting fork XOrg ○ Personal reasons situations developer team disagrees fundamental issues beyond mere technical matters related development process example OpenBSD fork NetBSD One developers NetBSD disagreement rest core developers decided fork focus efforts OpenBSD 20 Focusing variant forks Android ecosystem Businge et al 21 found rebranding simple customizations feature extension implementation different related features main motivations create forks Android apps Zhou et al 8 interviewed 18 developers hard forks GitHub understand reasons forking social coding environments explicitly support forking motivations observed align findings aforementioned studies Sung et al 9 investigated variant forks industrial case study uncover implications frequent merges mainline resulting merge conflicts variant forks implemented tool automatically resolve 40 8 types mainlineinduced build breaks preGitHub studies reported perceived controversy around variant forks Zhou et al 8 report controversy reduced advent GitHub Jiang et al 18 report forking considered controversial traditional OSS communities actually embraced builtin feature GitHub study builds previous studies identify whether motivations variant forks still whether new factors come play B Interaction variant forks mainline encountered two studies investigated interaction variant forks mainlines 8 10 Zhou et al 8 conducted 18 semistructured developer interviews Many respondents indicated interested coordination across repositories either eventually merging changes back mainline monitor activity mainline repository select integrate interesting updates variant Businge et al 10 also investigated interaction mainline variants authors quantitatively investigated code propagation among variants mainline three ecosystems found 11 10979 mainline–variant pairs integrated code Since mainlines variants share common code base collaborative maintenance facilities git pullbased development model one would expect interactions mainline variants hypothesise impediments enable interactions Since two aforementioned studies report impediments decided carry exploratory qualitative survey variant maintainers identify possible impediments III STUDY DESIGN understand motivations behind creation maintenance variant forks conducted online survey maintainers variant forks section explain designed survey protocol ii collected mainlinevariant pairs extracted maintainers variant forks iii recruited survey participants Survey Protocol Design designed 12question survey would last 15 minutes Since aimed learn large number projects used online survey data collection approach known scale well 22 survey found here2 questions designed cover two main research questions 8 12 questions closeended respondents could answer either via multiple choice Likert scales optional freetext form provided 3 8 closeended questions allow respondents share additional thoughts feedback 4 remaining questions openended questions carefully formulated bias respondents towards specific answer validated subjecting critical eye 7 colleagues conducting trial runs survey 7 participants B Identifying variant projects participants Given scope survey target respondents involved creation maintenance variant projects Therefore first needed identify variants end relied two data sources Librariesio GitHub Librariesio contains metadata projects distributed various package registries collected metadata projects largest package registries npm Go Maven PyPI Packagist relied metadata identify projects variants another one following variant identification method proposed Businge et al 10 23 considered variants actively maintained parallel mainline counterparts extracted variants mainline–variant pair created 20190401 updated least 20200401 ie active projects process yielded 227 mainline–variant pairs collected additional mainlinevariant pairs GitHub directly searched mainline projects using GitHub search endpoint looked popular 50 stars forks longlived created 2018 active still updated 2020 repositories focused development repositories whose main language among top 17 popular languages used GitHub eg JavaScript Java Go Python Ruby C etc mainline projects found tried identify collect variant forks process subject known threat validity since previous studies revealed majority forks GitHub inactive 24 25 social forks 21 reduce threat filtered forks based following heuristics geq 10 stars geq 10 commits ahead mainline geq 5 closed pull requests diverging README files manually verified remaining forks ensure corresponded variants corresponding mainline process yielded 264 additional mainlinevariant pairs leading total 491 collected mainline–variant pairs C Participant Recruitment Based collection mainlinevariant pairs identified contributors integrated least one pull request variant retrieved publicfacing emails available using GitHub API ensuring respect GitHub Privacy Statement3 individually contacted total 762 variant maintainers 491 variant projects received total 105 responses response rate 14 representing total 105 variant forks 21 participants required read accept informed consent form taking part survey Analysis used open card sorting 26 3 openended questions identify common responses reported participants analysis grouped similar responses openended questions themes start predefined themes mind instead derived themes openended answers iterating many times needed reaching saturation point first iteration coding themes performed first author paper responses first author unsure decided discussion second author first two authors agreed themes virtual meeting set six authors discuss resulting themes come negotiated agreement 27 allowed us remove duplicates cases generalize specialize themes 2105281zenodo5855808 3httpsdocsgithubcomengithubsitepolicygithubprivacystatement IV RQ1 developers create maintain variants GitHub RQ1 aims investigate whether new motivations creating variant forks changed since advent social coding platforms asked survey participants following questions SQ1a motivation creating variant individual community decision SQ1b motivation creating variant mainline SQ1c motivation details relating motivation SQ1b SQ1c presented multiple choice question SQ1a presented Likertscale answer options SQ1b optional openended question latter coded responses themes categorised common themes quoting survey respondents refer using R N notation N respondent’s ID respondents’ answers include selection multiple choice answers well themes resulting coding openended answers underlined openended responses presented italics applicable integrate compare findings related research findings Results Fig 2 summarises responses SQ1a SQ1b Fig 2a shows majority participants responded decision individual Fig 2b shows majority ranked highly technical motivation creating variants also see quite number highly ranked motivations governance others previous studies investigated motivations creating variants study investigated details motivations SQ1c identify details two optional openended questions allowed respondents provide details Likertscale answer SQ1b two questions 1 Kindly provide details selected answers motivation 2 links documented relating choice answers motivation detail kindly point us
::::
100 105 survey respondents answered optional openended question SQ1c Luckily coding process cf Section IIID able identify possible answers 5 respondents answer SQ1c comparing information readmemd files variant mainlines 30 105 respondents provided links documents pull requests issues blogs relating choice answers motivation detail Fig 3 presents Sankey diagram summarising details respondents’ choice motivation based coded themes figure presents distribution responses questions relating RQ1 responses relate thickness edge represents frequency respondents two entities Focusing axes decision motivation confirm observations Fig 2b majority respondents individual technical motivation majority respondents answered question original developers selected none implying majority variants started different developers Since answers SQ1b presented Likert scale participants asked rank appropriate motivations created variant coding motivations details identified respondents ranked highly one motivation category also provided response openended question support highly ranked motivation category scenario highly ranked motivation category would motivation detail respondent end found 105 survey participants chose 145 motivation categories 84 technical 34 governance 3 legal 24 others present common motivation themes specific responses found interesting Technical Maintenance frequently mentioned reason technical motivation 19 84 survey participants selected technical mentioned phrases related performing bugsecurity fixes R59 ranked highly technical governance mentioned “The PR merge fork’s new capabilities mainline code large attempts incorporate feedback PR ended upsetting primary maintainer studiously ignoring pull request three years” respondent also provided GitHub link pull request mainline Indeed found PR made February 2018 accompanied discussion 218 comments mainline maintainer respondent October 2021 PR still open • “I forked original order fix bug However way original architected made challenging ended rewriting instead submitting patch original” R79 next prominent technical motivation detail different goals 17 respondents selected technical mentioned phrases related variants present different goals content communities directions • “We list websites accept Bitcoin Cash cryptocurrency opposed mainline lists websites 2 factor authentication” R1 • “The original goal mainline completely different fork variant” R4 • “We wanted take different direction” R100 equally prominent technical motivation detail new features 17 respondents selected technical mentioned phrases related introduction new features mainline • “ add support feature knew would get merged main project” R53 • “Mainline developer bugfixes eventual underlying runtimeSDK upgrades stay current add new features due lack interest ” R67 • “Our variant introduces new experimental functionality yet ready use mainline” R80 Another technical motivation customization 8 respondents selected technical mentioned phrases related variant customizes mainline features • “The “bones” good wanted add aesthetics forked make pretty own” R10 • “The new version vectorized accelerated version original” R37 • “We added syntactic sugar improvements ” R42 next technical motivation unmaintained feature 8 respondents selected technical mentioned phrases related one mainline feature used variant longer maintained • “The ‘shiny’ component mainline declared longer maintained around time created fork like many architectural decisions original opted create fork instead volunteer maintain original” R65 respondent provided extra link issue ‘shiny’ component opened July 2015 closed July 2017 issue contained 93 comments 35 participants closing issue maintainer stated “ somebody bodies community wants fork source code run blessing ” variant created August 2017 • “The mainline made radical shift providing one set features different disjoint set features maintainer thought well users including built workflows around one old features reason lifted particular feature separate also published different name package index” R23 respondent also provided us GitHub issue link discussing details issue opened variant maintainer July 2015 eventually closed April 2018 issue 33 comments involving 17 participants • “Mainline dropped support small subset code asked community support create fork support subset” R66 final technical motivation technology 7 respondents selected technical mentioned phrases related variant created depend different technology • “Added support Open Street Maps available map provider mainline willing accept kind contribution” R8 also ranked governance • “The mainline wasn’t updated use NET Core using updated it” R29 • “ keep source code compatible languagecompiler version use Swift Xcode maintainer mainline supporting different one could compile dependency anymore” R54 Governance technical governance secondmost popular motivation responsiveness prominent governance category 18 34 respondents selected governance mentioned phrases related mainline unresponsive pull requests issues long time respondents ranked governance highly motivation also ranked options motivations highly 4 34 ranked governance • “They series commits fixed functionality newer PHP versions never made release waiting year release fork done push newer release ComposerPackagist” R21 • “We submitted bug fixes didn’t hear back maintainer needed progress meet goals forked followed email maintainer merged patches month later point closed archived fork returned using mainline” R15 Merging back original corresponds one outcomes variant forking reported 2 • “ due lack response mainline maintainer months need release lead release new variant intention submit changes mainline anymore even first PR merged mainline year” R56 next governance motivation feature acceptance 15 respondents selected governance mentioned phrases related mainline hesitant willing accept feature • “TECHNICAL Added support Open Street Maps available map provider GOVERNANCE exactly governance mainline willing accept kind contribution” R8 coded technology technical respondent also provided GitHub PR link containing extra information PR included 45 conversations 15 participants June 2018 March 2021 closed • “Mainline ready accept changes part maintainers responsive Since time issues dealt variant longer needed though infrastructure creating new release variant remains place event might needed future” R44 • “ even main repo maintainer saying busy please use fork thing X don’t know exact reason stopped maintaining also allow us maintain repo” R89 one multiple choice answers respondent indicated variant created community decision respondent also provided extra link revealing three contributors community interested couple new features missing mainline mainline maintainer seemed busy end two members community took fork maintenance introduced missing features advertised additions readmemd file fork well issue Others prominent motivation others supporting personal projects 8 24 respondents selected others mentioned phrases related variant created support personal projects • “The maintainer interested PR added functionality needed I’m developing considerably easier add logic new library bolt on” R18 ranked technical governance others see participant response phrases like “adding logic” new features technical “was interested PR” feature acceptance governance “functionality needed I’m developing” supporting personal projects others • “In Oct 2017 changed API changes broke mainline used daily needed fix ASAP quick fix started add features mainline fixed refactored projects already depending fork” R56 • “ make sure matter happen mainline repository maintain source access library essential dependency ” R54 response line Nyman et al 1 reported forking provides mechanism safeguarding despotic decisions lead thus guided actions consider best interest community next motivation others supporting mainline mentioned 7 respondents selected others • “We fork “main fork” “development fork” FORKNAME case modeling tool maintained fork synchronize everything forks FORKNAME one mainly used develop new features pushed PRs main fork” R61 • “Preparation mainline pull requests mainline repo spammed WIP PRs students Supervisors coaching try improve quality initial mainline pull request Keeping PR open fork reduces number PRs” R73 • “We needed repository tracking ideas keep number issues main repository low” R83 extra link provided revealed mainline variant owned developer “this repository used X make ideas transparent collects issues avoid flooding “official” issue tracker Refined issues migrated official issue tracker” next motivation detail others code quality 3 respondents selected others mentioned phrases related mainline low code quality “The mainline clearly written someone isn’t professional engineer” R63 “The way original architected made challenging ended rewriting instead submitting patch original” R79 Legal motivation legal least popular corresponding 3 105 respondents indicated phrases related closed source present corresponding responses “The main reason creating open source commercial product much features” R7 motivation detail also categorised new features technical supporting personal projects others “5 years ago permissions model GitHub Travis today wanted use Travis granted Travis access primary github account would read access github repos would expose private customer code forked repo permissions model evolved deleted fork” R24 “The founders mainline absent several years came back booted maintainers shifted closed source” R36 respondent provided link extra information showing three maintainers booted original fourth one community joined forces maintaining variant variant currently 739 stars used 35 developers 101 pull requests 195 issues B Discussion Implications RQ1 mainly focused determining motivations creating maintaining variants especially actively maintained parallel mainline counterparts identified decision create variants mostly initiated individuals less community observations thereby confirm findings literature study also extends stateoftheart providing finegrained reasons creating maintaining variants relating reported motivations Furthermore study revealed new reasons reported literature categorised others survey include 1 supporting mainline 2 variant supporting personal projects 3 localization purposes 4 variant developers trusting code quality mainline reported findings useful guide followup studies investigating coevolution mainline variant projects Fig 3 presented overview detailed motivations relate involved creating maintaining variants motivations majorly related developers outside core contributors mainlines 82 also observed quite significant number respondents 24 reporting decision create variant initiated community observed openended responses transition social variant fork variant maintainers engage mainline maintainers discussions issues pull requests inline Zhou et al reported many variant forks start social forks 8 Besides motivations creating maintaining variants respondents reported interesting reuse practices variants like categorized themes different goals new features customization technology supporting personal projects supporting upstream localization specific example R70 categorized different goals theme stated cryptocurrency world applications inherit code mother bitcoinbitcoin Downstream applications also monitor immediate upstream hierarchy important updates like bug security fixes well specific updates cryptocurrency applications considered family 21 ecosystem 28 Variants also likely occur dedicated ecosystems like Eclipse Atom Emacs library distributions Java C C Python Go Ruby OS distributions macOS Linux Windows iOS end study opens different research directions aim deeply investigating different reuse practices families variants deeper understanding reuse practices aid developing tools support effective reuse Summary – RQ1 Many variant forks start social forks decision createmaintain forks either communitydriven contributing 24 individual 76 majority developers 82 creating forks maintainers mainlines identified 18 variant creationmaintenance motivation details categorized motivations technical accounting 58 responses governance 24 others 16 legal 2 detailed motivations others category newly introduced since social coding era V RQ2 variant projects evolve respect mainline RQ2 aims identify impediments coevolution mainline variant projects question lead two specific focuses reflecting respectively focus aimed identifying developers involved maintaining variants aimed understand variant forks evolve wrt mainline RQ1 refer responses using underlined italics RN Results “who” focus understand creating maintaining variant forks asked two multiplechoice questions SQ b2 many original developers mainline maintained variant first 6 months SQ b1 variant mainline common active maintainers Fig 4a Fig 4b summarise answers SQ b2 SQ b1 respectively majority respondents chose options none SQ b2 none creators variant part mainline SQ b1 common active maintainers implies developers involved creation maintenance variants core maintainers mainline variant forked Fig 3 reveals difference numbers participants selected none SQ2a SQ2b Focusing responses SQ2a—original developers SQ2b—common active maintainers associated one observe respondents selected option none SQ2a went ahead select option SQ2b associations responses SQ2a SQ2b observed well Anecdotally R36 responded SQ2a 6–10 developers mainline involved creation variant responded SQ2b option yes no—“They used common maintainers early stages variant projects technically diverged away common maintainers” Respondents R51 R57 selected SQ2a options 6–10 2–5 respectively selecting option SQ2b implies least two maintainers involved fork creation longer contributing mainline Summarising observations SQ2a SQ2b conclude variant forks created maintained developers different mainline counterparts observation concurs earlier findings Businge et al 10
::::
B Results “how” focus understand variant forks evolve wrt mainline asked two additional questions SQ2c variant forks upstream still discuss main directions SQ2d variant developers integrate changes upstream repository SQ2c presented four multiple choice answer options corresponding first four answers reported Fig 5 gathering highest number responses allowed respondents provide openended answer felt choice among four proposed options openended answers coded themes listed Fig 5 variant follows mainline→to variant mirror mainline Fig 5 shows half respondents chose option never corresponding never discussion since creation variant Even discussion 107 respondents signal technically diverged corresponding “They used discuss anymore since projects technically diverged other” openended answers also revealed variant responses discuss directions like mainline hostile variant active contact rarely discuss explanation high number variant developers discuss mainline developers direction derived findings SQ2a SQ2b majority variants created maintained developers core developers mainline Also motivation details RQ1 could explain high numbers never example observed majority variants motivation details category different goals unmaintained features mainline issues mainline responsiveness whose features accepted mainline feature acceptance selected never SQ2c conclude reasons majority variant forks discuss directions mainline could attributed diverging range motivations creating variant well variant creators part mainline’s core development team Anecdotally 5 respondents indicated phrases related variant follows mainline Respondent R77 indicated “in crypto world mainline inherits changes BITCOIN example security commits variant merges changes variant interested every change Mainline However variant must maintain specific new features added separately Mainline interested helping Variant this” also observed two interesting cases variants merged back mainline line Robles GonzálezBarahona 2 reported one outcomes forking fork merging back SQ2 asked respondents two closedended questions 1 often maintainers variant integrate following types changes mainline 2 often maintainers variant integrate following types changes mainline provided Likertscale options two questions presented optional followup questions openended answers two questions allowing respondents provide extra information Fig 6a presents answers respondents value integrating changes back mainline highly scored changes bug fixes security fixes One observe respondents leaning towards negative side Likert scale implying variants interested integrating changes mainline Fig 6b focuses integrations variants towards mainline observe similar trend Fig 6a even pronounced negative inclination Fig 6c Fig 6d present coded themes extra information gathered openended answers corresponding results Fig 6a Fig 6b respectively Fig 6c summarises results 28 respondents provided extra information Fig 6d summarises results 17 respondents likely variants submit changes mainline prominent response Fig 6c related kept sync signaling desire variants keep sync changes made mainline next prominent response related occasionally pull mainline implying variants time time pull changes made mainline respondents mentioned phrases related specific changes pulled example R63 indicated “It’s mostly changes make library specific iRobot Roomba models new ones example” respondents mentioned phrases related everything except specific changes example R48 mentioned “All noncompiler specific changes pulled” Fig 6d two prominent answers PRs suggested example “Made PRs changes ignored They’re still “open” 0 comments mainline dev” R67 prominent answer changes scope example “We use dependency another often diverging language version mainline little reason us push mainline” R54 C Discussion Implications results RQ2 revealed variants created maintained developers core developers mainline also observed limited interaction mainline variants Although found little code integration integration mainline variant frequent variant mainline study confirms extends findings Businge et al 10 provide concrete reasons relating little integration mainline variants include 1 technical divergency variants mainlines offering different goals implementing different technologies variant maintaining part mainline frozen 2 governance disputes mainlines unresponsive pull requests issues variants mainlines willing hesitant accept features variants One respondent also reported mainline actively hostile variants result mainline’s license changing proprietary 3 distinct developers Another reason lack code integration variants maintained developers part core team mainline Furthermore observed mainline–variant pairs interchange code mostly interested patch sets security fixes bug fixes Although maintenance collaboration improved dedicated tooling especially distributed ver sion control systems like Git 29 transparency mechanisms social coding platforms like GitHub 30 tools ideal social forks aim sync changes repositories example code integration using pull requests git tools like mergerebase may best integrating changes mainline variant forks since involve syncing upstreamdownstream changes missing current branch study reveals variant maintainers interested integrating commits specific changes suitable integration mechanism would commit cherry picking since developers choose exact commits want integrate However GitHub’s current setup make easy identify commits cherrypick without digging branch’s history identify relevant changes since last code integration Additionally even though variants diverged mainlines believe since share common code common code may go maintenance perform bug security fixing Since mainline–variant repository pairs maintained uncommon developers chances fixes could missed could fixed different times different developers resulting duplicated effort findings relevant code integration tool builders mainline variants prioritise certain categories mainline–variant pairs targeting specific changes Ideally tooling would help identify possibly important fixes commits recommend commits mainline variant developers support efficient reuse promising studies direction focused providing mainline facilities explore nonintegrated changes forks find opportunities reuse 31 crossfork change migration 32 experimental ideas focused virtual productline platforms unified development multiple variants 33–37 Summary–RQ2 Variant forks usually interact mainline coevolution lack interaction could attributed variety reasons including technical divergence variants mainlines offering different features implementing different technologies nothing share ii governance disputes mainlines unresponsive requests community also uninterested features suggested community iii distinct development teams longer interact iv diverging licenses mainline variant changed license integration longer possible result divergences likely important security patch updates could missed duplicated VI Threats Validity Construct validity response categories closed questions survey originated thorough literature review questions carefully phrased avoid biasing respondent towards specific answer validated questions consulting seven colleagues three different universities trial runs survey seven participants Social desirability bias may also influenced answers 38 mitigate issue informed participants responses would anonymous evaluated statistical form Internal validity used open coding process classify participants responses received openended questions coding process known lead increased processing categorization capacity loss accuracy original response alleviate issue lack accuracy allowed one code assigned answer Generalizability study limited variants mainline repositories hosted GitHub claim findings generalize social coding platforms addition set participants interviewed corresponds decided make email public accepted take part study de facto representative maintainers variant forks VII Conclusions Thanks social coding platforms like GitHub reuse forking create variant projects rise carried exploratory study 105 maintainers variants focusing answering two key research questions 1 developers create maintain variants GitHub observed motivations reported studies carried preGitHub era still hold identified 18 motivation details variant creation maintenance categorized motivations technical 58 responses governance 24 others 16 legal 2 motivations newly introduced social coding era 2 variants projects evolve respect mainlines found little interaction variants mainlines coevolution reported possible impediments lack interaction include technical ie diverging features variants mainlines offering different goals implementing different technologies nothing share ii governance ie diverging interests mainlines unresponsive requests community also uninterested features suggested community iii legal eg diverging licenses mainline variant changed license integration longer possible findings useful guide followup studies investigating coevolution reuse practices mainline variants deeper understanding practices aid code integration tool builders developing tools support effective reuse mainline projects variant forks Acknowledgment work supported joint FWOVlaanderen FRSFNRS Excellence Science SECOASSIST Grant number O015718F RG43 REFERENCES 1 L Nyman Mikkonen J Lindman Fougère “Perspectives code Forking Sustainability open source software” Open Source Systems LongTerm Sustainability 2012 pp 274–279 2 G Robles J GonzálezBarahona “A comprehensive study forks Dates reasons outcomes” Open Source Systems LongTerm Sustainability 2012 pp 1–14 3 R Viseur “Forks impacts motivations free open source projects” International Journal Advanced Computer Science Applications vol 3 2 February 2012 4 L Nyman J Lindman “Code forking governance sustainability open source software” Technology Innovation Management Review vol 3 pp 7–12 January 2013 5 L Nyman Mikkonen “To fork fork Fork motivations SourceForge projects” Open Source Systems Grounding Research 2011 pp 259–268 6 J Gamalielsson B Lundell “Sustainability open source communities beyond fork libreoffice evolved” Journal Systems vol 89 pp 128 – 145 2014 7 G Gousios Pinzger van Deursen “An exploratory study pullbased development model” International Conference Engineering 2014 pp 345–355 8 Zhou B Vasilescu C Kästner “How forking changed last 20 years study hard forks GitHub” International Conference Engineering ACM 2020 pp 268–269 9 C Sung K Lahiri Kaufman P Choudhury C Wang “Towards understanding fixing upstream merge induced conflicts divergent forks industrial case study” International Conference Engineering ACM 2020 pp 172–181 10 J Businge Openja Nadi Berger “Reuse maintenance practices among divergent forks three ecosystems” Journal Empirical Engineering 2021 11 Laurent Understanding Open Source Free Licensing O’Reilly Media 2008 12 B B Chua “A survey paper open source forking motivation reasons challenges” Pacific Asia Conference Information Systems 2017 13 J Dixion “Different kinds open source forks Salad dinner fish” httpsjamesdixonwordpresscom20090513differentkindsofopensourceforkssaladdinnerandfish 2009 14 N Ernst Easterbrook J Mylopoulos “Code forking opensource requirements perspective” ArXiv vol abs10042889 2010 15 L Nyman “Hackers Forking” International Symposium Open Collaboration 2014 pp 1–10 16 E Raymond Cathedral Bazaar Musings linux open source accidental revolutionary O’Reilly Media Inc 2001 17 P Bratach “Why Open Source Projects Fork” httpsthenewstackioopensourceprojectsfork 2017 18 J Jiang Lo J X Xia P Kochhar L Zhang “Why developers fork GitHub” Empirical Softw Engg vol 22 1 pp 547–578 Feb 2017 19 Wheeler “forking” httpsdwheelercomossfswhyhtmlforking 2009 revised July 18 2015 20 de Raadt “Theo de Raadt’s dispute w NetBSD” httpszeustheoscomderaadtcoremailhtml 2006 retrieved October 2021 21 J Businge Openja Nadi E Bainomugisha Berger “Clonebased variability management Android ecosystem” International Conference Maintenance Evolution IEEE 2018 pp 625–634 22 F Uwe Introduction Qualitative Research London Sage Publications 2014 23 J Businge Decan Zerouali Mens Demeyer “An empirical investigation forks variants npm package distribution” BelgiumNetherlands Evolution Workshop ser CEUR Workshop Proceedings vol 2912 CEURWSorg 2020 24 J Businge Openja Kasvales E Bainomugisha F Khomh V Filkov “Studying Android app popularity crosslinking GitHub Google Play store” International Conference Analysis Evolution Reengineering 2019 pp 287–297 25 J Businge Kawuma E Bainomugisha F Khomh E Nabaasa “Code authorship faultproneness opensource Android applications empirical study” PROMISE 2017 26 Zimmermann “Cardsorting text themes” Perspectives Data Science Engineering Elsevier 2016 pp 137–141 27 Garrison ClevelandInnes Koole J Kappelman “Revisiting methodological issues transcript analysis Negotiated coding reliability” Internet Higher Education vol 9 pp 1–8 03 2006 28 Decan Mens P Grosjean “An Empirical Comparison Dependency Network Evolution Seven Packaging Ecosystems” Empirical Softw Engg vol 24 1 pp 381–416 Feb 2019 29 C RodríguezBustos J Aponte “How distributed version control systems impact open source projects” Working Conference Mining Repositories IEEE 2012 pp 36–39 30 L Dabbish C Sturt J Tsay J Herbsleb “Social Coding GitHub Transparency Collaboration Open Repository” Conference Computer Supported Cooperative Work 2012 pp 1277–1286 31 L Ren Zhou C Kästner “Poster Forks insight Providing overview GitHub forks” International Conference Engineering Companion ICSECompanion 2018 pp 179–180 32 L Ren “Automated patch porting across forked projects” Joint European Engineering Conference Symposium Foundations Engineering 2019 pp 1199–1201 33 Antkiewicz W Ji Berger K Czarnecki Schmorleiz R Lämmel u St ˘anciulescu W ˛ asowski Schaefer “Flexible Product Line Engineering Virtual Platform” Companion International Conference Engineering 2014 pp 532–535 34 Fischer L Linsbauer R E LopezHerrejon Egyed “Enhancing cloneandown systematic reuse developing variants” International Conference Maintenance Evolution 2014 pp 391–400 35 L Montalvillo Díaz “Tuning GitHub SPL development Branching models repository operations product engineers” International Conference Product Lines 2015 pp 111–120 36 J Rubin Chechik “A framework managing cloned product variants” International Conference Engineering IEEE 2013 pp 1233–1236 37 Stanculescu Berger E Walkingshaw Wasowski “Concepts operations feasibility projectionbased variation control system” International Conference Maintenance Evolution ICSME 2016 pp 323–333 38 Furnham “Response bias social desirability dissimulation” Personality Individual Differences vol 7 3 pp 385–400 1986
::::
“Nip Bud” Moderation Strategies Open Source Projects Role Bots JANE HSIEH Carnegie Mellon University USA JOSELYN KIM Carnegie Mellon University USA LAURA DABBISH Carnegie Mellon University USA HAIYI ZHU Carnegie Mellon University USA Much modern digital infrastructure relies critically upon open sourced communities responsible building cyberinfrastructure require maintenance moderation often supported volunteer efforts Moderation nontechnical form labor necessary often overlooked task maintainers undertake sustain community around OSS study examines various structures norms support community moderation describes strategies moderators use mitigate conflicts assesses bots play role assisting processes interviewed 14 practitioners uncover existing moderation practices ways automation provide assistance main contributions include characterization moderated content OSS projects moderation techniques well perceptions recommendations improving automation moderation tasks hope findings inform implementation effective moderation practices open source communities CCS Concepts • Humancentered computing → Open source Empirical studies HCI Empirical studies collaborative social computing Additional Key Words Phrases moderation automation coordination open source ACM Reference Format Jane Hsieh Joselyn Kim Laura Dabbish Haiyi Zhu 2023 Nip Bud Moderation Strategies Open Source Projects Role Bots Proc ACM HumComput Interact 7 CSCW2 Article 301 October 2023 29 pages httpsdoiorg1011453610092
::::
1 INTRODUCTION Online social coding platforms GitHub facilitate production open source OSS modern digital infrastructure relies heavily upon However excess volumes issues requests filed users overload volunteer maintainers 86 Aggravating situation open source developers become toxic hostile course technical ideological disagreements 26 Incivility suppresses productivity creativity quality workplace 84 semiprofessional production platforms like GitHub misbehaviors caused growing concerns mental wellbeing contributors maintainers 73 Moderation nontechnical form labor necessary often overlooked understudied task maintainers undertake sustain community around OSS date well understood maintainers grapple toxic undesirable behavior projects particularly scale Research described different types conversations around code contributions 104 categorized toxic content insults trolling well displays arrogance entitlement 26 75 time know responding issues pull requests important part maintenance work open source 34 43 Geiger points maintainers must delicately navigate instances mismatch work required merge contribution nonmaintainers’ desires integrate certain piece functionality 43 also know responses interactions around public conversations OSS projects GitHub important signal potential contributors users underscoring importance dealing toxicity 30 85 growing body research CSCW examines users moderate content online communities increasingly leverage automation efficiently control bad behavior 66 93 106 studies describe challenges moderation different platforms eg 56 58 explores novel moderation techniques tools eg 21 22 examine effectiveness different moderation behaviors strategies eg 92 93 example Jhaver et al find moderation transparency matters offering removal explanations Reddit reduces likelihood future post removals 55 66 Lampe Resnick observe timeliness trades accuracy distributed moderation systems like Slashdot automated moderation systems scale well removing obviously undesirable content eg spam malware links Chancellor et al note magnify errors 20 making human decisions preferable nuanced 19 54 high stakes contexts 47 online social media communities platforms studied much moderation research vary three important ways open source development First social media forums textbased discussion groups typically informal public spaces people gather share compelling interesting information converse well build communities 81 whereas open source communities aim collaboratively produce entail complex organizational structures highly technical discussions tied code artifacts utilized professional purposes 14 Secondly individual contributor’s activities GitHub implications employment prospects reputation within OSS professional community broadly 1 Unlike peers discussion groups like Reddit participants pseudonymous anonymous large portion GitHub users real name identified often accounts listed personal CVs resumes 29 95 Finally types inappropriate behaviors harmful content present OSS communities diverge traditionally found social media Past work uncovered passiveaggressive behaviors namecalling entitlement prevalent among conversations OSS developers 36 75 findings support results Thus distinctive user goals behaviors inappropriate content found OSS communities might necessitate adoption unconventional moderation strategies study qualitatively examine community moderation open source repositories existing strategies structures techniques used mitigating preventing inappropriate activity conversation Moderation includes activities manage behavior conversations around issues code contributions well code eg use potentially offensive variable names Specifically sought answer following research questions investigate moderation open source communities Research questions moderation look like OSS performs moderation actions projects capacity b strategies moderators use respond diffuse prevent conflicts 1 choices may lead disinhibition online 67 99 111 2 current limitations automation moderation potential future improvements order address questions conducted interviews 14 maintainers across 10 projects identify moderation actions performed projects different scales well attitudes towards algorithmic support toxicity moderation prevention find 1 moderation open source conducted different roles depending size structure projects 2 moderators leverage several strategies mitigate prevent emergent conflicts 3 future efforts need address concerns around customizability detection accuracy deploying automation tools help offload labor moderation documenting structures forms labor performed around moderation within open source projects hope enlighten future practitioners available strategies moderating digitallymediated development contexts characterizing potentials limitations automation tools moderation support practitioners understanding anticipating challenges impacts adopting automation also encourage tool designers developers build findings future tools moderation provide improved wider services open source community members
::::
2 BACKGROUND development productoriented collaborative endeavor making open source development environments semiformal working spaces expect professional conduct participants development process collaborators may encounter myriad technical interpersonal conflicts impede work following present study platform notable types open source conflicts prior work participants reported well relevant prior work moderation automation toxicity detection various online communities 21 Study Context GitHub focused data collection open source projects hosted GitHub platform GitHub facilitates collaboration communication among developers users owners projects 30 71 arguably popular hosting site projects June 2022 GitHub reports 83 million developers 5 200 million repositories including least 28 million public repositories Projects GitHub organized code repositories repos short owned personal account usually creator another maintainer organization comprises multiple users Collaborators repository direct write access make commits work together owner maintain Users primarily consist consumers star repos express interest save later reference Within repository contributors users plan work track bugs request new features express maintenance concerns creating issues 71 external noncollaborator developer changes propose submit pull request – special issue posting code contributions others review integrate existing codebase However pull request requires approval one authorized collaborators merged communicate developments collaborators comment issues pull requests well lines code 22 Conflicts Incivility Open Source Open source maintainers responsible tremendous amounts unseen civic labor underlies digital infrastructure many documented overwhelming volumes invisible labor engenders harm mental wellbeing 33 68 Maintainers seldom recognized sufficiently stewardship causing individual stress burnout 86 imperiling projects undermaintenance threatening overall sustainability open source ecosystem 26 86 Due factors like lack corporate management structure geographic dispersion open source maintainers required undertake plethora complex interpersonal organizational work 43 Community maintenance tasks include providing support internal contributors well technical assistance external users make use product Previous investigations found organizational interpersonal labor play critical role traditional engineering contexts 74 80 101 Due fully public largely voluntary nature discussions actions open source development moderation one necessary task maintainers must undertake avoid overwhelming amount negative content harmful interactions Prior work extensively documented presence incivility conflict general negative emotions across multiple actions open source development including code reviews 10 11 16 31 32 36 87 issue discussions 37 75 well comments actions 41 50 Negative interactions occur among different members community eg core collaborators external contributors well maintainers across different projects stem multiple grounds ranging language cultural differences political disagreements personal feuds dependencies mismatches expectations 38 43 71 Conflicts among internal contributors difficult moderate since organization members cannot ban interventions familiar respected contributors get tricky politically charged misconduct external banned members harmful well Incivil behaviors present semiprofessional volunteer development environment endanger sustainability open source decreasing intrinsic motivation contributors reducing productivity heightening dropout rates newcomers 63 76 84 Rather categorizing types conflict open source focus 26 38 75 one aim study via RQ1 characterize strategies structures maintainers use moderate incivil situations incivility originating internal contributors wellstudied 10 11 16 31 32 36 87 102 frustration follows unrealistic expectations user support cause toxic insults 75 entitlement directed maintainers demanding time attention 43 User support involves providing assistance consumers difficulties making use either existing defects consumer’s misunderstanding aspect 65 Swarts identified usability transparency issues causes user needs open source 100 projects scale user support becomes tedious task overwhelming maintainers issues requests demanding time emotional labor 43 Unlike commercial vendors generally rely institutional infrastructures paid dedicated tech support teams open source provides informational user support free charge via small group volunteer users maintainers 65 23 Governance Moderation nontechnical labor moderation often overlooked essential understanding infrastructure open source 33 69 88 According Grimmelmann moderation consisted “governance mechanisms structure participation community facilitate cooperation prevent abuse” 48 context open source define community moderation set activities maintainers designated moderators leverage manage behavior conversations around issues code contributions code effort minimize harmful abusive activities foster collaborative welcoming environment contributors Much like social media eg Discord 58 Reddit 23 62 Twitter57 peer production contemporaries eg Wikipedia 12 40 44 GitHub communities engage volunteerbased community moderation opposed platformwide commercial moderation voluntary nature moderation maintenance open source forces members community eg maintainers volunteer contributors bear responsibility providing support assistance users unlike support providers commercial products services volunteer contributors uncompensated 65 exacerbate workload prior studies reported maintainers found user support “overwhelming neverending chore particularly projects use GitHubstyle collaboration platforms” 43 staggering volume demands user support feature requests GitHub’s issueposting mechanisms demonstrates instance overuse – form deviant behavior among Grimmelmann’s categorization abuses leads congestion cacophony making harder information get thereby hindering users’ information search retrieval processes 48 Existing systems platformic content moderation found vary terms actions styles philosophies values systematic review engaged 86 papers related papers Jiang et al described tradeoffs compared various moderation techniques Grimmelann’s four broad categories 59 included exclusion – act depriving people access online community often bans timeouts organizing – consisting measures like removing annotating content normsetting – practice issuing warnings “indirect policing” denounce bad behavior well monetary pricing – way using market forces raise prices participation users – though social media users found engage last category 48 59 study volunteer moderators Reddit Facebook Twitch Seering et al showed moderators used excluding normsetting actions eg bans warnings increasingly restrictive rates relied heavily general community members report flag misbehaviors 94 actions excluding organizing normsetting may transferable open source moderation expect distinct forms inappropriate content might motivate adoption unique strategies governance structures sought characterize moderation structures norms roles involved open source via RQ2 past work examined conflict management strategies peer review 52 emergence early governance structures GitHub 82 lack knowledge around specific strategies maintainers use moderate inappropriate problematic behaviors open source Among many forms intervention techniques available purposes Renee et al investigated code conduct document “defines standards engage community signals inclusive environment respects contributions outlines procedures addressing problems members” 6 used moderation moderation tools include documents contributing guidelines “which provides potential contributors short guide help project” 9 moderation policies inhouse features bans locking conversations 8 However Geiger et al uncovered contributors intrinsically motivated engage nontechnical maintenance work eg community support documentation complete technical tasks eg feature implementation debugging 45 indicating need comprehensive higherlevel strategies conducting moderation complex situations interpersonal conflicts Maintainers especially discouraged perform moderation work since found cause psychological emotional distress 98 automated assistance moderation 2though GitHub develop set platformwide Acceptable Use Policies 7 appealing solution potential minimize maintainers’ time labor tedious tasks increasing developer productivity 35 107 However exists gap understanding OSS moderation executed practice terms strategies well roles structures established support facilitate moderation study qualitatively investigate infrastructures approaches well uncover maintainers’ perspectives automation support moderation 24 Automated Moderation Bots Open Source Sentiment Bot Safe Space examples tools leverage existing sentiment analysis models help maintainers detect regulate existence toxic comments GitHub Sentiment Bot GitHub App built GitHub’s Probot framework “replies toxic comments maintainer designated reply link repo’s code conduct” 4 Safe Space GitHub action leverages TensorFlow’s toxicity classification model “detect potential toxic comments added PRs issues authors chance edit keep repos safe space” 3 bots make use machine learning classifiers detect toxic content within pull request issue threads respond back comment urges original author modify delete comment whenever problematic content detected Underlying tools sentiment analysis detectors numerous models emerged field engineering improve accuracy domain specificity models include classifiers negative interactions trained conversations surrounding issues 41 61 78 86 89 code reviews 10 16 37 commits 50 51 codes conduct 97 well data contexts support 15 Stack Overflow 18 However bot use open source contexts associated challenges Wessel et al found botgenerated noise form verbosity excessiveundesirable tasks causes annoyance contributors disrupts workflow creates additional labor maintainers 108 Meanwhile Huang et al discovered contributors react negatively automated encouragements 52 Outside open source Jhaver et al described subpar removal explanations provided bots Reddit brewed community resentment 55 voicebased communities like Discord bots faced challenges identifying rule violations based nuances tone accent despite widespread adoption bots automate features 58 Jiang et al highlighted tradeoff automation help communities achieve moderation massive scales faster turnarounds human involvement required understand contextual nuances provide clear removal explanations conduct negotiations around norms contribute toward community building 59 Moderators three platforms Seering et al studied also expressed desire personally deal harder nuanced situations despite content automated tools deal egregious unwanted content — authors argue desiderata motivated moderators’ inclination make contextspecific judgments impact community development 94 Smith et al identified community values related design usage machine learningbased predictive tools content moderation Wikipedia 96 open source maintainers’ moderators’ stances toward automation likely differ open source contributors habituated using tooling increasing productivity efficiency whereas efficiency moderation found trade quality 59 66 second part RQ2 aims provide insights well current moderation bots support human maintainers open source contexts improvements needed reduce friction concerns adoption 3 METHOD learn maintainers moderators maintain communities interviewed 14 individuals moderate maintain projects varied sizes ranging 500 87000 stars repos 30 4000 contributors beginning interview recruitment process obtained institutional IRB approval debriefed participants type questions expect prior starting interviews ensure ethicality 31 Recruitment Participants recruited publicly available information GitHub requirement criteria participants 1 least age 18 years old 2 either current past maintainer moderator collaborative open source started recruiting participants emailing owners repos used moderation bots soon realized owners limited moderation experience since bot setup resulted forked template expanded recruiting repos designated moderation teams contributing guidelines using search terms “moderating” “moderation team” GitHub also conducted snowball sampling asking participants refer us potential interviewees maintainer’s contact information public requested interview via email 40 potential participants emailed 14 agreed take interview one female – proportion women involved study par overall representation women open source 5 103 concluded recruiting process addition participants stopped generating new emergent themes – signaling theoretical saturation 28 Table 1 displays summary participants’ projects respective roles well descriptive information 32 Interview Protocol started semistructured interviews following protocol scripted questions included questions negative positive interactions detection moderation strategies codes conduct bot use category questions main goal learn strategies maintainers used respond negative interactions violations codes conduct issues bot usage introduced Sentiment Bot Specifically inquired responsibilities moderating members expected norms behaviors community whenever applicable resolution strategies disruptive behaviors past set precedents future incidents interview lasted 3060 minutes participants compensated 15 time via PayPal donation charity organization choice 33 Analysis Using interview recordings transcripts team two researchers engaged bottomup thematic analysis interviews experience team open source contributions ranges novice knowledgeable adopted thematic analysis approach analyze transcribed video recordings followed shared open coding session calibrate coding granularity first two authors developed initial lower level codes participant’s data synched weekly resolve disagreements resolving disagreements amongst coders conducted bottomup affinity diagramming process iteratively refine group resulting 375 unique codes 32 firstlevel themes clustered four main themes present ID Pseudonyms Area Role Contributors Stars P1 Honeysuckle Visual diagramming platform MaintainerContributor 20 5k P2 Receptive Differential Privacy Library MaintainerContributor ∼50 300 P3 Apex Runtime Environment Moderation Team Members 3k 85k P4 JaguarAPI Web framework building APIs OwnerFounder ∼300 40k P5 Grunge Programming Language Designated Moderator 35k 65k P6 Hyundai Alternative firmware Designated Moderator 200 17k P7 Vessel Container management Moderation Team Member 3k 87k P9 OwnerFounder 200 500 P10 Silverback Object Storage OwnerFounder 90 95k P11 Community Manager OwnerFounder 80 400 Table 1 Participant Summaries details extracted preserve anonymity references projects pseudonyms
::::
4 RESULTS start characterizing types inappropriate behaviors moderators observed monitored separating common types rule violations found domains implicit forms conflicts emerge technical development environment open source Next describe types moderation roles structures individuals groups assume set order effectively address govern misconduct discuss specific strategies moderators use react address prevent misbehavior incivility Finally summarize maintainers’ stance around adoption tools automate moderation highlighting various concerns overcensorship technical incapabilities well limited customizability 41 Inappropriate behaviors OSS wellstudied online communities intolerable behaviors largely comprised deliberately abusive disruptive misconduct harassment hate speech 56 58 94 However context open source explicitly inappropriate behaviors accompanied subtle acts borderline behaviors miscommunication resistance new practices inquired moderation many maintainers brought strategies used respond mediate miscommunication well ways organizing curbing excessive volume demands prior works categorizing toxic behaviors GitHub also uncovered less severe misbehaviors technical disagreements arrogance 75 make distinction clearly disruptive content detectable toxicity classifiers covert forms incivility require human judgment identify following subsection outline disruptive acts misconduct eg hate speech snarky humor well subtle forms misbehaviors OSS moderators observed guarded follow strategies leveraged address 43 Moderation Strategies 411 Explicitly Aggressive Disruptive Behaviors first class misbehaviors consisted explicitly harmful illintended content start presenting misconduct obvious eg spam egregious eg hate speech harassment follow examples concealed still harmful forms hostility include passive aggressiveness snarky humor Spam hate speech harassment Much P8’s job moderator consisted “moderating spam users” include instances “bot that’s leaving nonsensical comments opening garbage pull requests wasting people’s time” Even smaller projects Hyundai “spammers come things political doesn’t anything Hyundai occur twice year” P6 Hate speech like “someone coming saying ‘why people stupid’ worse that” happen fortunately “those spotty” P8 one case banned member threatened “send collaborators bombs” afterwards “he got arrested like FBI made bombs house” P8 commonsense rules like “no sexual harassment discrimination” seems obvious P4 pondered “in cases explicitly stated people violate things probably people wouldn’t guess that” Passiveaggressiveness snarky humor destructive contagious 77 “passiveaggressive comments” sadly present OSS contexts include arrogant “things like ‘I working 10 years 20 years never seen solution like you’re proposing’ – something exclusively saying you’re proposing dumb kind implicitly saying you’re inexperienced hidden way” P4 demeaning insults “can’t ask intelligent question” P4 reports content “we often get within questions threads” similar vein snarky humor also advised “it’s easy offend someone that” “it’s really hard convey mean snarky internet nobody see face” P5 Entitled demands heated complaints Users contributors felt entitled receive responses take significant amount maintainers’ time – “the thing makes time questions issues” “80 time like 90 like feature request question demand like ‘I’m never really user scope’ ” P4 requests easy address eg simple questions feature requests others get quite heated “One user complained Hyundai good wasn’t working device person didn’t read documentation started aggressive ended user complaining documentation wasn’t good enough” P6 Ironically “in many cases like errors code developer who’s asking didn’t realize” P4 Yet someone must attend issues “If ignore people get mad act more” P11 problem entitled comments isn’t comment “it’s knock effects comment people see think it’s okay behave way feel entitled they’ve seen entitlement normalized” P3
::::
412 Misunderstandings technical disagreements resistance new practices contrast explicitly aggressive misbehaviors moderator participants also reported monitoring subversive disagreements misunderstandings arise technical collaborative nature OSS projects Aside intentional misconduct “many times bad behavior misunderstandings” “it boils like miscommunication understanding issue like people talking past people getting little bit heated” P9 According P13 miscommunications occurred frequently “If dig old threads see lot full miscommunication people shouting dealt what” Technical disagreements easy surface development environments “sometimes people simply get riled idea right wrong someone else different idea tech happen” P5 one instance people get heated “a disagreement licensing code” “from another library” contributors unfortunately “felt need use ‘accusatory’ language” Technical projects often need adopt new pipelines packages keep recent updates practices sometimes new standards met “resistance initially usually large changes build pipelines” P2 first must “get transition period” P3 “over time there’s acceptance” P2 “and new norms way everyone horrified used worse” P3 42 Moderation Roles Structures open source communities perceived decentralized bazaarlike emergent governance structures form time 13 64 82 Maintainers sample employed plethora strategies overcome interpersonal technical challenges social coding Depending size organization maintainers varied governing structure strategies Specific moderation actions performed members community moderation team maintainers basic form moderation involved contributors performing selfcensorship volunteer moderators described reported potentially harmful content actions maintainers formal moderation teams following sections describe participants sample described collaboration different roles governing powers conduct moderation together 421 Selfmoderation Volunteer Moderators particular individual violated community rule norm selfmoderation constituted first line defense Unlike broader term community selfmoderation Seering proposes 91 consider selfmoderation individual selfcorrective action author edit fix content regardless first noticed questionable content case large projects like Apex maintainers may instate “an explicit policy ask organization members selfmoderate rules ‘allow maintainers way say ‘if made mistake apologize don’t behavior you’ll fine’ way displays norms community” P3 Member status affected received selfmoderation requests – original author “not internal collaborator moderation team summarily decide best” P3 However internal organization members exhibit problematic behaviors “then first thing required always ask selfmoderate” P3 requesting selfmoderation contributors maintainers asked specific actions like editing deleting offensive comment avoid public shaming directed author escalations Since social coding platforms like GitHub working environments producing team members expected treat civility even “You don’t like 3Acts selfmoderation erases many public records accidentally posted harmful content thus suspect practice prevalent often undetected Apex reporting selfmoderation also one established OSS projects Hence expect selfmodertating appear projects well encourage future work explore detection frequency selfinitiated moderation professional” P13 Therefore P13 asked contributors harbor negative feelings toward “to selfmoderate did” general “people usually regretful comment hurtful eager happy selfmoderate” cases uncooperative contributors refused conduct selfmoderation one cause behavior difference cultures one case P13 asked selfmoderation posting request along lines “Hey comment perceived problematic please consider selfmoderating it” recipient US would understand – “you’re really telling that” “in Israeli culture it’s perfectly acceptable say ‘No considered think better understanding you’” contributors refused cooperate moderators escalated direct measures intervene cover 43 delegate responsibilities maintainers popular libraries Apex distributed moderation work relying community members “most time somebody reports surface saying ‘hey check this’ moderation repo that’s private org members” maintainers would prefer hide contentious content contributors users “in ideal world don’t require somebody report fix it” inevitably goes undetected larger projects “there’s scalability thing there” community reporting serve “a useful filter prevent time taken hunting problems” P3 422 Formal Moderation Teams Larger mature projects designated particular volunteer members community form official moderation team organization Apex moderation team example consisted “8 10 people 56 regularly active” P13 Moderation team members selfnominated role even exclusive contributors “any member say ‘hey want moderator’ nobody objects seven days join team” Team members recertified annually Technical Steering Committee TSC guided advised organization higherlevel directives Among ten projects interviewed six designated moderators appointed due existing demand instance P12’s moderatelysized “project first started getting popular” “no clue moderate” growing attention eventually convinced assign moderators “people demanding moderators quickly choose moderator moderators would tell people calm people respectful” Maintainers often encouraged contributor interactions moderators help offload maintenance responsibilities one case P5 popular programming language Grunge would tell users “If question anything disturbs may think disturbed others contact moderators” P7 mature Vessel also reports often redirect users contributors “talk moderator Slack” whenever community questions doubts around governance actions instituted bans moderators provide appropriate explanations Even nascent repos Silverback P11 explicitly “set community values proactively explains people community look like” way “if someone blocked don’t know blocked think unblocked know get touch us” P8 Outside moderating members may responsible onboarding tasks taking training course “It online course went Zoom whole team like weeks training probably another round refreshers people joined ” P13 since labor performing moderation actions eg providing explanations draining 59 moderator’s responsibility selfassess take breaks avoid burnout “Moderation something stop stop ’cause burn out” P8 423 Power Sharing Structures moderating team members hold power execute governing actions eg interaction limits bans also experienced power restrictions Restrictions typically originate higherup governing bodies Technical Steering Committee efforts decentralize democratize moderation also encourage community members call review call misjudgments moderators Technical Steering Committees TSCs tend appear larger projects 3 10 projects interviewed formed one – Apex Vessel Grunge sizable number internal members calls topdown governance Apex instance P13 blocked directly removing internal member “once you’re collaborator can’t really removed” order remove internal collaborator “the Technical Steering Committee needs vote moderation team wouldn’t typically remove collaborator” P13 TSC shoulders many technical governing responsibilities serving “the unifying factor” P13 TSC also exhibits “a strong bias towards inaction design making wrong technical decision lot riskier making technical decision” Finally TSC also consists “a lot people technical don’t like dealing interpersonal issues” P13 combination limited bandwidth composition technical members tendency toward inaction means TSC slow approving requests actions like removing collaborators result maintainers larger projects eventually “determined needed separate body Community Committee Technical Steering Committee handle governance actions membership TSC mean idea handle code conduct report” P8 leading formation official moderation teams Vessel TSC holds powers moderation team eg ability remove internal collaborators moderators must additionally “do weekly report TSC moderation actions happened adhere governance documentation” P8 addition moderation teams set structures also encourage members check judgments well ensure democratic distribution moderating powers “We always invite people call mods check actually right get wrong otherwise wouldn’t rules follow would well ‘this mod didn’t like nose banned me’ ” P5 424 Reporting Mechanisms support reporting misconduct volunteers moderators larger projects set “a private moderation repo” “collaborators ∼500600ish them” “open issues notify moderation team ‘here it’s something need look at’ ” P8 moderation repos community reporting works larger projects “for contentious topics issues pull requests happen occasionally projects someone notice surface even though there’s nothing bad yet” addition providing centralized place members submit reports strategy enables moderation team members “start subscribing jump really quickly something happens” addition moderation efforts community reports GitHub constitute another avenue escalation moderators don’t power edit particular posts close specific user accounts instance P8 relates “There definitely blind spots missing parts certain types comments can’t edit delete that’s bit problem contact GitHub gets really bad – gets really bad report user eventually stuff gets deleted GitHub deletes user” Spammers occasionally attacked midsized Hyundai dealt similar way “spammers come things political doesn’t anything Hyundai occur twice year deleteclose issues also reported GitHub may close accounts” Proc ACM HumComput Interact Vol 7 CSCW2 Article 301 Publication date October 2023 Moderation Strategy Description Example Actions Punitive Reactive measures taken eliminate harmful content prohibit interactions cause rapid excessive negative engagements Used situations someone acts clearly outlawed manner activities cause high levels community response Hidingdeleting comments bans interaction limits locking conversations calling bad behavior Mediations Diplomatic interventions taken resolve smallscale misunderstandings agreements Used disagreements small number usually internal contributors Correcting misunderstandings forming negotiations Preventative Inhibitory Precautionary measures used prevent development escalations conflicts Used situations maintainers perceive potential escalate expressions indirect hostility inside jokes belittling comments Issuing warnings calling behavior perceived potential escalate Proactive Setting rules workflows avoid repetition similar mistakes future usercontributor frustrations Used repeated offenses Setting private moderation repos codes conduct linters templates topicspecific channels Reformative Educational approaches rehabilitate misbehaviors set acceptable standards Used unintended neglect rules repeated violations multiple members Offering explanations polite admonishment Table 2 Summary Moderation Strategies 43 Moderation Strategies ideal world maintainers monitor respond negative interactions despite best intentions contributors end engaging heated conversations escalate quickly control unexpected situations occurred maintainers reacted utilizing set existing tools GitHub help limit deescalate remove interaction sometimes takes indepth interventions resolve conflict case maintainers moderators performed role conciliator mediate dispute Fortunately many misbehaviors anticipated prevented moderators witnessed intervened similar incidents instances moderators took preventative actions deter escalations avoid future mistakes reformative strategies newcomers distinguish acceptable behaviors inappropriate established norms guided contributors toward productive healthy efficient interactions Table 2 shows definitions moderation strategies uncovered example actions associated strategy 431 Punitive Strategies Punitive strategies consist reactive moderating actions content removal bans locking conversations strict enforcement codes conduct guidelines eliminate harmful content disruptive behaviors usually taken immediately severe situations unexpected debates outbursts limit impact inappropriate actions prevent escalations conflict content removal sufficient conclude archive exchange moderators simply hid deleted comments P3 Apex related latest preference hiding comments deletion since GitHub introduced public deletion receipts “I don’t delete comments anymore GitHub leaves record you’ve done it’s effective hide abusive offtopic” Unlike deletion leaves public trail delete receipts folding content via hiding offered transparency found 1 improve legitimacy accountability 2 increase perceived consistency 3 prevent confusion frustration 59 “People still read sucks there’s illusion censorship worse people reading content good content erased” P3 Prior public deletion receipts “if hugely toxic exchange irrelevant issue could sum delete comments nobody see toxic exchange happened” P3 Moderators found crucial enforce existing rules maintain healthy supportive environment case political spam Hyundai P6 recounted delete close issues well report accounts GitHub clearly outline desired behaviors helpful “code conduct open enforcing helps lot people know they’re getting go way acceptable behavior pretty much laid there” P5 P5 additionally emphasized importance invoking existing rules “We moderation team enforce want constructive times accept people harassing people calling names generally negative it’s rather frowned upon Basically criticize code person constructive point” Moderators also called clearly toxic behaviors yet explicitly delineated existing rules instance within popular language Grunge “we call bad behavior see it” P5 concerns raised directly GitHub media P9 team members Vessel “call bad behavior sending screenshots team’s Slack” P13 team Apex post moderation repo encourage accountability “You open issue moderation repo see you’re aware often that’s enough get deescalate people watching” Reactive approaches require quick responses since escalations tend unravel quickly “Either don’t notice say ‘hey it’s banned’ ‘work out’ sometimes it’s bad thing catch like one day late point it’s late” P13 disagreements develop heated debates moderators would institute temporary bans “There sometimes heated discussions may institute one even cases sevenday ban cool come back refreshed hopefully” P5 432 Mediations “In OS community implicit foundation contributions valid everybody equal stake something” P11 disagreements occurred maintainers contributors mismatched expectations future state 43 interpersonal conflicts fell moderators hear perspectives mediate underlying conflicts resolve disagreement limit development toxic behaviors Mediation involves communicating multiple parties involved conflict individually group resolve misunderstandings negotiate conflicting objectives collaborating decision P13 described one party engaged conflict sought moderators mediate situations “You find moderator respected parties involved conflict talk they’re nice usually agree facilitate things get hear sides take there” conducted mediations P13 elaborated sequential process mediations start moderator speaks individuals sides conflict “you talk sides try figure conflict try get see person’s perspective” cases someone actually commit wrongdoing misconduct “Sometimes clear person right conflict usually party either admit dig in” likely it’s miscommunication “Often someone wrong it’s like misunderstanding getting people see misunderstanding person’s perspective usually enough” P13 Apex also recounted approaching mediation giving parties benefit doubt “Most people good people good engineers little malice assuming good faith trying approach point like ‘these reasonable decent human beings’ often sufficient terms figuring right side” Meanwhile P14 smaller found mediation negotiatory task “It’s negotiation talk engineer tell don’t like try talk engineer B try see engineer proposing work engineer B try come tradeoff” maintainers happy act intermediary beginning instance P1 related “I would rather middleman call anyone toxicity” P14 helped contributors ask clarifications “Sometimes people come say ‘I read sure take it’s personal something’ Usually know interested parties try ask reviewer rephrase message clarify it” founders mature projects like JaguarAPI comfortable mediating “I’m trying mediate strange that’s something would normally wouldn’t normally engage aggressive conversation” P4 due hypervisibility obligations protect community members P4 wound learning learn anyway “I feel like protect community people around around family end stop whoever kind harassing us” 433 Preventative Strategies Mediations punitive strategies describe ways moderators react conflicts different scales techniques taught directly performed new moderator takes experience anticipate prevent budding future disputes Kiesler et al presented ways limit impacts misbehaviors well performance bad behaviors metaanalysis 2 categorized two types strategies inhibitory proactive preventions moderators used former prevent escalations latter proactively set workflows prevent frustrations ensure conformity standards Inhibitory Preventions conflicts end escalating fullblown arguments contributors time human moderators predict onset harmful behaviors Inhibitory preventions involve warningbased reproachful techniques moderators leverage target indirectly hostile behaviors eg inappropriate jokes passive aggressive behaviors limit harm avoid escalations indirectly hostile behavior open source projects analogous concept “toxicity elicitation” online textbased communities 110 comments behaviors elicit high toxic responses doesn’t necessarily contain toxic language preventative actions targeting behaviors included monitoring conversations calling correcting misbehaviors issuing warnings Passive aggressive behaviors classic example indirect hostility participants brought P5 Grunge recounts “we always stop nip bud people new language come ask questions that’s always delicate situation” reduce chances newcomers dropping “we’re extra careful protect people knowitalls people ooze negativity” P5 P11 Silverback similarly practiced firm enforcement rules prevent escalations “you firmly enforce creates good culture nip things bud don’t let escalate out” P11 absence existing rules moderators issued preventative warnings deescalate situations “Other bans even proverbial slap wrist call bad behavior fix directly that’s totally okay appreciate everyone best” P5 Even though comments outright blatantly harmful contribute normalization hidden hostility “The problem isn’t offensive toxic comments that’s actually issue It’s actually problem someone entitled comment directly It’s knock effects comment it’s people see think it’s okay behave way it’s people feel entitled they’ve seen entitlement normalized” P3 Proactive Preventions observing repeated instances misconduct moderators proactively established rules workflow standards avoid repetition similar mistakes minimize amount harm bad actors perform guide new contributors toward desired standards practices found issue newcomers 43 Specific structures include codes conduct private moderation repos formatting linters templates help contributors better frame questions suggestions channels organizing existing answers structures directly take place offense presence created structures support information dissemination thereby minimizing questions issues raised users contributors case P8 Apex entire moderation team set reaction conflict “The moderation team set reaction Apex botching situation public relations fiasco” team member P3 also described besides moderation teams contributors also set codes conduct instances conflict “anyone run need moderation codes conduct going quick implement community enter create” minimize edits maintainers need make contributors’ submissions templates helped list necessary components include new pull requests issues “In repository topmost folder Contributing document says pull request title commit messages format” P3 assist users drafting issues “I added load information template lot requisites ask people build simple example want” P4 434 Reformative Strategies acts misconduct borne malicious intentions 59 Sometimes moderators observed repeated instances misbehaviors employed nurturing reformative approach doesn’t castigate contributors unintentional offenses Unlike punitive preventative strategies remove member’s content right interactions reformative techniques educational gentle consisting actions like polite admonishments providing explanations long term artifacts reformative approaches eg explanations benefit community establishing acceptable behavioral standards even take time communities adopt offering benefits transparency way establish new norms subsequent community members reformative approaches garnered increased advocacy researchers recent years 53 79 Similarly reformative strategies well received among open source practitioners well P3 Apex related positive feedback community “The polite admonishment word eloquently enough tends gather lots heart thumbup emoji reactions person either apologize dip quiet it’s effective form response” newer P11 redirected raised issue demonstrate efficient response typos “Thanks point However instead raising issue ever see small typos please feel free put pull request fix them” However providing explanations nontrivial amount work sometimes maintainers fall back preventative strategies “I don’t always energy sometimes I’m hostile back sometimes biting comments response effective cost people seeing jerk still establishes behavior acceptable” P3 One side effect politely admonishing community members potential loss contributor often risk outweighed knockon effects unaddressed misbehaviors Establishing behavior acceptable performative it’s showing everyone else arena behavior okay even means person going improve it’s always preferred rehabilitate someone convince reevaluate I’d rather lose person forever community rest community see toxic behavior go unchallenged addition polite admonishment P2 new differential privacy library showed reformative actions also offer explanations newly established norms practices “New pipelines introduced meetings introducers explain they’re better accepted contributors explanation” standards took period transition communities adopt demonstrating case normative conflict identified 38 “When community starts moderating it’s overwhelming goal get everyone open tolerant respectful possible goal efficiently achieved immediately jumping list things potentially problem medium get time way norms established everyone accept them” P3 However communities adopted good practice grew appreciate long run “And norms established know it’s safe admonish newcomers behave way incidence reports plummets People don’t screw know norms supposed We’ll get transition period new norms way everyone horrified used worse” P3 Among collaborators adoption frictions usually mitigated group meetings discussion “New pipelines introduced meetings introducers explain they’re better accepted contributors explanation” P2 44 Automation Moderation interactions social coding platforms textbased making wellsuited automation compared social media counterparts 58 result bots GitHub Actions easily leverage available repo artifacts facilitate various workflows protection mechanisms projects 108 Half participants mentioned using considering automated tools facilitate community moderation currently moderation tools set repos Hyundai positive experience using Sentiment Bot Silverback installed alex bot detects instances “gender favoring polarizing race related unequal phrasing text” 109 However interviewees perceived current bots inadequate conducting moderation beyond simple reactive warnings Moderators reported community members view automated moderation tools overcensoring policing forces threaten freedom speech especially due tendencies false triggers Furthermore subtle forms misbehaviors found professional space development covered Section 412 Misunderstandings technical disagreements resistance new practices difficult anticipate language models’ underlying moderation tools Meanwhile tools used moderation seldom adapted development context lack access crossplatform information increasing chance false alarms Finally absence customization options privacy notifications breached social boundaries users contributors maintainers exposing deletions callouts public unnecessarily demanding maintainers’ attention excessive overly public notifications despite potential bots perform automated moderation behalf maintainers many participants expressed concerns adoption frictions highlight existing tensions well maintainers’ stance utility impact moderation bots 441 Automated Moderation Breeds OverCensorship public online spaces right free individual expression inevitably trades concerns wellbeing public safety 49 interview participants perceived potential automation lead overcensorship Free speech longstanding associations source code metaphor leveraged early supporters FOSS protect right use modify distribute 27 result open source communities strongly embraced valued right free speech However volunteerbased development contexts also important foster safe space welcomes contributions everyone especially given limited diversity inclusion modern open source communities 71 P8 Apex points shortcoming “As lot organizations struggle representation” Gibson found moderators use punitive strategies within safe spaces 46 leads sense overcensorship Teammate P3 described two opposing value systems “There’s spectrum much want modify language avoid offending people folks generally resist political correctness ones side saying ‘I’m going say whatever want you’re offended that’s problem’ think sucks don’t want extreme problematic damages message turning policing” Maintainers delicately balanced friction individual contributors’ desire free speech community’s need create welcoming space female LGBTQ underrepresented groups 71 105 Depending contextual needs maintainers balanced freedom expression contributors broader goals promoting respectful inclusive community P14 eloquently expressed “the tradeoff moderation freedom balance don’t want restrict people also want everybody play nice” maintainers struggled tradeoff free expression enforced civility P10 founder Silverback expressed active resistance demands free speech “This slippery slope we’re going ‘lose words’ ‘it’s going 1984’ ‘we can’t express ourselves’” assert automations powerful enough suppress creativity right speech P10 challenged userscontributors complained test AlexJS telling “if lose like 15 words year red rediscuss you’re smart enough creative enough language large enough you’re going fine” automation assistance participants considered focused language use less forms misbehavior However excessive attention moderation language misuse also derailed conversations topics away development “even people philosophically aligned idea avoiding gender words still irritated normal topic channel distracted disrupted conversation frequently” P3 concerns another set nuances bots would hard time taking account “if get tool like – legitimately seen pedantic tightly wound certain words culture community isn’t ready yet that’s worse saying anything” P3 442 Perceived Technical Limitations Automating Moderation Participants perceived moderation bots deployed GitHub doubly suffered context specificity due 1 situationspecific nuances difficult current tools pick 2 technical terms used development environments one hand lack nuanced understanding situational contexts made difficult models detect new subtle variations misbehaviors hand underlying language models lacked contextual sensitivity technical terms triggering false positives require additional human labor review combination shortcomings caused hesitation delegating moderation responsibilities automation tools Inability Anticipate False Negatives Human collaborators easily retrieve information distributed across platforms toggling automation tools access single deployment context platform “a lot mediums discussions public – bot wouldn’t access” P13 Without multiplatform contextual clues bots failed pick interpersonal relationships intended meanings working environment “even inside discussion lot background bot would hard time figure out” P13 instance P6 Hyundai pointed “when people passive aggressive bot cannot understand it’s better interact human” Hence much like moderators social media contexts open source moderators prefer humanreviewed decisions “for interpersonal conflict inside projects since would impossible” automated assistance moderation work 94 P3 also shares stance human involved moderating decisions “If maintainers see fine make judgment call” Context Specificity Raises False Positives Generalpurpose sentiment analysis models struggle pick connotations contextdependent terms causing learningbased models falsely trigger common engineering terms carry negative denotations used everyday contexts Consequently maintainers manually review instances false positives triggers ensure accuracy exacerbating already limited bandwidth instance P3 introduced AlexJS could offend individuals aggressively flagging words slightly negative connotations “AlexJS ended similar archeology conference couldn’t say word bone flagged offensive” Silerback P10 also observed “Alex triggers ‘master’ upstream dependency Vessel” Vessel yet renamed ‘master’ branch ‘main’ One particular example contextual information missing detection models contributor’s primary language someone’s native tongue English comments accidentally trigger bot reactions unintentionally using phrases carry negative connotations innuendos “English first language may affect them” P8 Li et al highlighted moderators must use intuition ie guessing discern behaviors needing intervention unintentional offenses caused language differences 71 Participants also worried bots treat selfdirected anger one instance founder Hyundai “was answering question said ‘Oh yes don’t worry feature rubbish feature already fixed it’ bot triggered” P6 Likewise P10 “worries use negative language code due personal experience writing code mostly selfdirected” P6 able find humor situation “Sometimes Sentiment Bot flags aggressive phrases it’s funny occurrence” instances may frustrating especially triggering words frequently used maintainers hold opinion false positives cause harm especially add noisy information “false positives become big problem especially they’re distraction” P5 potential false negatives cause disruptions depend context “Sometimes false positives acceptable better missing something there’s also sometimes false anything acceptable it’s better say nothing false result” P3 many false triggers numb maintainers’ responses warnings behavior consistent findings Wessel et al 108 perceiving noise instead thereafter ignoring altogether “We something called stale bot periodically put comments tickets send emails bad whatever reason we’ve learned ignore sometimes” P10 443 Customizations Boundaries Participants reported strong needs tweak customize tooling based specific needs existing automation tools lack personalization options harmed adoption rates instance absence options notification settings caused information overload fatigue maintainers especially given possibility abovementioned false positive triggers Sections 421 424 discussed distribution moderation work volunteer contributors P3 brought inhouse GitHub feature supports volunteer reporting framework – “GitHub reporting facility don’t think many people know turn allows arbitrary users say problem pay attention” Unfortunately feature lacks notification settings “In Apex actually turned turn whole org none it” Privacy another setting maintainers wanted configure transparency found improve collaboration open source 30 maintainers ready “to transparent direct” P3 mind set completely transparent configurations “going consequences folks didn’t anticipate private system allows bias” However also conceded later privacy “also allows refined response hence importance agency configure private notifications “Having surface notice also also tweak sensitive instead default setting” exist yet bots intended lessen burden maintainers instead crossed social boundary maintainers entirely comfortable example P3 worried automatic closing issues deprioritized time community contributors “I don’t use Probot primarily usage I’ve seen programmatically closing issues like stale issues think that’s insanely user hostile prioritizing maintainer time feelings users think that’s good trade off” similar vein P1 also claimed “would consider anything that’d directly communicate contributor value every single one them” allowing direct communication automated moderation tools contributors could risk offending losing valuable community members 444 Anticipated Role Bots Moderation Perhaps due shortcomings outlined maintainers repos various sizes indicated projects solely depend automation moderation moderator popular Grunge P5 thought presence would extraneous “the best thing could alert situations may arise people already that” creator nascent Silverback P11 also thought community “should never need catching slipups” “if we’re relying bot solve moderation problems we’ve gone far course” Fortunately future bot adoption entirely dismal P8 expressed appreciation depersonalized nature automated interventions suggesting may place initiating interpersonal interactions mediations “it’s nice tool depersonalizes intervention” like bots platform user abuse possibility “as soon repo system bunch people going go brigade drop every offensive word think see much respond” P3 Self moderation Volunteer Moderators Moderation Teams Punitive Current bot interventions suitable reactive selfmoderation customizations contextual sensitivity higher accuracy increase usage Current volunteer moderators often experience false positives triggers moderation bots customized notifications increased accuracy contextual sensitivity encourage adoption Bot interventions help improve efficiency content moderation large projects moderation teams opportunities bots help team members make collaborative decisions onboard new members Mediations NA Conflicts involving mediation usually escalated beyond selfmoderation Bots ask clarifications behalf contributor acting mediator place moderators Bots help depersonalize mediations exists room improvement detecting situations need mediations large projects Preventative Inhibitory opportunities detecting instances potential toxicity indirect hostility could develop serious conflicts eg passive aggressiveness inside jokes minor transgressions Proactive Bots provide suggestions improved workflows observing repeated mistakes unconformity existing standards Reformative Bots help enforce template use surface rules community guidelines codes conduct writers composing potentially harmful comment Table 3 Design Recommendations automation may support moderation structures strategies Additionally participants considered scenarios moderation bots leveraged execute Moderation Strategies – Table 3 overviews ways bots support moderation future situations needing immediate response warnings administered Punitive Strategies P12 recalled instance demanding user bot could intervened user commented “THIS BUG OPEN YEAR FIX AT” caps – definitely see using moderation bot something like that” P4 also contemplated situation Sentiment Bot could taken frontline reactive work moderation “if able take lot first conversations think useful” P8 imagined tool could help selfmoderation alerting wellintentioned commenters accidentally make mistake “This gonna good people good faith commenters it’s gonna effective trolls” communities malpractices pervasive P14 imagined moderation bots could help facilitate reform “If see type behavior becoming bad practice teamoverall community would definitely consider adopting something like that”
::::
5 DISCUSSION examination moderation norms practices among communities various sizes found diverse set structures practices maintainers leverage manage prevent conflicts selfmoderation volunteerbased moderation pervaded wellstudied neighboring communities Wikipedia 40 Stack Overflow 25 found moderation required different set strategies case larger projects formal structures moderation teams also discovered still many gaps forms moderation assistance bots offer terms serve type moderation strategies automate Inspired speculations participants present comparison moderation structures strategies opportunities automation open source versus platforms well design recommendations help guide future automation tools moderation 51 Moderation open source versus platforms 511 Moderated Content terms content prior works content moderation social media largely documented presence explicit forms misbehaviors infamous triad “flaming spamming virtual rape” among forms inappropriate content hate speech insults harassment 24 56 58 94 discussions practitioners GitHub gathered moderators also watched borderline actions technical disagreements resistance new norms may immediately apparent evidence subtle forms disputes mean moderators likely leverage Mediations approach conflicts contributors Another implication automated tools powered language models unlikely detect less obvious misbehaviors subtle also tend situational technical – therefore highly contextdependent 512 Structures Roles prior works discuss potentials selfmoderation communitybased platforms Facebook groups Wikipedia Reddit 17 90 considered moderation communitylevel effort platform peers helped one another moderate similar volunteer moderation attribute community members study Among participants selfmoderation considered individuallyinitiated action contributors selfmonitor edit content behaviors likely benefit automated assistance presented Table 3 terminology consistent one study YouTube 72 another investigation subreddits called phenomenon selfcensorship 46 Many communities practice communitylevel selfmoderation rely volunteers conduct moderation opposed centralized models corporate moderation 58 94 However past work suggest reliance online volunteers conduct moderation labor may exploitative meriting reexaminations ethical perspective 70 results revealed moderators OSS shared governing powers higherup authorities TSC well community members broadly Prior work suggest mechanisms distributing power across multiple hierarchical levels beneficial expected larger projects arguing 1 power limitations moderators increase perceived legitimacy decisions 2 2 growth communities increase decentralization moderation platforms Wikipedia 40 establishment formal structures moderation teams introduced section 422 found improve communication norms newcomers 40 perhaps increasing usage actions Reformative Strategies 513 Moderation Strategies Punitive actions hiding deletion content well banning calling rulebreaking behaviors resemble much organizing actions found Reddit Discord Twitter 24 56 58 found evidence strategies transferable OSS context Similarly inhibitory warnings used preventing conflict resemble normsetting practices adopted moderators Wikipedia Facebook Twitch well Reddit 39 94 transition inhibitory warnings punitive actions reflects Ostrom’s fifth design principle graduated sanctions 83 though participants explicitly discuss escalations encourage future work closely examine prevalence OSS contexts Ferreira et al 36 advocated proactive reactive punitive approaches addressing known issues conducting damage control findings provide evidence moderators employing strategies practice Finally mediation strategy almost never observed among extant literature except Wikipedia perhaps similarities open source collaborative peerproduction platform 12 514 Usage Perception Automation Prior works Wikipedia found semi fullautomated tools valuable providing moderators information infrastructure connected editors decentralized network facilitated valuation negotiation administration thereby enabling new moderating actions independent existing norms 44 open source Ferreira et al anticipated deployment similarly automated assistance moderation 36 yet many others critiqued existing toxicity detectors yet tailored enough engineering context 10 36 60 findings corroborated Beyond challenges induced limited domain adaptation additionally uncovered presence subtle misbehaviors may contribute inability models anticipate nuanced situations Lastly also highlighted absence customization options caused maintainers resist adoption implementation made difficult lack transparency underlying black box models 42 62 52 Design Implications Automating Moderation Punitive strategy first set strategies uncovered employed early stages conflict included punitive measures halted escalation removed toxic content Presently found moderation bots assistance human contributors reactive capacity ie pointing cases rule violations harmful content authors become aware unintentionally compose inappropriate content However bots tasked calling bad behaviors participants observed prone hypersensitivity causing false positive triggers false alarms scale well negatively impact moderators contributing already overloaded maintenance burdens 43 108 making existing tools helpful cases selfmoderation extend scope reactive support toward volunteer moderators formal moderation teams improvements contextual sensitivity underlying sentiment models supporting current moderation bots needed enhance understanding nuances language used engineering increase accuracy customization options also incorporated tools increase transparency explainability trust among community Mediation Strategy conflicts developed moderators engaged different approaches mitigate resolve issues depending specifics situation encountering disagreements among small parties contributors moderators took mediating actions reconcile differences mediations moderation bots help facilitate depersonalized interventions contributors advancements help moderators detect disputes require mediation ask clarifications fellow contributor one side uncertain presence conflict potentially negative connotations comment Preventative Strategy contributors engage indirect forms toxicity passive aggressiveness inappropriate jokes maintainers leveraged inhibitory preventions limit extent bad behaviors Bots support moderators expanding detection scope include forms indirect hostility repeated instances behavioral mistakes occur among different contributors moderators proactively established new rules standards prevent future violations contribute toward proactive preventions bots help monitor detect repeated offenses identify associated workflows cause unconformity suggest improvements based practices observed among communities Reformative Strategy mistakes repeated multiple contributors moderators took reformative approaches set standards proactively prevented future cases similar violations introducing new rules workflows help moderators initiate reformation among community automation utilized surface existing guidelines real time authors writing content prevent public posting potentially harmful content
::::
53 Relation Workflow Automations User Frustration study set find types technical interpersonal conflicts lead toxic uncivil behaviors open source emergent theme pointed entitlement user frustrations especially among prominent projects larger user bases highlighting shortage technical support users contributors problems usually surfaced participants discussed types strategies workflows used set mostly described Sections 433 441 many established reduce masses questions requests prior work touched upon timesensitive neverending chore user support one emotionally draining tasks maintainers 43 100 little known types technical complexities misunderstandings cause extensive amounts frustration Future work seek address missing link specific forms technical interpersonal issues cause emotionallycharged conflicts well mitigation strategies maintainers mentioned us
::::
54 Limitations results indicated three main themes around moderation potentials automation open source presented previous sections However despite efforts recruit diverse group participants projects interviews particular focus variety sizes claim representative open source developers number interviews conducted snowball sampling technique limit representativeness sample also focused solely projects hosted GitHub means scope theory results may generalize social coding platforms Bitbucket GitLab Furthermore would ideal highlight experiences perspectives marginalized underrepresented groups open source scarce availability participants present us opportunity – encourage future work explore gap understanding moderators OSS
::::
6 CONCLUSION paper examined moderation practices open source communities conducting 14 semistructured interviews moderators maintainers Specifically characterized norms roles practices performs moderation different techniques employed various contexts RQ 1 investigated automation tools moderation identified concerns adoption well potential ways future bots support different groups moderators various capacities Based implications results presented set design recommendations practitioners researchers guide future development automation tools moderation ACKNOWLEDGMENTS work supported National Science Foundation NSF Award 1939606 2001851 2000782 1952085 1952085 grateful Allen Yao Pranav Khadpe Jim Herbsleb Christian Kastner David Widder well anonymous reviewers crucial input feedback towards initial subsequent drafts work Finally would like thank participants offering us time share expertise insights REFERENCES 1 2011 httpscodedblockorg20110714githubisyournewresumehtml 2 2012 Regulating behavior online communities Building Successful Online Communities MIT Press 3 2021 Safe space Github action httpsgithubcomcharliegerardsafespace 4 2021 sentimentbot httpsgithubcombehaviorbotsentimentbot 5 2022 httpsgithubcomsearchqtypeusertypeUsers 6 2022 Adding code conduct httpsdocsgithubcomencommunitiessettingupyourprojectforhealthycontributionsaddingacodeofconducttoyourproject 7 2022 GitHub Acceptable Use Policies httpsdocsgithubcomensitepolicyacceptableusepoliciesgithubacceptableusepolicies 8 2022 Moderating comments conversations httpsdocsgithubcomencommunitiesmoderatingcommentsandconversations 9 nd Wrangling Web Contributions Build CONTRIBUTINGmd httpsmozillasciencegithubioworkingopenworkshopcontributing 10 Toufique Ahmed Amiangshu Bosu Anindya Iqbal Shahram Rahimi 2017 SentiCR customized sentiment analysis tool code review interactions 2017 32nd IEEEACM International Conference Automated Engineering ASE 106–111 11 K Singh Arneja 2015 Code reviews stressful httpsmediumcomidyllicgeekscodereviewsdonothavetobestressful919e0a8377a1 Accessed 2022713 12 Matt Billings Leon Watts 2010 Understanding dispute resolution online using text reflect personal substantive issues conflict Proceedings SIGCHI Conference Human Factors Computing Systems Atlanta Georgia USA CHI ’10 Association Computing Machinery New York NY USA 1447–1456 13 Christian Bird 2011 Sociotechnical coordination collaboration open source 2011 27th IEEE International Conference Maintenance ICSM 568–573 14 Christian Bird David Pattison Raissa D’Souza Vladimir Filkov Premkumar Devanbu 2008 Latent social structure open source projects Proceedings 16th ACM SIGSOFT International Symposium Foundations engineering 24–35 15 Cássio Castaldi Araujo Blaz Karin Becker 2016 Sentiment analysis tickets support Proceedings 13th International Conference Mining Repositories Austin Texas MSR ’16 Association Computing Machinery New York NY USA 235–246 16 Amiangshu Bosu Jeffrey C Carver 2013 Impact Peer Code Review Peer Impression Formation Survey 2013 ACM IEEE International Symposium Empirical Engineering Measurement 133–142 17 LIA BOZARTH JANE IM CHRISTOPHER QUARLES CEREN BUDAK 2023 Wisdom Two Crowds Misinformation Moderation Reddit Improve Process—A Case Study COVID19 2023 18 Fabio Calefato Filippo Lanubile Federico Maiorano Nicole Novielli 2018 Sentiment Polarity Detection Development Empirical Engineering 23 3 June 2018 1352–1382 19 Stevie Chancellor Andrea Hu Munmun De Choudhury 2018 Norms Matter Contrasting Social Support Around Behavior Change Online Weight Loss Communities Proceedings 2018 CHI Conference Human Factors Computing Systems Montreal QC Canada CHI ’18 Paper 666 Association Computing Machinery New York NY USA 1–14 20 Stevie Chancellor Yannis Kalantidis Jessica Pater Munmun De Choudhury David Shamma 2017 Multimodal Classification Moderated Online ProEating Disorder Content Proceedings 2017 CHI Conference Human Factors Computing Systems Denver Colorado USA CHI ’17 Association Computing Machinery New York NY USA 3213–3226 21 Eshwar Chandrasekharan Chaitrali Gandhi Matthew Wortley Mustelier Eric Gilbert 2019 Crossmod CrossCommunity Learningbased System Assist Reddit Moderators Proc ACM HumComput Interact 3 CSCW Nov 2019 1–30 22 Eshwar Chandrasekharan Shagun Jhaver Amy Bruckman Eric Gilbert 2022 Quarantined Examining Effects CommunityWide Moderation Intervention Reddit 26 pages 23 Eshwar Chandrasekharan Umashanthi Pavalanathan Anirudh Srinivasan Adam Glynn Jacob Eisenstein Eric Gilbert 2017 Can’t Stay Efficacy Reddit’s 2015 Ban Examined Hate Speech Proc ACM HumComput Interact 1 CSCW Dec 2017 1–22 httpsdoiorg1011453134666 24 Eshwar Chandrasekharan Umashanthi Pavalanathan Anirudh Srinivasan Adam Glynn Jacob Eisenstein Eric Gilbert 2017 Can’t Stay Efficacy Reddit’s 2015 Ban Examined Hate Speech Proceedings ACM HumanComputer Interaction 1 CSCW 2017 1–22 httpsdoiorg1011453134666 25 Jithin Cheriyan Bastin Tony Roy Savarimuthu Stephen Cranefield 2020 Norm violation online communities – study Stack Overflow comments April 2020 arXiv200405589 csSI 26 Sophie Cohen 2021 Contextualizing toxicity open source qualitative study Proceedings 29th ACM Joint Meeting European Engineering Conference Symposium Foundations Engineering Association Computing Machinery New York NY USA 1669–1671 27 Gabriella Coleman 2009 CODE SPEECH Legal tinkering expertise protest among free open source developers Cult Anthropol 24 3 Aug 2009 420–454 28 John W Creswell Cheryl N Poth 2016 Qualitative inquiry research design Choosing among five approaches Sage publications 29 Laura Dabbish Colleen Stuart Jason Tsay James Herbsleb 2012 Leveraging transparency IEEE 30 1 2012 37–43 30 Laura Dabbish Colleen Stuart Jason Tsay Jim Herbsleb 2012 Social coding GitHub transparency collaboration open repository Proceedings ACM 2012 conference Computer Supported Cooperative Work Seattle Washington USA CSCW ’12 Association Computing Machinery New York NY USA 1277–1286 31 Erik Dietrich 2020 Deal Insufferable Code Reviewer Retrieved September 2020 32 Carolyn Egelman Emerson MurphyHill Elizabeth Kammer Margaret Morrow Hodges Collin Green Ciera Jaspan James Lin 2020 Predicting Developers’ Negative Feelings Code Review 2020 IEEEACM 42nd International Conference Engineering ICSE 174–185 33 Nadia Eghbal 2016 Roads Bridges Unseen Labor Behind Digital Infrastructure Ford Foundation 34 Nadia Eghbal 2020 Working public making maintenance open source Stripe Press 35 Linda Erlenhov Francisco Gomes de Oliveira Neto Philipp Leitner 2020 empirical study bots development characteristics challenges practitioner’s perspective 36 Isabella Ferreira Jinghui Cheng Bram Adams 2021 “Shut fk up” Phenomenon Characterizing Incivility Open Source Code Review Discussions 35 pages 37 Isabella Ferreira Ahlaam Rafiq Jinghui Cheng 2022 Incivility Detection Open Source Code Review Issue Discussions June 2022 arXiv220613429 csSE 38 Anna Filippova Hichang Cho 2015 Mudslinging Manners Unpacking Conflict Free Open Source Proceedings 18th ACM Conference Computer Supported Cooperative Work Social Computing Vancouver BC Canada CSCW ’15 Association Computing Machinery New York NY USA 1393–1403 39 Andrea Forte Amy Bruckman 2008 Scaling consensus Increasing decentralization Wikipedia governance Proceedings 41st Annual Hawaii International Conference System Sciences HICSS 2008 IEEE 157–157 40 Andrea Forte Vanesa Larco Amy Bruckman 2009 Decentralization Wikipedia Governance Journal Management Information Systems 26 1 July 2009 49–72 41 Daviti Gachechiladze Filippo Lanubile Nicole Novielli Alexander Serebrenik 2017 Anger Direction Collaborative Development 2017 IEEEACM 39th International Conference Engineering New Ideas Emerging Technologies Results Track ICSENIER 11–14 42 R Stuart Geiger Aaron Halfaker 2016 Open algorithmic systems lessons opening black box Wikipedia AoIR Selected Papers Internet Research 2016 43 R Stuart Geiger Dorothy Howard Lilly Irani 2021 Labor Maintaining Scaling Free OpenSource Projects Proc ACM HumComput Interact 5 CSCW1 April 2021 1–28 44 R Stuart Geiger David Ribes 2010 work sustaining order wikipedia banning vandal Proceedings 2010 ACM conference Computer supported cooperative work Savannah Georgia USA CSCW ’10 Association Computing Machinery New York NY USA 117–126 45 R Stuart Geiger Nelle Varoquaux Charlotte MazelCabasse Chris Holdgraf 2018 Types Roles Practices Documentation Data Analytics Open Source Libraries Comput Support Coop Work 27 3 Dec 2018 767–802 46 Anna Gibson 2019 Free Speech Safe Spaces Moderation Policies Shape Online Discussion Spaces Social Media Society 5 1 Jan 2019 2056305119832588 47 Joanne E Gray Nicolas P Suzor 2020 Playing machines Using machine learning understand automated copyright enforcement scale Big Data Society 7 1 Jan 2020 2053951720919963 48 James Grimmelmann 2015 virtues moderation Yale JL Tech 17 2015 42 49 Ted Grover Gloria Mark 2019 Detecting Potential Warning Behaviors Ideological Radicalization AltRight Subreddit ICWSM 13 July 2019 193–204 50 Emitza Guzman David Azócar Yang Li 2014 Sentiment analysis commit comments GitHub empirical study Proceedings 11th Working Conference Mining Repositories Hyderabad India MSR 2014 Association Computing Machinery New York NY USA 352–355 51 Emitza Guzman Bernd Bruegge 2013 Towards emotional awareness development teams Proceedings 2013 9th Joint Meeting Foundations Engineering Saint Petersburg Russia ESECFSE 2013 Association Computing Machinery New York NY USA 671–674 52 Wenjian Huang Tun Lu Haiyi Zhu Guo Li Ning Gu 2016 Effectiveness Conflict Management Strategies Peer Review Process Online Collaboration Projects Proceedings 19th ACM Conference ComputerSupported Cooperative Work Social Computing San Francisco California USA CSCW ’16 Association Computing Machinery New York NY USA 717–728 53 Shagun Jhaver Darren Scott Appling Eric Gilbert Amy Bruckman 2019 suspect post would removed Proc ACM HumComput Interact 3 CSCW Nov 2019 1–33 54 Shagun Jhaver Iris Birman Eric Gilbert Amy Bruckman 2019 HumanMachine Collaboration Content Regulation Case Reddit Automoderator ACM Trans ComputHum Interact 26 5 July 2019 1–35 55 Shagun Jhaver Amy Bruckman Eric Gilbert 2019 Transparency Moderation Really Matter User Behavior Content Removal Explanations Reddit Proc ACM HumComput Interact 3 CSCW Nov 2019 1–27 56 Shagun Jhaver Sucheta Ghoshal Amy Bruckman Eric Gilbert 2018 Online Harassment Content Moderation Case Blocklists ACM Trans ComputHum Interact 25 2 March 2018 1–33 57 Shagun Jhaver Sucheta Ghoshal Amy Bruckman Eric Gilbert 2018 Online harassment content moderation case blocklists ACM Transactions ComputerHuman Interaction TOCHI 25 2 2018 1–33 58 Jialun Aaron Jiang Charles Kiene Skyler Middler Jed R Brubaker Casey Fiesler 2019 Moderation Challenges Voicebased Online Communities Discord Proc ACM HumComput Interact 3 CSCW Nov 2019 1–23 59 Jialun Aaron Jiang Peipei Nie Jed R Brubaker Casey Fiesler 2022 Tradeoffcentered Framework Content Moderation June 2022 arXiv220603450 csHC 60 Robbert Jongeling Subhajit Datta Alexander Serebrenik 2015 Choosing Weapons Sentiment Analysis Tools Engineering Research 2015 IEEE International Conference Maintenance Evolution ICSME 2015 531–535 httpsdoiorg101109icsm20157332508 61 Robbert Jongeling Proshanta Sarkar Subhajit Datta Alexander Serebrenik 2017 negative results using sentiment analysis tools engineering research Empirical Engineering 22 5 Oct 2017 2543–2584 62 Prerna Juneja Deepika Rama Subramanian Tanushree Mitra 2020 Looking Glass Study Transparency Reddit’s Moderation Practices Proc ACM HumComput Interact 4 GROUP Jan 2020 1–35 63 Rajdeep Kaur Kuljit Kaur 2022 Insights Developers’ Abandonment FLOSS Projects 731–740 pages 64 Terhi Kilamo Valentina Lenarduzzi Tuukka Ahoniemi Ari Jaaksi Jurka Rahikkala Tommi Mikkonen 2020 Cathedral Embraced Bazaar Bazaar Became Cathedral Open Source Systems Springer International Publishing 141–147 65 Karim R Lakhani Eric von Hippel 2004 open source works “free” usertouser assistance Produktentwicklung mit virtuellen Communities Springer 303–339 66 Cliff Lampe Paul Resnick 2004 Slashdot burn distributed moderation large online conversation space Proceedings SIGCHI Conference Human Factors Computing Systems Vienna Austria CHI ’04 Association Computing Machinery New York NY USA 543–550 67 Noam LapidotLefler Azy Barak 2012 Effects anonymity invisibility lack eyecontact toxic online disinhibition Comput Human Behav 28 2 March 2012 434–443 68 Nolan Lawson 2017 feels like opensource maintainer Read Tea Leaves httpsnolanlawsoncom20170305whatitfeelsliketobeanopensourcemaintainer 2017 69 Charlotte P Lee Paul Dourish Gloria Mark 2006 human infrastructure cyberinfrastructure Proceedings 2006 20th anniversary conference Computer supported cooperative work Banff Alberta Canada CSCW ’06 Association Computing Machinery New York NY USA 483–492 70 Hanlin Li Leah Ajmani Moyan Zhou Nicholas Vincent Sohyeon Hwang Tiziano Piccardi Sneha Narayan Sherae Daniel Veniamin Veselovsky 2022 Ethical Tensions Norms Directions Extraction Online Volunteer Work Companion Publication 2022 Conference Computer Supported Cooperative Work Social Computing Proc ACM HumComput Interact Vol 7 CSCW2 Article 301 Publication date October 2023 71 Renee Li Pavitthra Pandurangan Hana Frluckaj Laura Dabbish 2021 Code Conduct Conversations Open Source Projects Github Proc ACM HumComput Interact 5 CSCW1 April 2021 1–31 72 Renkai Yubo Kou 2021 advertiserfriendly video YouTuber’s Socioeconomic Interactions Algorithmic Content Moderation Proceedings ACM HumanComputer Interaction 5 CSCW2 2021 1–25 73 Pia Mancini et al 2017 Sustain one day conversation open source sustainers–the report Technical report Sustain Conference Organization 74 Gerardo Matturro 2013 Soft skills engineering study demand companies Uruguay 2013 6th international workshop cooperative human aspects engineering CHASE IEEE 133–136 75 Courtney Miller Sophie Cohen Daniel Klug Bogdan Vasilescu Christian Kästner 2022 “Did Miss Comment What” Understanding Toxicity Open Source Discussions 44th International Conference Engineering ICSE’22 76 Courtney Miller David Gray Widder Christian Kästner Bogdan Vasilescu 2019 People Give FLOSSing Study Contributor Disengagement Open Source Open Source Systems Springer International Publishing 116–129 77 BUCUREAN Mirela 2019 QUALITATIVE STUDY PASSIVEAGGRESSIVE BEHAVIOUR WORKPLACE Annals University Oradea Economic Science Series 28 2 2019 78 Alessandro Murgia Parastou Tourani Bram Adams Marco Ortu 2014 developers feel emotions exploratory analysis emotions artifacts Proceedings 11th Working Conference Mining Repositories Hyderabad India MSR 2014 Association Computing Machinery New York NY USA 262–271 79 Sarah Myers West 2018 Censored suspended shadowbanned User interpretations content moderation social media platforms New Media Society 20 11 Nov 2018 4366–4383 80 Nachiappan Nagappan Brendan Murphy Victor Basili 2008 influence organizational structure quality empirical case study Proceedings 30th international conference engineering 521–530 81 Ray Oldenburg 1999 great good place Cafes coffee shops bookstores bars hair salons hangouts heart community Da Capo Press 82 Siobhán O’Mahony Fabrizio Ferraro 2007 Emergence Governance Open Source Community AMJ 50 5 Oct 2007 1079–1106 83 Elinor Ostrom 2000 Collective action evolution social norms Journal economic perspectives 14 3 2000 137–158 84 Christine Porath Christine Pearson 2013 price incivility Harv Bus Rev 91 12 Jan 2013 114–21 146 85 Huilian Sophie Qiu Yucen Lily Li Susmita Padala Anita Sarma Bogdan Vasilescu 2019 Signals Potential Contributors Look Choosing Opensource Projects Proc ACM HumComput Interact 3 CSCW Nov 2019 1–29 86 Naveen Raman Minxuan Cao Yulia Tsvetkov Christian Kästner Bogdan Vasilescu 2020 Stress burnout open source toward finding understanding mitigating unhealthy interactions Proceedings ACMIEEE 42nd International Conference Engineering New Ideas Emerging Results Seoul South Korea ICSENIER ’20 Association Computing Machinery New York NY USA 57–60 87 Philipp Ranzhin 2020 ruin developers’ lives code reviews I’m sorry Retrieved September 2020 88 David Ribes Steven Jackson Stuart Geiger Matthew Burton Thomas Finholt 2013 Artifacts organize Delegation distributed organization Information Organization 23 1 Jan 2013 1–14 89 Jaydeb Sarker Asif Kamal Turzo Amiangshu Bosu 2020 Benchmark Study Contemporary Toxicity Detectors Engineering Interactions 90 Joseph Seering 2020 Reconsidering SelfModeration Proceedings ACM HumanComputer Interaction 4 CSCW2 2020 1–28 httpsdoiorg1011453415178 91 Joseph Seering 2020 Reconsidering SelfModeration Role Research Supporting CommunityBased Models Online Content Moderation Proc ACM HumComput Interact 4 CSCW2 Oct 2020 1–28 92 Joseph Seering Robert Kraut Laura Dabbish 2017 Shaping Pro AntiSocial Behavior Twitch Moderation ExampleSetting Proceedings 2017 ACM Conference Computer Supported Cooperative Work Social Computing Portland Oregon USA CSCW ’17 Association Computing Machinery New York NY USA 111–125 93 Joseph Seering Tony Wang Jina Yoon Geoff Kaufman 2019 Moderator engagement community development age algorithms New Media Society 21 7 July 2019 1417–1443 httpsdoiorg1011771461444818821316 94 Joseph Seering Tony Wang Jina Yoon Geoff Kaufman 2019 Moderator engagement community development age algorithms New Media Society 21 7 2019 1417–1443 95 Giuseppe Silvestri Jie Yang Alessandro Bozzon Andrea Tagarelli 2015 Linking Accounts across Social Networks Case StackOverflow Github Twitter KDWeb 41–52 96 C Estelle Smith Bowen Yu Anjali Srivastava Aaron Halfaker Loren Terveen Haiyi Zhu 2020 Keeping Community Loop Understanding Wikipedia Stakeholder Values Machine LearningBased Systems Proceedings 2020 CHI Conference Human Factors Computing Systems Association Computing Machinery New York NY USA 1–14 97 Megan Squire Rebecca Gazda 2015 FLOSS Source Profanity Insults Collecting Data 2015 48th Hawaii International Conference System Sciences 5290–5298 98 Miriah Steiger Timir J Bharucha Sukrit Venkatagiri Martin J Riedl Matthew Lease 2021 Psychological WellBeing Content Moderators Emotional Labor Commercial Moderation Avenues Improving Support Proceedings 2021 CHI Conference Human Factors Computing Systems Yokohama Japan CHI ’21 Article 341 Association Computing Machinery New York NY USA 1–14 99 John Suler 2004 online disinhibition effect Cyberpsychol Behav 7 3 June 2004 321–326 100 Jason Swarts 2019 OpenSource Sciences Challenge User Support Journal Business Technical Communication 33 1 Jan 2019 60–90 101 Damian Tamburri Patricia Lago Hans van Vliet 2013 Organizational social structures engineering ACM Computing Surveys CSUR 46 1 2013 1–35 102 Xin Tan Minghui Zhou 2019 Communicate Submitting Patches Empirical Study Linux Kernel Proc ACM HumComput Interact 3 CSCW Nov 2019 1–26 103 Bianca Trinkenreich Igor Wiese Anita Sarma Marco Gerosa Igor Steinmacher 2022 Women’s participation open source survey literature ACM Transactions Engineering Methodology TOSEM 31 4 2022 1–37 104 Jason Tsay Laura Dabbish James Herbsleb 2014 Let’s talk evaluating contributions discussion GitHub Proceedings 22nd ACM SIGSOFT International Symposium Foundations Engineering Hong Kong China FSE 2014 Association Computing Machinery New York NY USA 144–154 105 Bogdan Vasilescu Daryl Posnett Baishakhi Ray Mark G J van den Brand Alexander Serebrenik Premkumar Devanbu Vladimir Filkov 2015 Gender Tenure Diversity GitHub Teams Proceedings 33rd Annual ACM Conference Human Factors Computing Systems Seoul Republic Korea CHI ’15 Association Computing Machinery New York NY USA 3789–3798 106 Gang Wang Bolun Wang Tianyi Wang Ana Nika Haitao Zheng Ben Zhao 2014 Whispers dark analysis anonymous social network Proceedings 2014 Conference Internet Measurement Conference Vancouver BC Canada IMC ’14 Association Computing Machinery New York NY USA 137–150 107 Mairieli Wessel Alexander Serebrenik Igor Wiese Igor Steinmacher Marco Gerosa 2020 Expect Code Review Bots GitHub 108 Mairieli Wessel Igor Wiese Igor Steinmacher Marco Aurelio Gerosa 2021 Don’t Disturb Challenges Interacting Bots Open Source Projects Proc ACM HumComput Interact 5 CSCW2 Oct 2021 1–21 109 Titus Wormer 2015 alex Catch insensitive inconsiderate writing httpsalexjscom Accessed 2022714 110 Yan Xia Haiyi Zhu Tun Lu Peng Zhang Ning Gu 2020 Exploring Antecedents Consequences Toxicity Online Discussions Case Study Reddit Proc ACM HumComput Interact 4 CSCW2 Oct 2020 1–23 111 Gi Woong Yun Sasha Allgayer SungYeon Park 2020 Mind Social Media Manners Pseudonymity Imaginary Audience Incivility Facebook vs YouTube Int J Commun Syst 14 0 June 2020 21 Received January 2023 revised April 2023 accepted July 2023
::::
License Update Migration Processes Open Source Projects Chris Jensen Institute Research University California Irvine Irvine CA USA 926973455 Phone 1 949 8240573 Email cjensenicsuciedu Walt Scacchi Institute Research University California Irvine Irvine CA USA 926973455 Phone 1 949 8244130 Email wscacchiicsuciedu Abstract Open source OSS increasingly subject research efforts Central focus nature distributed used modified causes consequent effects development usage distribution present little understanding happens licenses change motivates changes new licenses created updated deployed Similarly little attention paid agreements contributions made OSS projects impacts changes agreements might also ask questions regarding licenses governing individuals groups contribute OSS projects paper focuses addressing questions case studies processes Apache Foundations creation migration Version 20 Apache License NetBeans projects migration Joint Licensing Agreement Keywords Open source license evolution process Apache NetBeans Introduction process research investigated many aspects open source OSS development last several years including release processes communication collaboration community joining governance central point Lawrence Lessigs book “Code” hardware make cyberspace also regulate cyberspace argues code enables protects certain freedoms also serves control cyberspace licenses codify freedoms regulations setting forth terms conditions use modification distribution system changes made reason others suggested licenses serve contracts collaboration case nonOSS licenses contract may indicate collaboration rather strict separation users developers OSS licenses contrast range widely permissiveness granting rights original authors granting rights consumers OSS research examined OSS licenses great detail beginning understand license evolution OSS code static neither licenses distributed Research license evolution beginning However licenses change contracts collaboration change paper seeks provide incremental step understanding changes licensing impact development processes understanding license update migration matter Companies using OSS need know changes affect use modification distribution system License compatibility OSS long topic debate Research beginning provide tools assistance resolving license compatibility 1 OSS participants need understand changes made whether changes align values business models eg enabling new avenues license compatibility offering strategic benefit opening new channels competition sponsor host may concerned best protect system user community also business model typically want license attract large number developers 2 time allowing make profit stay business licenses GNU General Public License GPL Berkeley Distribution BSD license Apache License well known rarely consider another type license agreement critical understanding collaboration OSS projects individual contributor license agreements CLAs organizational contributor license agreements OCLAs contributors organized entities nonOSS development contract collaboration typically employment contract often stating intellectual property rights pertaining source code written employee property employer provides employer complete control rights granted licensed OSS development situation multiple developers contributing system Without copyright assignment CLAs changing license requires consent every contributor system observed situation case Linux kernel suggested without CLA license evolution become inhibited prevented number contributors differing values objectives increases understand changes licenses affect development processes must also investigate changes CLAs address issues two case studies first examines creation deployment Apache License Version 20 second looks update contributor license agreement NetBeans Background Work Legal scholars St Laurent 3 Larry Rosen 4 former general counsel secretary Open Source Initiative OSI written extensively license selection note quite often choice license somewhat outside control particular developer certainly case code inherited dependent code either reciprocally licensed least requires certain license sake compatibility However outside cases St Laurent Rosen advocate use existing welltested wellunderstood licenses opposed practice creating new licenses license proliferation seen source confusion among users often unnecessary given extensive set licenses already exist diverse set purposes Lerner Tirole 5 observe specific determinant factors license selection 40000 Sourceforge projects studied projects geared towards endusers tended towards restrictive license terms projects directed towards developers tended towards less restrictive licenses Highly restrictive licenses also found common consumer eg games less common consumeroriented platforms eg Microsoft Windows compared nonconsumeroriented platforms Meanwhile Rosen specifically addresses issue relicensing commenting license changes made fiat likely fracture community case relicensing exactly focus case studies drafting release GNU General Public License Version 30 done public fashion inviting many prominent members OSS community participate process fact even see sort prescriptive process specification outlining high level new license created license revision process interesting perspective license question used one one foundation rather update commonly used open source license practice process update impact revision development wide ranging widely discussed Di Penta et al 6 examined changes license headers source code files several major open source projects three primary research questions sought understand frequently licensing statements source code files change extent changes copyright years change source code files work shows changes observed source code files small though even small changes could signify migration different license authors also note little research available speaks license evolution pointing need greater understanding area Lindman et al 2 examine companies perceive open source licenses major factors contribute license choice companies releasing open source study reveals tight connection business model patent potential motivation community members participate development control direction company size network externalities compatibility systems licensing choice Lindman et al provide model company developers users context OSS system developed released corporate environment 2 However systems developed complete isolation Figure 1 model production consumption open source licensing Rather leverage existing libraries components systems developed third parties Moreover Goldman Gabriel point open source source code public place released OSS license 7 communities matter Figure 1 shows production consumption open source highlighting impact licenses contributor license agreements Going step Oreizy 8 describes canonical highlevel customization process systems components highlighting intraorganizational development processes resource flow system application developer addon developer system integrator end user Similarly examined concepts context ecosystems 9 context process interaction license change precipitate integrative forms process interaction case dual multilicensing enabling new opportunities use systems upstream provide added functionality support well projects downstream vis vis use library plugin development support tool development via customization extension cases source becomes resource flowing interacting projects However license change also trigger interproject process conflict new license terms render two systems incompatible point resource flow projects cut downstream consumers source code longer receive updates common example nonOSS license expiration Licensebased interproject process conflicts also manifest unmet dependencies builds inability fix defects add enhancements resulting process breakdown failing recover failure OSS licenses however guarantee even conflict occurs recovery possible source available forked Methodology case studies report part ongoing multiyear research discovering modeling open source processes research methodology ethnographically informed applying grounded theory analysis artifacts found OSS projects primary data sources study come mailing list archives Apache NetBeans projects primary data sources mailing list messages However also found supplementary documentation projects websites served inform study supplementary documents often though always referenced messages mailing list Cases regarding NetBeans took place April June 2003 involving 300 email messages whereas Apache cases spread several discrete time periods consisted 350 messages Case selection happened two ways NetBeans cases arose study requirements release processes stood prominent issues facing community time period studied Although observed additional incidents appropriate discussion three cases selected fit together nicely cohesive story approach also used study Apache However due lower incident frequency expanded study longer time period find incidents proved substantial testament nature interaction issues raised mailing list discussions proved shortlived either resolved quickly conversation simply ceased possible suggest normal behavior pattern projects issues proved outliers focused discussions selected study also observed tendency discussions play series shortlived discussions sessions topic would raised receiving little attention later time would raised JCA discussion NetBeans Subversion migration discussion Apache demonstrated conversational resurgence observed general discussion topics carry certain conversational momentum Topics high degree momentum tended lengthier discussion periods frequent discussion sessions fully resolved abandoned topics low degree momentum addressed quickly simply died causes factors affecting changes momentum investigated lay far afield focus study note although consensus attrition cited communities eg 10 11 observe effect cases studied rather primary participants discussions remained active respective projects several months following reported incidents creation Apache License version 20 directed us colleague familiar Data Apache licensing case gathered email messages sent mailing list established purpose discussing proposed changes Considering difficulties experienced building search engine support process discovery still faced challenge keeping track process data found building models point approach providing process traceability simply include links artifacts models However strategy help us build models returned search problem back projects using search engines locate process data looking lightweight support discovery current strategy providing computer support process discovery returns using projects search engine locate process information operationalized reference model OWL ontology Protégé ontology editor 12 using OWL class individual constructs store process concepts associated search queries respectively Secondly built Firefox plugin Ontology 13 display reference model ontology Firefox web browser Next enlisted Zotero citation database Firefox plugin store process evidence elicited data integrating two plugins datum added citation database artifact automatically tagged selected reference model entities use citation database research data repository may seem unintuitive Zotero however proven well suited needs Like many Firefox plugins Zotero create records simply highlighted sections web document though creation arbitrary entries gleaned document text selections also possible also save snapshot entire document later review useful given high frequency changes web documents—changes evidence steps processes tag note date fields entry useful recording reference model associations memos entry use constructing process steps ascertaining order screenshot Zotero Ontology appears Figure 2 plugin integration greatly facilitates coding process evidence provides traceability raw research data analyzed process models tool set browserbased limited analysis particular data set whether local remote Moreover tool set limit users single ontology Zotero database thereby allowing users construct research models using multiple ontologies describing eg nonOSS process phenomenon reuse tool set analysis additional data sets Thus may easily appropriated grounded theory research fields study elicitation process evidence still search driven Rather use one highly customized search engine examined data repositories search task shifted back organizations study decision several implications comparison previous approach positive negative Using organizations search engine limits ability extract documenttype specific metadata however among organizations studied search tools provide greater coverage document artifact types Lucene handled time Furthermore approach suffer data set limitations imposed web crawler exclusion rules ability query data set scripted fashion lost yet scientists would see gain use computerassisted qualitative data analysis CAQDAS historically put question validity research method results 1516 tool still quite unfinished began governance process discovery modeling added functionality return data sources recapture Although high hope use integrated timeline feature assist process activity composition sequencing time date support within Zoteros native date format insufficiently precise provisions year month day ability capture action sequences happen day adding support greater date time found enter date time every piece data captured rather tedious Eventually prioritize completion discovery modeling ahead computersupport process discovery disable time date entry Unable utilize Zotero intended effect discovery modeling efforts Zotero remain progress pending usability improvements Creation Migration Apache License Version 20 Apache Foundation created new version license end 2003 beginning 2004 Roy Fielding director ASF announced license proposal 8 November 2003 17 inviting review discussion mailing list set specifically said purpose Per Roys message motivations proposed license included Reducing number frequently asked questions Apache License Allowing license usable including nonApache projects Requiring patent license contributions necessarily infringe contributors patents Moving full text license specific conditions outside source code Roy indicated desire license compatible OSS licenses notably GPL see Figure 3 discussion took place mid November 2003 fact given ApacheCon conference ran 1619 November see high message density days leading ApacheCon steady rate continuing days afterward Beyond frequency becomes sparse update proposed license announced 24 December 2003 internal review part process publicly visible update prompted brief discussion second active time period observable January 2004 Fielding announces final update 20 January 2004 final version license approved board 18 19 21 January 2004 primary discussion point creation migration 20 version Apache License centered around patent clause proposed license According Brian Behlendorf serving ASF board directors time ASF’s patentrelated goals “prevent company sneaking code codebase covered patent seeking royalties either ASF endusers” 20 clause question read Reciprocity institute patent litigation Contributor respect patent applicable including crossclaim counterclaim lawsuit patent licenses granted Contributor License shall terminate date litigation filed addition institute patent litigation entity including crossclaim counterclaim lawsuit alleging Work excluding combinations Work hardware infringes patents patent licenses granted License Work shall terminate date litigation filed 21 Consequences clause sparked discussion areas mainly surrounding first sentence clause regarding license termination Legal representatives industry stated objections losing usage rights patent litigation regarding even unrelated covered license 22 proposing alternative wordings achieve stated license goals restricting trigger litigation pertaining patents covered ASF licensed code 23 Uncertainty regarding roles people license revision process 24 proposed changes 25 created additional confusion regarding patent reciprocity stance Eben Moglen General Counsel Free Foundation FSF adds first sentence license clause carries great risk unintended serious consequences inappropriate vehicle protecting free patent litigation 26 FSF deemed clause causes license incompatible version 2 GPL failing one goals proposed Apache License Brian Carlson reports Debian communitys consensus proposed license meet criteria Free Licenses Debian Free Guidelines 27 Consequently code licensed would sandboxed nonfree archive therefore automatically built Debian distributions receive quality assurance attention license termination aspect reciprocity clause cited critical sticking point 28 several members Debian community arguing free licenses restrict modification distribution usage free patent reciprocity clause entirely rejected support extending provide mutual defense patent litigation attacks open source 29 idea quickly nixed grounds could lead users attacked unable defend someone maliciously violate users patent unrelated piece create open source version scenario user would choose using Apache licensed losing patents 30 18 November Fielding indicates “several iterations patent sentences mostly deal derivative work” 24 mentioning probably include suggested changes patent language recommended one legal representatives industry Fielding notes contact representatives organizations among Apple Sun OSI Mozilla independent attorneys although details portions process remain hidden next milestone process occurs 24 December Fielding mentions second draft version 123 prepared internal review due extensive changes 31 posted proposed licenses website 32 mailing list new proposed license 33 incorporates many proposed changes including removal contested first sentence patent reciprocity clause leaving generally agreed upon patent termination condition institute patent litigation entity including crossclaim counterclaim lawsuit alleging Work Contribution incorporated within Work constitutes direct contributory patent infringement patent licenses granted License Work shall terminate date litigation filed 123 version license received little feedback license discussion mailing list Aside definition clarifications inquiry GPL compatibility Behlendorf commented Moglens suggestions incorporated address two issues GPL compliance contacted earlier week take look current draft 34 result Behlendorf 7 January 2004 offers issues presented addressed satisfaction willing propose license board January 2004 meeting 35 However board meeting Fielding announces version 124 featuring change definition “Contributor” 36 125 version shortly thereafter address way “Copyright” represented due various laws use “C” indicate copyright 37 Finally Apache License Version 20 approved ASF board unanimous vote 20 January 2004 18 announced mailing list Fielding following day 19 Per board meeting minutes WHEREAS foundation membership expressed strong desire update license Apache released WHEREAS proposed text new license reworked refined many many months based feedback membership parties outside ASF THEREFORE RESOLVED proposed license found httpwwwapacheorglicensesproposedLICENSE20txt officially named Apache License 20 grant sufficient transition time license used releases Foundation date March 1st 2004 conversation continued briefly address two points Firstly return GPL compatibility discussion Armstrong requested verification whether Moglenthe FSF identified license GPL compatible Fieldings announcement claimed 38 Fielding responds saying Moglen sent private communication commenting license compatibility furthermore belief ASF “a derivative work consisting Apache Licensed code GPL code distributed GPL” wasnt anything consider far ASF concerned 39 Incidentally FSF standing due patent issue Apache license 20 GPL3 compatible GPL2 compatible 40 Secondly Vincent Massol requested information moving Apache subproject ASL2 license file license headers used 41 Behlendorf responds 42 flow graph License creation migration process appears Figure 4 Introduction Joint License Agreement Rosen 4 suggests copyright assignment sought two purposes defend court without participation approval contributors give contributor right make licensing decisions relicensing NetBeans case interesting simple copyright assignment rather affords contributor Sun Microsystems specifically equal independent copyright contributed source Joint License Agreement JLA introduced NetBeans 28 April 2003 Evan Adams prominent participant working Sun Microsystems 43 Adams states JLA introduced response observations Suns legal team Mozilla open source projects believed Sun required full copyright authority protect NetBeans legal threats provide Sun flexibility adapt NetBeans license time proposed agreement contributors original authors would retain copyrights independently contributions previous contributions whose authors agree terms JCA would removed source tree discussion spanned ninety messages seventeen individuals nearly two months followup discussion consisting forty six messages fourteen individuals eleven participated earlier discussion third month discussion began end April 2003 continued July sporadic messages extending September long deadline requiring JLA contributions process license format change seems simple particulars proposed license received early focus discussion discussion progressed concern shifted away details license agreement way change proposed course discussion revealed switching JLA idea proposed Sun legal counsel decision adopt done internally unilaterally irrevocably Sun without involvement large adoption decision raised questions regarding decision rights transparency within recognizing Sunemployed contributors responsible majority effort nonSun contributors took lack transparency consideration decision making process disenfranchisement followup discussion members expressed fears giving Sun full copyright contributed code could lead reclassification volunteercontributed code objectionable ways significantly feared change could impact copyright projects built upon NetBeans codebase contributed back NetBeans source repository time “corner case” concerns license agreement addressed However ultimately nonSun employed contributors still position trust Sun act acceptable manner grant full copyright Moreover discussion drew larger concerns regarding Suns role position leadership control regarding transparency decision making flow graph JCA introduction process appears Figure 5 Discussion Conclusions two cases presented directly comparable Apache study looks process creating new license used projects domain Apache Foundation NetBeans study focuses adoption new license agreement contributors NetBeans IDE platform source licenses govern rights responsibilities consumers among things use modify distribute Contributor license agreements CLAs hand govern rights responsibilities among things use modify distribute contributions organization contributions submitted retained contributor new CLA stated copyright contributions would jointly owned originating contributors well projects benefactor Sun Microsystems Code contribution agreements may interest end users executables However OSS movement known tendency towards usercontributors users contribute development developers use consider specifically license changes Apache NetBeans projects introduced inevitable changes persons authority founder Roy Fielding Apache Evan Adams Sun Microsystems NetBeans initiators discussion presented rationale making changes Apache move motivated desire increase compatibility licenses reduce number questions Apache license moving text outside source code require patent license contributions necessary NetBeans motivations protect legal threats provide Sun ability change license future Apache case motivations making changes went unquestioned discussion focused objectives achieve change best achieve former minority subset participants saw license change opportunity affect development culture altering direction ecosystem means governance macro level latter making sure verbiage license achieved intended objectives license without unintended consequences whose nature former NetBeans case discussion focused differences licenses affect nonsponsoringorganization participants mesolevel governance license Given context surrounded cases structural procedural governance also questioned area NetBeans license change received greatest pushback granting sponsoring organization right change license unilaterally point future right similarly granted ASF Apache contributor license agreement CLA 44 point lost participants NetBeans license change discussions 45 issue receive pushback NetBeans Apache West OMahony 46 suggest caution unlike communityinitiated projects sponsored OSS projects must achieve balance establishing preemptive governance design saw establishing boundaries commercial community ownership control surrounding cases served create atmosphere distrust within distrust led fears contributions sponsoring organization would become closed community perhaps saved organizations commercial version product leaving sponsoring organization freeriders 47 48 profiting efforts others without giving back 49 otherwise limit participants code Perhaps striking difference way two license changes introduced Apache case invited participants well ecosystem public large part change whereas NetBeans case Participants NetBeans left without sense transparency decisionmaking process change put without warning decision made Moreover left without representation decisionmaking process participate determining outcome decision large impact say Apache case entirely transparent clear indications messages list conversations held offlist Likewise misconceptions roles participants played participant affiliation However process questioned result conclusion taken first step understanding license change processes impact development processes discovering modeling update process Apache License update contributor license agreement NetBeans observed differences processes introducing change intent influenced response changes put cases context NetBeans underwent two license changes since events described neither received significant pushback community first shifted license CDDL second move dual license NetBeans GPLv2 second licensing shift considered Sun “at request community” 50 Unlike introduction JCA GPL shift presented community Sun feedback August 2007 added option rather complete relicensing change made Thus clearly see change processes used govern community way directly addressed defects projects governance processes circa 2003 Shah 51 echoes concerns observing code ownership firms creates possibility nonfirmemployed contributors denied future access code projects threats lead forking source happened MySQL corporation purchased Sun Microsystems turn recently acquired Oracle Acknowledgements research described report supported grants Center Edge Power Naval Postgraduate School National Science Foundation 0534771 0808783 endorsement implied References 1 Scacchi W Alspaugh Asuncion H Role Licenses Open Architecture Ecosystems Intern Workshop Ecosystems Intern Conf Reuse Falls Church VA September 2009 2 Lindman J Paajanen Rossi Choosing Open Source License Commercial Context Managerial Perspective Engineering Advanced Applications Euromicro Conference pp 237244 2010 36th EUROMICRO Conference Engineering Advanced Applications 2010 3 St Laurent 2004 Understanding Open Source Free Licensing OReilly Media Inc Sebastopol CA 4 Rosen L 2005 Open Source Licensing Freedom Intellectual Property Law Prentice Hall 5 Lerner J Tirole J 2005 Scope Open Source Licensing Journal Law Economics Organization 211 2056 6 Di Penta German Guéhéneuc Antoniol G 2010 exploratory study evolution licensing Proceedings 32nd ACMIEEE International Conference Engineering Volume 1 ICSE 10 Vol 1 ACM New York NY USA 145154 7 Goldman R Gabriel R 2004 Innovation Happens Elsewhere Company Participate Open Source Morgan Kaufmann Publishers Inc San Francisco CA USA 8 Oreizy P Open Architecture Flexible Approach Decentralized Evolution PhD Information Computer Sciences Irvine CA University California Irvine 2000 9 Jensen C Scacchi W 2005 Process Modeling Across Web Information Infrastructure Process Improvement Practice 103255272 10 Hedhman N Mailing list message dated 16 Dec 2004 071855 0000 “Re ANN Avalon Closed” available online httpwwwmailarchivecomcommunityapacheorgmsg03889html last accessed 15 September 2009 11 Dailey Mailing list message dated Wed 02 May 2007 103826 0400 “Re Support Existing Content consensus attrition” available online httplistsw3orgArchivesPublicpublichtml2007May0214html last accessed 15 September 2009 12 Protégé Ontology Editor available online httpprotegestanfordedu last accessed 23 June 2008 13 Firefox Ontology Plugin available online httprotterdamicsuciedudevelopmentpadmebrowserontology last accessed 23 June 2008 14 Zotero available online httpwwwzoteroorg last accessed 23 June 2008 15 Bringer J Johnston L H Brackenridge C H Using ComputerAssisted Qualitative Data Analysis Develop Grounded Theory Field Methods 2006 183 245266 16 Kelle U Theory Building Qualitative Research Computer Programs Management Textual Data Sociological Research Online 1997 22 available online httpwwwsocresonlineorguksocresonline221html last accessed 23 June 2008 17 Fielding R Message dated Sat 08 Nov 2003 023909 GMT “Review proposed Apache License version 20” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox3cBAAB287A119411D8842D000393753936apacheorg3e last accessed 14 August 2009 18 Board meeting minutes Apache Foundation January 2004 available online httpapacheorgfoundationrecordsminutes2004boardminutes20040121txt last accessed 13 August 2009 19 Fielding R Mailing list message dated Sat 24 Jan 2004 013436 GMT “Apache License Version 20 ” available online httpmailarchivesapacheorgmodmboxarchivelicense200401mbox3C781EEF084E0D11D8915D000393753936apacheorg3E last accessed 13 August 2009 20 Behlendorf B Mailing list message dated Sat 22 Nov 2003 073140 GMT “RE termination unrelated trigger considered harmful” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox3C20031121232552X38821fezhyperrealorg3E last accessed 13 August 2009 21 Carlson B Mailing list message dated Sat 8 Nov 2003 100355 0000 “Re fieldingapacheorg Review proposed Apache License version 20” available online httplistsdebianorgdebianlegal200311msg00053html last accessed 12 August 2009 22 Peterson SK Mailing list message dated Fri 14 Nov 2003 145254 GMT “termination unrelated trigger considered harmful” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox3C6D6463F31027B14FB3B1FB094F2C744704A11176tayexc17americascpqcorpnet3E last accessed 13 August 2009 23 Machovec J Mailing list message dated Fri 14 Nov 2003 164909 GMT “Re termination unrelated trigger considered harmful” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox 24 Fielding R Mailing list message dated Tue 18 Nov 2003 021027 GMT “Re fieldingapacheorg Review proposed Apache License version 20” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox3c60AEF3C1196C11D8A8F4000393753936apacheorg3e last accessed 13 August 2009 25 Engelfriet Mailing list message dated Mon 17 Nov 2003 205953 GMT “Re fieldingapacheorg Review proposed Apache License version 20” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox3c20031117205953GA95846stacknl3e last accessed 13 August 2009 26 Moglen E Mailing list message dated Fri 14 Nov 2003 212832 GMT “FSF Comments ASL 20 draft” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox3c1630918688540989283163newlawcolumbiaedu3e last accessed 13 August 2009 27 Carlson B Mailing list message dated Thu 13 Nov 2003 053949 GMT “DFSGfreeness Apache Licenses” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox3c20031113053949GD23250stonewall3e last accessed 13 August 2009 28 Armstrong Mailing list message dated Fri 14 Nov 2003 043950 GMT “Re DFSGfreeness Apache Licenses” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox3c20031114043950GM2707donarmstrongcom3e last accessed 13 August 2009 29 Johnson P Mailing list message dated Wed 12 Nov 2003 020914 GMT “Mutual defence patent clause” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox3c003d01c3a8c1f9b55170c6ba400cprotocolcom3e last accessed 12 August 2009 30 Behlendorf B Mailing list message dated Wed 12 Nov 2003 210932 GMT “Re Mutual defence patent clause” available online httpmailarchivesapacheorgmodmboxarchivelicense200311mbox3c20031112130508H497fezhyperrealorg3e last accessed 13 August 2009 31 Fielding R Mailing list message dated 12242003 0416 “Re Review proposed Apache License version 20” available online httpmailarchivesapacheorgmodmboxarchivelicense200312mbox3c464B4006360411D89A9F000393753936apacheorg3e last accessed 12 August 2009 32 Apache License Proposal Website available online httpwwwapacheorglicensesproposed last accessed 13 August 2009 33 Apache License Version 123 available online httpmailarchivesapacheorgmodmboxarchivelicense200312mbox last accessed 13 August 2009 34 Behlendorf B Mailing list message dated Fri 09 Jan 2004 224252 GMT “Re Review proposed Apache License version 20” available online httpmailarchivesapacheorgmodmboxarchivelicense200401mbox3c20040109143803G31301fezhyperrealorg3e last accessed 13 August 2009 35 Behlendorf B Mailing list message dated Wed 07 Jan 2004 221636 GMT “Re Review proposed Apache License version 20” available online httpmailarchivesapacheorgmodmboxarchivelicense200401mbox3c20040107140658A23429fezhyperrealorg3e last accessed 13 August 2009 36 Fielding R Mailing list message dated Wed 14 Jan 2004 202550 GMT “Re Review proposed Apache License version 20” available online httpmailarchivesapacheorgmodmboxarchivelicense200401mbox3cD81EA13646CF11D8B08A000393753936apacheorg3e last accessed 13 August 2009 37 Fielding R Mailing list message dated Wed 14 Jan 2004 205426 GMT “Re Review proposed Apache License version 20” available online httpmailarchivesapacheorgmodmboxarchivelicense200401mbox3cD6DB945446D311D8B08A000393753936apacheorg3e last accessed 13 August 2009 38 Armstrong Mailing list message dated Sat 24 Jan 2004 021350 GMT “Re Apache License Version 20” available online httpmailarchivesapacheorgmodmboxarchivelicense200401mbox3C20040124021350GG3060archimedesucredu3E last accessed 13 August 2009 39 Fielding R Mailing list message dated Sat 24 Jan 2004 022929 GMT “Re Apache License Version 20” available online httpmailarchivesapacheorgmodmboxarchivelicense200401mbox3C233851014E1511D8915D000393753936apacheorg3E last accessed 13 August 2009 40 Free Foundation Licenses webpage available online httpwwwfsforglicensinglicensesindexhtmlGPLCompatibleLicenses last accessed 14 August 2009 41 Massol V Mailing list message dated Sun 25 Jan 2004 160119 GMT “How use 20 license” available online httpmailarchivesapacheorgmodmboxarchivelicense200401mbox3C012f01c3e35c78e229d02502a8c0vma3E last accessed 13 August 2009 42 Behlendorf B Mailing list message dated Sun 25 Jan 2004 201706 GMT “Re use 20 license” available online httpmailarchivesapacheorgmodmboxarchivelicense200401mbox3C20040125121456H396fezhyperrealorg3E last accessed 13 August 2009 43 Adams E NBDiscuss mailing list message “Joint Copyright Assignment” available online httpwwwnetbeansorgservletsReadMsglistnbdiscussmsgNo2228 last accessed 6 August 2009 44 Apache Foundation Individual Contributor License Agreement Version 20 available online httpwwwapacheorglicensesiclatxt last accessed 20 October 2009 45 Brabant V mailing list message dated Tue 15 Jul 2003 185236 0200 “nbdiscuss licenses trees” available online httpwwwnetbeansorgservletsReadMsglistNamenbdiscussmsgNo2547 last accessed 20 October 2009 46 West J OMahony 2005 Contrasting Community Building Sponsored Community Founded Open Source Projects Proceedings Proceedings 38th Annual Hawaii international Conference System Sciences Volume 07 January 03 06 2005 HICSS IEEE Computer Society Washington DC 1963 47 Lerner J J Tirole 2000 simple economics open source NBER Working paper series WP 7600 Harvard University Cambridge 48 von Hippel E von Krogh G 2003 Open source privatecollective innovation model Issues organizational science Organization Science 142209223 49 Hedhman N mailing list message dated Sun 29 Jun 2003 133148 0800 “nbdiscuss licenses trees AntiSun Animosity” available online httpwwwnetbeansorgservletsReadMsglistNamenbdiscussmsgNo2578 last accessed 21 October 2009 50 NBDiscuss mailing list message Available online httpwwwnetbeansorgservletsReadMsglistnbdiscussmsgNo3784 last accessed 28 February 2009 51 Shah SK 2006 Motivation governance viability hybrid forms open source development Management Science 527 10001014
::::
Reuse Open Source Case Study Andrea Capiluppi Brunel University UK KlaasJan Stol Lero Irish Engineering Research Centre University Limerick Ireland Cornelia Boldyreff University East London UK ABSTRACT promising way support reuse based ComponentBased Development CBSD Open Source OSS products increasingly available freely used product development However OSS communities still face several challenges taking full advantage “reuse mechanism” many OSS projects duplicate effort instance many projects implement similar system application domain topic One successful counterexample FFmpeg multimedia several components widely consistently reused OSS projects Documented evolutionary history various libraries components within FFmpeg presently reused 140 OSS projects use blackbox components although number OSS projects keep localized copy repositories eventually modifying needed whitebox reuse cases authors argue FFmpeg successful provides excellent exemplar reusable library OSS components Keywords Case Study ComponentBased Development Empirical Study Open Source Quantitative Study Evolution Reuse INTRODUCTION Reuse components one promising practices engineering Basili Rombach 1991 Enhanced productivity less code needs written increased quality since assets proven one carried next improved business performance lower costs shorter timetomarket often pinpointed main benefits developing stock reusable components Sametinger 1997 Sommerville 2004 Although much research focused reuse OffTheShelf OTS components Commercial OTS COTS Open Source OSS corporate production Li et al 2009 Torchiano Morisio 2004 reusability OSS projects OSS projects recently started draw attention researchers developers OSS communities Lang et al 2005 Mockus 2007 Capiluppi Boldyreff 2008 vast amount code created daily modified stored OSS repositories inherent philosophy around OSS indeed promoting reuse Yet reuse OSS projects hindered various factors psychological technical instance reused could written programming language hosting dislikes incompatible hosting might agree design decisions made reused finally individuals hosting may dislike individuals involved reused Senyard Michlmayr 2004 search “email client” topic SourceForge repository httpwwwsourcforgenet produces 128 different projects SourceForge 2011 may suggest similar features domain implemented different projects code features duplication play significant role production OSS code interest practitioners researchers topic reuse focused two predominant questions 1 perspective OSS integrators Hauge et al 2007 select OSS component reused another potentially commercial system 2 perspective endusers provide level objective “trust” available OSS components interest based sound reasoning given increasing amount source code documentation created modified daily starts commercially viable solution browse components existing code select existing working resources reuse building blocks new systems rather building scratch Among reported cases successful reuse within OSS systems components clearly defined requirements hardly affecting overall design ie “S” “P” types systems following original SPE classification Lehman 1980 often proven typically reused resources OSS projects Reported examples include “internationalization” often referred I18N component produces different output text depending language system “install” module Perl subsystems involved compiling code test install appropriate locations Mockus 2007 best knowledge academic literature successful reuse OSS understanding internal characteristics makes component reusable OSS context lacking main focus paper report FFmpeg httpffmpegorg buildlevel components show components currently reused projects cornerstone multimedia domain several dozens OSS projects reuse parts FFmpeg one widely reused libavcodec component domain OSS multimedia applications libavcodec widely adopted reused audiovideo codec coding decoding resource reuse OSS projects widespread since represents crosscutting resource wide range systems singleuser video audio players converters multimedia frameworks FFmpeg represents unique case Yin 2003 p40 selected study particular study attempt evaluate whether reusability principle “high cohesion loose coupling” Fenton 1991 Macro Buxton 1987 Troy Zweben 1981 impact evolutionary history FFmpeg components paper makes two contributions studies size FFmpeg components evolve empirical findings show libavcodec component contained FFmpeg “evolving reusable” component “Etype” system Lehman 1980 poses several interesting challenges projects integrate studies architecture FFmpeg components evolve components evolve separated FFmpeg empirical findings show two emerging scenarios reuse resource one hand majority projects reuse FFmpeg components “blackbox” strategy Szyperski 2002 incurring synchronization issues due independent coevolution component hand number OSS projects apply “whitebox” reuse strategy maintaining private copy FFmpeg components latter scenario empirically analyzed order obtain better understanding component reused also integrated host system remainder paper structured following guidelines reporting case study research proposed Runeson Höst 2009 next section provides relevant background information overview related work components OSS systems followed presentation research design study results empirical study presented Followed threats validity study last section concludes key findings provides directions future work BACKGROUND RELATED WORK section presents background related work relevant remainder paper first subsection briefly discusses research OSS reuse followed discussion ComponentBased Development CBSD terminology used paper followed brief overview useful relevant categorization components Since work considers evolution components brief summary Lehman’s classification programs provided section concludes brief discussion related work regarding decay architectural recovery ComponentBased Development Terminology mentioned ComponentBased Development CBSD proposed promising approach largescale reuse important however first define clearly meant term “component” word “component” often used context CBSD reusable piece either Commercial OffTheShelf COTS Open Source instance Torchiano Morisio 2004 derived following definition “A COTS product commercially available open source piece projects reuse integrate products” definition considers COTS Open Source product independent unit reused However number authors provided specific definitions commonly cited definition found Szyperski 2002 p 41 “A component unit composition contractually specified interfaces explicit context dependencies component deployed independently subject composition third parties” De Jonge 2005 points “ComponentBased Engineering CBSE mostly concerned executionlevel components COM CCB EJB components” Szyperski 2002 p 3 also speaks components “executable units independent production acquisition deployment composed functioning system” paper following De Jonge 2005 use term “buildlevel component” De Jonge speaks buildlevel components “directory hierarchies containing ingredients application’s build process source files build configuration files libraries on” earlier paper De Jonge 2002 uses term “source code component” context interpret meaning “build level” component equivalent term “module” used Clements et al 2010 p 29 indicate module refers unit implementation source code implementation artifacts Eick et al 2001 also interpret module directory source code file system contains several files though note terminology standard Tran et al 1999 2000 considered individual source files modules Clements et al define “component” runtime entity consistent definition Szyperski Although important issues already known incorporating reusing whole systems larger overarching projects case Linux distributions German Hassan 2009 remainder paper use term “component” refer buildlevel component Components reused different ways briefly mentioned blackbox reuse whitebox reuse Szyperski 2002 Blackbox reuse refers reuse component asis without alterations component viewed terms input output typically case proprietary COTS components used source code usually available proprietary hand component’s source code available integrator perform whitebox reuse integrator may make changes component fit intended purpose Obviously availability source code makes OSS components particularly suitable whitebox reuse two scenarios summarized Figure 1 example MPlayer keeps copy library repository eventually modifies “forks” purposes whitebox reuse scenario VLC compilation time requires user provide location uptodate version FFmpeg blackbox reuse Research Open Source Reuse growing body empirical research use OSS components CBSD Ayala et al 2007 Hauge et al 2009 Capiluppi Knowles 2009 Li et al 2009 Ven Mannaert 2008 increasing number OSS products available many become viable alternatives commercial products Fitzgerald 2006 adopting OSS components build products common scenario Hauge et al 2010 Research OSS reuse classified along two dimensions first dimension considers question reuses either Independent Vendor ISV OSS communities second dimension considers reused particular granularity components Haefliger et al 2008 identified different granularities code reuse algorithms methods single lines code components Components may coarse granularity ie complete systems common example called “LAMP stack” Wikipedia nd “ensemble” Linux Apache MySQL scripting language Python Perl PHP Ruby Much literature OSS reuse focuses coarsegrained components ISVs though noteworthy granularity cannot measured discrete scale rather continuous one German et al 2007 discuss dependencies packages define installable unit found Linux distributions define model represent analyze dependencies work led German investigated issue licenses reusing different OSS components German Hassan 2009 German GonzálezBarahona 2009 hand reuse done components finer granularity studies focus reuse OSS projects study presented paper also considers components relatively small granularity discuss related work detail Table 1 provides overview study objectives well research methods samples One first studies quantifies reuse Open Source Mockus 2007 study focuses reuse identifying directories source code files share number defined threshold file names therefore study considers whitebox reuse Mockus studied reuse large sample 38700 unique projects 53 million unique file name paths Mockus found approximately half files used one indicates significant reuse among OSS projects Haefliger et al 2008 conducted study 15 OSS projects six studied indepth goal study investigation influence several factors identified literature support code reuse OSS development Factors included standards tools quality ratings certificates incentives found commercial development firms study shows studied projects reuse blackbox reuse predominant form Sojer Henkel 2010 conducted survey investigate quantitatively relationship developer characteristics one hand degree reuse OSS projects hand survey among 686 OSS developers identified number factors developers’ experience OSS projects affect reuse OSS projects Unlike studies one Mockus Haefliger et al mentioned study investigate actual reuse within OSS projects rather developers’ behavior opinions topic Heinemann et al 2011 studied reuse sample 20 OSS projects written Java programming language using clone detection
::::
Table 1 Overview previous studies reuse OSS Authors Study objective Method sample Mockus et al 2007 identify quantify largescale reuse OSS Survey 38700 projects 132 MLOC Haefliger et al 2008 code reuse supported OSS Multiple case study 15 projects indepth analysis 6 projects 6MLOC Sojer Henkel 2010 important code reuse OSS projects perceived benefits issues impediments code reuse code reuse affected characteristics developers Webbased survey 686 responses Heinemann et al 2011 OSS projects reuse much blackboxwhitebox Empirical study 20 OSS Java projects 33 MLOC techniques complemented manual inspection study investigated whether OSS projects reuse extent reuse happens whitebox blackbox found reuse common OSS Java projects studied particular blackbox reuse previously found Haefliger et al 2008 must noted measurements also counted reuse Java standard libraries Component Characterization Components defined characterized different categories depending relationships components Lungu et al 2006 distinguish four types Java packages Silent package dependency relations package packages Consumer package dependency relation package packages package depends consumes functionality packages Provider package dependency packages package package provides functionality packages Hybrid package package consumer provider time consumes provides functionality packages respectively Though Lungu et al refer Java packages argue main mechanism decomposition modularization system written Java argue four types listed used characterize components directories containing source code files defined previous subsection provider component provides services components therefore become dependent upon provider Likewise consumer relies functionality provided components therefore dependent upon Incidentally Java packages fact represented directories source code file system Evolution Program Classification continuous pressure systems evolve order prevent becoming obsolete Lehman 1978 Lehman 1980 stated number “laws evolution” presents classification programs three classes P E relates programs evolve three program types briefly summarized SPrograms Lehman 1980 described SPrograms “programs whose function formally defined derivable specification” programs solve specific problem completely defined specification problem “directs controls programmer creation program defines desired solution” Lehman 1980 Changes may course made program instance improve resource usage improve maintainability However changes must change mapping input output changes made due changed specification different program solves new problem Typical examples Stype programs library routines implement mathematical operations instance sine cosine functions PPrograms PPrograms programs implement solution problem welldefined whose implementation must limited approximation achieve practicality problem statement PPrograms “is model abstraction realworld situation containing uncertainties unknown arbitrary criteria continuous variables” Lehman 1980 Whereas correctness SProgram depends specification value validity PPrograms dependent solution acquired realworld environment environment world program used changing PPrograms must also change Examples suggested Lehman program implementing game chess well weather prediction EPrograms defining characteristic third class programs EPrograms installation program changes nature problem solving Lehman 1980 stated “Once program completed begins used questions correctness appropriateness satisfaction arise … inevitably lead additional pressure change” words environment world program originally conceived changing due introduction program stated abstract terms introduction solution program problem changes nature problem leads need continuous change Etype programs Lehman mentions examples types programs operating systems airtraffic control Lehman 1980 Architecture Decay Architectural Recovery empirical analysis FFmpeg components reported revealed several changes components connections core system changes revealed least one case decay components internally structured externally connected components Therefore work also related study architectures relates components mutual relationships Bass et al 2003 widely accepted system’s architecture different views IEEE 2000 well known 41 view model architecture Kruchten 1995 defines logical development process physical views plus usecase view outlined study considers components directories containing source code files would presented development view One related aspect also considered present study structural characteristics decay time components become less cohesive connections infringe original design constraints One important aspect architectures components modularity Parnas 1972 division system modules components helps separation functionality responsibilities various modules Reusability quality attribute directly related component’s system’s examining intercomponent couplings Bass et al 2003 may provide valuable insights help assess reusability component system analysis coupling cohesion objectoriented systems also shown good degree modularity achieved observing “loose coupling high cohesion” principle components Fenton 1991 Macro Buxton 1987 Troy Zweben 1981 systems evolve time engineering literature firmly established architectures associated code suffer decay Eick et al 2001 Perry Wolf 1992 speak architectural erosion architectural drift former occurs result violating conceptual architecture latter due insensitivity stakeholders architecture may lead obscuration architecture turn may cause violation architecture result systems progressive tendency lose original structure makes difficult understand maintain Schmerl et al 2006 Among common discrepancies original degraded structures phenomenon highly coupled lowly cohesive modules already known since 1972 Parnas 1972 established topic research Architectural recovery one recognized countermeasures decay Dueñas et al 1998 Several earlier works focused architectural recovery proprietary Dueñas et al 1998 closed academic AbiAntoun et al 2007 COTSbased systems Avgeriou Guelfi 2005 OSS Bowman et al 1999 Godfrey Lee 2000 Tran et al 2000 studies systems selected specific state evolution internal structures analyzed discrepancies conceptual concrete architectures Tran et al 2000 Researchers proposed various approaches address issue proposing frameworks eg Sartipi et al 2000 methodologies eg Krikhaar et al 1999 guidelines concrete advice developers eg Tran et al 2000 Architectural recovery provides insights concrete architecture turn may help developers integrators instance certain architectural styles Clements et al 2010 may identified provide valuable insights system’s quality attributes Bass et al 2003 Harrison Avgeriou 2011 Recovery important well ensure maintainability product conceptual architecture respected resulting concrete architecture may become spaghetti architecture obstacle making necessary changes system context reuse research particular components defined may identified reused systems ie OSS projects RESEARCH DESIGN study presented paper quantitative descriptive case study Yin 2003 Easterbrook et al 2008 pointed exists confusion engineering literature constitutes case study distinguishing case study “worked example” case study “empirical method” Case studies also conducted different contexts instance industry “in vivo” researchlaboratory setting “in vitro” study empirical “in vitro” case study one OSS namely FFmpeg study presents description analysis system following classification Glass et al 2002 research approach therefore classified “descriptive” remainder section proceeds follows First provide information FFmpeg Second introduce research questions guided research Third present definitions operationalize research section concludes discussion data collection analysis procedures Selection Description FFmpeg System paper presents case study reuse buildlevel components FFmpeg selected example reuse several reasons long history evolution multimedia player grown refined several buildlevel components throughout life cycle components appear like “E” type systems instead traditional “S” “P” types lower propensity evolution Several core developers collaborating also MPlayer httpwwwmplayerhqhu one commonly used multimedia players across OSS communities Eventually libavcodec component incorporated among others FFmpeg main development trunk MPlayer increasing FFmpeg’s visibility widespread usage components currently reused different platforms architectures static linking dynamic linking Static linking involves inclusion source code files precompiled libraries compiletime dynamic linking involves inclusion shared binary library runtime Finally staticlinking reuse FFmpeg components presents two opposite scenarios either blackbox reuse strategy “update propagation” issues reported latest version compiled particular version FFmpeg components Orsila et al 2008 whitebox reuse strategy mentioned FFmpeg system successfully become highly visible OSS partly due components libavcodec particular integrated large number OSS projects multimedia domain terms global system’s design FFmpeg yet provide clear description either internal design architecture decoupled components connectors Nonetheless visualizing source tree composition de Jonge 2002 folders containing source code files appear semantically rich line definitions buildlevel components de Jonge 2005 source tree composition de Jonge 2002 first column Table 2 summarizes folders currently contain source code subfolders within FFmpeg shown components act containers subfolders apart source files shown columns two three respectively Typically subfolders role specifyingrestricting functionalities main folder particular areas eg libavutil folder divided various supported architectures Intel x86 ARM PPC etc mentioned Lungu et al 2006 refer structural “pattern” Archipelago fourth column describes main functionalities component observed directory provides build configuration files subfolders contained following definition buildlevel components de Jonge 2005 fifth column Table 2 lists month component first detected repository Apart miscellaneous tools component currently reused OSS components multimedia projects development libraries example libavutil component currently redistributed libavutildev package Table 2 shows main components system originated different dates older ones eg libavcodec typically articulated several directories multiple files libavcodec component created relatively early history system 082001 grown 220000 source lines code SLOC alone visible timeline Figure 2 components coalesced since component appears modularized around specific “function” according “De Component name Folder count File count Description First detected libavcodec 12 625 Extensive audiovideo codec library 082001 libpostproc 1 5 Library containing video postprocessing routines 102001 libavformat 1 205 Audiovideo container mux demux library 122002 libavutil 8 70 Shared routines helper library 082005 libswscale 6 20 Video scaling library 082006 tools 1 4 Miscellaneous utilities 072007 libavdevice 1 16 Device handling library 122007 libavfilter 1 11 Video filtering library 022008 scription” column Table 2 become identifiable hence reusable systems fact repackaged distinct OSS projects httpwwwlibavorg Research Questions research guided three research questions RQ1 size FFmpeg components evolve Rationale first interested components FFmpeg behave terms size become available limit growth components affecting ability reused properly RQ2 architecture FFmpeg components evolve Rationale interested understanding various FFmpeg components relate one another terms coupling cohesion consider measures representation architecture RQ3 FFmpeg components evolve separated FFmpeg eg whitebox reuse Rationale mentioned FFmpeg components reused far blackbox whitebox scenario OSS components particularly suitable whitebox reuse due availability source code number FFmpeg components fact reused using whitebox reuse approach Since scenario copy component made maintained new hosting component likely evolve separately original host ie FFmpeg Therefore interesting study FFmpeg components evolve reused whitebox components Definitions Operationalization section introduces number definitions relevant research presented paper paper use terminology definitions provided related previous studies previous section already discussed interpretation term component summarize consider directory source code file system containing several source code files buildlevel component de Jonge 2005 subsequently used units composition Others used word “module” eg Clements et al 2010 order measure evolution components architectural evolution use number measurements well established engineering measurement literature namely coupling cohesion Coupling divided outbound coupling fanout inbound coupling fanin Furthermore considered concept “connection” states whether two components related • Coupling Coupling measure degree interdependence modules Fenton 1991 several types coupling common coupling modules reference global data area control coupling control data passed modules etc extensive classification types coupling presented Lethbridge Laganière 2001 p 323 study define coupling union “routine call” coupling “inclusionimport” coupling Routine call coupling refers function calls component component B Inclusionimport coupling refers dependencies expressed using include directive C preprocessor used Doxygen tool httpwwwdoxygenorg extract information Since empirical study based definition buildlevel components two conversions made filetofile functionstofunctions couplings “lifted” Krikhaar 1999 p 38 p 85 foldertofolder couplings also done Tran Holt 1999 graphically illustrated Figure 3 stronger coupling link folder B found many elements within call elements folder B Since behavior buildlevel components studied couplings subfolders component also redirected component alone hence coupling rightarrow BC C subfolder B reduced rightarrow B graphically illustrated Figure 4 • Outbound coupling fanout component percentage couplings directed elements elements components requests services component large fanout “controlling” many components provides indication poor design since component probably performing one function • Inbound coupling fanin component percentage couplings directed components “provision services” component high fanin likely perform oftenneeded tasks invoked many components regarded acceptable design behavior • Cohesion component sum couplings percentage elements files functions • Connection distilling couplings defined one could say Boolean manner whether two folders linked connection disregarding strength link overall number connections FFmpeg recorded monthly Figure 5 connections folder counted encapsulation principle twoway connection counted since interested folders involved connection Data Collection Analysis source code repository SVN FFmpeg parsed monthly resulting 100 temporal points tree structures extracted points monthly extraction raw data achieved downloading repository first day month example retrieving snapshot 022008 following command issued svn r 20080201 checkout svnsvnffmpegorgffmpegtrunk one hand number source folders yet buildlevel components corresponding tree recorded Figure 5 hand order produce accurate description tree structure suggested Tran et al 2000 month’s data parsed using Doxygen aim extracting common coupling among elements ie source files headers source functions systems Doxygen generates socalled dot files process dot files represents file class cluster files couplings towards system order generate dot files keep available process Doxygen configuration file httpmastodonuelacukIJOSSP2012Doxygenbasetxt contains two commands HAVEDOT YES DOTCLEANUP Various scripts applied obtain summary function calls httpmastodonuelacukIJOSSP2012ffmpeg20080201summaryALLFUNCTIONCALLStxt dependencies include relationships information summary files atomic level functions files order define interrelationships components relations lifted Krikhaar 1999 level buildlevel components ie folders contain mentioned analysis size growth performed using SLOCCount tool Wheeler nd buildlevel component summarized Table 2 study relative change terms contained SLOC along lifecycle undertaken addition study architectural connections performed analyzing temporally number couplings actually involved elements component per definition cohesion number couplings consisted links components per definition inbound outbound couplings respectively Previous studies present recovered architectures used “boxandline” box arrow diagrams eg Bowman et al 1999 use UML package diagrams rather component diagrams graphically visualize buildlevel components defined previous section RESULTS DISCUSSION section provides results empirical investigation addressing three research questions identified previous section First size growth FFmpeg components presented Table 2 followed presentation analysis architectural evolution components section concludes discussion deployment libavcodec OSS projects Size Growth FFmpeg Components general result two different evolutionary patterns observed clustered two graphs Figure 6 Figure 7 measures relative highest values recorded presented percentages Yaxis top graph three components libavcodec libavutil libavformat blue yellow red respectively show linear growth general trend relative maximum size achieved following components referred Etype components hand components FFmpeg Table 2 show traditional evolution typical library packages referred either “Stype” “Ptype” systems presented background section Size Growth EType Components Considering top diagram Figure 6 libavcodec component started mediumsized component 18 KSLOCs currently size reached 220 KSLOCs increase 1100 Also libavformat component moved comparable pattern growth 250 increase smaller size overall 14 50 KSLOC Although reusable resources often regarded “Stype” “Ptype” systems since evolutionary patterns manifest reluctance growth typical behavior libraries two components achieve “Etype” evolutionary pattern even heavily reused several projects studied cases appear driven mostly adaptive maintenance Swanson 1976 since new audio video formats constantly added refined among functions components Using metaphor botany components appear grow “fruits” main “plant” “trunk” version control system Furthermore components behave “climacteric” fruits bananas meaning ripen parent plant cases must picked order ripen component needs separated parent order allow mature evolve FFmpeg components achieved evolution even separated belonged ie FFmpeg similarly climacteric fruits Size Growth PType Components bottom diagram Figure 7 details relative growth remaining components Figures 6 7 show remaining components show traditional librarystyle type evolution Maintenance activities components likely corrective perfective nature Swanson 1976 components libpostproc libswscale appear hardly changing even though formed several years main Figure 2 Libavdevice created already 80 current state libavfilter contrast although achieving larger growth since created small stage 600 SLOC doubled 1400 SLOCs resources effectively librarytype systems reuse simplified relative stability characteristics meaning type problem solve Using metaphor shown components “fruits” following behavior unlikely ripen picked Outside main trunk development components remain unchanged even incorporated OSS projects Architectural Evolution FFmpeg Components observations related growth size used cluster components based coupling patterns mentioned 100 monthly checkouts FFmpeg system analyzed order extract common couplings element functions files common couplings converted lifted connections components observed also growth size Etype components present steadily increasing growth couplings compared stable Stype Ptype components following section study whether former also display modularized growth pattern resulting stable defined behavior Coupling Patterns EType Components Figures 8 10 present visualization three Etype components identified component four trends displayed overall amount common couplings amount couplings directed towards elements cohesion amount outbound couplings fanout amount inbound couplings fanin seen trends also measured relative highest values recorded trend present results percentages Yaxis component continuous growth trend regarding number couplings affecting libavutil component one sudden discontinuity growth later explained common trend also visible libavcodec libavformat components strong cohesion factor maintains 75 threshold throughout evolution words two components 75 total number couplings consistently internal elements cohesion libavutil hand degrades becomes low revealing high fanin restructuring around one fifth lifecycle June 2006 component becomes provider Lungu et al 2006 fully providing services components 90 overall amount couplings – around 3500 – either towards elements serving calls components observing three components part common larger system changes one component become relevant components well example general trend libavcodec intertwined two components ie libavutil libavformat following ways overall cohesion decreases time interval overall couplings ie blue trend added therefore attribute decayed parallel cohesion decay fanout libavcodec top Figure 5 abruptly increases topping 17 latest studied point closer inspection larger fanout eg requests services increasingly directed towards libavutil component around period middle Figure 5 experiences sudden increase fanin ie provision services Also fanin libavcodec decreases first part evolution libavcodec served numerous requests libavformat component throughout evolution links converted connections libavutil instead decreasing fanin libavcodec Performing similar analysis libavformat becomes clear fanout degrades becoming gradually larger reason increasingly stronger link elements libavcodec libavutil form intercomponent Figure 8 Coupling patterns Etype components Libavavcodec Figure 9 Coupling patterns Etype components Libavutil dependencies form architectural decay Eick et al 2001 reproduced latest available data point Figure 11 libavformat libavcodec depend heavily libavutil 1093 1748 overall couplings respectively furthermore two components also intertwined 523 calls libavformat served libavcodec Figure 11 shows couplings displayed components amongst instance 68 couplings libavformat 4051 couplings couplings ie cohesion 18 1093 libavutil 9 libavcodec Ninetyfive per cent libavformat’s couplings found within three components remaining 5 couplings components comparing results plots Figures 8 10 especially one representing libavcodec component becomes clear architecture decayed earliest points libavcodec represented excellent component cohesion made 90 couplings fanin 10 couplings fanout recorded essentially libavcodec need services components latest available point instead Figure 11 shows component decayed needs libavutil 16 couplings fanout increased 18 overall couplings graph Figure 11 shows another result representing fact typical tradeoffs encapsulation decomposition several common files accessed libavformat libavcodec “relocated” Tran Holt 1999 recently third location libavutil acts provider Lungu et al 2006 turns negative effect reusability trying reuse functionality libavcodec necessary include also contents libavutil since large amount calls issued libavformat towards libavutil Even worse trying reuse functionality libavformat necessary include also functionality libavutil libavcodec since three components heavily intertwined Coupling Patterns PType Components characteristics Etype components described summarized follows High cohesion Fanout certain threshold Clear defined behavior component eg “provider” achieved libavutil component second cluster components identified “S” “Ptype” revealed several discrepancies results observed previously list key results summarized also observed growth components number couplings affecting second cluster components reveals difference one libswscale libavdevice libavfilter even two libpostproc orders magnitude respect Etype components Slowly growing trends number couplings observed libavdevice libavfilter cohesion remains stable hand high fanout consistently observed values 07 05 respectively Observing closely dependencies directed towards three Etype components defined suggests components yet properly designed may also due relatively young age potential reuse subsumed inclusion FFmpeg libraries well summarize second type components classified slowly growing less cohesive connected components system acceptable reusable candidates resolving interconnections components could prove difficult Deployment libavcodec OSS Projects Although identified “Etype” components three components libavcodec libavformat libavutil shown highly reusable based coupling patterns size growth attributes interesting seems contradict expectation Etype less reusable due need continuously evolve order observe components actually reused deployed hosting systems section summarizes study deployment libavcodec component four OSS projects avifile httpavifilesourceforgenet avidemux httpfixounetfreefravidemux MPlayer xine Freitas Roitzsch Melanson Mattern Langauf Petteno et al 2002 selection projects deployment study based current reuse components hosts copy libavcodec component code repositories therefore implementing whitebox reuse strategy resource words projects maintain copy libavcodec component issue investigate whether hosting projects maintain internal characteristics original libavcodec hosted FFmpeg order coupling attributes folder extracted OSS number connected folders counted together total number couplings results shown Figure 12 diagram Figure 12 represents hosting libavcodec copy presents degree cohesion reentrant arrow specific fanin fanout inwards outwards arrows respectively number connections ie distinct source folders responsible fanin fanout displayed number multi module diagram upperleft upperright corners following observations made total amount couplings copy always lower original FFmpeg copy means whole FFmpeg reused specific resources copy ratio textfaninfanout approximately 21 xine copy reversed due fact apparently xine host copy libavformat component graph connections libavcodec libavutil libavcodec libavformat specifically detailed fanin libavformat alone typically order magnitude remaining fanin fanout towards libavutil typically accounts much larger ratio confirmation presence consistent dependency libavcodec libavutil therefore must reused together avidemux moved necessary dependencies libavutil within libavcodec component therefore buildlevel component libavutil detectable THREATS VALIDITY aware limitations study discussed Threats may occur respect construct validity reliability external validity Since seek establish causal relationships discuss threats internal validity Construct Validity Construct validity concerned establishing correct operational measures concepts studied Yin 2003 used coupling cohesion measures represent intersoftware component connections measures widely used within engineering literature relation module interconnectivity interpreted term “component” “buildlevel” component previously done studies eg de Jonge 2005 Furthermore buildlevel components presented Table 2 though probably accurate automatically assigned could subcomponents larger component eg composed libavutil libavcodec Reliability Reliability level operational aspects study data collection analysis procedures repeatable results Yin 2003 p 34 time study FFmpeg hosted Subversion repository parsed monthly discussed research design section Guba 1981 states inquiry affected “instrumental drift decay” may produce effects instability order guard established audit trail data extraction process recommended practice establish reliability Guba 1981 snapshot example given research design section made publicly available httpmastodonuelacukIJOSSP2012ffmpeg20080201targz generated dot files represent individual files classes clusters files contain couplings modules system also publicly available httpmastodonuelacukIJOSSP2012ffmpeg20080201dotstar External Validity External validity concerned extent results study generalized study focused one case study FFmpeg written mostly C programming language Performing similar study system written instance objectoriented language eg C Java results could quite different However outlined introduction section goal present generalizations based results Rather aim paper document successful case OSS reuse OSS projects CONCLUSION FUTURE WORK section presents conclusion study followed directions future work Conclusion Empirical studies reusability OSS resources proceed two strands first provide mechanisms select best candidate component act building block new system second document successful cases reuse OSS components deployed OSS projects paper contributes second strand empirically analyzing FFmpeg whose components currently widely reused several multimedia OSS applications empirical study performed data last eight years development studied monthly intervals determine extract characteristics size evolutionary growth coupling patterns order identify understand attributes made components successful case OSS reusable resources studied characteristics four OSS projects selected among ones implementing whitebox reuse FFmpeg components deployment reuse components studied perspective interaction hosting systems case study FFmpeg number findings obtained First found several buildlevel components make good start selection reusable components coalesce grow become available various points life cycle currently available building blocks OSS projects use Second possible classify using Lehman’s SPE program type categories least two types components one set presents characteristics evolutionary Etype systems sustained growth throughout set albeit recent formation mostly unchanged therefore manifesting typical attributes libraries two clusters compared study connections components first set showed components either clearly defined behavior excellent cohesion elements also found three components become increasingly mutually connected results formation one single supercomponent second set appeared less stable accounts large fanout suggests poor design immaturity components One reusable resources found within FFmpeg ie libavcodec analyzed deployed four OSS systems reused using whitebox approach cohesion pattern appeared similar original copy libavcodec emerged clarity currently reuse facilitated libavformat libavutil components reused Given projects reusing libavcodec library “dynamically” linking ie black box reuse code change made libavcodec library propagation issue Orsila et al 2008 means linking projects need adapt code long new version libavcodec released hand projects hosting copy library ie white box reuse face less propagation issue since changes pushed onto original version libavcodec affect copies Future Work work several open strands follow first would interesting replicate study systems currently widely reused particular necessary start defining distinguishing reuse whole systems “as libraries” zlib reuse components within larger projects component libavcodec within FFmpeg first case whole reused asis seems likely subset functions reused latter implications interesting researchers practitioners try extract automatically libraries comply reusability principles avoid reusing whole systems second research direction needs addressed evolution reusable resources needs address following questions libraries need remain mostly unchanged reusable main issues forking reusable libraries avoid effects “cascade updates” respect OSS developers interested parties produce strategy upgrade resources resources rely heavily external libraries Thirdly example components available different times FFmpeg shows evolving projects might able produce similar response OSS communities signaling presence reusable libraries could benefit projects apart Finally presence many available OSS projects implementing similar applications eg example 100 projects implementing “email client” analyzed detect much code duplication code cloning components reuse visible projects ACKNOWLEDGMENTS authors would like thank Dr Daniel German clarification potential conflicts licenses FFmpeg Thomas Knowles insightful discussions Nicola Sabbi insider knowledge MPlayer system thank anonymous reviewers constructive feedback improved paper work part supported Science Foundation Ireland grant 10CEI1855 Lero—The Irish Engineering Research Centre wwwleroie paper revised version Capiluppi Boldyreff C Stol K 2011 Successful Reuse Components Report Open Source Perspective Hissam Russo B de Mendonça Neto G Kon F Eds Open Source Systems Grounding Research Springer Advances Information Communication Technology AICT vol 365 pp 159176 REFERENCES AbiAntoun Aldrich J Coelho W 2007 case study reengineering enforce architectural control flow data sharing Journal Systems 802 240–264 doi101016jjss200610036 Avgeriou P Guelfi N 2005 Resolving architectural mismatches COTS architectural reconciliation X Franch Port Eds Proceedings 4th International Conference COTSBased Systems LNCS 3412 pp 248257 Ayala C Sørensen C Conradi R Franch X Li J 2007 Open source collaboration fostering offtheshelf components selection Feller J Fitzgerald B Scacchi W Sillitti Eds Open source development adoption innovation New York NY Springer doi10100797803877248672 Basili V R Rombach H 1991 Support comprehensive reuse IEEE Engineering Journal 65 303–316 Bass L Clements P Kazman R 2003 architecture practice 2nd ed Reading AddisonWesley Bowman Holt R C Brewster N V 1999 Linux case study extracted architecture Proceedings 21st International Conference Engineering pp 555563 Capiluppi Boldyreff C 2008 Identifying improving reusability based coupling patterns H Mei Ed Proceedings 10th International Conference Reuse High Confidence Reuse Large Systems LNCS 5030 pp 282293 Capiluppi Knowles 2009 engineering practice Design architectures FLOSS systems Proceedings 5th IFIP WG 213 International Conference Advances Information Communication Technology Vol 299 pp 3446 Clements P Bachmann F Bass L Garlan Ivers J Little R …Stafford J 2010 Documenting architectures Views beyond 2nd ed Reading AddisonWesley de Jonge 2002 Source tree composition C Gacek Ed Proceedings 7th International Conference Reuse Methods Techniques Tools LNCS 2319 pp1732 de Jonge 2005 Buildlevel components IEEE Transactions Engineering 317 588–600 doi101109TSE200577 Dueñas J C de Oliveira W L de la Puente J 1998 Architecture recovery evolution Proceedings 2nd Euromicro Conference Maintenance Reengineering pp 113119 Easterbrook Singer J Storey Damian 2008 Selecting empirical methods engineering research Shull F Singer J Sjøberg K Eds Guide advanced empirical engineering pp 285–311 New York NY Springer doi101007978184800044511 Eick G Graves L Karr F Marron J Mockus 2001 code decay Assessing evidence change management data IEEE Transactions Engineering 271 1–12 doi10110932895984 Fenton N E 1991 metrics rigorous approach London UK Chapman Hall Fitzgerald B 2006 transformation open source Management Information Systems Quarterly 303 587–598 Freitas Roitzsch Melanson Mattern Langauf Petteno …Lee 2002 Xine multimedia engine Retrieved httpwwwxineprojectorghome German GonzálezBarahona J 2009 empirical study reuse licensed GNU general public license Proceedings 5th IFIP WG 213 International Conference Open Source EcoSystems Diverse Communities Interacting pp 185198 German GonzalezBarahona J Robles G 2007 model understand building running interdependencies Proceedings 14th Working Conference Reverse Engineering pp 140149 German Hassan E 2009 License integration patterns Addressing license mismatches componentbased development Proceedings 31st IEEE International Conference Engineering pp 188198 Glass R L Vessey Ramesh V 2002 Research engineering analysis literature Information Technology 448 491–506 doi101016S0950584902000496 Godfrey W Lee E H 2000 Secrets monster Extracting Mozilla’s architecture Proceedings 2nd Symposium Constructing Engineering Tools pp 1523 Guba E 1981 Criteria assessing trustworthiness naturalistic inquiries Educational Communication Technology 29 75–92 Haefliger von Krogh G Spaeth 2008 Code reuse open source Management Science 541 180–193 doi101287mnsc10700748 Harrison N B Avgeriou P 2011 Patternbased architecture reviews IEEE 286 66–71 doi101109MS2010156 Hauge Ø Ayala C Conradi R 2010 Adoption open source softwareintensive organizations systematic literature review Information Technology 5211 1133–1154 doi101016jinfsof201005008 Hauge Ø Østerlie Sørensen CF Gerea 2009 May 18 empirical study selection open source Preliminary results Proceedings 2nd ICSE Workshop Emerging Trends FreeLibreOpen Source Research Development Vancouver BC Canada pp 4247 Hauge Ø Sørensen CF Røsdal 2007 Surveying industrial roles open source development Feller J Fitzgerald B Scacchi W Sillitti Eds Open source development adoption innovation pp 259–264 New York NY Springer doi101007978038772486725 Heinemann L Deissenboeck F Gleirscher Hummel B Irbeck 2011 extent nature reuse open source Java projects K Schmid Ed Proceedings 12th International Conference Reuse Top Productivity Reuse LNCS 6727 pp 207222 IEEE 2000 IEEE Std 14712000 IEEE recommended practice architectural description softwareintensive systems Piscataway NJ IEEE Krikhaar R 1999 architecture reconstruction Unpublished doctoral dissertation University Amsterdam Amsterdam Netherlands Krikhaar R Postma Sellink Stroucken Verhoef C 1999 twophase process architecture improvement Proceedings IEEE International Conference Maintenance pp 371380 Kruchten P B 1995 41 view model architecture IEEE 125 42–50 doi10110952469759 Lang B Abramatic JF GonzálezBarahona J Gómez F P Pedersen K 2005 Free proprietary COTSbased development X Franch Port Eds Proceedings 4th International Conference CompositionBased Systems LNCS 3412 p 2 Lehman 1978 Programs cities students limits growth Programming Methodology 4262 Lehman 1980 Programs life cycles laws evolution Proceedings IEEE 689 1060–1076 doi101109PROC198011805 Lethbridge C Laganière R 2001 Objectoriented engineering Practical development using UML Java 2nd ed London UK McGrawHill Li J Conradi R Bunse C Torchiano Slyngstad P N Morisio 2009 Development offtheshelf components 10 facts IEEE 262 80–87 doi101109MS200933 Lungu Lanza Gîrba 2006 Package patterns visual architecture recovery Proceedings 10th European Conference Maintenance Reengineering Macro Buxton J 1987 craft engineering Reading AddisonWesley Mockus 2007 Largescale code reuse open source Proceedings First International Workshop Emerging Trends FLOSS Research Development Orsila H Geldenhuys J Ruokonen Hamouda 2008 Update propagation practices highly reusable open source components Proceedings IFIP 20th World Computer Congress Open Source Vol 275 pp 159170 Parnas L 1972 criteria used decomposing systems modules Communications ACM 1512 1053–1058 doi101145361598361623 Perry E Wolf L 1992 Foundations study architectures ACM SIGSOFT Engineering Notes 174 Runeson P Höst 2009 Guidelines conducting reporting case study research engineering Empirical Engineering 142 131–164 Sametinger J 1997 engineering reusable components Berlin Germany SpringerVerlag Sartipi K Kontogiannis K Mavaddat F 2000 pattern matching framework architecture recovery restructuring Proceedings 8th International Workshop Program Comprehension pp 3747 Schmerl B Aldrich J Garlan Kazman R Yan H 2006 Discovering architectures running systems IEEE Transactions Engineering 327 454–466 doi101109TSE200666 Senyard Michlmayr 2004 successful free Proceedings 11th AsiaPacific Engineering Conference pp 8491 Sojer Henkel J 2010 Code reuse open source development Quantitative evidence drivers impediments Journal Association Information Systems 1112 868–901 Sommerville 2004 engineering International Computer Science Series 7th ed Reading AddisonWesley SourceForge 2011 Email client Retrieved httpsourceforgenetdirectoryqemail20client Swanson E B 1976 dimensions maintenance Proceedings 2nd International Conference Engineering pp 492497 Szyperski C 2002 Component Beyond objectoriented programming 2nd ed Reading AddisonWesley Torchiano Morisio 2004 Overlooked aspects COTSbased development IEEE 212 88–93 doi101109MS20041270770 Tran J B Godfrey W Lee E H Holt R C 2000 Architectural repair open source Proceedings 8th International Workshop Program Comprehension pp 4859 Tran J B Holt R C 1999 Forward reverse repair architecture Proceedings Conference Centre Advanced Studies Collaborative Research Troy Zweben H 1981 Measuring quality structured designs Journal Systems 22 113–120 doi1010160164121281900315 Ven K Mannaert H 2008 Challenges strategies use open source independent vendors Information Technology 50910 991–1002 doi101016jinfsof200709001 Wheeler nd SLOCCount Retrieved httpwwwdwheelercomsloccount Wikipedia nd Lamp bundle Retrieved httpenwikipediaorgwikiLAMPsoftwarebundle Yin R K 2003 Case study research Design methods 3rd ed Thousand Oaks CA Sage ENDNOTES
::::
1 course full structural evaluation 128 projects performed arguing features reused among projects
::::
2 list OSS commercial projects integrating libavcodec given maintained httpffmpegorgprojectshtml
::::
3 term “connection” intended cover term “dependency” packages distribution since paper analyses internal architecture components Andrea Capiluppi Lecturer Engineering University Brunel since May 2012 Senior Lecturer University East London February 2009 April 2012 Senior Lecturer University Lincoln UK three years January 2006 February 2009 gained PhD Politecnico di Torino Italy May 2005 held Researcher position Consultant position Open University UK November 2003 Visiting Researcher GSyC group University Rey Juan Carlos de Madrid Spain one partners proposal publications include 50 papers published leading international conferences journals mostly devoted Open Source topic consultant several industrial companies published works results FLOSS research disseminated commercial sites taken part one packages CALIBRE €15 million panEuropean EU research focused use FLOSS industry KlaasJan Stol researcher Lero Irish Engineering Research Centre worked since 2008 holds PhD Engineering University Limerick Ireland MSc Engineering University Groningen Netherlands research interests Open Source OSS development methods including OSS development practices architecture componentbased development reuse empirical engineering Cornelia Boldyreff Associate Dean Research Enterprise School Architecture Computing Engineering University East London gained PhD Engineering University Durham 2004 moved University Lincoln become first Professor Engineering university cofounded directed Centre Research Open Source 25 years experience engineering research published extensively research field Fellow British Computer Society founding committee member BCSWomen Specialist Group actively campaigning women SET throughout career
::::
Changes free open source licenses managerial interventions variations attractiveness Carlos Denner dos Santos Jr Abstract license adopted open source associated success terms attractiveness maintenance active ecosystem users bug reporters developers sponsors cannot done derivatives terms improvement market distribution depends legal terms specified knowing licensing effect scientific publications experience managers became able act strategically loosening restrictions associated source code due sponsor interests example contrary tightening restrictions guarantee source code openness adhering “forever free” strategy managers behaved strategically like changing projects license paper know types changes legal allowances managers made importantly whether managerial interventions associated variations intervened attractiveness ie related numbers web hits downloads members paper accomplishes two goals demonstrates 1 managers free open source projects change distribution rights source code change group licenses adopted 2 variations attractiveness associated strategic choice licensing schema reach conclusions unique dataset open source projects changed license assembled comparative form analyzing intervened projects monthly periods different licenses Based sample 3500 active projects 44 months obtained FLOSSmole repository Sourceforgenet data 756 projects changed source code distribution allowances restrictions identified analyzed dataset projects’ type changes assembled enable descriptive exploratory analysis types license interventions observed period almost four years anchored projects’ attractiveness 35 types interventions detected results indicate variations attractiveness license intervention symmetric change license schema B beneficial attractiveness change B necessarily prejudicial interesting findings discussed detail general results reported support current literature knowledge restrictions imposed license source code distribution associated market success visavis attractiveness also suggest stateofthescience superficial terms known differences attractiveness observed complexity results indicates free managers licensing schema seen right one choice carefully made considering strategic goals perceived relevant stakeholders application production conclusions create awareness several limitations current knowledge discussed along guidelines understand deeper future research endeavors Keywords Open source Attractiveness license Intellectual property GPL Free Governance people management Information technology Open source Correspondence carlosdennerunbbr Department Management PPGAADM University Brasilia UnB Brasília Brazil © Authors 2017 Open Access article distributed terms Creative Commons Attribution 40 International License httpcreativecommonsorglicensesby40 permits unrestricted use distribution reproduction medium provided give appropriate credit original authors source provide link Creative Commons license indicate changes made 1 Introduction collective production legal issues Society creations become increasingly complex body knowledge grew information retrieval technologies evolved Innovating competing global scale activity individual alone Searching partners peers collaborate projects crucial task fields notably science engineering public policy management 1–3 Experts noticed expressed notion saying modern inventors organizations individuals production processes best dealt open public fashion opposed proprietary private economic model firm production 3–5 change course raises concerns rights collective goods properties regulated managed prevent disincentives entrepreneurship cooperation thus maintain labor market active sustainable 6–8 digitalization world stimulated trend working collectivities decreasing costs searching collaborators using communication technologies coordinate production activities asynchronicity production activities web led many investigators developers engage geographically distributed projects development 9 10 least last 20 years phenomenon “collective production” particularly prominent development free open source free short reshaping information technology industry became strategic player Nowadays hundreds thousands free projects online representing computer supported cooperative work opportunity generating active growing ecosystem users contributors capable joint development unprecedented scale 11 12 Free projects FSP reflect intention founder original owner property rights share costs continuous improvement user base expansion visibility growth 13–15 ability attract peers cocreate founder understood attractiveness 12 Richard Stallman Linus Torvalds among first famous ones publicize type intention bringing forth GNU operating system Linux incredibly successful alone impacted industry deeply Unsurprisingly inspired Linux case many organizations created FSP deliberate organizational strategy known open sourcing alternative classic outsourcing possibility 11 successful FSP involve active communities structured networks evolution public resourceful communication channel users developers sponsors Nevertheless terms success achieved small fraction total number FSP making investment releasing intellectual property public assembling proper infrastructure risky worth managerial consideration failed attempt wastes organization’s limited resources 12–16 scenario uncertainty competition whether attention users developers obtained knowledge effectively create manage FSP suit better demands interests stakeholders sponsor codeveloper useful timely Founders managers take account stakeholders demands interests expect translate increasing adoption intention contribute ie people reporting developers fixing bugs One central issues literature open source affecting intention adopt contribute attractiveness license terms legal specifications released regulate improvement distribution 6 7 16–18 influence license choice discussed many grounds legal 6 strategic 3 8 sociological 7 standpoints main effects summarized related people’s motivation getting involved community stakeholders believe private property derivative public one legal restriction found scare corporations’ investments away obliged always free open eg licensed GPL 20 duality effects creates tension interests cannot met forcing FSP managers choose strategic path “pick side” terms licensing distribution rights major concern terms application source code allowed modified redistributed Free modified result modification distributed sold hardware example source code embedded kept proprietary depending license chosen According previous studies intellectual property policy delineated chosen license schema power drive people organizations away adopting contributing FSP operates governance mechanism thereby impacting attractiveness consequently production activities 6–8 12 17–19 nutshell license believed influence FSP’s attractiveness production activities thereby success strategic effect becomes known FSP founders managers assuming rationality towards attempt successful expectation act practice change licenses affect attractiveness created paper represents methodological advance comparison previous studies verifies theoreticallyderived expectation relationship license attractiveness performing longitudinal study large sample observed natura wide time frame methodological approach specifically developed towards answers following research questions 1 intellectual property interventions license changes occur practice 2 different licensing schemas chosen managers associated FSP attractiveness questions answered sampling strategy designed identify projects changed licenses followed statistical analysis various types license interventions FSP managers decided make changing thereby legal restrictions thereby attractiveness Nevertheless besides methodological improvement literature found paper also contributes sense previous empirical studies considered open source one type license even though many projects one paper incorporates methodological procedures improves classic way classifying licenses based Lerner Tirole’s work realistic empiricallybased schema Furthermore unique dataset assembled produce paper released open free charge along publication another form contribution future research endeavors Additional file 1 scientific basis grounding theoretical expectations spelled next stated details foundation followed methods section describing specific steps followed obtain sample results discussed conclusions 11 Theoretical foundations definitions related work 111 Free open source projects general projects endeavors toward goals writing paper developing source code freely publicly available online use modification license specifying attached may classified free open source 7 8 11 12 Free projects FSP object interest study position key players industry Several become widely known GNULinux operating system R statistical package Apache web server communities maintaining systems large active professional producing first class applications domains receiving sponsorship companies IBM Google However beyond highclass applications FSP become successful never attracting external users contributors generate network peers producing useful uptodate public freely available 12–14 12 role attractiveness One way understand FSP successful others study attractiveness 12 “magnetism stickiness” informally stated Attractiveness common cause many visitors website receives many users number downloads many contributors possesses FSP attractiveness concept considered responsible lack flow market resources basically time money Higher attractiveness leads intention adopt download contribute become member motivating justifying production activities investments towards improve quality generate innovation via “more eyeballs effect” 12 19 20 FSP attractiveness vital role perspective evident important understand influences associated attractiveness variations 13 choice license FSP success choice license impacts FSP success defines scope business distribution derivatives perhaps preventing source code hijacking impacting reuse “citation” incentive sure influencing stakeholders’ perception control utility technology People organizations take license terms consideration deciding whether adopt use free later worthy contributing reusing source code 7 8 16 21 Figure 1 depicts thesis causal chain intellectual property choice attractiveness qualityproject success summary based literature review study grounded 8 12 Fig 1 read left right FSP managers select license defines restrictions applied source code redistribution affects flow market resources visits website visitors downloads intention use membership intention contribute consequence increase attractiveness people thus interested quality bugs reported fixed new features requested developed influencing directly longterm success Accordingly causal chain expected “disturbed” managerial interventionchange license interests relevant stakeholders sponsors volunteers etc might met anymore explore empirically hypothesis based done previous research 8 12 21 22 study focuses four types legal restrictions may applied free open source code first relates whether source code “restrictive” requiring derivative works released license case redistribution 19 second whether “highly restrictive” besides restrictive forbids source code even mingled compilation different license 19 third whether code may relicensed meaning “any distributor right grant license … directly third parties” 7 p 88 fourth whether licensed Academic Free License since written correct problems important licenses MIT BSD 7 understudied Methodologically speaking projects licenses classified basis including cases would license Therefore schema might restriction one group stakeholders students example restriction corporations methodological choice reflects reality open source projects accurately downside complex results demonstrate basic sampling strategy idea guided research look projects undergone change legal terms lifecycle verify possible associationsvariations main indicators attractiveness projects approach aims uncover whether FSP managers change legal restrictions projects lifecycle research question 1RQ1 evaluate whether success FSP associated legal terms change beforeandafter statistical analysis managerial intellectual property intervention IPI attractiveness research question 2RQ2 intents together addressed previous research methodological approach
::::
2 Methods data sampling statistical analyses obtain data capable answering questions whether FSP managers performed changes schema licensing years RQ1 whether changes associated attractiveness RQ2 search internet secondary data free projects made options popped University Notre Dame based seemingly straightforward one chosen FLOSSmole 23 Data obtained released FLOSSmole projects largest free repository available online 6 time data collection efforts organized database inspection covering 44 months activities database filtered contain projects changed listed licenses years covered obtained dataset filtered dataset equal zero projects first research question paper would “no FSP managers changed license schema despite known effect attractiveness found previous research” empirical answer yes FSP managers made interventions aka IPI hundreds times research sample obtaining working sample data organization process performed classifying various licenses projects many one license given point categories described right Fig 1 shown information audience enduser developer example date creation etc also kept sample description data numbers web hits downloads members gathered monthly allow comparisons indicators attractiveness anchored type licensing schema intervention choice specific indicators aligned previous research 12 attractiveness first directly addressed specialized literature details data preparation procedure described sampling filtering procedures adopted specifically designed detect changes license terms adopted FSP managers explore IPI associated FSP attractiveness variations ideal methodological situation random selection projects undergo license change possible due impossibility people’s experiment alternatively control confounding effects projects listing categories audiences changed period covered study selected Also missing data number members also removed sample indicates “orphan” working sample 756 FSP monthly data covering period 44 months October2005 June2009 1 month missing FLOSSmole July 2008 monthly data license collected classification based legal restrictions covered paper explained classification set forth based previous research always treated licenses restrictions 1 compatibility mingling different compilation referred “highly restrictive” 2 whether improvement must released free well yes referred “relicensable” 3 whether might relicensed third party different license originally chosen referred “relicensable” However empirical fact projects one license challenges classification considers simply based one licenses Free projects choose schemas licensing example “highly restrictive” stamp nonpayers “relicensable” option pays classification adopted takes account obtain accurate however complex picture projects licensing schema listed projects’ licenses considered duallicensed might indeed “Restrictive Highly Restrictive Relicensable” something first sight appear contradictory classification performed per month changes schema managerial interventions detected flagged analysis
::::
3 Results Findings descriptive statistics towards RQ1 Table 1 summarizes interventions detected along labels given see column “description” number occurrences type change legal terms displayed table cells table represents detailed answer RQ1 One see example GPL involved managerial interventions 715 times endstate 298 times sum column F beginning state change 417 times sum row F description column one see GPL restrictive highly restrictive derivative work redistributed must GPL well source code mingled compilation must GPL well “viral” license GPL cannot relicensed different license GPL thus restrictive highly restrictive nonrelicensable GPL motivates managerial interventions probably due popularity mixed feelings community adoption loved believe “free forever” much primarily guided competitive motivations GPL leadership followed duallicensing strategy FSP managers decide release code different licenses depending interest profile user eg whether individual forprofit organization interventions ranking number occurrences found Table 1’s column data related new license type chosen adopted rows data license type abandoned “from” “to” indicated first cell second row Additionally monthly data Web hits visitors downloads intention install use number members intention contribute reporting bugs features besides type development stage gathered Table 2 contains FromTo Description Count license type interventions sample B C E F G Sum Ranking None “other” 0 22 2 13 3 47 1 88 5 B NonRestrictive Relicensable eg Public Domain MIT 8 0 7 20 16 31 45 127 4 C Academic Free LicenseAFL NonRestrictive Relicensable 2 5 0 0 7 0 14 7 Restrictive NonRelicensable eg GNU Lesser General Public LicenseLGPL 6 34 0 0 21 67 6 134 3 E Restrictive Relicensable eg Mozilla Public LicenseMPL 3 19 0 12 0 7 8 49 6 F Restrictive Highly Restrictive NonRelicensable eg GNU General Public LicenseGPL 36 81 3 137 5 0 155 417 1 G Restrictive Highly Restrictive Relicensable eg dual licensed GPL Apache 0 32 0 6 6 139 0 183 2 Sum 55 193 12 188 51 298 215 1012 Rank 53 74 61 2 Source author’s descriptive statistics numerical variables Table 3 frequency projects particular type license versus development status first month dataset October 2005 calculate “attractiveness” latent construct correlation matrix previous study 12 used principal component analysis 24 linear combination three indicators attractiveness identified maximize explained variance first principal component extracted operationally defined 063times logtextwebhits 064times logtextdownloads 043times logtextmembers explains 65 sample variance first component extracted used calculate new variable named attractiveness result multiplied sum projects logtransformed web hits downloads number members given month measure attractiveness expresses ability attract market resources environment competes projects Attractiveness thus common cause website visits downloads membership numbers Data organized statistically analyzed R Table 2 one see sample 1 projects founded early 1999 2 average approximately 378 downloads October 2005 least one four different licenses listed point Table 3 depicts different picture showing 1 48 projects 363 licensed GPL restrictive highlyrestrictive nonrelicensable 95 beta stage 2 11 756 projects license specified 3 7 projects license development status file October2005 distribution projects sample demonstrates wide variability various stages lifecycle reducing limitations nonexperimental nature study potential sampling biases
::::
4 Results Findings preparing answer RQ2 explore IPI associations attractiveness variations obtain statistical evidence variation FSP classified according type intervention subject every month working sample organized analyzed following fashion allow statistical comparisons reasonable sample sizes dataset reorganized display seven licensing schemas G columns attractiveness rows new dataset cell represents attractiveness specific month broken licensing schema various columns analytical strategy treating licensing
::::
Table 2 Variable Minimum Maximum Mean Std deviation Registered 11041999 3132009 1082003 – nlicenses200510 0 4 110 039 attractiveness200510 0 1612 54694 333 downloads200510 0 34514 37807 194156 webhits200510 0 836740 926723 487277 members200510 1 55 338 522 Source Author’s
::::
Table 3 Type licensedevelopment status Alpha Beta Mature None Planning Prealpha Stable Total 12 18 2 7 7 33 86 16 24 03 09 09 09 44 114 B 25 36 4 2 15 11 33 126 33 48 05 03 20 15 44 167 C 1 1 0 2 1 0 3 8 01 01 0 03 01 0 04 11 22 33 2 5 15 13 38 128 29 44 03 07 20 17 50 169 E 4 1 0 0 1 1 3 6 25 05 13 0 01 01 04 08 33 F 84 95 9 8 36 38 93 363 111 126 12 11 48 50 123 480 G 4 7 0 0 2 3 4 2 05 09 0 0 03 04 05 26 TOTAL 152 200 17 25 77 75 210 756 201 265 22 33 102 99 278 100 Source Author’s schema specific change schema increased sample size immensely permitted statistical mean comparisons attractiveness RQ2 required classic ttest robust violations assumptions large samples performed using SPSS descriptive statistics variable variable new dataset shown Table 4 possible see smallest sample size 265 means 33264 monthprojects available 756 projects times 44 months 265 monthprojects could flagged C type schema
::::
5 Results Findings revisiting RQ1 towards RQ2 license schema licensing imposes restrictions allowances application adopter source code contributor creator derivative work example company customizes GPL application distributes market obliged make source code redistributed improved public license choice strategic decision social economic impacts block interests people related users developers relevant stakeholders major decision like expected occur often managers avoid status quo changes harm expectations turn people’s attention away actual work eg politics disputes tendency change strategic matters known organizational literature structural inertia 25 conformance organizational inertia thousands free projects obtained FLOSSmole Sourceforgenet analyzed research 756 decided change license type period 44 months covered research October2005 June2009 missing July2008 Nevertheless already shown Table 1 756 projects changed licenses done 1012 times considerable number validates theoretical expectation managerial action changes legal restrictions towards meeting stakeholders’ demands expectations success Previous research stated license affects probability success accordingly FSP managers indeed attempted changes legal restrictions terms specific results leaving projects exposed legally unattended managerial decision license specified detected ways projects left “none” choice 88 times surprisingly changed current state license one license 55 times see Table 1 type license fact found projects license specified every month covered research FSP license “none” Acategory created less average attractiveness restrictiverelicensable duallicensed projects often attractiveness GPL Fschema Let us move one step analyze data numerically initially explore statistical associations attractiveness license ratios mean attractiveness afterbefore interventions computed considering projects given change licensing schema summarized Table 5 calculating ratios summed projects specific license attractiveness component calculated standardization sum attractiveness projects state license change type change Projects aggregated afterwards one ratio calculated dividing mean attractiveness change mean attractiveness change interpret results Table 5 example one see ratio 094 first row indicates projects changing type license B experienced lower levels attractiveness intervention moving away status license going status “public domain” license B average detrimental attractiveness specifically reduction B C E F G Ab – 094 107 106 114 109 087 Bcdf 096 – 097 102 103 098 101 Cdf 092 093 – – – 105 – Dbeg 098 105 – – – 096 103 Edg 070 086 – 091 – 089 089 Fbc 089 100 200 098 106 – 101 Gde – 085 – 098 088 089 – Superscript letters indicate asymmetric effect interventions going one license another similar effect eg positive leave C go F leave F go C Source Author’s Table 4 Descriptive statistics mean comparisons licensing schema License Type Sample Size Minimum Maximum Mean Std Dev Aattractiveness 2134 044 1460 69037 281252 Battractiveness 5322 000 1612 64007 275651 Cattractiveness 265 000 1112 64196 209862 Dattractiveness 5522 030 1446 67547 231416 Eattractiveness 1073 030 1597 74004 274449 Fattractiveness 9849 000 1683 66443 288175 Gattractiveness 1865 044 1801 76265 308157 Source Author’s 6 However strategic move detected 22 times sample see Table 1 imposing limitation robust statistical analysis variation attractiveness limitation overcome later analysis ttests described methods section Moving ahead exploratory results interpretation associations odd managerial action moving away license specified one type change “A” target attractiveness variations average attractiveness ratio projects undergone type change found always detrimental attractiveness column Table 5 demonstrating stakeholders like uncertainty associated license looking interventions none choice target Table 5 noteworthy every time change made average attractiveness decreased number smaller one indicates attractiveness ratio afterbefore change average pushed Additionally went none restrictive relicensable choice → E change associated average change 14 attractiveness distinct perspective interestingly intervention none nonrestrictive relicensable eg MIT restrictive highly restrictive relicensable ie dual licensed led attractiveness reduction see → B → G Table 5 moment one wonder actual reasons findings casespecific manner general theoretical interpretation relevant stakeholders’ interests harmed due license change affecting consequent attractiveness Together findings related managerial decision license specified probably interpreted several ways sign welcoming market unregulated easier suffer litigation consider managerial change license specified always detrimental However another perspective projects license still considered attractive suggesting possibility regular user take license account Perhaps explanations valid complementary attractiveness measure adopted research groups effects developers users together downloads membership numbers future research sort Attractiveness cause variables common likely one first principal component extracted example explains 64 variance 36 due attractiveness measure Future studies dig line inquiry studying indicators separately well Back results interpretation focusing popular choice GPL generically restrictive licensing ie restrictive highly restrictive nonrelicensable – Fschema found beneficial projects abandon scheme source code regulation concerning attractiveness increase Overall positive variation change terms attractiveness detected strategic move detrimental FSP attractiveness projects went “none” restrictive nonrelicensable normally LGPL option see changes involving F Table 5 support results become GPL good FSP attractiveness initial state absence license option Academic Free License C LGPL one strategic interventions detected 47 7 67 times respectively Table 1 taken together findings suggest good avoid GPL better adopt compared license LGPL challenging explanation findings type change intervention GPL AFL F → C opposite C → F positive means good change GPL Academic Free License also positive change GPL coming Academic Free License suggests change might good depending whether change aligned FSP stakeholders’ demands lack symmetry effects interventions better observed looking matrix shown Table 5 superscript letters pattern findings dealt details later section Analyzing interventions together 35 types observed sample 13 positive attractiveness 21 negative one neutral total 1012 intellectual property interventions found average one per taking initial state involves F Table 5 account common managerial intervention F detected 417 times consistent positive impact attractiveness least common origin C 14 associated negative change attractiveness largest negative impact occurs abandonment E 15 found 49 times mixed results apparent visual inspection Table 5’s coloring scheme suggests interventions types licenses always come good always impact although exploratory statistical attractiveness exception F B reinforces importance carefully strategically think decision impacts seem irrelevant regarding associated changes attractiveness Moreover every intervention targeted originated E G impacted attractiveness negatively Also although changing C B change type license terms restrictions analyzed research impact attractiveness suggesting stakeholders prefer AFL MIT instance makes sense AFL designed improve MIT reason include separately study However actual reasons finding object future research suggests licensing scheme quantitative research captures Finally going G B led reduction 15 attractiveness duallicense option G represents signals projects’ stakeholders suitable wider audience intellectual property model accommodate interests various groups market flexible generic strategy Moving away management model appears push attractiveness always mentioned focused strategy
::::
6 Results Findings asymmetry effects statistical answer RQ2 lack symmetry effect interesting deserves consideration None types licensing schemas analyzed research escapes licensing schemas asymmetric effects least one type license contradictory type license B symmetric effects E G least contradictory scheme opposite effect attractiveness B involved see superscript letters Table 5 finding suggests match licensing scheme projects’ specific stakeholders might exist direction effect given license would simply reversed depending whether source destiny intervention suitability one license schema likely rely context adoption momentary demands stakeholders thus combination license treated ideal general specific according stakeholders’ expectations projectbyproject basis towards statistically based answer RQ2 results reported analyzed reorganized dataset mean monthprojects attractiveness per licensing schema subjected analysis see Table 4 descriptive statistics getting mean difference comparisons ttests values mean attractiveness time considered results taken together signal less restrictive licenses attractive average dual license beats academic unrestricted schema eg MIT turn attractive GPL highly restrictive choice conclusion attractiveness varies according license schema consistently course analysis basic statistical terms clear variations attractiveness indicators associated licensing schema chosen FSP manager ttests performed give confidence answer RQ2 explained mean statistical comparisons monthly data aggregated increase sample size explained mean differences pair licensing schema calculated along standard deviation differences subsequent confidence intervals statistical significance determination results presented Table 6 considers mean difference significant 005 type error Bonferroni correction procedure applied marked effect size pair licensing schema based Cohen’s marked According results shown Table 6 one see 11 21 pairs statistically significantly different using conventional statistical procedure control inflated alpha context multiple comparisons Bonferroni 11 4 effect sizes small medium significant according Cohen’s famous suggested interpretations higher 02 signals licensing schema indeed associated average numbers web hits downloads members attract differences absolute numbers effect sizes schemas peaks CG pair 135 mean difference favor dual license schema moves away AFL license option rest results pair licensing schemas found Table 6 Overall statistical results analysis variations attractiveness taken together allow solid answer second research question posed paper whether intellectual property intervention managerial change licensing schema licensing schema indeed associated variations attractiveness level many cases meaningful effect size next section general conclusions discussed based answers found research questions presenting directions future research guidelines free open source managers
::::
7 Conclusions implications research practice research focused intellectual property rights interventions free open source projects FSP licensing schema changes regulate distribution allowances source code hypothesis managerial interventions would affect stakeholders’ perceptions value thus variations FSP attractiveness managerial intervention could observed validate theoretical expectation data thousands FSP almost 4 years filtered identify sample 756 projects changed types licenses allowing empirical study various managerial interventions detected period 44 months variations cataloged organized allow comparisons attractiveness changes grouped intervention type finding far missing free literature Moreover reorganization original datasets allowed comparisons projects’ attractiveness verify whether licensing schema adopted FSP managers associated performance concerning attraction developers users visitors represented linear combination numbers members downloads web hits classification schema licenses adopted FSP managers developed paper also represents step forward literature reality adoption various licenses apparent contradictory allowances source code GPL public domain license example captured previous research result complex accurate classification course pros cons general conclusion results indicate legal terms specified license indeed associated attractiveness aggregated measure line previous research led expectation various business models possible open source expressed licensing schemas related success regarding attraction users developers 10 12 26 However moving beyond previously published literature findings suggest specifics generic hypothesis well understood yet found changes rights distribution fully understood cannot treated solely generically interventions vary attractiveness variations associated beneficial depending much known published literature free research first point providing thus ground future casequalitative studies follow lead explore specific reasons license intervention consequent increase reduction attractiveness based stakeholders’ perceptions projects
::::
Table 6 Statistical tests attractiveness mean differences one MINUS another licensing schema Paired differences attractiveness 99 confidence interval df pvalue Cohen’s Mean Std Deviation Std Error Mean Lower Upper Pair 1 – B 022 397 010 002 047 233 1735 002 006 Pair 2 C 090 387 027 019 161 328 200 000 023 Pair 3 – 004 374 009 019 027 050 1751 062 001 Pair 4 – E 031 408 014 066 004 226 895 002 008 Pair 5 – F 003 391 009 021 027 034 1749 074 001 Pair 6 G 067 422 011 095 039 624 1553 000 016 Pair 7 B – C 039 344 025 024 103 161 195 011 011 Pair 8 B 038 357 005 051 024 695 4356 000 011 Pair 9 B E 057 384 013 091 022 426 829 000 015 Pair 10 B F 032 395 006 048 017 544 4419 000 008 Pair 11 B G 090 419 011 118 062 822 1458 000 022 Pair 12 C – 006 344 025 072 059 026 186 080 002 Pair 13 C E 090 360 024 153 028 377 224 000 025 Pair 14 C – F 009 323 023 050 069 040 197 069 003 Pair 15 C G 135 379 025 201 068 528 220 000 036 Pair 16 E 060 367 013 093 027 470 817 000 016 Pair 17 – F 007 366 005 007 021 128 4614 020 002 Pair 18 G 084 395 010 110 057 817 1490 000 021 Pair 19 E F 065 385 013 031 099 488 836 000 017 Pair 20 E – G 031 407 013 065 004 229 927 002 008 Pair 21 F G 077 416 011 105 050 720 1504 000 019 indicates significance 005 Bonferroni correction 00023 00521 Cohen’s calculated mean divided std deviation Superscript letter means effect size small medium Source Authors managers stakeholders’ perceptions considered future research endeavors future line inquiry based casequalitative studies would able shed light asymmetric effects detected sample well Quite often intervention one license another opposite effect change another one analyzed viceversa comparison possible Probably FSP stakeholders expectations related occasional change might occur license terms free intention adopt contribute means depending current license anchor effects changing one license might different specific interests stakeholders also matter eg hardware production service sale Managers take account considering license change FSP managers aware success projects linked choice license fewer market resources – attention users labor developers – might flow direction depending means managers must understand relevant stakeholders application want source code attempt meet expectations carefully considering change licensing direct negotiation stakeholders avoid unwanted consequences research indicates silver bullet concerning right licensing schema business model signaling general hypothesis explored needs elaboration Academically speaking contingent type theory explain license schema impacts attractiveness based context perhaps stakeholderbased needs developed help guide future researchers direction moment possible highlight general strategy multiple licenses appears superior specific license schema perhaps accommodates stakeholders’ conflicting interests better would explain noticeable trend adopt “various licenses” strategy demonstrates important improve classification schema previously adopted literature conclusion intellectual property interventions always beneficial free almost invariably associated attractiveness variations Accordingly FSP managers aware importance carefully select change type license FSP continuously succeed result growing market interest application source code Nevertheless intervention decision occur unaware specific consideration stakeholders’ intentions future Nevertheless methodologically speaking future research must persist pursuing licenseattractiveness relationship analyzing longitudinal type data advanced inferential statistical techniques structural equation modeling explore understand causal relationships better even rigorously ttests Bonferroni procedure applied basic reliable choice problem hand analytical improvements possible welcome collective scientific communication towards knowledge accumulation Another downside research sample restricted Sourceforgecom projects Nowadays many free repositories could considered Nevertheless findings reported likely constant across repositories hypothesis future research verify well Finally measures attractiveness adopted another point improvement performed future research number web hits downloads members utilized various measures possible example one could use market share alternative survey methods evaluate attractiveness subjectively Moreover attractiveness probably consequence many things besides license chosen manager factors considered future research paper endogeneity issue dealt sampling procedure identified projects various kinds level maturity thereby controlling effects Additionally results discussed appear complex seem accurate representation FSP reality fully understood future research use dataset made available along paper different analytical theoretical approaches shed light projects behaviour time 8 Endnotes 1httpwwwgnuorggnuinitialannouncementenhtml 2httpwwwnberorgpapersw9363 3httpdlacmorgcitationcfmid2597116 4httpflossmoleorg 5httpwww3ndeduossDatadatahtml 6httpthestatsgeekcom20130928thettestandrobustnesstononnormality 7httpnrsharvardeduurn3HULInstRepos11718205
::::
9 Additional file Additional file 1 Dataset raw data used research CSV 1489 kb Abbreviations AFL Academic Free License FSP Free open source projects GPL General Public License IPI Intellectual property interventions MIT Massachusetts Institute Technology license Acknowledgements appreciate comments guidance provided Professors Julio Singer statistics USP Fabio Kon computer science USP contributions initial stages research incredibly helpful also thank Center Technology Development CDT University Brasilia UnB technical help provided work Raphael Saigg previous version paper presented CSCW 2011 Funding thank FAPESP 2009020462 funding Authors’ contributions sole author Ethics approval consent participate need secondary public data used Competing interests authors declare competing interests Publisher’s Note Springer Nature remains neutral regard jurisdictional claims published maps institutional affiliations Received 2 March 2016 Accepted 12 July 2017 Published online 07 August 2017 References 1 McCafferty code released Commun ACM 20105310 2 Stone R EarthObservation Summit Endorses Global Data Sharing Science 20103306006 3 Sojer Henkel J Code reuse open source development Quantitative evidence drivers impediments J Assoc Inf Syst 20101112868–901 4 Allen RC Collective invention J Econ Behav Organ 198341–24 5 von Hippel E Cooperation rivals Informal knowhow trading Res Policy 198716291–302 6 Colazo J Fang Impact license choice Open Source development activity J Society Inf Sci Tech 2009605 7 Rosen L Open Source Licensing Freedom Intellectual Property Law Prentice Hall 2004 8 Stewart KJ et al Impacts License Choice Organizational Sponsorship User Interest Development Activity Open Source Projects Inf Syst Res 2006172 9 Raymond Eric Cathedral Bazaar Musings linux open source accidental revolutionary O’Reilly 2001 10 Fitzgerald Brian Transformation Open Source MIS Quarterly 30 3 2006 11 Agerfalk P Fitzgerald B Outsourcing unknown workforce Exploring Opensourcing global sourcing strategy MIS Q 2008322 12 Santos C Kuk G Kon F Pearson J attraction contributors free open source projects J Strateg Inf Syst 201322126–45 13 Maillart Sornette Spaeth von Krogh G Empirical tests Zipf’s law mechanism open source Linux distribution Phys Rev Lett 2008101 14 Wiggins Howison J Crowston K Heartbeat measuring active user base potential user interest FLOSS projects Proceedings Fifth International Conference Open Source Systems OSS 2009 p 94–104 15 Crowston K Howison J Annabi H Information systems success Free Open Source development Theory measures Soft Proc Improv Pract 2006112123–48 16 Vendome C LinaresVásquez Bavota G Di Penta Daniel German DM Poshyvanyk 2015 developers adopt change licenses Proceedings 2015 IEEE International Conference Maintenance Evolution ICSME ICSME ’15 IEEE Computer Society Washington DC USA 3140 httpdxdoiorg101109ICSM20157332449 17 Stewart K Gosain impact ideology effectiveness open source development teams MIS Q 2006302291–314 18 Sen R et al Determinants Choice Open Source License JMIS 252008 19 Lerner J Tirole J Scope Open Source Licensing J Law Econ Org 2005211 20 Raymond E cathedral bazaar Knowledge Technol Policy 199912323–49 21 Sing P Phelps C Networks social influence choice among competing innovations Insights open source licenses Inf Syst Res 2009243539–60 22 Wu Manabe Kanda German K Inoue method detect license inconsistencies largescale open source projects 12th Working Conference Mining Repositories MSR 2015 Florence Italy May 1617 2015 IEEE 2015 23 Howison J Conklin Crowston K FLOSSmole collaborative repository FLOSS research data analyses Int J Inform Technol Web Engr 20061317–26 24 Mardia K et al Multivariate Analysis Probability Mathematical Stats Academic Press 1980 25 Hannan Freeman J Structural Inertia Organizational Change Sociol Rev 1984492149–64 Retrieved httpwwwjstororgstable2095567seq1pagescantabcontents 26 Watson RT et al business open source Commun ACM 200851441–6
::::
Exploratory MixedMethods Study General Data Protection Regulation GDPR Compliance OpenSource Lucas Franke lfrankevtedu Virginia Tech Blacksburg Virginia USA Huayu Liang huayu98vtedu Virginia Tech Blacksburg Virginia USA Sahar Farzanehpour saharfarzavtedu Virginia Tech Blacksburg Virginia USA Aaron Brantly abrantlyvtedu Virginia Tech Blacksburg Virginia USA James C Davis davisjampurdueedu Purdue University West Lafayette Indiana USA Chris Brown dcbrownvtedu Virginia Tech Blacksburg Virginia USA ABSTRACT Background Governments worldwide considering data privacy regulations laws European Union’s General Data Protection Regulation GDPR require developers meet privacyrelated requirements interacting users’ data Prior research describes impact laws development commercial Although opensource commonly integrated regulated thus must engineered adapted compliance know laws impact opensource development Aims Understanding data privacy laws affect opensource development focused European Union’s GDPR prominent law specifically investigated GDPR compliance activities influence OSS developer activity RQ1 OSS developers perceive fulfilling GDPR requirements RQ2 challenging GDPR requirements implement RQ3 OSS developers assess GDPR compliance RQ4 Method distributed online survey explore perceptions GDPR implementations opensource developers N56 augment analysis conducted repository mining study analyze development metrics pull requests N31462 submitted opensource GitHub repositories Results results suggest GDPR policies complicate opensource development processes introduce challenges developers primarily regarding management users’ data implementation costs time assessments compliance Moreover observed negative perceptions GDPR opensource developers significant increases development activity particular metrics related coding reviewing activity GitHub pull requests PRs related GDPR compliance Conclusions findings provide future research directions implications improving data privacy policies motivating need policyrelated resources automated tools support data privacy regulation implementation compliance efforts opensource
::::
1 INTRODUCTION products collect increasing amount data users enhance user experiences personalized machine learningenabled 53 application behaviors 33 marketing 79 practices may benefit users also threaten wellbeing example 2013 Facebook allowed political research firm Cambridge Analytica access data 87 million Facebook users 62 Cambridge Analytica used data influence US elections 114 115 protect citizens 100 governments worldwide developing data privacy regulations 105 goal constrain citizens’ personal data collected processed stored saved target specific industries eg United States’s Health Insurance Portability Accountability Act HIPAA places requirements healthcare organizations handling medical data 7 Others cover personal data regardless context eg European Union’s General Data Protection Regulation GDPR grants rights EU citizens affects entities handle data 12 penalties noncompliance data privacy laws regulations may severe 18 46 example GDPR corporations fined millions billions euros 80 organizations store manipulate data electronically ensuring legal compliance important engineering task Data privacy regulations create challenging requirements entail technical legal expertise developers must implement required features obtaining consent users data collection ensure organizations’ products compliant However developers may limited legal knowledge 81 109 receive minimal training 21 55 lead coarse solutions exiting affected market 88 — hundreds websites simply banned European users GDPR went effect 97 103 Researchers explored impact data privacy regulations businesses 72 73 88 users 22 32 68 observable product properties website cookies 67 database performance 92 However limited study laws affect development process existing studies commercial development 20 29 lack knowledge effects GDPR opensource OSS development goal work describe impact data privacy regulation compliance opensource study first topic2 therefore adopt exploratory methodology provide initial characterization identify phenomena 2This paper extension preliminary work presented poster 44 interest study study draws two data sources collected two phases first phase examined qualitative data developers’ experiences GDPR implementations OSS collected via survey N56 investigate impact GDPR OSS second phase collected analyzed developers’ activities opensource projects GitHub examining metrics sentiments 31462 pull requests divided 15731 GDPR nonGDPR pull requests PRs results show GDPR compliance negatively impacts opensource development—incurring complaints developers significantly increasing coding reviewing activities PRs addition despite benefits data privacy regulations users find developers mostly negative perceptions GDPR reporting challenges implementing verifying policy compliance also find interactions legal experts hinder development processes yet developers rarely consult legal teams—often relying ad hoc methods verify GDPR compliance sum contributions survey OSS developers understand developers’ experiences GDPR compliance challenges implementing assessing data privacy regulations empirically analyze impact GDPRrelated implementations development activity metrics use natural language processing NLP techniques evaluate perceptions GDPR compliance discussions OSS repositories Significance work contributes exploratory analysis impact GDPR compliance opensource identifies interesting phenomena research—in particular opportunities support policy implementation verification also provide recommendations policymakers developers improve data privacy regulations implementation
::::
2 BACKGROUND 21 Regulatory Compliance 211 General requirements divided two categories functional nonfunctional 96 Functional requirements pertain inputoutput characteristics ie functions computes Nonfunctional requirements cover everything else resource constraints deployment conditions development process One major class nonfunctional requirement compliance applicable standards regulations requirements typically developed enforced perindustry basis acknowledgment industry’s risks best practices 54 Complying standards regulations part engineering work many years standards apply manufacturing process eg ISO 9001 quality standard 11 Others generic development eg ISOIECIEEE 90003 10 Still others contextualized risk profile usage context eg ISO 26262 13 IEC 61508 9 describe standards safetycritical systems 54 US HIPAA law Health Insurance Portability Accountability Act describes privacy standards handling medical data 7 US FERPA law Family Education Rights Privacy Act describes privacy standards handling educational data 5 Although regulations new eg FERPA dates 1974 HIPAA 1996 IEC 61508 1998 engineering teams still struggle comply 34 40 43 75 212 OpenSource study focuses GDPR compliance opensource reader may surprised regulatory compliance factor opensource development opensource licenses MIT 3 Apache 8 GNU GPL 6 disclaim legal responsibility example MIT license common license GitHub 27 states “the provided ‘as is’ without warrantyauthors liable claim damages liability” However users developers opensource may desire regulatory compliance note three examples 1 majority opensource developed commercial use 47 may require standards regulatory compliance 108 2 Users opensource components supply chains 52 83 may request compliance requirements web cookies developers may service requests 3 Users may extend opensource undertake compliance analysis 99 Standards IEC 61508–Part 3 include provisions 60 Opensource longer minor player commercial engineering Multiple estimates suggest opensource components comprise majority many applications 47 82 2023 survey ∼1700 codebases across 17 industries Synopsys found opensource 96 codebases reported average contribution 75 code codebase 101 therefore important understand opensource development considers nonfunctional requirements regulatory compliance 22 Privacy Regulations Especially GDPR 221 Consumer Privacy Laws §21 discussed standards regulatory requirements affect products based industry Recently new kind regulation begun affect consumer privacy laws prominent example law European Union’s General Data Protection Regulation EU GDPR enacted 2016 enforceable beginning 2018 Examples United States include California Consumer Privacy Act CCPA enacted 2018 Virginia Consumer Data Protection Act CDPA enacted 2021 Similar legislation considered 100 governments 59 105 222 General Data Protection Regulation GDPR General Data Protection Regulation GDPR 12 protects personal data European Union EU citizens regardless whether data collection processing based EU law implications entities interact personal data EU citizens divided data subjects data controllers data processors 45 Data subjects individuals whose personal data collected Data controllers entities —organization company individual otherwise — control responsible personal data Data processors entities process data data controllers GDPR grants data subjects rights personal data providing guidelines requirements data controllers processors understand properly handle data GDPR compliance complex engineers consequential organizations Data controllers processors commonly use eg controller’s mobile app transmits data backend service processors subsequently access update database teams must determine appropriate data policies update systems comply validate eg incorporating cookie consent notices websites provide users informed consent 106 Anticipating lengthy compliance process EU enacted GDPR 2016 made enforceable 2018 allowing two years corporations prepare 1 Companies US UK alone invested 9 billion GDPR compliance 110 December 2022 many use manual compliance methods compliant 14 Noncompliance costly thousands distinct fines imposed noncompliant data controllers processors exceeding €25 billion 15 Although GDPR compliance affects processes data EU citizens opensource components comprise majority many applications process data 47 82 101 best knowledge prior research impacts GDPR compliance opensource
::::
3 METHODOLOGY 31 Data Availability Research Questions §2 described range privacyrelated standards regulations noted little study effect requirements opensource engineering practice address gap need data Table 1 estimates availability engineering data associated requirements two common metrics number posts Stack Overflow number pull requests GitHub Privacy Law Year Stack Overflow GitHubPRs GDPR 2016 2058 64 K HIPAA 1996 725 5 K CCPA 2018 96 1 K FERPA 1974 35 254 CDPA 2021 7 19 PIPEDA 2000 5 31 Based data scoped study EU’s GDPR opensource hosted GitHub currently popular hosting platform OSS answer four research questions RQ1 GDPR compliance influence development activity OSS projects RQ2 OSS developers perceive fulfilling GDPR requirements RQ3 GDPR concepts OSS developers find challenging implement RQ4 OSS developers assess GDPR compliance analyzed data quantitative qualitative sources surveying opensource developers mining OSS repositories GitHub present obtained analyzed data source next integrate data answering RQ1 RQ2 use survey data alone answer RQ3 RQ4 32 Data Source 1 Developer Survey explore impact implementing GDPR policies OSS development distributed online survey opensource developers data informed answers RQs used fourstep approach motivated framework analysis methodology 90 policy research collect analyze data second phase experiment overview process presented Table 2 Institutional Review Board IRB provided oversight 321 Step 1 Pilot Study Data Familiarization formulate initial thematic framework qualitative analysis conducted semistructured pilot interviews OSS developers n 3 prior work explored perceptions GDPR compliance OSS pilot interviews gave us insight developers’ perceptions experiences implementing GDPR concepts context opensource development Two subjects contributed PRs dataset third personal contact wide range opensource development experience 1 year 20 years Interviews transcribed using Otterai coded two researchers inform survey Thematic analysis pilot interviews provided insight informed survey questions participants highlighted challenges implementing GDPR requirements opensource One participant worked large corporation outlined differences GDPR compliance company OSS namely 1 approaches used assess whether compliance implemented correctly 2 access legal teams two participants discussed impact GDPR noting privacy benefits well challenges OSS developers face implementing GDPR requirements assessing compliance findings informed survey 322 Step 2 Survey Design survey consisted openended short answer questions seeking details GDPR implementation experiences context opensource development used pilot study interview results identify topics focus survey Based interviews asked perceived impact GDPR data privacy difficult concepts implement assess GDPR compliance survey instrument supplemental material 323 Step 3 Participant Recruitment distributed survey three rounds first round emailed sample 98 developers authored commented GDPRrelated pull requests publicly available email addresses received 5 responses ie 5 response rate second round made broader calls participation Twitter Reddit received 44 responses 2 indicated experience implementing GDPR compliance survey respondents rounds entered drawing two 100 Amazon gift cards months undertook third round redistributing survey additional 235 GitHub users GDPR implementation experience authored GDPRrelated pull requests dataset offered individual compensation 10 gift card encourage participation received 9 responses 4 response rate total data 56 survey participants 14 direct GitHub contacts 42 Twitter Reddit Table 2 Overview sample questions pilot interview study survey designanalysis framework analysis approach used Data Source 2 final column notes interrater agreement score themes using kappa score prior reaching agreement Interview Question Codes Survey Question Codes kappa meaningful impact believe GDPR data security privacy data privacy rights users data collection impact believe GDPR similar data privacy regulations data security privacy data privacy data processing data collection insufficient information data breach fines 0736 GDPR concepts find difficult frustrating implement None data minimization embedded content GDPR concepts find difficult frustrating implement privacy design data minimization cost data processing user experience data management security risks None lawfulness dispute resolution time right erasure 0929 specifically seek legal consultation GDPRrelated issues affect development process YesNo effect negative effect time specifically seek legal consultation GDPRrelated issues affect development process YesNo NA effect positive effect negative effect cost time data storage data processing 0514 development projects frequently consult legal team impact development processes assess GDPR compliance projects Yes legal consultation privacy design data minimization development projects consulted legal team assess GDPR compliance projects Yes legal consultation accountability system online resources selfassessment data management none NA 0668 — — implementing GDPR concepts compliance impacted development process way yesnomaybe Please explain positive impact logging privacy design negative impact cost data management security impact 0860 participants median approximately 5 years OSS development experience avg 59 6 years general industry experience avg 77 Participants reported contributing variety OSS projects Mozilla Wordpress Fedora Moodle Ansible Flask Django Kubernetes PostGreSQL OpenCV GitLab Microsoft Cognitive Toolkit 324 Step 4 Data Analysis analyze survey results used open coding approach Two researchers independently performed manual inspection responses–highlighting keywords categorizing responses based predefined themes derived pilot study new themes arose coders discussed agreed upon adding new theme coders came together merge individual results Finally used Cohen’s kappa kappa calculate interrater agreement see Table 2 33 Data Source 2 GDPR PRs GitHub collected data concerning GDPR compliance analyzing pull requests GitHub repositories Pull requests mechanism GitHub allow developers collaborate opensource repositories involving code contributions developers reviewed merged source code 48 331 GDPR nonGDPR PRs used GitHub REST API search GDPRrelated pull requests—pull requests returned GitHub API’s default search query string “GDPR” Manual inspection suggested results typically Englishlanguage PRs related GDPR data privacy regulatory compliance Using method collected GDPRrelated PRs created April 2016 GDPR adopted European Parliament January 2024 removed content submitted users “bot” username 16 designated bot type according GitHub API avoid PRs generated automated systems resulted 15731 GDPRrelated pull requests across 6513 unique GitHub repositories comparison also collected random sample 15731 pull requests created repositories April 2016 mention “GDPR” call nonGDPRrelated pull requests studied repositories median 14 stars avg 1635 11 forks avg 416 727 commits avg 8997 172 PRs avg 1425 15 contributors avg 59 suggesting popular active repositories distribution PRs across repositories GDPRrelated nonGDPRrelated datasets summarized Table 3 332 Measuring Development Activity analyze GDPR’s impacts collected development activity metrics 49 per pull request Comments total number comments Active time amount time PR remained active merged closed Table 3 Distribution PRs Datasets Dataset min 50ile 75ile 90ile max GDPR 1 1 2 3 956 nonGDPR 1 2 10 34 203 3httpsdocsgithubcomengraphqlreferenceobjectsbot • Commits total number commits • Additions number lines code added • Deletions number lines code removed • Changed files total number modified files • Status outcome PR merged closed open selected metrics analyze development activity specifically derive coding code review tasks pull requests compared distributions metrics GDPRrelated nonGDPRrelated PRs using MannWhitney U test compare nonparametric ordinal data datasets 76 control multiple comparisons dataset calculate adjusted pvalues using BenjaminiHochberg correction 30 measure effect size r significant results using Cohen’s 39
::::
333 Measuring Developer Perception augment survey results applied sentiment analysis—a technique automatically infer sentiment natural language—on title body commit messages review comments discussion comments pull requests datasets examine developer perceptions GDPR compliance Prior studies similarly inferred developer sentiment emotion GitHub activity including PR discussion comments 87 review comments 57 commit messages 50 bodies 84 technique sometimes negative results engineering contexts 64 use exploratory work proxy obtain preliminary insights developers’ sentiments regarding GDPR compliance OSS followed standard NLP preprocessing steps 69 1 removed botgenerated content using process described Section 331 2 removed nonsentiment material hyperlinks mentions “username” 3 tokenized text using Natural Language Toolkit NLTK tokenize library 4 converted tokens lowercase removed punctuation 5 removed stopwords “but” “or” nltkcorpus library 6 lemmatized text ie reducing words base form eg “mice” becomes “mouse” 23 using WordNetLemmatizer nltkstem library 7 normalize data removing meaningless tokens SHA hash values commits nonstandard English words words contain numerical values ie “3d” 98 preprocessing data left 15731 titles 14515 bodies 15217 commit messages 4922 review comments 4862 discussion comments across GDPRrelated pull requests compared nonGDPRrelated PRs 15731 titles 13718 bodies 15652 commit messages 3427 review comments 3165 discussion comments perform sentiment analysis use three stateoftheart models LiuHu 56 VADER 58 SentiArt 63 fed preprocessed textual data model provided compound sentiment scores use ttest statistically analyze sentiment across datasets Moreover aim assess impact GDPR developer sentiment time accomplish divided GDPR nonGDPR PRs 3month segments based creation date PR performed sentiment analysis binned data observe whether developer sentiments manifest OSS interactions lifecycle GDPR regulation — initial adoption 2016 enforcement 2018 present combined preprocessed textual elements title body commit messages review comments discussion comments observe overall trends PR communications compare nonGDPR data baseline sentiment developer communications projects studied
::::
4 RESULTS interested understanding impact GDPR implementations opensource analyzing development activity developer perceptions including challenges implementation assessment compliance work answer research questions using multiple sources—analyzing GitHub repositories surveying opensource developers RQ1 RQ2 report views survey GitHub measurements RQ3 RQ4 use data survey
::::
41 RQ1 Development Activity question RQ1 GDPR compliance influence development activity OSS projects
::::
411 Survey surveyed 56 OSS developers understand impact GDPR implementations development activity participants n 41 73 responded “Yes” question regarding impact implementing GDPR concepts development processes indicating data privacy compliance effects opensource development asked elaborate 23 developers provided examples development impacts related GDPR Data Management 11 participants mentioned GDPR requirements related data management impact development activity notably increasing development efforts instance responses indicated handling personal data P17 anonymization P19 managing data controllers P21 data recipients P23 implementing functionality limit collection personal data P26 monitoring data subjects EU P28 impacted development processes P53 also added “we separate clear way sensitive data data” exemplifying effort needed implement compliant data processing OSS Time Costs Five participants mentioned GDPR compliance increases development time costs OSS example regarding time respondents said “it slow development cycle” P54 “we lost complete year ready” P56 costs participants said “budgets soared” P5 “costs production go cost consequence data breach” P46 Design Three participants also noted effects GDPR compliance design structure products example P54 responded “we check whether comply GDPR every time draft new design” P55 added “the design systems incorporates concept needing remove PII fact” P21 explained GDPR compliance reduced quality application’s design—replying “the principle minimum scope observed”—indicating potential unnecessarily extended scopes variables code 36 Organization Three participant responses embodied negative effects data privacy regulations organization stating GDPR “major impact” requiring “an overhaul management program priorities” P1 P45 highlighted “making sure follow privacy design” challenging GDPR compliance OSS development One participant also mentioned additional steps verify implementations affected development stating need make additional review GDPR consultants functionality related users’ data P53 Benefits One participant mentioned benefits development team processes regarding implementation GDPR concepts stating helped highlight things considered ensuring logging functionality access restrictions place P1 However majority responses indicate GDPR compliance often increases development efforts incurs negative impacts opensource developers
::::
412 Pull Request Metrics observe impact GDPR compliance OSS compared metrics GDPR nonGDPR related PRs Table 4 presents results Using MannWhitney U test found statistically significant differences GDPR nonGDPR PRs number comments active time number commits lines code added lines code deleted number modified files also calculate effect size results indicates incorporating changes related GDPR major impact development work leading increased discussions developers longer review times code commits higher code churn observed significant differences exist pull request metrics GDPR nonGDPR PRs calculated effect sizes small 71 indicating low practical differences groups Yet findings support survey results opensource developers purporting GDPR compliance efforts affect OSS development Finding 1 Developers report implementing GDPR compliance negatively affects development processes—citing cost time data management concerns Finding 2 PRs related GDPR compliance significantly development activity coding comments additions deletions files changed review comments active time tasks
::::
Table 4 GDPR G vs NonGDPR nonG GitHub Activity Metrics Characteristic Type Median pvalue Comments G 1 00001 nonG 1 U 14E8 r 009 Active time days G 41805 00001 nonG 178 U 14E8 r 014 Commits G 2 00001 nonG 1 U 14E8 r 004 Additions G 57 00001 nonG 19 U 15E8 r 005 Deletions G 7 00001 nonG 4 U 13E8 r 005 Changed files G 4 00001 nonG 2 U 14E8 r 003 denotes statistically significant results pvalue 005
::::
42 RQ2 GDPR Perceptions question RQ2 OSS developers perceive fulfilling GDPR requirements
::::
421 Survey asked participants perceptions impact GDPR regulations privacy participants responded question n 25 negative opinions GDPR Three participants neutral eg NA P4 summarize positive negative perceptions next Negative Perceptions Despite utility data privacy regulations 22 participants reported negative perceptions GDPR responses primarily focused three issues cost organizations enforcement costs respondents noted implementing GDPR requirements expensive burdensome Participants said compliance costly many companies P16 expensive P24 cost protection go cost consequence data breachGDPR isn’t worth time P46 P55 also highlights general major costs companies sizes regarding GDPR implementations organizations participants reported negative impact GDPR companies organizations mentioned GDPR compliance weakens small mediumsized enterprises P15 threatens innovation P18 fails meaningfully integrate role privacyenhancing innovation consumer education data protection P23 order safer risky useful functionality removed P52 P46 added GDPR lot headachejobs lawyers expense people trying solve real problems enforcement one subject said large gap GDPR enforcement among member states P17 another observed trendis increase number times amount fines P18 Similarly P49 described GDPR big hammer unsure necessarily increased security privacy point Positive Perceptions Eight participants positive perceptions GDPR generally stating GDPR enhances data privacy users example participants said risk incurring paying hefty fines made companies take privacy security proactively P30 GDPR brings awareness importance privacy P45 data integrity ensured P47 customers delete data quite easily P54 Participants also appreciated increased accountability corporations safeguarding users’ data—for example one participant stated GDPR data protection usually considered afterthought outright joke Nowadays companies least consider wrong violating data protection laws rather accident noone even thought P50 responses reflect intentions GDPR — safeguard rights users data online
::::
422 Sentiment Analysis investigated sentiment developers implementing GDPR concepts analyzing PR titles commit messages review comments discussion comments bodies overall results Table 5 anticipated higher percentage negative comments GDPRrelated pull requests However find evidence GDPRrelated PRs less favorable sentiments developers fact found often positive sentiments nonGDPRrelated PRs—with two three models LiuHu VADER indicating statistically significant difference GDPR nonGDPR sentiment speculate two explanations First nonGDPRrelated PRs represent broad range code contributions could address number Table 5 GDPR G vs NonGDPR nonG Sentiment Analysis Test Type Mean Variance pvalue LiuHu G 043 027 p 00001 405 r 022 nonG 004 028 VADER G 044 004 p 00001 647 r 002 nonG 021 001 SentiArt G 039 001 p 01399 110 r 001 nonG 036 0002 denotes statistically significant results pvalue 005 Figure 1 Longitudinal GDPR G NonGDPR nonG Sentiment Analysis Data grouped GDPR nonGDPR data 3month segments used 3 sentiment models model GDPR data plotted color filled marker nonGDPR data color hollow marker general trend sentiment GDPR data moderately positive positive nonGDPR data issues Second limited capabilities sentiment analyzer example two negative commit messages nonGDPR pull requests said “obsolete” “fatal” common terms art maintenance tasks 89 113 eg “fix fatal error” also observed variation beginning end dataset collection period significant variation sentiment time see Figure 1 Nonetheless manual inspection negatively scored content showed OSS developers expressing frustration GDPR compliance instance one title commit message described GDPRrelated changes “avoid lawsuits mentioning cookies thing” 91 Another title states adding “just enough EULA end user license agreement get banned” 31 Similar frustrations shared PR body “GDPR stuff” adding changes “display annoying cookies banner” 104 Discussion comments “will conflict GDPR” 24 also highlight OSS developers’ confusion GDPR requirements Finding 3 Despite nominal advantages developers negative perceptions GDPR implementation Finding 4 found developers express negative sentiments GDPR compliance PR discussions Finding 5 Sentiment related GDPR compliance appears stable time 43 RQ3 Implementation Challenges question RQ3 GDPR concepts OSS developers find challenging implement survey data observed three common challenges data management data protection vague requirements Data Management 11 developers responded processing storing users’ data according GDPR requirements challenging concept implement example participants mentioned challenges implementing “data protection” P24 handling “personal data” P34 “exchange documents containing personal data” P32 “improper storage” 30 user data “knowing info cannot accessed saved” P49 particular four participants mentioned users’ right erasure—or obligation data controllers delete users’ data upon request “without undue delay” 4—as complicated requirement implement example P53 responded “it’s always easy enough implement data processing way it’s anonymized user would like data erased able continue processing results based user data anonymous way”—describing complexity requirement Data Protection Five participants mentioned security factors challenge GDPR compliance instance participants concerned “data protection” “other security concerns” P24 “leaks” P27 fact entities “the ability steal data” P28 P55 noted challenges handling securing data “central databases data may relied many loosely connected applications systems” responses highlight difficulties implementing mechanisms safeguard users’ data Vague Requirements 10 survey respondents highlighted lack clear requirements biggest challenge GDPR compliance OSS example one participant mentioned GDPR “is pretty vague” lack “standard format” P54 Another described confusion knowing “how long data retained” “what Personalsic Identifiable Information”—adding “lack clarity regulationssic leads confusion” P52 Moreover P48 highlighted lack company understanding GDPR requirements makes compliance difficult Beyond clear categories also received wide range responses including “lawfulness dispute resolution” P47 conflict “individual privacy public’s right know” P21 “rush regulate” P28 P27 mentioned challenges user experiences stating “users endure invasive popups” P1 noted challenges evolve lifetime stating “At beginning privacy design default middle end data minimization transparency” main challenges Based challenges implementation participants described difficulties limiting functionality—eg “knowing interacting EU citizens” P49 “more 1000 news websites European Union gone dark” P15 Meanwhile P17 mentioned difficulties implementing GDPR requirements dataintensive programming domains “many GDPR’s requirements essentially incompatible big data artificial intelligence blockchain machine learning” challenges motivate new resources help developers overcome problems related GDPR implementation compliance Finding 6 management protection user data vague requirements key challenges opensource developers face implementing GDPR requirements
::::
44 RQ4 Compliance Assessment question RQ4 OSS developers assess GDPR compliance found three kinds responses related compliance assessment consulting legal counsel referencing compliance resources selfassessment Compliance Legal Counsel survey results 15 OSS developers reported consulting legal teams GDPR compliance also interested exploring impact seeking legal counsel GDPR compliance OSS development processes Seven participants experience seeking legal consultations noted positive impact development activity P6 P13 P14 P45 P53 P55 P56 Participants noted benefits seeking legal experts stating importance “consulting lawyers team seat table” P45 “clarifies requirements prevents misinterpretations” P55 allowed GDPR compliance “implemented rather easily” P56 However participants n 9 experience seeking legal counsel lamented impact stating decreased development productivity “it slows things code reviewed objectives revised” “it impacted approach SDLC” P1 “it’s bit headache” P24 “it slowed us downwas mostly box ticking exercise” P51 “it interrupted development required” P49 Respondents also bemoaned costs working legal teams stating “for global open source legal advice would extremely expensive” P52 “opensource projects can’t afford even sustain maintainers even speaking legal teamLegal teams consulted corps want kill project” P47 P54 also noted legal experts found difficulties vagueness GDPR compliance replying “legal team struggles interpret comply GDPR lot backandforth change design many times” sum legal experts provide valuable insight data privacy regulations compliance developers often find interactions negatively impact development processes Compliance Resources assess GDPR compliance three participants mentioned variety resources One participant described formal training regulatory compliance “special training GDPR within company” P16 Another participant responded team uses “accountability system” P24 assess compliance Finally P15 noted using online resources help highlighted ineffectiveness stating “many articles Internet GDPR incomplete even wrong” Selfassessment developers mentioned largely responsible evaluating “legality” P18 “integrity confidentiality” P23 processing storage user data system P24 responded developers “consider whether really need data collect” P38 advised “get consent order” P53 noted impact development teams stating GDPR implementations “took us significant amount time due several rounds architecture review” P18 added “really good way” evaluate compliance Finding 7 Developers often consult legal experts validate GDPR compliance relying resources compliance training accountability systems online resources selfassessed data management Finding 8 Participants experience interacting legal teams provided mixed perceptions feeling provided valuable insight hindered development processes
::::
5 DISCUSSION FUTURE WORK results demonstrate GDPRrelated code changes major impact OSS development significantly increasing development activity regards number lines code added number commits included PRs–indicating increased effort code contributions code review activities developers §412 found GDPR compliance provides wide range challenges OSS development §43 developers often assess compliance without help legal policy experts §44 findings posit implementing GDPR compliance challenging activity OSS developers recognize many stakeholders involved adhering data privacy legislation instance policymakers also play role data privacy compliance 112 Data privacy regulations GDPR beneficial protecting rights data users online However noticed developers complaining providing privacy people–holding negative perceptions GDPR policy general implementation end provide guidelines enhance data privacy regulations development processes reduce negative effects policy compliance OSS
::::
51 Improving Data Privacy Regulations Provide Clear Requirements found developers struggled implement GDPR concepts §43 Moreover respondents reported consulting legal experts provide insight policies assess compliance projects §44 Thus development teams forced evaluate system Yet participants complained understanding compliance difficult due ambiguity GDPR concepts instance “the procedure obtaining user consent information provided unclear” P25 Prior work suggests ambiguity main challenge requirements engineering 28 incomplete requirements increase development costs probability failure 38 improve program specifications researchers explored variety techniques instance Wang et al explored using natural language processing automatically detect ambiguous terminology requirements 111 Similar techniques could applied regulations GDPR notify policymakers unclear language clarify requirements engineers Another way improve clarity requirements involve developers policymaking process Verdon argues good policy must “understandable audience” 109 p 48 yet results show developers confused GDPR requirements Prior work shows collaboration policy makers practitioners improves policies domains public health 37 education 61 Thus developers incorporated policymaking process provide input impact implementing complying policies concerning development data privacy regulations 512 Policy Resources survey results show OSS developers face challenges implementing GDPRrelated changes §43 Participants also found legal consultations negatively affect development processes §44 report existing resources largely ineffective primarily relying selfassessment within development team one participant mentioned receiving formal training GDPR compliance P16 end OSS developers largely resort implementing evaluating compliance efforts “insufficient information” P26 Prior work also outlines issues developers security policies noting lack understanding programmers 109 Based findings posit OSS development benefit novel resources educate developers policies implementation support compliance policymakers provide resources guides online forums provide information data privacyrelated concepts accessible manner guidelines also reduce effects GDPR compliance code review tasks providing specialized expertise correct understanding reviewers 85 Yet limited online developer communities focused seeking help data privacy policy implementation Popular programmingrelated QA websites eg Stack Overflow frequently used developers ask questions seek information online 86—and used discussions data privacy policy implementation see Table 1 However developers way verify correctness responses also become obsolete time Zhang et al recommend automated tools identify outdated information responses development concepts API libraries programming languages 116 similar approach used keep responses regarding GDPR compliance uptodate accurate 52 Improving Development Processes 521 Privacy Design Participants reported challenges implementing GDPR compliance §43 negative effects development practices §411 Moreover GitHub analysis found GDPRrelated changes necessitated significantly time effort ie comments commits etc developers implement review PRs see Table 4 However compliance required organizations avoid “paying hefty fines” P30 Researchers investigated techniques streamline incorporation privacy development processes instance Privacy Design PBD development approach make privacy “default mode operation” 35 P50 mentioned cultivating “a privacyrespecting mindset long GDPR came about” avoided negative impacts development processes made effort required “quite minimal” However numerous participants noted burden implementing GDPR requirements one survey participant particular P1 highlighting prioritizing privacy development processes “requires overhaul” Additionally PBD benefit GDPR compliance efforts Kurtz et al note scarcity research area note particular challenges PBD GDPR implementations ensuring third party libraries also adhere privacy principles 70 PBD effective new projects starting scratch 102 yet may illequipped existing projects complying new changing data privacy regulations Anthonysamy et al outline limitations current privacy requirements solve present issues may differ regulations policies future 25 work needed explore tools processes support data privacy mature projects One solution could partial gradual approach compliance instance programming languages eg Typescript support gradual typing selectively check type errors code 93 Similarly research formal methods explored supporting gradual verification programs 26 Thus gradually introducing privacy OSS help reduce efforts related GDPR compliance opposed overhauling development processes prioritize privacy 522 Automated Tools found GDPR compliance major impact OSS development significantly increasing coding reviewing tasks PRs GitHub repositories see Table 4 Developers responded survey also indicated impact GDPR compliance source code noting data privacy regulations always need P4 violate principle minimum scope P21 indicates difficulty developers validate projects GDPR one participant responding “no good way” assess compliance P18 findings point increased burden effort OSS developers implement review GDPR requirements comply data privacy regulations avoid penalties noncompliance eg losing market share end posit automated tools reduce burden GDPR implementation efforts One participant mentioned using tool “accountability system” P24 help assess compliance—however provide details system findings RQ1 §41 show GDPRrelated pull requests significantly coding involved consisting commits lines code added code contributions well requiring significantly comments time reviewing processes Thus systems support data privacy implementation tools review policyrelevant code needed streamline compliance Ferrara colleagues present static analysis techniques support GDPR compliance 42 tools support review processes assessing implementation changes Prior work suggests static analysis tools reduce time effort code reviews 94 Future systems could also provide automated feedback developers reviewers data privacy regulation compliance instance using NLP techniques 17 rulebased machine learning approaches 51 automatically summarize requirements verify compliance 53 Directions Based results observe several avenues future work First plan investigate data sources explore GDPR compliance opensource projects example plan mine relevant queries Stack Overflow gain insight challenges information needs developers implementing GDPR policies also examine answers observe developers respond instance online discussions developers regarding policies often use disclaimers acronyms “IANAL” “NAL” indicate “I lawyer” offering advice answering questions related legal frameworks Without legal expertise anticipate difficult OSS developers offer guidance seek help complying data privacy regulations—motivating need novel approaches support regulation adherence compliance assessment Moreover aim engage policymakers understand perspectives data privacy policies challenges developers face implementing collect qualitative insights politicians individuals authority develop policies explore methods support implementation privacy laws Finally aim extend work investigate impact broader technologyrelated policies opensource development practices—for instance investigating impact alternative data privacy regulations ie CCPA CDPA well legal frameworks impact development maintenance current imminent legislation regarding artificial intelligence governance
::::
6 RELATED WORK note two lines related work characterizations stakeholder perspectives data privacy regulations technical methodological approaches regulatory compliance Stakeholder perspectives Research investigated perspectives GDPR stakeholders data privacy regulation compliance Sirur colleagues examined organizational perceptions feasibility implementing GDPR concepts finding larger organizations confident ability comply smaller companies struggled breadth ambiguity GDPR requirements 95 Earp et al surveyed users show Internet privacy protection goals policies online websites meet users’ expectations privacy 41 Similarly Strycharz et al surveyed consumers uncover frustrations negative attitudes related GDPR 100 work focuses perceptions developers responsible implementing code changes comply data privacy regulations perspective engineers regulatory stakeholders van Dijk colleagues provide overview transition privacy policies selfimposed guidelines developers legal frameworks legislation 107 Alhazmi interviewed developers uncover barriers adopting GDPR principles—finding lack familiarity precedented techniques useful help resources prioritization employers paper also found developers generally prioritize privacy features projects focusing instead functional requirements prevent compliance 20 Similarly researchers interviewed senior engineers understand challenges implementing general privacy guidelines indicating frustration legal interactions nontechnical aspects requirements 29 Finally Klymenko et al interviewed technical legal professionals investigate measures data privacy compliance GDPR implementation—noting lack understanding need interdisciplinary solutions 66 papers take similar approaches research ultimately goals questions distinct since specifically interested perspective opensource developers Implementing verifying GDPR compliance Prior work explored approaches implement verify GDPR compliance instance Martín et al recommend Privacy Design methods tools GDPR compliance 78 Shastri colleagues introduce GDPRBench tool assess GDPR compliance databases 92 Li et al investigated automated GDPR compliance part continuous integration workflows 74 AlSlais conducted literature review develop taxonomy privacy implementation approaches guide GDPR compliance 19 Finally Mahindrakar et al proposed use blockchain technologies validate personal data compliance 77 Rather proposing new engineering methods measures tools related GDPR work takes empirical perspective understand current practices
::::
7 THREATS VALIDITY discuss three types threats validity Construct mining OSS repositories defined construct “GDPRrelated pull requests” based presence string “GDPR” PRs may incorrectly refer GDPR false positives others may perform GDPRrelevant changes without using acronym false negatives also biased towards Englishspeakers acronym differs languages mitigate nonEnglish GDPRrelated PRs polluting nonGDPRrelated dataset manually inspected PR titles various iterations GDPR languages including “RGPD” French Spanish Italian “DSGVO” German “AVG” Dutch However included GDPRrelated dataset since focus PRs English analysis used offtheshelf NLP techniques assess sentiment inheriting biases methods eg misinterpreted connotations homonyms “mock” addition parametric models sentiment analysis based defined dictionary values cannot detect certain aspects human communication sarcasm Prior work also suggests sentiment analysis tools inaccurate engineering contexts 64 However use gain preliminary insights developers’ perceptions GDPR compliance OSS Internal perceive internal threats study provides characterizations rather causeeffect measurements External several threats generalizability findings inherit standard perils mining opensource 65 focus opensource available GitHub omits code hosting platforms GitLab may used different populations developers doubt results generalize commercial since development organizations directly face consequences GDPR noncompliance consider effect GDPR prominent privacy law hence available data regulations may different effects Specifically conjecture differences engineering impact general data privacy regulations GDPR CCPA industryspecific data privacy regulations HIPAA FERPA general regulations may necessarily ambiguous
::::
8 CONCLUSIONS Data privacy regulations introduced prevent data controllers misusing users’ information protect individuals adhere regulations developers charged complex task understanding policies making modifications source code applications implement privacyrelated requirements work examines impact data privacy regulations development processes investigating code contributions developer perceptions GDPR compliance opensource results show complying data privacy regulations significantly impacts development activities GitHub evoking negative perceptions frustrations developers findings provide implications developers policymakers support implementation data privacy regulations protect rights human users digital environments
::::
9 DATA AVAILABILITY uploaded survey datasets data collection analysis scripts supplementary materials 2 IRB protocol allow us share individual survey responses
::::
10 ACKNOWLEDGMENTS Brown Brantly acknowledge support Virginia Commonwealth Cyber Initiative CCI REFERENCES 1 n httpsedpseuropaeudataprotectiondataprotectionlegislationits torygeneraldataprotectionregulationen 2 n httpsanonymous4opensciencerGDPROSSImpactD77B 3 n MIT License httpsopensourceorglicensesMIT Accessed July 2023 4 n Right erasure ‘right forgotten’ httpsgdprinfoeuart17 gdpr 5 1974 Family Educational Rights Privacy Act 1974 20 USC § 1232g 34 CFR Part 99 httpswww2edgovpolicygenguidepcoerpaindexhtml 6 1991 GNU General Public License version 2 Free Foundation httpswwwgnuorglicensesoldlicensesgpl20enhtml 7 1996 Health Insurance Portability Accountability Act 1996 Pub L 104191 110 Stat 1936 httpswwwgovinfogovcontentpkgPLAW 104publ191pdfPLAW104publ191pdf 8 2004 Apache License Version 20 Apache Foundation https wwwapacheorglicensesLICENSE20 9 2010 IEC 6150812010 Functional safety electro calelectronicprogrammable electronic safetyrelated systems – Part 1 General requirements International Electrotechnical Commission httpswebstoreiecchpublication5512 10 2014 ISO 900032014 engineering – Guidelines applica tion ISO 90012015 computer International Organization Standardization httpswwwisoorgstandard59149html 11 2015 ISO 90012015 Quality management systems – Requirements International Organization Standardization httpswwwisoorgstandard62085h 12 2016 Regulation EU 2016679 European Parliament Council 27 April 2016 protection natural persons regard processing personal data free movement data repealing Directive 9546EC General Data Protection Regulation Official Journal European Union httpseurlexeuropaeulegalcontentENTXTuriCELEX 9546EC 13 2018 ISO 2626212018 Road vehicles – Functional safety – Part 1 Vocabulary International Organization Standardization httpswwwisoorgstandard 68383html 14 2023 5th State CCPA GDPR Privacy Rights Compliance Research Report – Q4 2022 Cytrio httpscytriocomwpcontentuploads2023025thState ofCCPAGDPRComplianceReportFNL2pdf 15 2023 GDPR Enforcement Tracker – list GDPR fines Enforcement Tracker httpswwwenforcementtrackercom 16 Ahmad Abdellatif Mairieli Wessel Igor Steinmacher et al 2022 BotHunter approach detect bots GitHub Proceedings 19th Interna tional Conference Mining Repositories 6–17 17 AbdelJaouda Aberkane Geert Poels Seppe Vanden Broucke 2021 Ex ploring automated gdprcompliance requirements engineering systematic mapping study IEEE Access 9 2021 66542–66559 18 Saeed Akhlaghpour Farkhondeh Hassandoust et al 2021 Learning enforcement cases manage gdpr risks MIS Quarterly Executive 20 3 2021 19 Yaqoob AlSlais 2020 Privacy Engineering Methodologies survey 2020 ternational Conference Innovation Intelligence Informatics Computing Technologies SICT 1–6 httpsdoiorg1011093ICT5114620209311949 20 Abdulrahman Alhazmi Nalin Asanka Arachchilage 2021 I’m ears listening developers putting GDPR principles development practice Personal Ubiquitous Computing 25 5 2021 879–892 21 Reni Allan 2007 Reskilling compliance Inf Professional 4 1 2007 20–23 22 Fernando Almeida José Augusto Monteiro 2021 Exploring effects GDPR user experience Journal information systems engineering management 6 3 2021 23 Murugan Anandarajan Chelsey Hill Thomas Nolan Murugan Anandarajan Chelsey Hill Thomas Nolan 2019 Text preprocessing Practical text analytics Maximizing value text data 2019 45–59 24 Maythee Anegboonlap 2018 conflict GDPR httpsgithubcom ReferaCandywoocommercereferacandypull24discussionr2381535 46 Github repository ReferaCandywoocommercereferacandy 25 Pauline Anthonysamy Awais Rashid Ruzanna Chitchyan 2017 Privacy quirements present future IEEEACM International Conference Engineering Engineering Society ICSESEIS IEEE 13–22 26 Johannes Bader Jonathan Aldrich Éric Tanter 2018 Gradual program verification Verification Model Checking Abstract Interpretation VMCAI Springer 25–46 27 Ben Balter 2015 Open source license usage Githubcom Github Blog httpsgithubblog20150309opensourcelicenseusageongithubcom 28 Muneera Bano 2015 Addressing challenges requirements ambiguity review empirical literature 2015 IEEE Fifth International Workshop Empirical Requirements Engineering EmpiRE IEEE 21–24 29 Kathrin Bednar Sarah Spekermann Marc Langheinrich 2019 Engineering Privacy Design engineers ready live challenge Information Society 35 3 2019 122–142 30 Yoav Benjamini Yosef Hochberg 1995 Controlling False Discovery Rate Practical Powerful Approach Multiple Testing Journal Royal Statistical Society Series B Methodological 57 1 1995 289–300 http wwwjstororgstable2346181 31 Ani Betts 2021 enough EULA get banned httpsgithubcomanais betssirenepull37 Github repository anaisbetssirene 32 Alex Bowyer Jack Holt Johnnie Go Jeffers Rob Wilson David Kirk Jan David Smeddinck 2022 HumanGDPR interaction Practical experiences accessing personal data Proceedings 2022 chi conference human factors computing systems 1–19 33 Randolph E Bucklin Catarina Sinimero 2009 Click Internet insight Advances clickstream data analysis marketing Journal Interactive marketing 23 1 2009 35–48 34 Noel Carroll Ita Richardson 2016 Softwareasamedical device demystify ing connected health regulations Journal Systems Information Technology 18 2 2016 186–215 35 Ann Cavoukian 2009 Privacy design 2009 36 David Chisnall 2012 Go programming language phrasebook Addison Wesley 37 Bernard CK Choi Tikki Pang Vivian Lin et al 2005 scientists policy makers work together Journal Epidemiology Community Health 59 8 2005 632–637 38 Tom Clancy 1995 chaos report Standish Group 1995 39 Jacob Cohen 2013 Statistical power analysis behavioral sciences Rout ledge 40 Jose Luis de La Vara Markus Borg Krzysztof Wnuk Leon Moonen 2016 industrial survey impact evidence change impact analysis practice IEEE Transactions Engineering 42 12 2016 1095–1117 41 JB Earp AI Anton L AimanSmith WH Stufflebeam 2005 Examining Internet privacy policies within context user privacy values IEEE Transactions Engineering Management 52 2 2005 227–237 42 Pietro Ferrara Nicola Fausto Spoto et al 2018 Static analysis GDPR com pliance CEUR Workshop Proceedings CEUR Workshop Proceedings 1–10 43 Aaron J Fischer Brandon K Schultz Melissa CollierMeek et al 2018 critical review videoconferencing support school consultation International Journal School Educational Psychology 6 1 2018 12–22 44 Lucas Franke Huayu Liang Aaron Brantly James C Davis Chris Brown 2024 First Look General Data Protection Regulation GDPR Open Source Proceedings 2024 IEEEACM 46th International Confer ence Engineering Companion Proceedings Lisbon Portugal ICESE Companion ’24 Association Computing Machinery New York NY USA 268–269 httpsdoiorg10114536394783643077 45 GDPR 2018 Art 4 GDPR Definitions httpsgdpreuarticle4definitions 46 GDPR 2018 Art 83 GDPR General conditions imposing administrative fines httpsgdpreuarticle83conditionsforimposingadministrativefines 47 Github 2022 Octoverse 2022 state open source https octoversegithubcom 48 Github 2023 Creating pull request httpshelpgithubcomenarticlescrea tingapullrequest Github Help 49 Georgios Gousios Andy Zaidman 2014 dataset pullbased develop ment research Conference Mining Repositories 368–371 50 Emiza Guzman David Aziozar Yang Li 2014 Sentiment analysis commit comments Github empirical study Mining Repositories MSR 51 Rajaa El Hamdani et al 2021 combined rulebased machine learning approach automated GDPR compliance checking Eighteenth International Conference Artificial Intelligence Law 40–49 52 Nikolay Harutyunyan 2020 Managing open source supply chainwhy Computer 53 6 2020 77–81 53 Paul Hitlin Rainie Lee Kenneth Olmstead 2019 Facebook Algorithms Personal Data Pew Research Center httpswwwpewresearchorginternet2 0190116facebookalgorithmsandpersonaldata 54 Chris Hobbs 2019 Embedded development safetycritical systems CRC Press 55 Sebastian Holst 2017 GDPR liability development new law LinkedIn 2017 httpswwwlinkedincompulsegdprliabilitysoftware developmentnewlawsebastianholst 56 Mingyu Hu Bing Liu 2004 Mining opinion features customer reviews AAAI Vol 4 755–760 57 Syed Fatiul Huq Ali Zafar Sadiq Kazi Sakib 2019 Understanding effect developer sentiment fixinducing changes exploratory study github pull requests 2019 26th AsiaPacific Engineering Conference APSEC IEEE 514–521 58 Clayton Hutto Eric Gilbert 2014 Vader parsimonious rulebased model sentiment analysis social media text Proceedings international AAAI conference web social media Vol 8 216–225 59 International Association Privacy Professionals Accessed 2023 Global Comprehensive Privacy Law Mapping Chart httpsiapprorgresourcesarticleglo balcomprehensiveprivacylawmappingchart 60 International Electrotechnical Commission 2010 Functional safety electrical electronicprogrammable electrical safetyrelated systems Part 3 requirements httpswebstoreiecchpublication9277 61 Chongtao Jia Mihai Stănescu Elham Marin 2019 researchers facilitate utilisation research policymakers practitioners education Research Papers Education 34 4 2019 483–498 62 Onnisaak Henna 2018 User Data Privacy Facebook Cambridge Analytica Privacy Protection Computer 51 8 2018 56–59 63 Arthur Jacobs 2019 Sentiment analysis words fiction characters perspective computational neuro poetics Frontiers Robotics AI 6 2019 53 64 Robbert Jongeling Proshanta Sarkar Subhajit Datta Alexander Serebrenik 2017 negative results using sentiment analysis tools engineering research Empirical Engineering 22 2017 2543–2584 65 Eirini Kallianvakou Georgios Gousios Kelly Blincoe Leif Singer Daniel German Daniela Damian 2014 promises perils mining github 11th Working Conference Mining Repositories MSR 92–101 66 Oleksandra Klymenko Oleksandr Kosenkov Stephen Meisenbacher Parisa Elahidoost Daniel Mendez Florian Matthes 2022 Understanding implementation technical measures process data privacy compliance qualitative study Proceedings 16th ACMIEEE International Symposium Empirical Engineering Measurement 261–271 67 Michael Kretschmer Jan Pennekamp Klaus Weber 2021 Cookie banners privacy policies measuring impact gdpr web ACM Transactions Web TWEB 15 4 2021 1–42 68 Oksana Kulyk Nina Gerber Annika Hilt et al 2020 gdpr hype affected users’ reaction cookie disclaimers Journal Cybersecurity 1 88–95 69 Aman Kumar Manish Khare Saurabh Tiwari 2022 Sensitivity Analysis Developers’ Comments GitHub Repository Study International Conference Advanced Computational Intelligence ICACI IEEE 91–98 70 Christian Kurtz Martin Semmann Tilo Bohman 2018 Privacy design comply GDPR review thirdparty data processors 2018 71 Daniël Lakens 2013 Calculating reporting effect sizes facilitate cumulative science practical primer ttests ANOVAs Frontiers psychology 4 2013 6267 72 Roslyn Layton Silvia ElalufCalderwood 2019 social economic analysis impact GDPR security privacy practices 2019 12th CMI Conference Cybersecurity Privacy CMI IEEE 1–6 73 Thomas W MacFarland Jan Yates Thomas W MacFarland Jan Yates 2016 Mann–whitney u test Introduction nonparametric statistics biological sciences using R 2016 103–132 74 Abhishek Mahindrakar Karuna Pande Joshi 2020 Automating GDPR Compliance using Policy Integrated Blockchain IEEE Intl Conf Big Data Security Cloud BigDataSecurity IEEE Intl Conf High Performance Smart Computing HPSC IEEE Intl Conf Intelligent Data Security IDS 86–93 httpsdoiorg101109BigDataSecurityHPSCIDS49724202000026 75 MH Lloyd PJ Reeve 2009 IEC 61508 IEC 61511 assessmentssome lessons learned 2009 76 Thomas W MacFarland Jan Yates Thomas W MacFarland Jan Yates 2018 2023–2026 impact GDPR global technology development Journal Global Information Technology Management 22 1 2019 77 Ze Shi Li Colin Werner Neil Ernst 2019 Continuous Requirements Example Using GDPR 2019 IEEE 27th International Requirements Engineering Conference Workshops REW 144–149 httpsdoiorg101109REW201900031 78 MH Lloyd PJ Reeve 2009 IEC 61508 IEC 61511 assessmentssome lessons learned 2009 79 Thomas W MacFarland Jan Yates Thomas W MacFarland Jan Yates 2016 Mann–whitney u test Introduction nonparametric statistics biological sciences using R 2016 103–132 80 Abhishek Mahindrakar Karuna Pande Joshi 2020 Automating GDPR Compliance using Policy Integrated Blockchain IEEE Intl Conf Big Data Security Cloud BigDataSecurity IEEE Intl Conf High Performance Smart Computing HPSC IEEE Intl Conf Intelligent Data Security IDS 86–93 httpsdoiorg101109BigDataSecurityHPSCIDS49724202000026 81 YodSamuel Martad Anu Kung 2018 Methods tools GDPR compliance privacy data protection engineering IEEE European Symposium Security Privacy–Workshops IEEE 108–111 82 J Valdez Mendia J J FloresCuatle 2022 Toward customer hyperpersonalization experience — Datadriven approach Cogent Business Management 9 1 2022 2041384 httpsdoiorg1010802331197520222041384 83 Dan Milmo Lisa O’Carroll 2023 Facebook owner Meta fined €12bn mishandling user information Guardian httpswwwtheguardiancomtechnology2023may22facebookfinedmishandlinguserinformationirelandeumeta 84 Rene Moquin Robin L Wakefield 2016 roles awareness sanctions ethics compliance Journal Computer Info Sys 56 3 2016 85 Frank Nagle James Dana Jennifer Hoffman Steven Randazoo Xanou Zhou 2022 Census II Free Open Source Software—Application Libraries Linux Foundation Harvard Laboratory Innovation Science LISH Open Source Security Foundation OpenSSF 80 2022 86 Chinenye Okafor et al 2022 Sok Analysis supply chain security establishing secure design properties ACM SCORED Workshop 15–24 87 Kangil Park Bonita Sharif 2021 Assessing perceived sentiment pull requests emojis evidence tool developer eye movements 2021 IEEEACM Sixth International Workshop Emotion Awareness Engineering SEmotion IEEE 1–6 88 Luca Pascarella Davide Spadini et al 2018 Information needs contemporary code review Proc ACM HumanComputer Interaction CSCW 2018 89 Cole Peterson Jonathan Saddler Natalie Halavick Bonita Sharif 2019 gazebased exploratory study information seeking behavior developers stack overflow CI 1–6 90 Daniel Pletea Bogdan Vasilescu Alexander Serebrenik 2014 Security emotion sentiment analysis security discussions github Proceedings 11th working conference mining repositories 348–351 91 Supreeth Shastri et al 2020 Understanding benchmarking impact GDPR database systems VLDB 13 7 2020 1064–1077 92 Jeremy Sirk Walid Tabu 2007 Gradual typing objects European Conference Objectoriented Programming Springer 2–27 93 Devarshi Singh et al 2017 Evaluating static analysis tools reduce code review effort 2017 IEEE symposium visual languages humancentric computing VLHCC IEEE 191–105 94 Sean Sirur Jason RC Nurse Helena Webb 2018 Yet Understanding Challenges Faced Complying General Data Protection Regulation GDPR 2nd International Workshop Multimedia Security MMSec Springer 1–16 95 Ian Sommerville 2011 Engineering 9E Pearson Education India 96 Jeff South 2018 1000 US news sites still unavailable Europe two months GDPR took effect Nieman Lab httpswwwniemanlaborg201808morethan1000usnewssitesarestillunavailableineuropetwomonthsaftergdprtookeffect 97 Richard Sproat Alan W Black Stanley Chen Shankar Kumar Mari Ostendorf Christina Richards 2001 Normalization nonstandard words Computer speech language 15 3 2001 287–333 98 David Stokes 2012 21 Validation regulatory compliance freeopen source Open Source Life Science Research Lee Harland Mark Forster Eds Woodhead Publishing 481–504 99 Joanna Stryczew Jef Audouls Natali Helberger 2020 Data protection data frustration Individual perceptions attitudes towards GDPR Eur Data Prot L Rev 6 2020 407 100 Synopsys 2023 Open Source Security Risk Analysis Report httpswwwpwccomusenservicesconsultinglibrarygdprreadinesshtml 101 Aurelia TamòLarrieux Aurelia TamòLarrieux 2018 Privacy Design Internet Things Startup Scenario Designing Privacy Legal Framework Data Protection Design Default Internet Things 2018 203–226 102 Neil Thurman 2020 Many EU visitors shut US sites response GDPR never came back Reuters Institute Study Journalism httpsreutersinstitutepoliticsoxacuknewsmanyeuvisitorsshutoutussites 103 Serj Tubin 2023 GDPR stuff httpsgithubcom2beensserjtubinvuepull71 GitHub repository 2beensserjtubinvue 104 UNCTAD 2021 Data Protection Privacy Legislation Worldwide United Nations Conference Trade Development 2021 httpsunctadorgpagedataprotectionandprivacylegislationworldwide 105 Christine Utz Martin Degeling Sascha Fahl et al 2019 Un informed consent Studying GDPR consent notices field ACM SIGSAC Conference Computer Communications Security CCS 973–990 106 N van Dijk Tanas K Rommetveit C Raab 2018 Right engineering redesign privacy Personal Data Protection International Review Law Computers Technology 32 2–3 Apr 2018 230–256 httpsdoiorg1010801360006920141575022 107 Ana Vazão Leonel Santos Maria Beatriz Piedade Carlos Rabadao 2019 SIEM open source solutions comparative study 2019 14th Iberian Conference Information Systems Technologies CISTE IEEE 1–5 108 Denis Verdon 2006 Security policies developer IEEE Security Privacy 4 4 2006 42–49 109 Branka Vuleta 2023 10 unbelievable GDPR statistics 2023 httpslegaljobsiobloggdprstatistics 110 Yue Wang Irene L Manotas Gutièrrez Kristina Winbladh Hui Fang 2013 Automatic detection ambiguous terminology requirements 18th International Conference Applications Natural Language Information Retrieval NAACLHLT Association Computational Linguistics 75–85 111 R Kent Weaver 2015 Getting people behave Research lessons policy makers Public Administration Review 75 6 2015 806–816 112 Krzysztof Wnuk Tony Gorschek Showary Zahda 2013 Obsolete requirements Information Technology 55 6 2013 921–940 114 Christopher Wylie 2019 Helped Hack Democracy New York Magazine httpsnymagcomintelligencer201910bookexcerptmindfckbychristopherwyliehtml 115 Christopher Wylie 2019 Made Steve Bannon’s Psychological Warfare Tool Meet Cambridge Analytica Whistleblower New York Magazine httpsnymagcomintelligencer201910bookexcerptmindfckbychristopherwyliehtml 116 Haoxiang Zhang Shaowei Wang TseHsun Chen Ying Zou Ahmed E Hassan 2019 empirical study obsolete answers stack overflow IEEE Transactions Engineering 47 4 2019 850–862
::::
“They Ever Guide” Open Source Community Uses Roadmaps Coordinate Effort DANIEL KLUG CHRISTOPHER BOGART JAMES HERBSLEB Carnegie Mellon University USA Unlike commercial development open source OSS projects generally managers direct control developers spend time yet projects large diverse sets contributors need exists focus steer development particular direction coordinated way especially important “infrastructure” projects critical libraries programming languages many people depend projects taken approach borrowing planning tools originated commercial development despite fact techniques designed different contexts eg strong topdown control profit motives Little research done understand practices adapted new context paper examine Rust project’s use roadmaps important OSS infrastructure adapted inherently topdown tool freewheeling world OSS find Rust’s roadmaps built part summarizing motivated developers prefer work ways description motivated labor available directive community move particular direction allow community avoid wasting time unpopular proposals revealing little help building encouraging work popular features making visible amount consensus features Roadmaps generate collective focus without limiting full scope developers work roadmap issues consume proportionally effort issues constitute minority work done ie issues pull requests made central peripheral participants also create transparency among beyond community central contributors’ plans allow rational decisionmaking providing way evidence community needs linked decisionmaking CCS Concepts • Humancentered computing → Open source • Social professional topics → Sustainability Additional Key Words Phrases collaboration common pool resources open source Rust language ACM Reference Format Daniel Klug Christopher Bogart James Herbsleb 2021 “They Ever Guide” Open Source Community Uses Roadmaps Coordinate Effort Proc ACM HumComput Interact 5 CSCW1 Article 158 April 2021 28 pages httpsdoiorg1011453449232
::::
1 INTRODUCTION Open source OSS come fulfill infrastructure role economy Eghbal 26 highlights OSS projects MySQL Ruby OSS industrial projects depend heavily nonprofit OSS projects fulfill infrastructural role needs careful coordination among maintainers users infrastructure 68 work behalf different companies foundations perhaps volunteers Good coordination especially important infrastructure projects since definition essential underpinning many projects poorlyconsidered changes damage stakeholders would merely incidental dependency projects could simply swap alternative Coordination work selforganizing systems27 poses difficult important problem CSCW infrastructure projects ensure maintained future preserve values users depend Unlike commercial development OSS “developer community” 78 projects manager direct control features attributes developers choose spend time yet projects still need somehow coordinate stabilize make visible development priorities Open source projects governance governance models generally dictate features added example even highly orchestrated work Linux kernel multiple coordination processes driven open source norm contributors selfselect tasks 75 Much preexisting work CSCW focused tensions infrastructure contributors’ work infrastructure priorities often driven primary work infrastructure intended support example scientific written academic collaborators shortterm paper deadlines lead people focus needed new features longterm maintainability 68 hand infrastructure development offer contributors new opportunities leading realign priorities 11 perhaps helping build consensus Researchers identified broad spectrum ways OSS communities organize coordinate development avoid tragedyofthecommons problems 50 cases preexisting social networks among contributors drive much work done 46 OSS projects taken approach borrowing planning tools originated commercial development milestones issue tracking eg Scala 1 beta testing eg PostgreSQL 2 roadmaps eg Rust despite fact techniques originally conceived different contexts ie strong topdown control profit motives executives managers make final decisions goals timelines rankandfile developers responsible carrying plans Developers open source communities contrast often free choose tasks bottomup power may impact planning tools work opensource world Investigating diverse OSS projects attempt shape collaboration stable visible way requires considering bottomup forces work developers’ motivation whether contribute users’ motivation choose support influence development factors make one survive another fails 98 well topdown techniques leadership employs projects despite relative lack power OSS leaders communities 75 research investigate consensus around community’s direction constructed maintained evaluated approach considering roadmaps originally topdown technique industry adapted reconfigured work OSS Roadmaps understood layout existing plans make future decisions usually visualization steps 97 intended open later revision 79 investigate effect choice roadmapping method coordination community could chosen process deciding use roadmaps first place rather carried particular method choose immediate effects 1 httpsgithubcomscalascalamilestones 2 httpswwwpostgresqlorgdeveloperbeta one iteration look OSS roadmap’s creation applied community evaluates impact addressing following research questions RQ1 functions roadmap serve open source community RQ2 open source community use roadmap fulfill functions results show although roadmap appears superficially edict leaders specifying resources applied fact reflects consensus among active developers wish apply efforts power derives core developers’ ability accept reject changes reassures wouldbe contributor productive developers already motivated collaborate stick roadmaprelated topics roadmapbuilding process helps developers reach consensus community members use roadmap throughout year rhetorical resource cut digressions signal intention cooperate community goals
::::
2 BACKGROUND research questions address apparent mismatch idea volunteers coming together work motivates roadmaps plans surface appear telling people Prior research partly explained open source collaborators set directions literature roadmapping corporate settings appears reveal little roadmapping applies volunteer projects section describe prior research areas
::::
21 Problem Coordinating Developer Effort Open Source recent years use OSS become pervasive 35 among developers resulting great economic value OSS 20 34 however largely invisible public Although OSS often critical infrastructure 26 managed differently traditional infrastructure users freely distribute access adapt modify redistribute source code community use Analyses OSS projects various social organizational perspectives shown managing requires taking account developers’ distinct motivations contributing 5 15 38 benefits rewards contributing 13 44 54 preferred levels involvement 4 62 building managing social capital 66 80 networking 60 76 77 differing communication interaction strategies 6 19 33 varying motivations characteristics raise question OSS communities coordinate agree work towards common goals define “coordination” many individuals deciding work together effectively choose tasks amount collective progress mutually agreeable direction opposed working crosspurposes OSS contributors maintainers often work distributed decentralized way little hierarchy institutional structure 22 99 likely engage projects tasks based personal interests 5 Coordinating organizing work OSS projects therefore involves matching demand effort desired features known bugs take time specialized skills fix supply effort volunteers paid developers motivations priorities
::::
211 Supply Demand Development Work Like OSS requires maintenance “to correct faults improve performance attributes adapt changed environment” 48 Unfulfilled demand maintenance may render regular obsolete infrastructure ramifications insufficient maintenance magnified projects users rely infrastructure thus demand development effort greater coming large dependent pool projects users Prior research shows demand maintenance work issue fixes testing documentation may depend many factors example size user base particular feature 56 extent upstream interdependent projects 12 Research managing OSS requirements 73 103 shows demand discovered analyzed prioritized validated within discussions issue requests Popular projects need help triaging userreported issues 2 104 Infrastructures typically also need coordinators 65 ensure individual projects features needed infrastructurewide release Skilled volunteers motivated factors strength identification community 38 internal eg selfuse external eg reputation motivations 36 45 desire learn 102 105 longterm “hobbyist” status developers become deeply involved play critical role longterm viability 74 Developers hired industry also play increasing role OSS development 28 Firms likely pay developers participate way sharing cost innovation creating demand complementary products services establishing technology de facto standard attracting improvements complements products 100 However industry support OSS projects carries risk discouraging volunteers mitigated transparency decisionmaking 101 negotiation governance membership ownership control production 58 212 Matches Mismatches Effort Supply Demand Participants OSS infrastructure generally free contribute anywhere individual decisions bring emergent allocation effort across projects besides decisionmaking individual participants unclear mechanism influences participants apply effort greatest need contrast clear commercial firms participating markets forces supply demand determine price strong signal guiding allocation resources 9 Economists Dalle David 21 puzzled “how absence directly discernible market links producing entities ‘customers’ output mix OSS sector industry determined Yet date question appear attracted significant research attention” unable find research addresses issue years since study requirements management points difficulties discovering articulating implementing needed features even development effort plentiful 103 lack development effort documented highly publicized Heartbleed bug 23 aware systematic studies undersupply recognize address total research seems support conclusion currently general mechanism closing gap demand supply effort except perceptions decisionmaking individual developers Yet infrastructure effort mismatch difficult participants see 68 213 Organizing Allocating Work Open Source Projects OSS leaders face tradeoffs openness fostering productive collaboration Decisionmaking behind closed doors cause conflict discourages volunteers since may feel preferences considered 40 much visibility disagreements among leadership also lead uncertainty among volunteers decisions may firm contributions may end used 82 often partial control limited means enforcement OSS leaders may rely social factors technical reputation community traditions promote vision project’s direction 55 64 Publishing schedules roadmaps help get developers identify take responsibility community goals 64 Leaders may develop formal policies guidelines collaboration give structure developers’ work 40 may assert authority reject additions given release 55 Prior research identified implicit ways core members influence newcomers peripheral members adopt cultural norms practices Hemetsberger Reinhardt 37 describe number mechanisms core members opensource projects KDE use enculturate peripheral members example project’s manifesto3 may discourage nonlikeminded contributors KDE’s leaders enforce norms mailing list discussions code review processes Crowston Shamshurin 18 showed core members successful Apache incubator projects communicative unsuccessful projects likely use pronouns way suggested inclusiveness peripheral community Gallivan 32 however argues rigorous control standardization measurability “McDonaldization” helps open source projects achieve common objectives virtual distributed environments trust relationships difficult form particular despite potentially many mutual trust relationships opensource communities control onedirectional relationship core periphery 22 Roadmaps Commercial OSS Development Roadmaps plans use resources time often created iterative reflective processes 61 intended open changes 79 goal lay existing plans future decisions visualize steps 79 97 may revised based results 41 commercial contexts developer resources needs coordinated explicitly management roadmaps tool create implement manage alignment company strategies product lifecycles audiences 24 30 96 Product Management SPM roadmaps communicative tool knowledge sharing 81 consensusreaching individual interpretation goals people involved development processes 47 example product roadmaps present features manage product stages 49 96 select assign requirements 25 connect teams ensure success product within larger time frame 30 96 create roadmaps information audiences characteristics needs usually collected beforehand 7 communicative tool roadmaps describe achieved way meet business objectives 57 Many OSS projects generate roadmap documents including large OSS communities React 67 Facebook Libra 84 Scala 85 QT 95 well industryproduced OSS AWS CloudFormation 14 industrial coalitions like Open Service Broker 3 roadmaps appear varying roles communities seem multiple versions maintained revisited others onetime descriptions envisioned future features However difficult casual observer tell importance roadmap documents play research choose Rust Language community particular example examine use roadmaps
::::
3 CASE STUDY ROADMAPS RUST LANGUAGE Based theoretical propositions selected Rust programming language singlecase study appropriate popularity openness popularity infrastructure means many users may pressure participants make implement good choices features priorities openness means rich variety data Rust compiler community’s working decisionmaking processes available blogs forums GitHub repositories Thus opportunity study great detail community 3httpsmanifestokdeorg making implementing consequential choices together constitutes Yin 106 calls “revelatory case” provides “an opportunity observe analyze phenomenon previously inaccessible social science inquiry” Rust programming language growing role popular important part infrastructure 59 many individuals subteams outside organizations stake future Rust promoted suitable infrastructural code performance reliability important web browser engines hardware devices limited resources reason used numerous big tech companies 70 Facebook Mozilla Rust community organized teams 69 work groups large active social community variety blogs chat rooms forums GitHub discussion threads inperson conferences meetings worldwide Rust community adopted evolved roadmapping process adding purposes roadmap serves time release version 10 2015 Rust core team initiated process organize prioritize future work define future goals areas Rust citing need sequence feature additions avoid later rework prioritize features would solve many problems benefit many users 51 2016 Rust team refined RFC request comments process RFCs documents proposing significant changes 93 overarching roadmap process added define initiatives rallying points concrete goals fixed time frames clear commitments individuals process involves building consensus community projectwide goals proposing goals community discussion RFC finally advertising publishing agreed upon goals yearly roadmap Rust core team 69 released first Rust roadmap February 2017 94 create roadmap core team gathered priorities Rust community survey 92 commercial user survey companies using Rust 91 2018 roadmap addition annual survey 83 core team asked Rust community blog post ideas Rust next year 87 Rust community submitted 100 blog posts suggestions roadmap core team collected incorporated suggestions RFC discussion review 71 released roadmap March 2018 88 2019 roadmap followed similar process 86 building 73 community blog posts 72 survey 90 RFC discussion core team created roadmap released March 2019 89 Unlike previous years 2019 roadmap explicitly organized around Rust’s team structure made explicit mention teams roadmaps process thus evolved four years thoughtfully sequence development prioritize worst problems users elicit broad survey deep narrative blog post input community devolve planning separate teams form teamspecific roadmaps finally ensure chosen initiatives needed actually supported people willing commit working 31 2018 Rust roadmap 2018 roadmap available httpsgithubcomrustlangrfcsblobmastertext2314roadmap2018md lays four major goals shipping ‘Rust 2018’ edition language creating documentation support intermediatelevel Rust programmers encouraging global spread Rust adding internationalization support links local Rust groups finally strengthening compiler’s work teams leadership document goes identify several concrete things need done support areas 2018 compiler release Roadmap’s first goal focuses support four identified use cases language network services WebAssembly ie use web browsers command line applications use embedded devices document also specifies rough schedule year starting design work February March 2019 focusing RFCs “buckling down” April July focusing development work “Fun” August November focused forwardlooking exploratory features “Reflection” December document ends brief discussion ‘rationale drawbacks alternatives’ 32 Rust documents Rust publishes great many documents defining product community governance Documents somewhat standard open source projects available project’s GitHub site httpsgithubcomrustlangrust include “READMEmd” telling users Rust install copyright license files positioning work legally “CONTRIBUTINGmd” “CODEOFCONDUCTmd” files laying high level community norms developers contribute expected interact “RELEASESmd” describing change history high level Beyond provides wealth deeper information including “The Rust Programming Language”4 teaches language “Guide Rustc Development”5 teaching compiler works going great depth contribution norms governance September 2019 beyond compiler Rust community 147 GitHub repositories organizational umbrella including collection RFCs discussions around httpsgithubcomrustlangrfcs annual roadmaps found among RFCs repositories hold auxiliary tools bots websites documents
::::
4 METHODOLOGY Understanding communities work often complex research matter requires large data collection research benefits Rust community open communicative produce lots publicly accessible artifacts document community related activities Therefore high volume data available researchers community builds maintains evaluates consensus direction 41 Data Collection analyze functions roadmap serves Rust community use fulfill functions collected following publicly available data produced Rust community 411 Yearly Rust Roadmaps focused communitywide 2018 Rust roadmap collected official roadmap document 88 Rust community introduced first roadmap 2017 analyzing 2018 roadmap allows look past following years’ roadmap include community’s reflection roadmap used collected 97 100 blog posts 71 3 longer retrievable submitted Rust community members process creating 2018 roadmap written response core team’s call goals directions Rust 2018 412 Direct records Rust compiler work Rust community uses RFC process find consensus proposed substantial changes language standard libraries also 4httpsdocrustlangorgbookindexhtml 5httpsrustcdevguiderustlangorg Fig 1 gathered engineering artifacts GitHub comments blog posts email interviews analyzed engineering artifacts set preroadmap blog posts roadmaprelevant content analyzed GitHub comments chats blog posts interview text qualitative coding statistically analyzed Likertscale answers email interviews describe functions mechanisms roadmap drawing three types analysis Community standards Issues PRs pull requests ie proposed specific edits code often linked RFCs show actual coding work contributors happens Rust contributors allocate time effort Comments RFCs issues PRs involve discussions among contributors teams scraped code discussion contents GitHub repositories associated Rust compiler Jan 1 2018 Dec 31 2018 time frame 2018 roadmap data allowed us analyze much kind work coding work discussion work people core peripheral people adhered topics called roadmap 413 Records argumentation discussion understand participants used roadmap resource argumentation year affect decisions priorities collected excerpts across several communication channels used Rust community Table 1 people explicitly mentioned roadmap ie explicit mentions word “roadmap” “road map” Compiler work extracted roadmap mentions corpus RFC issue PR discussions described excluding mentions roadmap’s RFC2314 httpsgithubcomrustlangrfcspull2314 Posts Rust blogs forums participants Rust many OSS communities maintain personal official community blogs post updates goals ideas critical thoughts gather samples participants explicitly using existence content roadmap resource argumentation direction searched roadmap mentions posts main publicly accessible Rust blogs Rust Blog httpsblogrustlangorg Inside Rust Blog httpsblogrustlangorginsiderust Read Rust httpsreadrustnet Week Rust httpsthisweekinrustorg Rust Internals forum httpsinternalsrustlangorg Jan 1 2018 Apr 23rd 2019 time period extended past end year specifically include posts advocating content 2019 roadmap since might contain reflections 2018 roadmap content 2019 call roadmap blog posts explicitly asked Rust contributors reflect Rust 2018 86 Online team meetings OSS community Rust contributors characteristically distributed world meetings mainly held online Rust compiler team holds weekly meetings collaborative chat Zulip httpsrustlangzulipchatcom update manage monitor plan work working groups Table 1 total number collected data excerpts data mention roadmap data collected RFC issue PR comments GitHub Blog forum posts Blog posts reflecting roadmap messages Zulip chat threads total mentions roadmap 135234 3394 73 58901 197602 59 110 28 144 341 throughout larger community Zulip conversations semipublic members need create free account log participate thus setting low barrier read contribute discussions Anticipating team members contributors might use online meetings discuss matters related roadmaps roadmap processes searched roadmap mentions Rust team meetings held Zulip starting Jan 1 2018 extending months beyond end 2018 Apr 23rd 2019 also include reflection 2018 roadmap happened early 2019 textual data collected GitHub comments online meetings blog post identified total 118 participants name username made least one comment multiple comments regarding roadmaps anonymized participants P001 P002 P118 chronologically appearance different data sources Five participants core team members 28 members teams 85 nonteam members five identified working group members see Fig 2 414 Email Interviews addition data mining conducted short emailed structured interviews Rust contributors contextualize findings two research questions generated sample community members stratified level community involvement find highly involved members collected list Rust team members blog post authors 2018 roadmap 99 people time sampling lessinvolved members chose random sample size committers compiler listed emails Github profiles later data cleaning people multiple invalid emails ended list 190 candidates mailed interview candidates 39 people responded 205 response rate 24 identified belonging Rust team 15 said see Fig 2 email interviews conducted anonymously could match participants existing list participants Rust forums Therefore interview participants anonymized numbered separately PS001 PS002 PS039 interview questions asked Rust contributors experience opinions Rust roadmaps year questions shown Appendix 42 Data Description Analysis case study includes data collected GitHub reconstruct allocation effort code work textual data several Rust community sources analyze communicative aspects creating using roadmap documents discussing work effort related roadmap topics answers structured email interviews Rust community members triangulate results obtained collected textual data analyze Rust community creates uses evaluates roadmaps decided follow mixedmethod approach quantitative qualitative methods could sufficiently address research questions 16 simultaneously used quantitative qualitative data collection methods followed convergent approach separately analyze data sets combine results interpretation Following methodological approach goal generate complete deep understanding 17 roadmaps used discuss allocate effort used quantitative technique estimate proportion work done year relevant communitywide 2018 roadmap developed roadmap topic heuristic determining whether given piece text relevant topics mentioned roadmap purpose heuristic give us objective way saying whether unit discussion coding part roadmap secondarily part roadmap pertained heuristic starts handwritten regular expressions built around topics found 2018 roadmap identifies text applying regular expressions also making inferences topics “related” items example inferring issue claims track RFC probably addresses topic RFC output list issues pull requests RFCs commits tagged “in roadmap” “not roadmap” algorithm described detail Appendix B applied heuristic create two datasets identify ideas roadmap came applied heuristic 97 retrievable blog posts answered 2018 Rust call roadmap blogs generating mapping 2018 roadmap topics blogs core team drew preparing roadmap also identified whether post written member Rust team estimate influence roadmap work done throughout 2018 applied heuristic Rust issues Rust PRs creating data set consisting one record per PR issue tagged possibly empty set roadmap topics context discussion issue PR type contributor Rust team member two measures work effort discussion work coding work Discussion work operationalized number characters English text PR issue discussion threads removing code snippets coding work operationalized lines code added removed Rust commits associated PRs datasets distinguish individual participants “team” vs “nonteam” defined scraping membership Rust teams Figure 2 project’s governance page Fig 2 classify Rust community members “team” 191 people “nonteam” participants whether contributing code effort depending whether listed team httpswwwrustlangorggovernance January 3 2019 Although organizational literature often refers “core” “peripheral” members avoid confusion use word “core” 9person team Rust governance page identified “core team” “team” refer 191 members teams including core “nonteam” larger community periphery supplemental check sources ideas roadmap manually inspected ten commits 2018 Rust roadmap document GitHub Rust RFC summarized changes looking introductions new topics none found small relatively 6httpswwwrustlangorggovernance January 2019 retrieved httpswebarchiveorgweb20190103220022httpswwwrustlangorggovernance Table 2 Examples applying codes excerpts sorting categories Excerpt Code Category “a key step successful WG going forming roadmap ” point need roadmap creating roadmap “it’s kind change that’s targeted roadmap year” rejecting RFC using roadmaps decline allocating effort informal effort since cursory check showed little substantial change RFC made discussion Complementing quantitative technique also created dataset handcoded roadmap mentions work team meetings blogs Table 1 shows amount data collected source extracted 341 excepts mentioned roadmap road map collected data tracking excerpt author source case study textual data collected GitHub comments online meetings blog posts followed qualitative content analysis approach 42 52 characterize people said roadmaps excerpts sampled Rust online artifacts decided use qualitative content analysis case study method rooted social research linked particular science concepts 43 makes useful approach study documents artifacts across various data sources 8 Content analysis profitable mixedmethod research comprises quantitative qualitative methodology qualitative content analysis particular allows researcher extract manifest latent information different textual data 10 used datadriven open coding approach across collected excerpts textbased data sources GitHub comments online meetings blog posts 52 performed inductive coding created preliminary codes construct coding scheme processing qualitative data Open codes data sources combined larger categories total generated 91 codes see Table 2 code examples sorted eight categories see Table 3 Throughout open coding process research team ensured common shared agreement generated applied codes coding varying textual data set GitHub comments online meetings blog posts based consistent use codes one researcher subsequent review generated applied codes second researcher process little disagreement found cases two researchers met review discuss refine disagreed upon codes relation data source research question coded excerpt relates discussion refinement disagreements solved codes mutually validated way ensuring validity qualitative research agreement established approach CSCW community 53 matches inductive coding approach qualitative case study across varying textual data sources addition analyzing blog posts online meetings structured email interviews served collect additional data triangulate results observed 29 Interview questions asked roadmaps influence decisionmaking helpful roadmaps community roadmaps match personal work priorities see Appendix analyzed numeric responses shown Table 4 identify themes textual responses one researcher grouped responses question categories another researcher reviewed challenged categorizations
::::
5 RESULTS following two subsections answer research questions roadmap accomplish Rust community RQ1 RQ2 Table 3 Number excerpts number codes applied per category Category Num excerpts Num codes applied Creating roadmap 134 17 Using roadmap decline 33 13 allocating effort Pointing effort roadmap 26 12 topics Executing roadmap 81 28 Asking roadmap 11 4 Linking roadmap documents 28 2 Praising use roadmap 13 6 Criticizing use roadmap 15 9 total 341 91 Table 4 Summary responses email interview Q13 asked textual explanations accompanied Likertstyle question fivepoint scale 3 would neutral answer 5 means roadmap high influence respondent’s activities helpfulness alignment respondent’s priorities team nonteam differ ttest p005 Questions given Appendix Question Likert answers mean Text answers count overall Team NonTeam Team NonTeam Q1 influence 15 scale 28 32 23 10 6 Q2 helpful 15 scale 41 42 40 11 7 Q3 priorities 15 scale 35 37 31 11 3 Q4 improve text 15 4 Q5 years numeric 37 38 35 Q6 team yesno 24 yes 15 51 Functions Roadmap Building using roadmap appeared serve neither extreme forcing team members’ agenda wider community letting broader user community choose direction Rather allowed team members others identify areas consensus around goals keep focus goals year 511 Reaching consensus purpose among team members Rust team put call beginning 2018 asking community submit “blogposts reflecting Rust 2017 proposing goals directions Rust 2018” analysis posts eventual 2018 roadmap suggests Rust team indeed succeeded soliciting input people outside team structure 18 97 retrievable blog posts collected authored people listed team members alumni However blog posts responding solicitation seem major source novel ideas outside central community resulting roadmap document synthesis shared ideas many sources 23 30 roadmap topics could find blog posts mentioned team non team blog posts three topics mentioned team members four non team members single blog post contained 12 topics suggesting roadmap really synthesis many perspectives Table 5 table quantifies two types effort discussion code contribution applied Rust community broken roadmaprelatedness type effort “total” figures show discussion coding nonroadmap items however Bytes per issue lines per PR figures show effort per item roadmap items Total issue text ÷ issues Bytes per issue Total lines code ÷ PRs lines per PR Roadmap 316 MB ÷ 2899 109150 246K ÷ 680 3622 Non Roadmap 788 MB ÷ 9092 86622 923 K ÷ 3320 2779 simply codification existing consensus RFCstyle process accepting roadmap core team created elicit completely new ideas community rather discussion consisted mostly clarification acceptance roadmap changed little core team proposed Jan 29 2018 adoption March 5th Discussion 51 general comments 20 comments linked lines document led little change time Besides typos formatting clarifications main substantive change rewording strongly emphasize compiler performance short process appear generate innovative new directions rather consolidation ideas already support previously gathered together 512 Focusing work year Analysis effort expended Rust community 2018 demonstrates 2018 roadmap neither followed religiously ignored completely Rather represented community focus sense initiatives attracted proportionally coding discussion per issue issues roadmap Table 5 quantifies two types effort applied Rust community broken type effort contributing discussion GitHub threads writing code Fig 3 Volume discussion left coding right broken team n191nonteam n2392 members roadmap nonroadmap issues Left figure measures discussion megabytes right figure measures lines code pull requests thousands lines code Nonroadmap matters dominated volume discussion code Nonteam members discussion team members somewhat coding overall Note team members work per person vastly nonteam members roadmap issues involve work per item nonroadmap issues see Table 5 Roadmap matters constitute minority work receive outsize attention Rust compiler project’s issue PR threads Rust community generated 121457 comments across 11991 different discussion threads 2018 discussing proposed ongoing development Rust compiler hottest threads ie Rust issues pull requests bytes discussion likely roadmap topics – 6 top 10 largest issue threads roadmap topics overall 2899 11991 24 issues related directly roadmap measured heuristic chi2 6989 p0082 words roadmap topics community focus long tail smaller efforts actually constituted discussion Discussion roadmaprelated issues constituted average 278 text per issue nonroadmap issues Roadmap issues included text nonroadmap issues p0092 2tailed ttest logtransformed byte counts issue discussions Although 211 lines code added deleted roadmaprelated focus relationship applied 170 PRs worked associated roadmap roadmap PRs substantial changes averaging 304 lines code per PR p0001 2tailed ttest logtransformed lines code per pull request Thus although majority issues discussed code changes proposed envisioned roadmap ones roadmap consume proportionally effort per issue especially frequent contributors roadmap appears serve focus attention still allowing great deal work outside boundaries everything community agrees requires consensusbuilding needs roadmap priorities bug fixing obvious asked whether followed roadmap personally twelve interviewees PS002 PS006 PS007 PS008 PS016 PS018 PS020 PS021 PS023 PS027 PS036 PS039 replied Rust roadmaps set common direction community emphasized common focus “I think give clear focus point year community wants work next see accomplished goals next ones be” –PS008 email interview others emphasized open nonprescriptive attitude “Ostensibly called roadmaps helpful sense set general priorities course lot things outside roadmap worked cannot command volunteers otherwise” –PS016 email interview Another said “Roadmaps independent actual work invest ever guide” PS021 email interview 513 Prioritizing work core Team members pay heed roadmap priorities nonteam members Although roadmap pitched description general community priorities evidence people team nonteam perceive roadmap especially relevant activities team developers less important binding nonteam participants Four 16 people answered interview question roadmaps influenced decisions work indicated roadmap applied highlyinvolved people One respondent claimed fairly low 25 influence roadmap said “I started contributing learning experience roadmaps didn’t influence start contributing influence contribute I’m involved” PS023 email interview Another claimed high influence 55 roadmap said “I’m core team work subteams roadmap directly related work do” PS039 email interview amounts text code generated participants support idea team members likely pay attention roadmap issues 87 108 team members contributed code 2018 81 added comment least one roadmaprelated issue 39 nonteam contributors 1065 2757 difference proportions significant chi2 7598 p00001 Still bulk work regardless role nonroadmap matters 341 text team members wrote issue comments roadmaprelated issues 69MB 203MB Figure 3 217 lines code wrote roadmaprelated PRs Contributors teams similar proportion roadmap work 274 issue comment text 200 code lines written roadmaprelevant seems teams’ proportionally greater preference work roadmap issues individual issue level result vastly greater proportion roadmap work done volume might explained example team members “touching” many issues bulk work Although developers particular issues prefer work others especially team members took cues roadmap setting priorities interviews people gave equivocal answers question whether Rust roadmaps influence decision work contribute average choice 28 15 scale slightly closer “not all” scale’s midpoint 3 People said team rated higher 32 nonteam respondents 23 ttest p01 Four people elaborated question said felt roadmap mostly relevant important issues addressed team developers Two specifically indicated presence feature roadmap gave developers confidence work feature knowing change wanted work would taken others community One said “only contribute driveby ie oneoff edit without much community engagement itch needs scratching roadmaps influence see chance scratching actually result usable changes language” PS001 email interview short roadmap provides encouragement work certain issues certain people developers feel constrained work roadmap initiatives Influence individual priorities roadmap ran ways among interviewees People rated agreement roadmap’s priorities slightly positively average 35 15 scale team members significantly higher 37 nonteam members 31 ttest p05 14 chose elaborate causality ran ways two said priorities matched roadmap’s helped write three said happened agree priorities hand five said pursued roadmap initiatives didn’t priorities three said disagreed priorities valued importance shared goal getting way One person said roadmap priorities vague resolve disagreements relevant working team 514 Creating external visibility saw roadmap also serving communicate intentions Rust community outside community make community’s trajectory predictable first proposing roadmap process author proposal listed among goals “Advertise goals published roadmap” “Celebrate achievements informative publicitybomb” 1 interviews four 14 people PS001 PS005 PS006 PS038 answered question roadmaps valuable indicated helped communicate vision intentions outside One said roadmap helps users plan giving “ sense unstable features OK use that’s planning switch stable reasonable time frame” PS001 email interview Another respondent found helpful way judge plans use language “I consider Rust still young language yet finalized depending direction goes could dealbreaker me” PS038 email interview 515 Building sense group identity online team meetings Zulip largest number roadmap mentions concerned creating roadmaps majority mentions 63 91144 single participant P008 core team member championed roadmap formation strengthening Rust’s team structure 2018 P008’s rhetorical use roadmap included emphasizing need start separate roadmap eg subproject suggesting collecting roadmap topics existing roadmaps P008 emphasized benefits roadmaps successful collaborative work “a key step successful WG going forming roadmap ” –P008 core team member online meeting structuring work processes “I think encouraging people outline roadmap specific steps good idea” –P008 core team member online meeting reaching bigger shared goals argued example creating roadmaps worth effort put “it’s worth taking time make roadmap” –P008 core team member online meeting work time needed create roadmaps non team members also mentioned need roadmaps organize work effort “We need open issues first kind roadmap” –P040 non team member online meeting overall less committed making decisions create manage roadmaps “not sure want wait collect appropriate toolsubteam roadmaps publish one collectively” –P038 non team member online meeting online meetings non team members rather make comments show mostly strong support roadmap creation reaction suggestions made core team members “I think roadmap definitely good idea something get working groups working towards goal could helpful keeping active” –P045 non team member online meeting praise effort made team members create apply roadmaps “I applaud can’t agree everything ” –P048 non team member online meeting Team members understood roadmaps useful planning tool ongoing future work manage working groups attract contributors presenting work areas goals Roadmaps functioned manifest topics working groups focus certain time team members gently pushed towards creating roadmaps example suggesting new group begin lightweight alternative complex communitywide process “I’m imagining long ‘roadmaps’ bullets” –P008 core team member online meeting team members’ effort contributors working groups start roadmaps illustrates need goal organize manifest work written form especially core team tries manage larger general goals distributed Rust community 516 Summary Rust community’s team members began diverse set priorities individuals roadmap process way team members decide consensus focus attention commit applying things year gave way define strongly group knowing shared purpose evidence gave peripheral participants way assert identity group ingroup members gently channel outside contributions away distracting alternate paths Although process explicitly listened input outside community Rust team membership practice bring significant new ideas outsiders conversation 52 Mechanisms Roadmap roadmap written never referred might simply gather dust bear relation subsequent activity Rust community however appears take roadmap seriously written Individuals used gauge whether ideas likely supported others strengthen formation teams discuss argue encourage discourage proposed efforts reflect progress 521 Assembling work groups Although roadmap creation phase helps whole community build consensus overall goals developers also use find form collaborations particular tasks blog posts team non team members alike mentioned personal roadmaps way inform work activities promote plans action example referred detailed goals roadmaps There’s bit detail roadmap –P091 non team member blog post pointed roadmap goals work groups Embedded one four target domains Rust 2018 Roadmap –P084 non team member blog post one issue comment contributor motivated others contribute ideas roadmap call blog posts influence Rust roadmap Please write Rust 2019 blog post express concern think enough us influence roadmap –P021 team member issue comment Core team members early 2018 pushed creation formal working groups domains defined focus roadmap blog posts team members emphasized work effort would aimed domain working groups primary focus year’s domain working groups kicked 2018 Roadmap –P076 core team member blog post team leaders advertised community allocate resources domain working groups Blog posts time announced new working groups domain argued reorganizing existing working groups better meet roadmap goals devtools team reorganised continue scale support goals roadmap –P077 team member blog post Conversely although roadmaps promoted complete list things work also serve prewarn developers things might work would likely attract much support collaboration RFC issue PR comments team members used roadmap refer overall direction Rust take Even without definite future goals mere existence roadmap process served reject proposals matching potential goals included explanations right time right trend details roadmap still play seems like clear expansion insufficiently strong motivation –P008 core team member RFC comment right perspective don’t think major rework enums currently aligns well current priorities priorities likely set upcoming roadmap –P008 core team member RFC comment 522 Discouraging nonroadmap RFCs basis rejecting proposals Team membership appears affect people talk roadmap Roadmap mentions team members RFC issue PR comments intended point contributors roadmap topics away RFC proposal I’d like draw attention 2018 roadmap –P012 core team member RFC comment However team members often still valued developers’ ideas motivated future work example presented prospect feature could make upcoming roadmap could interesting thing consider next year’s roadmap –P002 team member RFC comment roadmap gave justification team members especially core team members dismiss proposals fit well community’s vision Rust would take much significant effort away current efforts comments GitHub roadmap mostly mentioned argument discussions team members decline proposed RFCs seem fit roadmap goals it’s kind change that’s targeted roadmap year –P002 team member RFC comment argumentative strategy seems go perception roadmap mere guideline instead posing roadmap goals delimiting boundaries work effort allocated comments gave additional explanations declining RFCs relation roadmap example roadmap treated strict work plan proposals possible threat achieving roadmap goals pretty worried delay hard time delivering roadmap year –P007 team member issue comment Team members also used roadmap reinforce something perceived true insufficient reason end RFC issue discussions example proposal generate enough community interest hasn’t lot activity RFC also doesn’t particularly fit roadmap –P008 core team member RFC comment also defined adequacy RFC discussions roadmap goals also don’t think RFC high enough priority Rust roadmap devote lot attention reaching consensus –P018 core team member RFC comment words features match roadmap worth effort find consensus within community Although nonteam members rarely used roadmap argue features one contributor mentioned roadmap speak issue Finally ‘abstract type’s close roadmap –P011 non team member RFC comment Beyond role consolidating consensus created roadmap also used argumentative resource encouraging work shared goals discouraging work even extended discussion work risks becoming distraction 523 Reason promote particular issues PRs found issue PR comments non team members mostly mentioned roadmap referring supporting emphasizing roadmap goals issue discussions asking clarification status roadmap goals often argued favor features related roadmap Using build systems thanin addition Cargo explicitly goal 2018 roadmap –P028 non team member issue comment often mentioned roadmap strong reference argue working implementing features sometimes even reference previous roadmap topics Cargo able integrate larger build systems think 2017 roadmap –P009 non team member RFC comment discussing work effort issues PRs non team members also pointed roadmap goals others Note haven’t seen yet macros 20 apparently slated stable later year according proposed roadmap –P021 team member issue comment 524 Shared basis later reflection Rust roadmap process promises retrospective reflection end year 1 part Rust core team asked people reflect 2018’s roadmap posing ideas 2019 Roadmap reflections within posts mostly evaluated progress roadmap’s particular initiatives example posters praised progress WebAssembly 2018 really cool year WASM Rust –P116 team member blog post reflecting futures asyncawait lot progress made Futures asyncawait 2018 –P110 team member blog post reflecting People also criticized lack progress unfinished tooling Tooling large part goal Rust 2018 one gets lucky tooling around editor IDE support “just work” many times doesn’t –P071 non team member blog post reflecting missing libraries posts commented features claiming changes made actual benefit users mistimed Reflections process relatively rare Developers mentioned community collaborative work processes yet improved planned community still needed better manage exhaustion time spent topics general many key contributors rustc put enormous amount pressure get changes shipped deadline –P086 non team member blog post reflecting Moving 2019 efforts reflecting 2018 waned blog posts mentioning roadmaps mostly highlighted work group achievements developments Rust package manager cargo WebAssembly goals stabilization growth increased productivity Rust teams seems consistent 2019 roadmap’s shift emphasis towards teamspecific roadmaps email interviews 19 people PS002 PS005 PS006 PS007 PS008 PS013 PS016 PS018 PS021 PS023 PS025 PS028 PS029 PS030 PS032 PS035 PS036 PS037 PS039 responded question roadmaps could improved two people teams suggestions seemed aimed reinforcing roadmap’s role commitment achieve goals common suggestion 7 respondents PS006 PS007 PS008 PS028 PS030 PS032 PS035 better reflection process cases end year preparation next roadmap One respondent said “It’d nice retrospective examines much work year kept plan give summary language advanced desired direction” PS007 email interview Seven respondents satisfied process PS002 PS036 PS037 said opinion PS005 PS013 PS023 PS029 rest ideas improvements suggestions less ambitious goals specificconcrete goals better estimation effort levels two nonteam members responded question one called stakeholder involvement saying “Figuring low threshold way bringing library stakeholders projects minimal time commitment paramount” PS018 email interview 525 Summary intention process creating roadmap gave community opportunity shared artifact around talk balance priorities define boundaries shared purpose forming teams year effect community members used online discourse justification discouraging offtopic work justification encouraging ontopic work also tipped balance individual decisionmaking work allocation providing evidence ontopic efforts would supported community members Afterwards served standard evaluate progress year
::::
6 DISCUSSION Rust’s roadmap process strikes balance openness new ideas people unifying around common goals popular programming language many potential contributors could welcomed encouraged help mentioned Subsection 213 eliciting help peripheries community requires balance welcoming openness predictable direction Rust’s process seems strike balance creating ceremony around transition openness direction welcome input building roadmap visibly commit one direction roadmap released Although new ideas outsiders appear enter roadmap process enumerated summarized listened fact new ideas outsiders nonzero chance heeded may well important encouraging participation infinitesimal nonzero chance winning lottery effective encouraging broad participation Another advantage transparent roadmap creation process confers legitimacy governing process 31 document visible grounding process might trusted date one individual’s interpretation community’s goals even intentions sponsoring organization like Mozilla contrast offering prospective contributors ability gain knowledge trust community’s true intentions Rust might allowing quickly gain sense belongingness community wellstudied motivator contribution 38 fact observed nonteam participants encouraging others work PRs relevant roadmap suggests may visibly signalling commitment community demonstrating familiarity roadmap individual contributors trust planned work done others known timeframe “divide conquer” approaches coordination may become viable Howison Crowston 39 found concurrent development dependent contributions rare open source studying open source projects performed complex multiperson tasks Howison Crowston observed developers either immediately adding contributions necessary supporting code already place deferring contributions hopes someday support would become available observe pattern multiperson interdependent work one developer proceeded feature trusting another developer would writing supporting code time hypothesize cowork may common projects provide trustable signal others’ intentions Searching examples Rust would fruitful future work Team members particularly core team play important role curating suggestions articulating common vision core team influences consensus built maintained roadmap process Framing community survey questions requests preroadmap blog posts choosing among answers build coherent set initiatives Using visibility respect argue vision publicly blog posts RFC issue discussions forums team meetings Holding voting privileges RFCs merge rights PRs mentioned earlier accepted RFCs align roadmap roadmap sometimes used way frame rejection RFCs usually problematic reasons roadmap allows core team members take role similar manager seen example P008’s strategy steering team contributor effort using roadmap agreed upon validation
::::
7 IMPLICATIONS PROJECTS case study useful providing deep example process played real world provide experiences projects learn projects considering roadmapping need consider applies context may want consider roadmapping process struggling balance diverging priorities wants strengthen sense shared direction Based observation single case suggest following guidance Actively solicit input larger community developers well core team saw case overlap ideas helpful identifying areas consensus already exist letting harboring ideas lacking consensus unlikely significant effort aggregate applied ideas Adopt nonzero number ideas community seems likely order keep larger community engaged interested ideas beyond core team make roadmap evaluation process open fair form governance fairness openness convey sense legitimacy around decisionmaking enhance likelihood community accept act roadmap Don’t expect – even – development work discussion focus roadmap items Nevertheless significant progress items made especially frequent contributors Reflecting community’s progress roadmap process roadmap constructed helpful creating future versions caution next section however paper describes Rust’s experience building roadmap process particular needs clear process would need different community building different different developers different users
::::
8 THREATS VALIDITY results rely part detailed qualitative analysis Qualitative studies mostly aim generalizability providing “a rich contextualized understanding human experience intensive study particular cases” 63 looked Rust community case study example OSS communities use roadmaps organizational tools manage allocate work effort shared work goals Interviewees may representative entire community although response rate fairly high long tail contributors may selfselection bias especially among lowvolume contributors know typical Rust OSS communities regard roadmap speculate findings might apply beyond Rust identified specific list roadmap topics classified issues PRs RFCs according topics using heuristic described Appendix B may undercount work roadmap boundaries topics welldefined since features interact work nonroadmap feature may needed interacts roadmap feature viceversa However relied titles labels assigned community mapping roadmap topics labels many cases great deal face validity attempt tease effectiveness roadmaps coordination mechanism compared ways governing focus understanding community constructed used roadmaps Future work could address questions effectiveness example comparing quality productivity community satisfaction roadmap adoption
::::
9 CONCLUSIONS work set understand functions roadmaps Rust community used fulfill functions qualitatively examined creation management reflection consensus roadmap process estimated proportions roadmaprelated work done throughout planned year shown roadmap’s purposes included building legitimizing consensus focusing prioritizing collective attention particularly team members building group identity creating external visibility community’s plans community accomplishes purposes assembling work groups around roadmap’s structure using roadmap goals justification directing people towards roadmaprelated work using roadmap ground reflection end year planning next year power roadmap influence contributors’ choices year comes fact comprises exactly initiatives collaborators willing help transparent process provides evidence willingness developers deciding contribute effort roadmapped year instead strictly constraining activity roadmap rather functioned nudge contributors work collectively agreed upon topics case focus would wander individually motivated topics way roadmap enables community guide areas mutual interest rather commanding effort shared goals thus guides community without need exert hierarchical power provides useful prediction future development people working dependent projects REFERENCES 1 Brian Anderson 2016 Feature northstar httpsgithubcombrsonrfcsblobnorthstartext0000northstarmd Last accessed 13 January 2020 2 John Anvik Lyndon Hiew Gail C Murphy 2006 Fix Bug Proc International Conference Engineering Shanghai China ICSE ’06 ACM New York NY USA 361–370 3 Open Service Broker API 2019 Roadmap Release Planning httpsgithubcomopenservicebrokerapiservicebrokerprojects1 Last accessed 13 January 2020 4 Barcomb Kaufmann Riehle K Stol B Fitzgerald 2018 Uncovering Periphery Qualitative Survey Episodic Volunteering FreeLibre Open Source Communities IEEE Trans Eng 2018 1–1 5 Hoda Baytiyeh Jay Pfaffman 2010 Open source community altruists Comput Human Behav 26 6 Nov 2010 1345–1354 6 Stefan Kambiz Behfar Ekaterina Turkina Thierry BurgerHelmchen 2018 Knowledge management OSS communities Relationship dense sparse network structures Int J Inf Manage 38 1 Feb 2018 167–174 7 Willem Bekkers Inge van de Weerd Marco Spruit Sjaak Brinkkemper 2010 Framework Process Improvement Product Management Systems Services Process Improvement Springer Berlin Heidelberg 1–12 8 Mariette Bengtsson 2016 plan perform qualitative study using content analysis NursingPlus Open 2 2016 8–14 9 Yochai Benkler 2002 Coase’s Penguin Linux “The Nature Firm” Yale Law J 2002 369–446 10 Bruce Lawrence Berg Howard Lune Howard Lune 2004 Qualitative research methods social sciences Vol 5 Pearson Boston 11 Matthew J Bietz Eric P Baumer Charlotte P Lee 2010 Synergizing Cyberinfrastructure Development Comput Support Coop Work 19 34 July 2010 245–281 12 Christopher Bogart Christian Kästner James Herbsleb Ferdian Thung 2016 Break API Cost Negotiation Community Values Three Ecosystems Proc International Symposium Foundations Engineering Seattle WA USA FSE 2016 ACM New York NY USA 109–120 13 Yuanfeng Cai Dan Zhu 2016 Reputation open source community Antecedents impacts Decis Support Syst 91 Nov 2016 103–112 14 AWS Cloudformation 2018 CloudFormation Public Coverage Roadmap httpsgithubcomawscloudformationawscloudformationcoverageroadmap Last accessed 13 January 2020 15 J Coelho Valente L L Silva Hora 2018 Engage FLOSS Answers Core Developers Intl Workshop Cooperative Human Aspects Engineering CHASE 114–121 16 John W Creswell Vicki L Plano Clark 2017 Designing conducting mixed methods research Sage publications 17 John W Creswell Cheryl N Poth 2016 Qualitative inquiry research design Choosing among five approaches Sage publications 18 Kevin Crowston Ivan Shamshurin 2016 CorePeriphery Communication success freelibre open source projects IFIP Advances Information Communication Technology 472 2016 45–56 httpsdoiorg1010079783319392257 19 Laura Dabbish Colleen Stuart Jason Tsay Jim Herbsleb 2012 Social Coding GitHub Transparency Collaboration Open Repository Proc Conference Computer Supported Cooperative Work Seattle Washington USA CSCW ’12 ACM New York NY USA 1277–1286 20 Carlo Daffara 2012 Estimating economic contribution open source European economy First Openforum Academy Conference Proceedings booksgooglecom 21 JeanMichel Dalle Paul David Others 2003 allocation development resources ‘open source’ production mode SIEPRProject NOSTRA Working Paper 15th February Accepted publication Joe Feller Brian Fitzgerald Scott Hissam Karim Lakhani eds Making Sense Bazaar forthcoming MIT Press 2004 2003 22 Premkumar Devanbu Pallavi Kudigrama Cindy RubioGonzález Bogdan Vasilescu 2017 Timezone Timeofday Variance GitHub Teams Empirical Method Study Proc International Workshop Analytics Paderborn Germany SWAN 2017 ACM New York NY USA 19–22 23 Zakir Durumeric Frank Li James Kasten Johanna Amann Jethro Beekman Mathias Payer Nicolas Weaver David Adrian Vern Paxson Michael Bailey J Alex Halderman 2014 Matter Heartbleed Proc Internet Measurement Conference Vancouver BC Canada IMC ’14 Association Computing Machinery New York NY USA 475–488 httpsdoiorg10114526637162663755 24 Christof Ebert 2007 impacts product management J Syst Softw 80 6 June 2007 850–861 25 Christof Ebert Sjaak Brinkkemper 2014 product management–An industry evaluation J Syst Softw 95 2014 10–18 26 Nadia Eghbal 2016 Roads Bridges unseen labor behind digital infrastructure Technical Report Ford Foundation 27 Anna Filippova Hichang Cho 2016 Effects Antecedents Conflict Free Open Source Development Proc Conf Computer Supported Cooperative Work Social Computing CSCW 2016 705–716 28 Brian Fitzgerald 2006 Transformation Open Source MIS Quarterly 30 3 2006 587–598 29 Uwe Flick 2018 introduction qualitative research Sage Publications Limited 30 Samuel Fricker 2012 product management People Springer 53–81 31 Archon Fung 2006 Varieties Participation Complex Governance Public Administration Review 66 s1 2006 66–75 32 Michael J Gallivan 2001 Striking balance trust control virtual organization content analysis open source case studies Information Systems Journal 11 4 2001 277–304 httpsdoiorg101046j13652575200100108x 33 Mohammad Gharehyazie Daryl Posnett Bogdan Vasilescu Vladimir Filkov 2015 Developer initiation social interactions OSS case study Apache Foundation Empirical Engineering 20 5 Oct 2015 1318–1353 34 Shane Greenstein Frank Nagle 2014 Digital dark matter economic contribution Apache Research Policy 43 4 May 2014 623–631 35 Gordon Haff 2018 Open Source Ate Understand Open Source Movement Much Apress 36 Hars Shaosong Ou 2001 Working free Motivations participating open source projects Proc Hawaii International Conference System Sciences 9 pp– 37 Andrea Hemetsberger Christian Reinhardt 2009 Collective development opensource communities activity theoretical perspective successful online collaboration Organization Studies 30 9 2009 987–1008 httpsdoiorg1011770170840609339241 38 Guido Hertel Sven Niedner Stefanie Herrmann 2003 Motivation developers Open Source projects Internetbased survey contributors Linux kernel Research Policy 32 7 July 2003 1159–1177 39 James Howison Kevin Crowston 2014 Collaboration open superposition theory open source way Miss Q 38 1 2014 29–50 40 Chris Jensen Walt Scacchi 2010 Governance open source development projects comparative multilevel analysis IFIP International Conference Open Source Systems Springer 130–142 41 HansBernd Kittlaus Samuel Fricker 2017 Product Management ISPMACompliant Study Guide Handbook Springer 42 Florian Kohlbacher 2006 use qualitative content analysis case study research Forum Qualitative SozialforschungForum Qualitative Social Research Vol 7 Institut für Qualitative Forschung 1–30 43 Klaus Krippendorff 2018 Content analysis introduction methodology Sage publications 44 Sandeep Krishnamurthy Shaosong Ou Arvind K Tripathi 2014 Acceptance monetary rewards open source development Research Policy 43 4 2014 632–644 45 K Lakhani 2005 Hackers Understanding Motivation Effort FreeOpen Source Projects Perspectives Free Open Source 2005 3–21 46 Charlotte P Lee Paul Dourish Gloria Mark 2006 human infrastructure cyberinfrastructure Comput Support Coop Work 2006 483–492 47 Jung Hoon Lee HyungIl Kim Robert Phaal 2012 analysis factors improving technology roadmap credibility communications theory assessment roadmapping processes Technol Forecast Soc Change 79 2 Feb 2012 263–280 48 Lehman J F Ramil P Wernick E Perry W Turski 1997 Metrics laws evolutionthe nineties view Proceedings Fourth International Metrics Symposium IEEE 20–32 49 Andrey Maglyas Uolevi Nikula Kari Smolander 2013 roles product managers empirical investigation J Syst Softw 86 12 Dec 2013 3071–3090 50 Lynne Markus 2007 governance freeopen source projects Monolithic multidimensional configurational Journal Management Governance 11 2 2007 151–163 51 Niko Matsakis 2015 Priorities 10 httpsinternalsrustlangorgtprioritiesafter101901 Last accessed 13 January 2020 52 Philipp Mayring 2004 Qualitative content analysis companion qualitative research 1 2004 159–176 53 Nora McDonald Sarita Schoenebeck Andrea Forte 2019 Reliability interrater reliability qualitative research Norms guidelines CSCW HCI practice Proceedings ACM HumanComputer Interaction 3 CSCW 2019 1–23 54 Rebeca MéndezDurón 2013 allocation quality intellectual assets affect reputation open source projects Information Management 50 7 Nov 2013 357–368 55 Martin Michlmayr Francis Hunt David Probert 2007 Release management free projects Practices problems IFIP Int Fed Inf Process 234 December 2006 2007 295–300 56 Mockus Weiss Ping Zhang 2003 Understanding predicting effort projects 25th International Conference Engineering 2003 Proceedings IEEE 274–284 57 Jürgen Münch Stefan Trieflinger Dominic Lang 2019 Product roadmap–from vision reality systematic literature review 2019 IEEE International Conference Engineering Technology Innovation ICEITMC IEEE 1–8 58 Siobhán O’Mahony Beth Bechky 2008 Boundary organizations Enabling collaboration among unexpected allies Administrative science quarterly 53 3 2008 422–459 59 Stack Overflow 2019 Loved Dreaded Wanted Languages httpsinsightsstackoverflowcomsurvey2019technologymostloveddreadedandwantedlanguages Last accessed 13 January 2020 60 Gang Peng Yun Wan Peter Woodlock 2013 Network ties success open source development Journal Strategic Information Systems 22 4 Dec 2013 269–281 61 Robert Phaal Gerrit Muller 2009 architectural framework roadmapping Towards visual strategy Technol Forecast Soc Change 76 1 Jan 2009 39–49 62 Gustavo Pinto Luiz Felipe Dias Igor Steinmacher 2018 Gets Patch Accepted First Comparing Contributions Employees Volunteers Proceedings 11th International Workshop Cooperative Human Aspects Engineering Gothenburg Sweden CHASE ’18 ACM New York NY USA 110–113 63 Denise F Polit Cheryl Tatano Beck 2010 Generalization quantitative qualitative research Myths strategies International journal nursing studies 47 11 2010 1451–1458 64 Germán PooCaamaño Eric Knauss Leif Singer Daniel German 2017 Herding cats FOSS ecosystem tale communication coordination release management Journal Internet Services Applications 8 1 2017 65 Germán PooCaamaño Leif Singer Eric Knauss Daniel German 2016 Herding cats case study release management open collaboration ecosystem IFIP Adv Inf Commun Technol 472 2016 147–162 66 Huilian Sophie Qiu Alexander Nolte Anita Brown Alexander Serebrenik Bogdan Vasilescu 2019 Going Farther Together Impact Social Capital Sustained Participation Open Source 67 Hector Ramos 2018 Open Source Roadmap httpsfacebookgithubioreactnativeblog20181101ossroadmap Last accessed 13 January 2020 68 David Ribes Thomas Finholt 2009 long infrastructure Articulating tensions development Journal Association Information Systems JAIS 2009 69 Rust 2019 Governance httpswwwrustlangorggovernance Last accessed 13 January 2020 70 Rust 2019 Production users httpswwwrustlangorgproductionusers Last accessed 13 January 2020 71 Read Rust 2018 Rust 2018 Hopes dreams Rust 2018 httpsreadrustnetrust2018 Last accessed 13 January 2020 72 Read Rust 2019 Rust 2019 Ideas community Rust 2019 next edition httpsreadrustnetrust2019 Last accessed 13 January 2020 73 W Scacchi 2002 Understanding requirements developing open source systems IEEE Proceedings 149 1 Feb 2002 24–39 74 Sonali K Shah 2006 Motivation Governance Viability Hybrid Forms Open Source Development Manage Sci 52 7 July 2006 1000–1014 75 Maha Shaikh Ola Henfridsson 2017 Governing open source coordination processes Information Organization 27 2 2017 116–135 76 Cuihua Shen Peter Monge 2011 connects social network analysis online open source community First Monday 16 6 June 2011 77 Param Vir Singh Yong Tan Vijay Mookerjee 2011 Network Effects Influence Structural Capital Open Source Success MIS Quarterly 35 4 2011 813–829 78 Matthias Stürmer 2013 Four types open source communities httpsopensourcecombusiness136fourtypesorganizationalstructureswithinopensourcecommunities Accessed 202015 79 Tanja Suomalainen Outi Salo Pekka Abrahamsson Jouni Similä 2011 product roadmapping volatile business environment Journal Systems 84 6 958–975 80 Yong Tan Vijay Mookerjee Param Singh 2007 Social capital structural holes team composition Collaborative networks open source community Proc International Conference Information Systems 2007 155 81 Antony Tang Taco de Boer Hans van Vliet 2011 Building roadmaps knowledge sharing perspective Proc International Workshop SHAring Reusing Architectural Knowledge 13–20 82 Niels C Taubert 2008 Balancing requirements decision action Decisionmaking implementation freeopen source projects Science Technology Innovation Studies 4 1 2008 69–88 83 Jonathan Taylor 2017 Rust 2017 Survey Results httpsblogrustlangorg20170905Rust2017SurveyResultshtml Last accessed 13 January 2020 84 Libra Engineering Team 2019 Libra Core Roadmap 2 httpsdeveloperslibraorgblog20191217libracoreroadmap2 Last accessed 13 January 2020 85 Scala Team 2017 Scala 213 Roadmap httpswwwscalalangorgnewsroadmap213html Last accessed 13 January 2020 86 Rust Core Team 2018 call Rust 2019 Roadmap blog posts httpsblogrustlangorg20181206callforrust2019roadmapblogpostshtml Last accessed 13 January 2020 87 Rust Core Team 2018 New Year’s Rust Call Community Blogposts httpsblogrustlangorg20180103newyearsrustacallforcommunityblogpostshtml Last accessed 13 January 2020 88 Rust Core Team 2018 Rust’s 2018 roadmap httpsblogrustlangorg20180312roadmaphtml Last accessed 13 January 2020 89 Rust Core Team 2019 Rust’s 2019 Roadmap httpsblogrustlangorg20190423roadmaphtml Last accessed 13 January 2020 90 Rust Survey Team 2018 Rust Survey 2018 Results httpsblogrustlangorg20181127Rustsurvey2018html Last accessed 13 January 2020 91 Jonathan Turner 2016 2016 Rust Commercial User Survey Results httpsinternalsrustlangorgt2016rustcommercialusersurveyresults4317 Last accessed 13 January 2020 92 Jonathan Turner 2016 State Rust Survey 2016 httpsblogrustlangorg20160630StateofRustSurvey2016html Last accessed 13 January 2020 93 Aaron Turon 2016 Refining Rust’s RFCs httpaturongithubioblog20160705rfcrefinement Last accessed 13 January 2020 94 Aaron Turon 2017 Rust’s 2017 Roadmap httpsblogrustlangorg20170206roadmaphtml Last accessed 13 January 2020 95 Tuukka Turunen 2018 QT Roadmap 2018 httpswwwqtioblog20180222qtroadmap2018 Last accessed 13 January 2020 96 van de Weerd Brinkkemper R Nieuwenhuis J Versendaal L Bijlsma 2006 Towards Reference Framework Product Management International Requirements Engineering Conference RE’06 319–322 97 Konstantin Vishnevskiy Oleg Karasev Dirk Meissner 2015 Integrated roadmaps corporate foresight tools innovation management case Russian companies Technol Forecast Soc Change 90 Jan 2015 433–443 98 Georg Von Krogh Stefan Haefliger Sebastian Spaeth Martin W Wallin 2012 Carrots rainbows Motivation social practice open source development MIS Quarterly 2012 649–676 99 Kangning Wei Kevin Crowston U Yeliz Eseryel Robert Heckman 2017 Roles politeness behavior communitybased freelibre open source development Information Management 54 5 July 2017 573–582 100 Joel West Scott Gallagher 2006 Challenges open innovation paradox firm investment opensource RD Management 36 3 2006 319–331 101 Joel West Siobhán O’Mahony 2008 Role Participation Architecture Growing Sponsored Open Source Communities Industry Innovation 15 2 April 2008 145–168 102 ChorngGuang Wu James H Gerlach Clifford E Young 2007 empirical analysis open source developers’ motivations continuance intentions Information Management 44 3 2007 253–262 103 Xuan Xiao Aron Lindberg Sean Hansen Kalle Lyytinen 2018 “Computing” Requirements Open Source Distributed Cognitive Approach Journal Association Information Systems 19 12 2018 1217–1252 104 J Xie Zhou Mockus 2013 Impact Triage Study Mozilla Gnome International Symposium Empirical Engineering Measurement IEEE 247–250 105 Yunwen Ye Kouichi Kishida 2003 Toward Understanding Motivation Open Source Developers Proc International Conference Engineering Portland Oregon ICSE ’03 IEEE Computer Society Washington DC USA 419–429 106 Robert K Yin 2017 Case study research applications Design methods Sage publications EMAIL INTERVIEW QUESTIONS • Q1 much Rust roadmaps influence decision work contribute Rust influence 1 2 3 4 5 lot influence Explain optional • Q2 opinion helpful roadmaps Rust community helpful 1 2 3 4 5 helpful explain way helpful unhelpful optional • Q3 much Rust roadmaps eg working groups projects match priorities Rust represent priorities 1 2 3 4 5 Represent priorities well Explain optional • Q4 could use roadmaps Rust improved future • Q5 many years involved Rust • Q6 official Rust team working group Yes B ROADMAP TOPIC HEURISTICS began manually extracting list topics 2018 roadmap assign topics particular issues PRs RFCs used following method • Two researchers independently compiled list topics document identifying bullet points lists text appeared identify specific features One researcher’s list strictly longer 36 items other’s 23 items two discussed additional topics included two resulting 34 topics • Using generated list one researcher generated list proposed search keywords topic using acronyms distinctive terms word sequences found part roadmap researcher judged would high selectivity distinguishing text topic general Rust discussion final list shown Table B • Labels short strings used GitHub tag issues RFCs pull requests assigned roadmap topics applying keywords labels’ descriptions shown httpsgithubcomrustlangrustlabels example label Anet assigned topic “network services” matched search term “networking” researchers checked list labels descriptions agreed matched topics • mapping used assign topics issues PRs RFCs rust excluding socalled Rollup PRs issue PR RFC assigned topic tagged label mapped keyword • Topics also assigned RFCs tracking issues subset issues formally tied certain RFCs search terms matched item’s title • spread activation RFCs issues issues PRs RFCs PRs issues issue inherits topic RFC RFC lists issue official tracking issue PR inherits topic issue PR mentions issue ID initial description done recursively • assign commit topic part nonRollup PR topic eventually merged main thread omitted commits multiple parents avoid double counting merges commits commits 100 files avoid commits mass moves files • Discussion effort operationalized characters text header commentthread RFC discussion issue PR excluding code embedded comments delimited triple backticks • Coding effort operationalized lines code deleted plus lines code added • Team contributors operationalized anyone member one teams listed Rust’s governance page beginning 2018 Also note development happened outside repositories example rustlangcargo repository capture aspects development affect main compiler Table 6 Search terms identifying 2018 roadmap topics labels text left middle columns used search terms within descriptions labels right column shows labels matched 2018 Topic Search Terms Labels add edition flag rustfix edition rustfix 2018 lint rustfix Aasyncawait AsyncAwaitTriaged AsyncAwaitFocus AsyncAwaitOnDeck Fasyncawait asyncawait async await asyncawait build system integration cargo custom registries Cargo registry Cargo registries Aregistry CargoXargo integration cargo xargo CLI apps CLI app CLI application command line app command line application Clippy Clippy rustup Clippy 10 Clippy 1 0 Alint compiler optimizations optimization optimisation optimize optimise Aoptimization ALLVM Amiropt compiler parallelization parallelization parallelisation compilerdriven code autocomplete RLS completion RLS completion RLS const generics Aconstgenerics Fconstgenerics custom allocator custom allocator Aallocators custom test frameworks custom test framework Fcustomtestframeworks embedded device embedded WGembedded GATs generic associated type associated type constructor Fgenericassociatedtypes generator Agenerators Fgenerators 2018 Topic Search Terms Labels improve compiler error error message Adiagnostics Fonunimplemented message incremental compilation incremental compilation Aincremental Aincrcomp WGcompilerincr internationalization internationalization internationalisation macros 20 hygiene macro hygiene macro 20 macro 2 0 hygiene Ahygiene Amacros20 MIRonly rlibs MIR rlib modules revamp modules Amodules network services networking Anet nonlexical lifetimes NLL non lexical lifetime nonlexical lifetime ANLL NLLcomplete NLLdiagnostics NLLfixedbyNLL NLLperformant NLLpolonius NLLreference NLLsound public dependencies cargo libstd cargo std cargo xargo cargo revise cargo profiles cargo profile Aprofile RLS 10 RLS Alanguageserver Arls rustdoc RLSbased edition RLS rustdoc rustfmt rustfmt Ship drop ergonomics RFCs ergonomics rfc ergonomics initiative Ergonomics Initiative SIMD Asimd Fsimdffi stabilize impl Trait impl Trait Aimpltrait Fimpltraitinbindings Ftypealiasimpltrait tokio web assembly webassembly wasm web assembly Owasm Received June 2020 revised October 2020 accepted December 2020
::::
Extent Nature Reuse Open Source Java Projects Lars Heinemann Florian Deissenboeck Mario Gleirscher Benjamin Hummel Maximilian Irlbeck Institut für Informatik Technische Universität München Germany heinemandeissenbgleirschhummelbirlbeckintumde Abstract Code repositories Internet provide tremendous amount freely available open source code reused building new argued reuse bring gain productivity construction demanded market However knowledge extent reuse projects sparse remedy report empirical study reuse 20 open source Java projects total 33 MLOC study investigates 1 whether open source projects reuse third party code 2 much whitebox blackbox reuse occurs answer questions utilize static dependency analysis quantifying blackbox reuse code clone detection detecting whitebox reuse corpus 61 MLOC reusable Java libraries results indicate reuse common among open source Java projects blackbox reuse predominant form reuse
::::
1 Introduction reuse involves use existing artifacts construction new 9 Reuse multiple positive effects competitiveness development organization reusing mature components overall quality resulting product increased Moreover development costs well time market reduced 7 11 Finally maintenance costs reduced since maintenance tasks concerning reused parts “outsourced” organizations even stated alternatives reuse capable providing gain productivity quality projects demanded industry 15 Today practitioners researchers alike fret failure reuse form components subindustry imagined McIlroy 40 years ago 13 Newer approaches product lines 2 development product specific modeling languages code generation 8 typically focus reuse within single product family single development organization However reuse existing third party code is—from observation—a common practice almost projects significant size repositories Internet provide tremendous amount freely reusable source code frameworks libraries many recurring problems Popular examples frameworks web applications provided Apache Foundation Eclipse platform development rich client applications Due ubiquitous availability development Internet become interesting reuse repository projects 3 6 Search engines like Google Code Searchfootnotehttpwwwgooglecomcodesearch provide powerful search capabilities direct access millions source code files written multitude programming languages Open source repositories like Sourceforgefootnotehttpsourceforgenet currently hosts almost quarter million projects offer possibility open source projects conveniently share code worldwide audience Research problem Despite widely recognized importance reuse proven positive effects quality productivity time market remains largely unknown extent current projects make use extensive reuse opportunities provided code repositories Internet Literature scarce much reuse occurs projects also unclear much code reused blackbox whitebox fashion consider lack empirical knowledge extent nature reuse practice problematic argue solid basis data required order assess success reuse Contribution paper extends empirical knowledge extent nature code reuse open source projects Concretely present quantitative data reuse 20 open source projects acquired different types static analysis techniques data describes reuse rate relation whitebox blackbox reuse provided data helps substantiate academical discussion success failure reuse supports practitioners providing benchmark reuse 20 successful open source projects
::::
2 Terms section briefly introduces fundamental terms study based reuse paper use rather simple notion reuse reuse considered utilization code developed third parties besides functionality provided operating system programming platform distinguish two reuse strategies namely blackbox whitebox reuse definitions strategies follow notions 17 Whitebox reuse consider reuse code whitebox type incorporated files source form ie internals reused code exposed developers implies code may potentially modified reuse rate whitebox reuse defined ratio amount reused lines code total amount lines code incl reused source code Blackbox reuse consider reuse code blackbox type incorporated binary form ie internals reused code hidden developers maintainers implies code reused ie without modifications blackbox reuse reuse rate given ratio size reused binary code size binary code whole system incl reused binary code
::::
3 Methodology section describes empirical study performed analyze extent nature reuse open source projects
::::
31 Study Design use GoalQuestionMetric template 20 defining study analyze open source projects purpose understanding state practice reuse respect extent nature viewpoint developers maintainers context Java open source achieve investigate following three research questions RQ 1 open source projects reuse first question study asks whether open source projects reuse according definition RQ 2 much whitebox reuse occurs projects reuse existing ask much code reused whitebox fashion defined Section 2 use metrics number copied lines code external sources well reuse rate whitebox reuse RQ 3 much blackbox reuse occurs ask much code reused blackbox fashion according definition question use metrics aggregated byte code size reused classes external libraries reuse rate blackbox reuse Although covered definition reuse separately measure numbers blackbox reuse Java API since one could argue also form reuse
::::
32 Study Objects section describes selected projects analyzed study preprocessed advance reuse analyses Table 1 20 studied Java applications System Version Description LOC Size KB AzureusVuze 4504 P2P File Sharing Client 786865 22761 Buddi 3403 Budgeting Program 27690 1149 DavMail 3851480 Mail Gateway 29545 932 DrJava stable20100913r5387 Java Programming Env 160256 6199 FreeMind 090 RC 9 Mind Mapper 71133 2352 HSQLDB 1813 Relational Database Engine 144394 2032 iReportDesigner 375 Visual Reporting Tool 338819 10783 JabRef 26 BibTeX Reference Manager 109373 3598 JEdit 432 Text Editor 176672 4010 MediathekView 220 Media Center Management 23789 933 Mobile Atlas Creator 18 beta 2 Atlas Creation Tool 36701 1259 OpenProj 14 Management 151910 3885 PDF Split Merge 006 PDF Manipulation Tool 411 17 RODIN 20 RC 1 Service Development 273080 8834 soapUI 36 Web Service Testing Tool 238375 9712 SQuirreL SQL Client Snapshot201009181811 Graphical SQL Client 328156 10918 subsonic 41 Webbased Music Streamer 30641 1050 Sweet Home 3D 26 Interior Design Application 77336 3498 TVBrowser 30 RC 1 TV Guide 187216 6064 YouTube Downloader 19 Video Download Utility 2969 99 Overall 3195331 100085 Selection Process chose 20 projects open source repository Sourceforge study objects Sourceforge largest repository open source applications Internet currently hosts 240000 projects 26 million users3 used following procedure selecting study objects4 searched Java projects development status ProductionStable sorted resulting list descending number weekly downloads stepped list beginning top selected standalone application purely implemented Java based Java SE Platform source download 20 study objects selected procedure among 50 downloaded projects Thereby obtained set successful projects terms user acceptance application domains projects diverse included accounting file sharing email development visualization size downloaded packages zipped files broad variety ranging 40 KB 53 MB Table 1 shows overview information study objects LOC column denotes total number lines Java source files downloaded preprocessed source package described Size column shows bytecode sizes study objects Preprocessing deleted test code projects following set simple heuristics eg folders named testtests cases remove code compilable one omitted code referenced commercial library 3 httpsourceforgenetabout 4 selection performed October 5th 2010 Table 2 22 libraries used potential sources whitebox reuse Library Description Version LOC ANTLR Parser Generator 32 66864 Apache Ant Build Support 181 251315 Apache Commons Utility Methods 5Oct2010 1221669 log4j Logging 1216 68612 ASM ByteCode Analysis 33 3710 Batik SVG Rendering Manipulation 17 366507 BCEL ByteCode Analysis 52 48166 Eclipse Rich Platform Framework 35 1404122 HSQLDB Database 1813 157935 Jaxen XML Parsing 113 48451 JCommon Utility Methods 1016 67807 JDOM XML Parsing 111 32575 Berkeley DB Java Edition Database 40103 367715 JFreeChart Chart Rendering 1013 313268 JGraphT Graph Algorithms Layout 081 41887 JUNG Graph Algorithms Layout 201 67024 Jython Scripting Language 251 252062 Lucene Text Indexing 302 274270 Spring Framework J2EE Framework 303 619334 SVNKit Subversion Access 134 178953 Velocity Engine Template Engine 164 70804 XercesJ XML Parsing 290 226389 Overall 6149439 also added missing libraries downloaded separately order make source code compilable either obtained libraries binary package library’s website latter case chose latest version library 33 Study Implementation Execution section details study implemented executed study objects automated analyses implemented Java top open source quality analysis framework ConQAT5 provides—among others—clone detection algorithms basis functionality static code analysis Detecting WhiteBox Reuse whitebox reuse involves copying external source code project’s code sources reuse limited libraries available compile time virtually span existing Java source code best approximation existing Java source code probably provided indices large code search engines Google Code Search Koders Unfortunately access engines typically limited allow search large amounts code 3 MLOC study objects Consequently considered selection commonly used Java libraries frameworks potential sources whitebox reuse selected 22 libraries commonly reused based experience development projects systems analyzed earlier studies libraries 5 httpwwwconqatorg listed Table 2 comprise 6 MLOC sake presentation treated Apache Commons single library although consists 39 individual libraries developed versioned independently holds Eclipse chose selection plugins find potentially copied code used clone detection algorithm presented 5 find duplications selected libraries study objects computed clones consisting least 15 statements normalization formatting identifiers type2 clones allowed us also find partially copied files files fully identical due independent evolution keeping rate false positives low clones reported tool also inspected manually remove remaining false positives complemented clone detection approach manual inspection source code study objects size study objects allows shallow inspection based names files directories correspond Java packages scanned directory trees projects files residing separate source folders packages significantly different package names used files found way inspected source identified based header comments web search course step find large scale reuse multiple files copied original package names preserved typically different project’s package names However inspection limited 22 selected libraries potentially find reused code well Detecting BlackBox Reuse primary way blackbox reuse Java programs inclusion libraries Technically Java Archive Files JAR zipped files containing byte code Java types Ideally one would measure reuse rate based source code libraries However obtaining source code libraries errorprone many projects document exact version used libraries certain cases source code libraries available avoid problems prevent measurement inaccuracies performed analysis blackbox reuse directly Java byte code stored JAR files JAR files standard way packaging reusable functionality Java JAR files directly reused merely represent container Java types classes interfaces enumerations annotations referenced types Hence type main entity reuse Java blackbox reuse analysis determines types libraries referenced types code dependencies defined Java Constant Pool 12 part Java class file holds information referenced types References method calls type usages induced eg local variables inheritance analysis transitively traverses 6 addition JAR files Java provides package concept resembles logical modularization concept Packages however cannot directly reused dependency graph ie also types indirectly referenced reused types included resulting set reused types analysis approach ensures contrast counting whole library reused code subset actually referenced considered rationale incorporate large library use small fraction quantify blackbox reuse analysis measures size reused types computing aggregated byte code size blackbox analysis based BCEL libraryfootnotehttpjakartaapacheorgbcel provides byte code processing functionality analysis lead overestimation reuse always include whole types although specific methods type may actually reused Moreover method may reference certain types method could unreachable hand approach lead underestimation reuse implementations interfaces considered reused unless discovered another path dependency search Details regarding potential error found section discusses threats validity Section 6 Although reuse Java API covered definition reuse also measured reuse Java API since potential variations reuse rates Java API worthwhile investigate Since every Java class inherits textttjavalangObject thereby transitively references significant part Java API classes even trivial Java program exhibits—according analysis—a certain amount blackbox reuse determine baseline performed analysis artificial minimal Java program consists empty textttmain method baseline blackbox reuse Java API consisted 2082 types accounted 5 MB byte code investigated reason rather large baseline found textttObject reference textttClass turn references textttClassLoader textttSecurityManager classes belong core functionality running Java applications referenced parts include Reflection API Collection API Due special role Java API captured numbers blackbox reuse Java API separately blackbox reuse analyses performed Sun Java Runtime Environment Linux 64 Bit version 16020
::::
4 Results section contains results study order research questions 41 RQ 1 Open Source Projects Reuse reuse analyses revealed 18 20 projects reuse third parties ie analyzed projects 90 reuse code textttHSQLDB textttYouTube Downloader projects reuse—neither blackbox whitebox—was found 42 RQ 2 Much WhiteBox Reuse Occurs attempt answer question combination automatic techniques clone detection manual inspections clone detection code study objects libraries Table 2 reported 337 clone classes ie groups clones 791 clone instances together numbers include clones study object one libraries clones within study objects libraries considered HSQLDB set study objects libraries used discarded clones two Manual inspection clones led observation typically clones file pairs nearly completely covered clones unit reuse far found fileclass level single methods sets methods copied copied files completely identical changes caused either minor modifications files copying study objects likely due different versions libraries used differences files minor counted entire file copied major part covered clones manual inspection study objects found entire libraries copied four study objects libraries either less wellknown GNU ritopt longer available individual microstar XML parser released individual rather extracted another OSM JMapViewer could found clone detection algorithm corresponding libraries part original set results duplicated code found clone detection code found manual inspection summarized Table 3 last column gives overall amount whitebox reused code relative project’s size System Clone Detection LOC Manual Inspection LOC Overall Percent AzureusVuze 1040 57086 739 Buddi — — — DavMail — — — DrJava — — — FreeMind — — — HSQLDB — — — iReportDesigner 298 — 009 JabRef — 7725 706 JEdit 7261 9333 939 MediathekView — — — Mobile Atlas Creator — 2577 702 OpenProj — 87 006 PDF Split Merge — — — RODIN — 382 014 soapUI — 2120 089 SQuirreL SQL Client — — — subsonic — — — Sweet Home 3D — — — TVBrowser — 513 027 YouTube Downloader — — — Overall 11701 76721 na LOC 11 20 study objects whitebox reuse whatsoever could proven another 5 reuse 1 However also 4 projects whitebox reuse range 7 10 overall LOC numbers shown last row indicate amount code results copying entire libraries outnumbers far code reused selective copypaste 43 RQ 3 Much BlackBox Reuse Occurs Figure 1 illustrates absolute bytecode size distributions code reused parts libraries 3rd party Java API ordered descending total amount bytecode horizontal line indicates baseline usage Java API reuse third party libraries ranged 0 MB 422 MB amount reuse Java API similar among analyzed projects ranged 129 MB 166 MB median 24 MB third party libraries 133 MB Java API iReportDesigner reused functionality blackbox fashion libraries Java API smallest extent blackbox reuse YouTube Downloader Figure 2 based data shows relative distributions bytecode size projects ordered descending total amount relative reuse relative reuse third party libraries 0 617 median 118 relative amount reused code Java API ranged 230 993 median 730 Overall third party Java API combined relative amount reused code ranged 413 999 median 854 iReportDesigner highest blackbox reuse rate YouTube Downloader used code Java API relative code size 19 20 projects amount reused code larger amount code overall amount reused code sample projects 34 stemmed third party libraries 66 Java API Figure 3 illustrates relative byte code size distributions code third party libraries ie without considering Java API reused library projects ordered descending reuse rate relative amount reused library code ranged 0 989 median 451 9 20 projects amount reused code third party libraries larger amount code
::::
5 Discussion data presented previous sections lead interesting insights current state open source Java development also open new questions part study setup discuss following sections 51 Extent Reuse study reveals reuse common among open source Java projects blackbox reuse predominant form None 20 projects analyzed less 40 blackbox reuse including Java API Even considering Java API median reuse rate still 40 4 projects 10 threshold Contrary whitebox reuse found half projects never exceeds 10 code difference probably explained increased maintenance efforts commonly associated whitebox reuse described Jacobson et al 7 Mili et al 14 detailed results RQ 2 also revealed larger parts consisting multiple files mostly copied either originating library longer maintained files never released individual library cases project’s developers would maintain reused code case removes major criticism whitebox reuse also seems amount reused third party libraries seldom exceeds amount code reused Java API projects case iReportDesigner RODIN soapUI first two built upon NetBeans respectively Eclipse provide rich platforms top Java API Based data obvious early visions reusable components connected small amounts glue code would lead reuse rates beyond 90 realistic today hand reuse rates found high enough significant impact development effort would expect reuse also fostered open source movement huge contribution rich set applications available today 52 Influence Size Reuse Rate amount reuse ranges significantly different projects PDF Split Merge thin wrapper around existing libraries also large projects relatively small reuse rates eg less 10 Azureus without counting Java API Motivated study Lee Litecky 10 investigated possible correlation code size reuse rate data set study based survey domain commercial Ada development 73 samples found negative influence size rate reuse reuse rate without Java API third party code found Spearman correlation coefficient 005 size project’s code twotailed pvalue 083 Thus infer dependence values use overall reuse rate including Java API Spearman coefficient 093 pvalue 00001 indicates significant strong negative correlation confirms results 10 size typically reduces reuse rate 53 Types Reused Functionality interesting investigate kind functionality actually reused Therefore tried categorize reused libraries different groups common functionality Consequently analyzed purpose reused library divided seven categories eg Networking TextXML Rich Client Platforms GraphicsUI determine extent certain type functionality reused employed blackbox reuse detection algorithm presented Section 33 calculate amount bytecode library reused inside observed predominant type reused functionality nearly projects reusing functionality belonging one category believe significant insight report except reuse seems diverse among categories concentrated single purpose
::::
6 Threats Validity section discusses potential threats internal external validity results presented paper 61 Internal Validity amount reuse measured fundamentally depends definition reuse techniques used measure discuss possible flaws lead overestimation actual reuse underestimation otherwise threaten results Overestimation reuse measurement whitebox reuse used results clone detection could contain false positives Thus reported clones indicate actual reuse mitigate manually inspected clones found Additionally automatically manually found duplicates known whether code copied study objects rather However findings manually verified example checking header comments ensured code actually copied library study object estimation blackbox reuse based static references bytecode consider class completely reused referenced may case example method holding reference another class might never called Another possibility would use dynamic analysis execution traces determine amount reused functionality However approach disadvantage finite subset execution traces could considered leading potentially large underestimation reuse Underestimation reuse application clone detection limited fixed set libraries Thus copied code could missed source taken included comparison set Additionally detector might miss actual clones low recall due weak normalization settings address chose settings yield higher recall cost precision manual inspection study objects’ code whitebox reuse inherently incomplete due large amounts code obvious copied parts could found static analysis used determine blackbox reuse misses certain dependencies method calls performed via Java’s reflection mechanism classes loaded based configuration information Additionally analysis penetrate boundaries created Java interfaces actual implementations used runtime dependencies might included reuse estimate mitigate one could search implementing class include first match dependency search result set However preliminary experiments showed approach leads large overestimation example command line program references interface also implemented UI class could lead us false conclusion program reuses UI code many forms reuse covered approach One example reusable generators uses code generator generate source code models would detected form reuse approach Moreover many ways components interact besides use dependencies source code Examples interprocess communication web services utilize services via SOAP calls integration database via SQL interface 62 External Validity tried use comprehensible way sampling study objects clear extent representative class open source Java programs First choice Sourceforge source study objects could bias selection certain kind open source developers could prefer repositories Google Code Second selected projects 50 downloaded ones could bias results scope study open source Java programs transferability results programming languages commercially developed unclear Especially programming language expected huge impact reuse availability open source commercial reusable code heavily depends language used
::::
7 Related Work reuse research field extensive body literature overview different reuse approaches found survey Krueger 9 following focus empirical work aims quantifying extent reuse real projects 18 Sojer et al investigate usage existing open source code development new open source conducting survey among 686 open source developers analyze degree code reuse respect developer characteristics report reuse plays important role open source development study reveals mean 30 implemented functionality projects survey participants based reused code Since Sojer et al use survey analyze extent code reuse results may subject inaccurate estimates respondents approach analyzes source code projects therefore avoids potential inaccuracy results confirmed study since also report reuse common open source projects Haefliger et al 4 analyzed code reuse within six open source projects performing interviews developers well inspecting source code code modification comments mailing lists web pages study revealed sample projects reuse Moreover authors found far dominant form reuse within sample blackbox reuse sample 6 MLOC 55 components total account 169 MLOC reused 6 MLOC 38 kLOC reused whitebox fashion developers also confirmed form reuse occurs infrequently small quantities study related however granularity blackbox analysis different treated whole components reusable entities measured fraction library actually used Since use code repository commit comments identifying whitebox reuse results sensitive regards accuracy comments contrast method utilizes clone detection therefore dependent correct commit comments study confirms finding blackbox far predominant form reuse 16 Mockus investigates largescale code reuse open source projects identifying components reused among several projects approach looks directories projects share certain fraction files equal names investigates much files reused among sample projects identify type components reused studied projects 50 files used one Libraries reused blackbox fashion considered approach Mockus’ work quantifies often code entities reused work quantifies fraction reused code compared code within projects Moreover reused entities smaller group files considered However results line findings regarding observation code reuse commonly practiced open source projects 10 Lee et al report empirical study investigates organizations employ reuse technologies different criteria influence reuse rate organizations using Ada technologies surveyed 500 Ada professionals ACM Special Interest Group Ada onepage questionnaire authors determine amount reuse survey Therefore results may inaccurate due subjective judgement respondents approach mitigates risk analyzing source code 19 von Krogh et al report exploratory study analyzes knowledge reuse open source authors surveyed developers 15 open source projects find whether knowledge reused among projects identify conceptual categories reuse analyze commit comments code repository identify accredited lines code direct form knowledge reuse study reveals considered projects reuse components observation reuse common open source development therefore confirmed study Like Haefliger et al Krogh et al rely commit comments code repository already mentioned potential drawbacks Basili et al 1 investigated influence reuse productivity quality objectoriented systems Within study determine reuse rate 8 projects developed students size ranging 5 kSLOCs 14 kSLOCs report reuse rates similar range results analyzed rather small programs written students context study contrast analyzed open source projects
::::
8 Conclusions Future Work reuse often called holy grail engineering certainly found form reusable components simply need plugged together However study shows reuse common almost open source Java projects also significant amounts reused analyzed 20 projects 9 projects reuse rates 50—even reuse Java API considered Reassuringly reuse rates great extent realized blackbox reuse copypasting source code conclude world opensource Java development high reuse rates theoretical option achieved practice Especially availability reusable functionality necessary prerequisite reuse occur wellestablished Java platform next step plan extend studies programming ecosystems development models particular interested extent nature reuse projects implemented legacy languages like COBOL PL1 one hand currently hyped languages like Python Scala hand Moreover future studies include commercial systems investigate extent opensource development model promotes reuse Acknowledgment authors want thank Elmar Juergens inspiring discussions helpful comments paper References Basili V Briand L Melo W reuse influences productivity objectoriented systems Communications ACM 3910 116 1996 Clements P Northrop LM Product Lines Practices Patterns 6th edn AddisonWesley Reading 2007 Frakes W Kang K reuse research Status future IEEE Transactions Engineering 317 529–536 2005 Haefliger Von Krogh G Spaeth Code Reuse Open Source Management Science 541 180–193 2008 Hummel B Juergens E Heinemann L Conradt IndexBased Code Clone Detection Incremental Distributed Scalable ICSM 2010 2010 Hummel Atkinson C Using web reuse repository Morisio ed ICSR 2006 LNCS vol 4039 pp 298–311 Springer Heidelberg 2006 Jacobson Griss Jonsson P reuse architecture process organization business success AddisonWesley Reading 1997 Kelly Tolvanen JP DomainSpecific Modeling Wiley Chichester 2008 Krueger C reuse ACM Comput Surv 242 131–183 1992 Lee N Litecky C empirical study reuse special attention Ada IEEE Transactions Engineering 239 537–549 1997 Lim W Effects reuse quality productivity economics IEEE 115 23–30 2002 Lindholm Yellin F Java virtual machine specification AddisonWesley Longman Publishing Co Inc Boston 1999 McIlroy Buxton J Naur P Randell B Mass produced components Engineering Concepts Techniques pp 88–98 1969 Mili H Mili Yacoub Addy E ReuseBased Engineering Techniques Organizations Controls Wiley Interscience Hoboken 2001 Mili H Mili F Mili Reusing Issues research directions IEEE Transactions Engineering 216 528–562 1995 Mockus Largescale code reuse open source FLOSS 2007 2007 Ravichandran Rothenberger reuse strategies component markets Communications ACM 468 109–114 2003 Sojer Henkel J Code Reuse Open Source Development Quantitative Evidence Drivers Impediments JAIS appear 2011 von Krogh G Spaeth Haefliger Knowledge Reuse Open Source Exploratory Study 15 Open Source Projects HICSS 2005 2005 Wohlin C Runeson P Höst Experimentation engineering introduction Kluwer Academic Dordrecht 2000
::::