From: Date: Tue, Sep 6, 2016 at 9:26 PM Subject: CSCW 2017 notification - #516 To: snehanarayan@gmail.com Cc: papers2017@cscw.acm.org Dear Sneha Narayan - We are pleased to inform you that your paper: 516 - The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users has been accepted to CSCW 2017. Congratulations! This year we received 530 submissions, of which 183 have been accepted for presentation at the conference. We are writing to provide your second round reviews, and to give you important information related to submitting your camera-ready paper and presenting it at the conference. First, your reviews are provided below. Please read these carefully and make sure your final submission of the camera-ready paper is as good as possible. In many cases reviewers have suggestions or requests that will improve your paper. Your next step is to prepare your camera-ready paper, which must be submitted into the PCS system by October 31, 2016. You will also be contacted by Sheridan Publishing, or directly by us, with specific information about producing an appropriate PDF, choosing among ACM copyright and license options, etc. Please pay special attention to the citation format used by CSCW (e.g., author’s first name spelled out first, but sorted by family name). All papers must be submitted in camera-ready form to be included in the conference program. Please note that an author of each paper must register for the conference and attend it to present the paper. Papers without a registered presenter will be removed from the proceedings. Registration will open in the Fall, with the greatest discounts available until the early registration deadline of January 11, 2017. Please be sure that at least one author registers by that date. Also, please let us know if the presenting author is someone other than the contact author for this paper so we can appropriately reach that person with any needed information. Finally, if you are coming from a country where a visa is required to visit the US, please be sure to start the process of getting that visa early. Soon after October 31st we will post a presentation schedule on the CSCW 2017 website so you can plan for your presentation time. All papers will be presented in slots of just over 20 minutes, so you should plan on a talk of 15-17 minutes with 3-5 minutes for questions. Finally, if your work involves an innovative system that would be appropriate to demonstrate, we'd like to encourage you to submit a demonstration to CSCW 2017 as well (deadline: November 4). Details at https://cscw.acm.org/2017/submit/demos.php. Again, congratulations! Thank you for submitting your work to CSCW 2017 and we look forward to seeing you in Portland! Louise Barkhuus Marcos Borges Wendy Kellogg CSCW 2017 Papers Chairs ------------------------ Submission 516, Review 4 ------------------------ Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users Reviewer: AC Expertise 2 (Passing Knowledge) First Round Overall Recommendation 3 (Maybe acceptable (with significant modifications)) Contribution and Criteria for Evaluation The paper presents the design and evaluation of a gamified tool for socializing and retaining new Wikipedia editors. Contribution criteria include (1) a description and rationale for the system; (2) system novelty and rationale for how it leads to learning; and (3) a methodologically sound evaluation. First Round Review from AC (if needed) Coordinator's First-Round Report to Authors The paper presents the design and evaluation of a gamified tool for socializing and retaining new Wikipedia editors. The study found that users liked—but did not learn from—the system. The focus on improving the experience of newcomers in Wikipedia is relevant and important. Reviewers describe the study as well motivated and exceptionally well-written. Read R3’s comments on the writing quality and congratulate yourself! The reviewers, however, have many concerns about the paper—each focusing on a different aspect of the work. The concerns the reviewers note /may/ be addressable during the revise and resubmit period, but it will be an exceptionally herculean effort. Also, please keep in mind that there is no guarantee of acceptance even after making changes. So, it is at the authors’ discretion about whether or not to proceed with revisions or withdraw the paper. There is split amongst the reviewers as to whether the failure of the tool is interesting or not. R1 raises concerns that the failure of the tool could be predicted from existing literature, suggesting little rationale for doing the work in the first place. R2 asks whether there is something fundamentally different about people who continue to contribute to Wikipedia, and as such whether the system holds value in practice. R3, on the other hand, sees much value in the systems contribution of the work as well as the real-world evaluation. R3's review has some suggestions of alternative framings that may make the contribution more valuable. In treatment of related work, many improvements are needed. R1 notes that the discussion of the well-known concept of legitimacy/authenticity in learning environments is missing. R2 also points to missing literature about Wikipedian experience. R2 and R3 raise a number of methodological questions about the paper. R2 suggests the distribution of participants across the timeline may bias the results. R3, on the other hand, sees opportunity here, suggesting additional statistical analysis related to longevity and power users. Both R2 and R3 question the methodological choice and contribution of measuring perceptions of learning rather than actual learning. Overall, this points to a need for at the very least justifying the methodological choices and at the most carrying out additional statistical analyses. In summary, there is quite a bit of work to be done. I wish the authors the best of luck, should they choose to continue in the review process. Requested Revisions REQUIRED: - Provide justification for why the study was worth carrying out, in response to R1 and R2’s concerns. R3’s review may have some insight into alternative framings. - State the research questions more explicitly, as per R2’s recommendation - Address R1 and R2’s concerns about missing literature - Ensure that the narrative around Wikipedia is clear to readers who do not have an in-depth background in production/editing details. - Improve the clarity of the results by using percentages or another baseline that allows comparison between numbers, as per R2’s review. - Provide justification for measuring perceptions of learning versus actual learning. - Provide a robust discussion of why the results are meaningful for researchers and/or practitioners. OPTIONAL, RECOMMENDED - Consider carrying out additional statistical analyses as recommended by R3. - Provide a short justification for use of English language Wikipedia, as per R2’s review. Formatting and Reference Issues Author Response Most or all of my comments were addressed. Final Rating of Revision 4 (Probably Accept) The Review of Revision The authors addressed the majority of reviewer concerns during the revision period. The explanation of study purpose and specific research questions is much clearer. The authors filled the gaps in the related work. Wikipedia-specific jargon was reduced. The data were reanalyzed to address reviewer concerns about temporal aspects of the data. Where the paper still struggles a bit is in clearly articulating its contribution. As such, the reviewers are largely positive, but also have significant reservations about the work. In what is otherwise a solid research study, the authors have struggled with clearly articulating how the work contributes to - and challenges - existing research knowledge. Ultimately, I think this is an issue of the paper just needing a bit more tweaking to the narrative. The reviewers all note ways in which the paper does provide a counterpoint to existing research. In particular, R1 notes that the findings of the paper are different than what one would expect given Halfaker's work on newcomer enculturation. This previous work indicates that interventions to boost confidence should have had a positive effect. But in this study, it didn't happen. So although the authors may have not called out this issue as explicitly as they could, it does appear that the the work does have some contrasting findings research about wikipedia newcomer enculturation. In addition, there are (perhaps understated) implications for gamification research which R3 describes as "fascinating." To sum up, the paper is right on the borderline for acceptance at a top-tier conference, and I would lean toward accepting it. It's not a perfect paper, but I am very encouraged by the amount of thought and conversation this work has raised amongst the reviewer. I'm also encouraged by how the work has cross-cutting implications across three research areas: learning, online peer production, and gamification. So there is potential for rather broad appeal. Coordinator's Final Report to Authors (meta-review) Congratulations on acceptance! The paper was discussed at the program committee meeting and was positively received. While no changes are mandatory, we do recommend taking a look at the second round reviews which have some additional suggestions for improvement. Remaining Formatting and Reference Issues Report completed Completed ------------------------ Submission 516, Review 1 ------------------------ Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users Expertise 4 (Expert) First Round Overall Recommendation 3 (Maybe acceptable (with significant modifications)) Contribution and Criteria for Evaluation In this paper, the authors present the design and two-pronged evaluation of a tutorial for new wikipedia editors that uses elements of gamification like missions and badges to help coach new editors and help them learn best practices and social norms of Wikipedia. The outcome is that users like the system but, based on behavioral measures, they don't actually learn from it. Learning interventions are a classic kind of research problem, and the paper should include robust measures of learning, as well as a good description of the designed intervention itself, why the design is expected to lead to learning, and a clear description of the study. First Round Review This is a reasonably well motivated study with connections to appropriate literature and the writing is engaging and understandable. The problem of enculturating newcomers into projects like Wikipedia is well documented and this paper investigates a potential intervention with an admirably well-planned study. Designing learning interventions is really difficult and I commend the authors on a well-executed effort. Still, I am ambivalent about the paper because I would have predicted these outcomes based on the literature alone. In the discussion, the authors note that one mismatch between Wikipedia and the tutorial as designed involve the “gradual peripheral participation” of newcomers as they take on the identity of “Wikipedian.” They suggest that maybe speeding up this process is unnatural. I would argue that the most important concept from the literature on learning is missing from this discussion, and that’s “legitimacy” (also sometimes referred to in education and learning literature as “authenticity”.) The authors explain that by doing tasks in a pretend version of Wikipedia, they make it a safe space for newcomers to practice, yet performing “canned” tasks in a pretend system is the opposite of offering a legitimate form of participation. I immediately wonder, why not use what we know from the literature to create low-risk missions that newcomers can complete while legitimately contributing to the encyclopedia? Risk taking is a fundamental characteristic of games that makes them engaging; it certainly seems like it would play a role in people’s motivation in a scenario like this. Rather than eliminating risk, the literature on legitimate peripheral participation would suggest that finding the right degree of risk is required to facilitate progressive entree into a set of shared practices. I am disappointed by the missed opportunity here, the outcome mainly seems to verify that what we know shouldn’t work based on the literature in fact doesn’t work. Yet still the paper isn’t bad and the study is carefully crafted and reported. With some extension and reflection, I think the discussion could help point future research in a more fruitful direction. There are millions of pages written on the challenges of designing learning interventions that change people’s behavior, this paper ends on a painfully obvious note. It’s true that usability isn’t all it takes, but what can we learn from TWA adventure about the design of systems to facilitate enculturation into a community of practice? What can we take away from this that might inform more successful tutorial systems in the future? Author Response Most or all of my comments were addressed. Final Rating of Revision 4 (Probably Accept) The Review of Revision The revised version of the paper has addressed many of the issues raised by me and other reviewers and, in some cases, even presents new analyses to address critiques. My initial response to the study was that, knowing all that I do about the literature on learning, I don't see why an intervention like TWA should be expected to work. The authors have addressed that problem by including citations to Guzdial and Tew that suggest inauthentic learning experiences can be effective sometimes. It's fine to point out exceptional cases - it doesn't change the fact that generally the literature would lead us to expect such interventions not to work. And it didn't. The fact that the authors thought it would work and can rationalize it post-hoc isn't an argument that the results are surprising. What might have been more compelling was the observation that confidence was bolstered and therefore should lead noobs to overcome the problems outlined in halfaker's "don't bite the newbies" paper - yet it didn't. The findings seem to suggest that Wikipedians are, indeed, born and not made. Either that or this simply isn't a good way of enculturating newcomers. Maybe people didn't find the tasks to be authentic in the context of an imagined community? Maybe people still think Wikipedians are mean and scary even if their confidence has been bolstered? Maybe gamification isn't a great way to engage would-be encyclopedia writers. I have raised my score to a 4 because the authors have done a good job in writing about a negative result, but I clearly also have some serious reservations. The findings suggest that Wikipedia needs to work on welcoming people, but we already knew that from the "don't bite the newbies" paper. In the end, I feel that although the paper presents a negative finding well, the many alternative explanations don't provide a satisfying narrative that leaves a reader with answers or theoretical insight. The phenomenon that learners report they *like* or enjoy a learning intervention that has no impact on their behavior or learning is, unfortunately, all too familiar. (It's almost remarkable in that the intervention didn't even elicit a short-term Hawthorne Effect.) Remaining Formatting and Reference Issues ------------------------ Submission 516, Review 2 ------------------------ Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users Expertise 4 (Expert) First Round Overall Recommendation 2 (Probably NOT acceptable) Contribution and Criteria for Evaluation This paper's contribution is the design and evaluation of a structured introduction to a peer production community (English Wikipedia) called "The Wikipedia Adventure". TWA's design is rooted in theories of gamification, and its utility is evaluated through a user survey and an invitation-based field experiment. The paper reports on the survey respondents' satisfaction with TWA, and how their experiment results reveal some of the challenges of affecting lasting changes to contributor patterns in peer production communities. These findings are then discussed in relation to cultural factors in Wikipedia, issuses of self-selection and voluntary participation, and the limitations of gamification. When evaluating a paper that describes the design of a system, the two main criteria are that the system and/or its development setting is/are novel, and that the way the system is evaluated is methodologically sound. First Round Review As mentioned in the contribution section, this paper's contribution is the design and evaluation of a structured introduction to a peer production community based on gamification, called "The Wikipedia Adventure". This is a great idea and sounds like a useful addition to Wikipedia. The paper is written in a way that makes it easy to read, and provides the reader with a good introduction to how TWA's design is rooted in theories of gamification, thus applying these principles in what appears to be a novel setting. The paper also does a good job of discussing the findings, organizing them in a way that is easy to follow and touching on important points (e.g. cultural factors, and the limitations of self-selection and gamification). The overall ideas and approach taken in this paper are sound, they are in line with the criteria described previously. Unfortunately, there are two major issues and several minor ones that need to be resolved before this paper is ready for publication. The first major issue is that the methodology used to evaluate performance in the invitation-based experiment measures contribution in a skewed manner and does not establish why that is appropriate. Secondly, the paper fails to consider arguments put forth by Panciera et al's "Wikipedians are Born, Not Made" paper. This review will expand on both of these major issues below. Further below will be notes and comments with suggestions for improvement for specific sections of the paper, some of which are rather substantial as well. 1: Evaluating TWA effectiveness by number of contributions ---------------------------------------------------------- A major part of the paper is the evaluation of TWA's effect on subsequent contributions. To evaluate this an invitation-based field experiment is used, and the paper does a great job of justifying why that is appropriate in this setting. The experiment runs from February 2014 and three months forward. Exact dates are not given, so let us assume that it ran until the end of April 2014. User contributions are then measured until the end of May 2014. There are two problems with this approach that the paper fails to address properly. One is the issue of right-truncation found in the data. Contributors who joined in early February 2014 would have about four months to make edits, whereas those who joined in late April would only have about a month. The model does contain a control variable for number of days in the experiment, but why is that appropriate in this context? If we examine other work in the same domain, they tend to either use a much longer time period (e.g. the Teahouse paper, citation 23, which uses 6-9 months) or ensure that the time period is fixed (e.g. Kittur et al. "Herding the Cats: The Influence of Groups in Coordinating Peer Production", WikiSym 2009; or Zhu et al. "Effectiveness of Shared Leadership in Online Communities", CSCW 2012). Related to the right-truncation problem is the fact that the paper also fails to discuss and justify what a reasonable timespan for measuring the effect of TWA is, and that it will have an effect on the number of contributions made. It might for instance be that TWA instead has an effect on how long it takes before a user drops out of the system. If we assume that TWA has an effect on contributions, what timespan is needed to measure that effect? The paper assumes that a month is adequate to discover it, whereas one might suspect that it is only measurable over a longer period of time. If it is the case that a short period of time is appropriate (for instance because these users are likely to drop out after a certain amount of time) the paper needs to properly establish that, either by measuring it or referring to previous work. 2: Wikipedians Are Born, Not Made --------------------------------- In their GROUP 2009 paper "Wikipedians Are Born, Not Made: A Study of Power Editors on Wikipedia", Panciera et al. show data that argues that those contributors who are going to stick around behave in a way that is different from the very beginning. In followup work published in 2010 they find similar differences in another peer production community. (Panciera et al. "Lurking? cyclopaths?: a quantitative lifecycle analysis of user behavior in a geowiki." CHI 2010) These two papers and the argument they put forth are relevant because they question who TWA is designed for. In the related work a reference to Bryant et al's "Becoming Wikipedian" is made, thereby suggesting that TWA is designed to teach someone how to be a Wikipedian. As Panciera et al's paper argues along the lines of these contributors already being Wikipedians, should TWA be designed to instead help these contributors stay productive? If Wikipedians are born, not made, then one could also question whether these contributors are at all going to use TWA. Maybe they ignore TWA because they are already productive and do not need it? Since the paper never makes any references to these papers and discusses issues related to this (e.g. "is the Teahouse more effective since it allows them to get answers when they need help?"), this whole topic area is left hanging. --- Below follows comments/notes for each section of the paper. Introduction: * An overall issue here is that there are few citations to sources. For instance a claim is made that "newly created accounts are the primary source of spam and vandalism on Wikipedia". Consider a "[citation needed]" added after that. * When citing multiple papers it is preferable that they are in order, e.g "[14, 23, 17]" should be "[14, 17, 23]" (page 1). This minor issue also occurs elsewhere in the paper. * "Unlike prior systems, TWA creates a structured experience that guides newcomers through critical pieces of Wikipedia knowledge..." Do we know that there are no other prior systems that offer a similar experience? It might be that there are none within the Wikipedia domain, but what about outside it? That sentence is making a rather bold claim. * After reading the introduction, what is the reader expected to remember as the main findings in this paper? At the end of the introduction the following sentence is found: "The study underscores the importance of conducting multiple types of evaluations of social systems." Is that the main contribution? What about the implications for gamified structured introductions to peer production? Background: * "...women reported that they found that contributing to Wikipedia involved a high level of conflict and that they lacked confidence in their expertise [8]. This suggests that more effective onboarding tools could help incorporate newcomers." This is an important side of Wikipedia, but how does TWA's design help mitigate this issue? Are there design elements in TWA that aims to boost confidence in one's expertise? * At the end of the introduction we find the following two questions: "Would a gamified tutorial produce a positive, enjoyable educational experience for new Wikipedians? Would playing the tutorial impact newcomer participation patterns?" These are the paper's _research questions_! It would be very helpful to the reader if they were displayed more clearly, e.g. as separate items. They should not be hidden. System Design: * "...it does not depend on the availability, helpfulness, or intervention of existing Wikipedia editors..." The underlying argument here is that scalability is preferable to personal interaction when socializing newcomers (in peer production communities). Why is that the better solution? As discussed previously, TWA might be designed for contributors who are not going to stick around, why are those the right audience for it? Is the goal to provide _everyone_ with a scalable impersonale introduction, or is it better to provide _some_ (typically based on self-selection) with a personal introduction (e.g. the Teahouse)? Game-like elements (subsection of System Design): * In "Missions" a distinction is made between "basic" and "advanced" editing techniques. It appears to be somewhat arbitrary, why is adding sources advanced editing, but watchlists are not? * Your readers might not now what watchlists are, take care to write for a general audience, not everyone knows a lot about how Wikipedia works behind the scenes. Study 1: User Survey: * This paper doesn't discuss any other language editions of Wikipedia besides the English one, and makes the assumption that "Wikipedia" equals the English edition. Adding a mention that Wikipedia exists in multiple languages and explaining why English was chosen as the language where TWA was launched would be very helpful. * The paper aims to measure "educational effectiveness". Why is a survey the appropriate way to measure that? Based on the description of the survey, it seems that it never asks specific questions to test whether TWA's users learned specific things, in other words whether the education was successful. Later when describing the results the phrase "learning to edit Wikipedia" is used, isn't that the _key_ learning goal of TWA? Yet the survey asks Likert-scale questions. In other words, you're measuring whether TWA users are under the impression that they learned something, not whether they actually did. * Figure 4 uses counts. While it shows that none of the questions had responses from all participants, it makes comparisions between questions with different response rates very difficult. Using percentages would allow for direct comparisons, and makes the references to the figure in the text easier to follow along with. The text refers to four questions with a certain percentage of responses, but leaves the math to the reader. * The survey leaves many questions unanswered, some of which the paper might want to address. Were any negative questions asked? Were there any control questions, such as a similar question worded slightly differently to allow for comparison between responses? As it is, this survey comes across as a set of positive statements about TWA that respondents agreed to. Given that respondents self-select and no attempts to contact users who didn't go through TWA appears to have been made, it is likely there is a bias in the responses, and that bias should be discussed. Study 2: Field Experiment: * The description of how accounts were selected to be included is rather confusing. First it describes 1,967 accounts that met the same criteria as for the user survey, however 10,000 individuals ("accounts"?) were invited to the beta. Why is one an order of magnitude larger than the other? Then in the second paragraph of "Methods" it describes the selection criteria, that at least one contribution would have to be made after getting invited. This would perhaps be much less confusing if the criteria were first explained, particularly how the experiment and control groups were set up, and then how many accounts were identified. * "This is a larger proportion of users than took up the invitation in Study 1, which may be due to changes in the invitation text." Earlier in the paper study 1 refers to a "beta", whereas this appears to be not. If this is the case, this is an important difference between the two that should be made clear to the reader. * "we measure the overall contributions as the total number of edits made by each account from the time of inclusion in the study until May 31, 2014." When exactly is "time of inclusion", is that when they got the invite? What about when they completed one (or all) TWA mission(s)? The concern here is that all contributions are measured, whereas the experiment sets up a pre/post-scenario. Later on the paper refers to "subsequent contributions", indicating that contributions after a certain point in time was measured. This quickly becomes rather confusing, spelling out clearly what points in a user's account history is used (e.g. "we measure contributions at four points in time: when the user registered their account, the time of invitation, when they first started using TWA, and the end of the experiment") would be very helpful. * Why is a six-edit radius chosen when measuring word persistence? Halfaker et al. make no claim about what the radius should be in the referenced work, and Ekstrand et al suggest a 15 edit radius in a related paper (Ekstrand and Riedl "rv you're dumb: identifying discarded work in Wiki article history." WikiSym 2009) The six-edit radius also comes with an issue that is unadressed: how long does it take for an edit made by a contributor in the study to reach that six-edit radius? If it hasn't been reached at the end of the study period, that edit has to be discarded as its quality is unknown. In a related paper, Farzan and Kraut instead chose to use percentage of words that survived as a measure of quality (Farzan and Kraut "Wikipedia classroom experiment: bidirectional benefits of students' engagement in online production communities" CHI 2013) * Tables 1, 2, 3, and 4, as well as figure 6 should be brought closer together so it's easier to follow along. Table 1 occurs before the text that refers to it, and table 4 is two pages further along. Putting all tables and figure 6 on the same page might be a good solution. * Table 3 refers to users "reached" a mission. It is confusing how 181 users reached the final mission but did not complete it, yet in the text it seems these 181 users actually did. * The post-hoc power analysis is very useful! Discussion: * "The new editors in our study may have had unpleasant experiences during their initial time on Wikipedia..." It appears that the survey asked no questions about this, yet is it not a very important issue related to TWA's success? * In "Limitations of gamification" the following sentence is found: "...our study is among the first that compares levels of participation in a task among individuals who were introduced to gamified learning first to those that were not." This is an _important_ finding, it shouldn't be hidden back here but instead be up front in the introduction! Author Response Most or all of my comments were addressed. Final Rating of Revision 5 (Definitely Accept) The Review of Revision First of all, this reviewer would like to congratulate the authors on the herculean effort that's been put into improving this paper and the quality that has resulted from it. The attached revision document is also of high quality, carefully considering the comments from the reviewers and arguing well for why some of our suggestions were not implemented. After carefully reading the revised version, my final recommendation for this paper changes to a 'Definitely Accept'. There are several reasons for why this paper ought to be included in the conference proceedings: 1: It is a well-written paper. This was implied in some of my previous comments, as well as R3's applause. The revised version is no different from the initial one, the changes kept with the clear writing style and the content changes have further improved the paper. 2: The literature on newcomer interventions in peer production communities (such as Wikipedia) is sparse. Much work has studied what happens to newcomers in these communities and proposed solutions, but large-scale interventions are few and far between. This paper therefore starts filling that gap in the literature. 3: The design of TWA is well-founded. As the paper argues: "The design of our system was informed by previous empirical, theoretical, and systems work and our system performed well according to the types of survey self-report measures used to evaluate the usability of many social computing systems." (revised version, page 12) There are strong reasons to believe that this intervention _should_ work, partly due to the positive responses in the survey, as well as previous research on gamification. In other words, the fact that it didn't work is arguably a noteworthy result in and of itself. 4: Figuring out why these types of interventions fail and/or what types of interventions succeed is probably on the order of a lifetime's worth of research work. As the authors argue in the revision notes, documenting these failures is important. We don't need to document _all_ of them, but given that this one describes an intervention with a reasonably solid foundation in previous work that indicated it was likely to succeed, it should be a sufficiently interesting example that the community will benefit from having documented. 5: Connected with the last sentence in point #4 (it's a sufficiently interesting example), this paper's negative result can initiate a discussion around and motivate future research in this space in order to uncover what factors lead to a successful intervention in this space, as well as further document failures. Those things being said, there's still a bit of room for improvement in this paper, here are some final notes and suggestions: Introduction: In the introduction, the phrase "Social computing systems that aggregate voluntary contributions" is used, while in the conclusion the phrase "peer production" is used. Consistent terminology usage is useful. Awkward phrasing: "(a limited resources)" Awkward phrasing: "…how new users perceive to the system’s design and tone." Background: In the section "Why Gamify Becoming a Wikipedian?" a reference to Kriplean et al.'s barnstar work is made, "…badge-like social awards which confer external recognition of their achievements [31]." Something that wasn't mentioned in the previous review is that Kriplean et al. found that barnstars take a rather long time to be awarded (see their footnote 2, page 3), median edits is around 1,200 and median tenure is around a year. This suggests that barnstars are a rather slow process, creating somewhat of a broken feedback loop. TWA's faster achievements, and perhaps other solutions such as "WikiLove" and "thanks", can be seen as important improvements since they close the loop much faster. The lack of (positive) feedback on wikiwork is maybe one of the reasons newcomers don't stick around. (While that's not necessarily a suggestion for changes to this paper, it's perhaps something worth keeping in mind for future work) In the same section, a reference to GettingStarted is made. This reference points out some of GS' features, which seem to be very similar to some of what TWA does. The claims in the introduction that are made about TWA's novelty in creating a structured experience for newcomers are therefore maybe a bit strong? System Design: The Wikipedia Adventure: In the section "Game-like elements", subsection "Missions", a reference to setting up a user page is made. It might be useful to explain to the reader why creating a user page is important (e.g. that non-existent user/user talk pages are a flag to patrollers, or that it can signal stronger commitment to the community). Study 1: User Survey: The study and the results in Figure 5 are referred to in the text as measuring “user confidence” and “user engagement”. The two rightmost questions don't fit into that, they instead measure the participants' perception of whether TWA would be useful for other newcomers in the community. In addition to lacking reference in the text, an issue with this evaluation is that it's only asking one specific group to evaluate it. There's no survey of experienced contributors, for instance whether they perceive TWA participants to be "better" contributors. Lastly there is also the issue of whether a newcomer to the Wikipedia community is able to properly consider whether TWA is a good way to introduce newcomers, partly because they might not know what's missing. In the results section the following claim is made: "These findings provide validation of our choice to gamify the tutorial." That conclusion doesn't appear to be supported. The survey questions don't appear to poke at whether the gamification elements of the tutorial were the reason for the positive responses. It instead appears that we have a general evaluation of the perceived utility of TWA. Study 2: Field Experiment Table 2 in the results is still somewhat confusing. Using "Attrition" doesn't seem to work well either, since there appears to be 181 participants who either started or completed mission seven. Maybe it's just that there's no mission eight, so the categorization scheme becomes difficult? Either way, some way of clarifying what happened to these 181 participants would be helpful. Figure 6 might benefit greatly from the Y axis being log-scale for the edit counts, given that the distributions are so skewed. Not sure there's much usefulness in a box plot if there's no box. Discussion: "The null findings in these models indicate that the people who played the game and went on to contribute extensively would have done so anyway." So are Wikipedians born, not made, then? :-P Remaining Formatting and Reference Issues Looks like the 15th page can be spared if the references are shortened, for instance by removing "ACM, New York, NY, USA" from all the ACM references, shortening the proceeding names, etc… Given that there isn't strictly a page limit, it's arguably not that necessary, but perhaps worth considering. ------------------------ Submission 516, Review 3 ------------------------ Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users Reviewer: AC-Reviewer Expertise 4 (Expert) First Round Overall Recommendation 3 (Maybe acceptable (with significant modifications)) Contribution and Criteria for Evaluation This paper presents the results of a deployment of a gameification-based system designed to retain new editors in Wikipedia. It is a negative results paper: the authors claim that they have conclusive evidence that the system did not work (although I have suggested a few additional lines of inquiry below that might problematize this assertion). The committee will have to have a discussion about how to evaluate this paper, and likely negative results papers more generally. First Round Review This paper presents the results of a deployment of a gameification-based system designed to retain new editors in Wikipedia. It is a negative results paper: the authors claim that they have conclusive evidence that the system did not work (although I have suggested a few additional lines of inquiry below that might problematize this assertion). The paper is very well-written and has some large positives. It also is a negative results paper, and the committee will have to decide how to handle this. In general, I’m strongly sympathetic to arguments to include more negative results papers in our proceedings, but I’m quite unclear on the details of how to do so (e.g. what defines a top-quality negative results paper?). I’m hopeful that this paper can instigate a broader discussion on this topic at the PC meeting. All of that said, this paper also has a number of idiosyncratic limitations that make it perhaps not the best trial balloon for negative results papers. Below, I outline what I believe to be the paper’s positives and then describe these limitations in more detail, phrased as both critiques and questions. Overall, my recommendation is to invite the authors to revise and resubmit. If this occurs, I’ll want to see the below critiques addressed and the below questions answered (both through direct answers in the response to reviewers and through clarifications and changes to the paper). I’m hopeful through, through the R&R process, this paper can become an ideal negative results trial balloon. Important positives: * The authors built a system to solve a real-life problemand did a real-life, relatively large-scale deployment. Awesome! * The paper is easily in the top 95% in terms of writing quality. This is true both at the sentence level and at the narrative level. As a person who has to review lots of papers, this was a breath of fresh air. * The design of the game is quite well-thought-out, save a few relatively arbitrary decisions. I was particularly compelled by the use of gameification techniques that are also present in “real Wikipedia” (e.g. barnstar-like rewards). Critiques: CRITIQUE #1 – Excessive import placed on trivial self-report data: It is well-known that self-report data from participants is inferior to observations of actual behavior, and that self-report data can be quite unreliable more generally. As such, in my view, it is not a contribution to show that self-report data didn’t end up panning out in the behavioral results. In the next draft of this paper, I would like to see the authors address this issue. This might mean framing this paper as a full-on negative results paper, but lighter weight adaptations might be possible. Open questions: QUESTION #1: As noted above, this paper is a negative results paper at its core, and we’ll have to have a broad discussion about this at the PC meeting, assuming the paper makes it this far. In the event that this occurs, can the authors provide a more robust argument as to why these negative results are important for other researchers and practitioners? The paper attempts to argue that one contribution that comes out of its negative results is to distrust self-report data, but this is well-known (see below). The other negative results argument in the paper is that these results add to growing evidence of long-term gameificiation failures. I find this argument much more compelling. In other words, by expanding on this argument, the authors may be able to address this question. That said, regardless of how this question is addressed in the second draft, I’d like to see it done both through changes to the paper and through discussion in the response to reviewers. QUESTION #2 – Is there a possibility that the statistical framework employed is not appropriate for this particular study? The authors utilize a two-level statistical approach that I haven’t seen before in the CSCW/CHI literature. I enjoyed thinking about this approach, and the authors did a relatively good job explaining it. That said, I’m currently not convinced that it was the appropriate framework for this study. Here’s my reasoning: (1) The goal here is to introduce a treatment that ultimately will produce strong new members of the Wikipedia community at a higher rate than the control. (2) Let’s say the game produces 3 such members out of 100 new editors and the control produces 1, which looks like it might be the case. Let’s also say that this pattern additionally persists over a large n. (3) If this is true, why do we care about the potentially moderating effect of the invitations? The authors argue that new editors that responded to the invitation to play the game might just be new editors who are engaged and, critically, would have been power editors whether or not the game existed. However, barring a random fluke, shouldn’t these future power editors also have been in the control group? If I’m right here, I’m thinking the invitation doesn’t matter and a more traditional statistical analysis (or at least one targeted at identifying rare events) is appropriate. I could be wrong, but I want the authors to respond to this question, both through feedback to reviewers and clarifications in the paper. As an important side note, if we agree that this framework is the right way to go in the end, the authors should puff their chests more about this by claiming it as a contribution (assuming it hasn’t been used at CSCW before). Question #3 – Are the outcome variables considered here the best outcome variables? Are some critical variables missing? The authors seem focused on the average effects across the entire control and treatment groups (the two treatment groups, to be specific). However, would it not also be reasonable to consider the metric I describe above: the % of new editors that go on to be power editors? Since power editors end up contributing most of the edits anyway *over the long term*, to me this seems like the way to go (i.e. if this group of editors were followed for years, statistically significant differences would begin to emerge). If the authors agree, the authors need to reanalyze their data with this metric in mind. Another related outcome variable that might be useful to analyze is how long the new editors in each group remained active editors in the community (i.e. survival analysis). Because the data is quite old, this should be an easy new analysis to run, and longevity has been a variable interest in a number of peer production studies. In their second draft and the feedback to reviewers, I would like to see the authors discuss either new analyses related to power users or why thy did not consider this outcome variable. I would also like to see the same for survival analysis. QUESTION #4: Is there a path towards positive results? As noted above, I believe some discussion around this paper and negative results papers more generally will have to happen at the PC meeting. However, I think there are so missed opportunities here for positive results and that the authors were too quick to settle for negative results. This is likely an important factor to consider when deciding whether to accept a negative results paper. Most notably, there are several, well-motivated unexplored avenues that could lead to positive results that would have a much larger impact than the negative results presented here: * As noted above, examining additional outcome variables is important, most notably # of power editors and longevity. * Does the game work if folks are forced to play it prior to editing Wikipedia, as would be the case in most other institutionalized socialization contexts? This is not just a hypothetical: this game could be used in all Wikipedia Education Project classes and related endeavors. Author Response Some of my comments were addressed. Final Rating of Revision 3 (Borderline) The Review of Revision After reviewing the change log and the new draft, I remain on the fence about this paper. Below, I outline what I believe to be the key discussion points about this paper in preparation for a likely conversation with my fellow reviewers and at the PC meeting. First, though, I outline some important positives to keep in mind as we have this discussion: POSITIVES • This paper is a canonical systems paper, and one that has a strong evaluation. The effort involved in putting together this paper is probably 2-3x that of the average quant/qual paper. • I think the implications of these findings for gameification research are very interesting, especially because they replicate and extend what has been found in prior meta-analyses. (This receives too little attention in the paper, though). • The paper is very well written. • The change log is by far the most detailed of any I have encountered thus far this year, although I think the authors were a little stuck in their ways in terms of actually making changes. Discussion point #1: What makes a good negative results paper? Does this paper meet these criteria? The revision of this paper doubles-down on the “negative results as contribution” message. This means that we as a committee have to define the conditions for a high-quality negative results paper, and do so in a way that won’t lead to moral hazards down the road. The authors did not do a good job in their change log arguing why this paper is a good negative results paper, instead making standard arguments about the importance of negative results (without much recognition of the challenges associated with evaluating them). Most if not all reviewers should have already been aware of the argumentation in the change log. As far as I can tell, this paper implicitly and explicitly states that the following is required for a good negative results paper: (1) A sample size that gives us relative confidence that moderate growth of the experiment won’t lead to important effects in the end (e.g. we might see significant results, but not ones of a meaningful size). (2) A discussion section that helps to interpret the negative results so that this paper can lead to some generalizable findings. (3) The usual array of well-executed methods, excellent communication of results, etc. Upon significant reflection, I tend to agree that these criteria do help to turn a negative result into something that can be useful outside of the specific experiment (a pre-condition to the acceptance of any paper, in my view). However, I think we need to reflect on this more as a community. Critically, we also need to decide if this paper meets the above criteria, which is the subject of the next two sections of this review. Overall, I think #1 and #3 are spot on with this paper, but #2 is weaker. METHODS I don’t think the authors understood my concerns about their statistical approach. I do not take any issue with the two-level design. I take issue with the interpretation. The authors themselves argue that the invitation is an ecologically valid way to test the system, and to me, that means that – at least to some degree – the invitation is *part of the system*. After all, it would be necessary for its real-world deployment. This makes the first level of the results quite important, although I agree that the second level contributes to important understanding as well. The original paper was written as a social science paper would be and tried to control away the invitation. My point is that the invitation can in many ways be considered the first interaction with the system, a point with which I think the authors agree. This is one way where this study is different than how it sounds like this technique is often employed in the social sciences. The good/bad news is that it looks like with the new results, the point is mostly moot: regardless of whether you consider the invitation as part of the system or not, there is no effect (and if there were an effect, it would be that the system made things worse; Note: This is assuming I’m understanding things correctly: the authors use the term ‘control’ and ‘treatment’ without referring to whether they are referring to the first or second level). The only way in which this point is still an important one is that the interpretation is somewhat strained in this new draft with regard to this issue. I would confront it head-on in any future drafts. This would involve presenting two interpretations of the system: one that includes the invitation (which is ecologically valid in terms of how it would actually be deployed) [level one] and one that does not [level two]. INTERPRETATIONS OF RESULTS I think the paper falls a bit short in interpreting what the results mean (criterion #2 for a good negative results paper, as per above). The key takeaway seems to be: designing gameified systems to support newcomers in Wikipedia is hard. I’m not sure that’s good enough. The implications for the gameification literature continue to fascinate me (this is a big plus for the paper in my book), but the authors write in their change log that this is not the focus of the paper, and this is reflected in the new draft. My thinking is that if some of the most interesting implications of this paper are in this space, why not make it the focus of the paper? I realize it’s not in the WP domain, but WP has a great deal of value in social computing as a test bed for social computing generally, not just studying WP. Remaining Formatting and Reference Issues Report completed