From: <papers2017@cscw.acm.org>
Date: Tue, Sep 6, 2016 at 9:26 PM
Subject: CSCW 2017 notification - #516
To: snehanarayan@gmail.com
Cc: papers2017@cscw.acm.org


Dear Sneha Narayan -

We are pleased to inform you that your paper:

516 - The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users

has been accepted to CSCW 2017. Congratulations! This year we received 530
submissions, of which 183 have been accepted for presentation at the
conference.

We are writing to provide your second round reviews, and to give you
important information related to submitting your camera-ready paper and
presenting it at the conference.

First, your reviews are provided below. Please read these carefully and
make sure your final submission of the camera-ready paper is as good as
possible. In many cases reviewers have suggestions or requests that will
improve your paper.

Your next step is to prepare your camera-ready paper, which must be
submitted into the PCS system by October 31, 2016.  You will also be
contacted by Sheridan Publishing, or directly by us, with specific
information about producing an appropriate PDF, choosing among ACM
copyright and license options, etc. Please pay special attention to the
citation format used by CSCW (e.g., author’s first name spelled out first,
but sorted by family name). All papers must be submitted in camera-ready
form to be included in the conference program.

Please note that an author of each paper must register for the conference
and attend it to present the paper. Papers without a registered presenter
will be removed from the proceedings. Registration will open in the Fall,
with the greatest discounts available until the early registration deadline
of January 11, 2017. Please be sure that at least one author registers by
that date. Also, please let us know if the presenting author is someone
other than the contact author for this paper so we can appropriately reach
that person with any needed information. Finally, if you are coming from a
country where a visa is required to visit the US, please be sure to start
the process of getting that visa early.

Soon after October 31st we will post a presentation schedule on the CSCW
2017 website so you can plan for your presentation time. All papers will be
presented in slots of just over 20 minutes, so you should plan on a talk of
15-17 minutes with 3-5 minutes for questions.

Finally, if your work involves an innovative system that would be
appropriate to demonstrate, we'd like to encourage you to submit a
demonstration to CSCW 2017 as well (deadline: November 4).  Details at
https://cscw.acm.org/2017/submit/demos.php.

Again, congratulations! Thank you for submitting your work to CSCW 2017 and
we look forward to seeing you in Portland!

Louise Barkhuus
Marcos Borges
Wendy Kellogg
CSCW 2017 Papers Chairs

------------------------ Submission 516, Review 4 ------------------------

Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users

Reviewer:           AC

Expertise

   2  (Passing Knowledge)

First Round Overall Recommendation

   3  (Maybe acceptable (with significant modifications))

Contribution and Criteria for Evaluation

   The paper presents the design and evaluation of a gamified tool for
   socializing and retaining new Wikipedia editors. Contribution criteria
   include (1) a description and rationale for the system; (2) system
   novelty and rationale for how it leads to learning; and (3) a
   methodologically sound evaluation.

First Round Review from AC (if needed)


Coordinator's First-Round Report to Authors

   The paper presents the design and evaluation of a gamified tool for
   socializing and retaining new Wikipedia editors. The study found that
   users liked—but did not learn from—the system.

   The focus on improving the experience of newcomers in Wikipedia is
   relevant and important. Reviewers describe the study as well motivated
   and exceptionally well-written. Read R3’s comments on the writing
   quality and congratulate yourself!

   The reviewers, however, have many concerns about the paper—each
   focusing on a different aspect of the work. The concerns the reviewers
   note /may/ be addressable during the revise and resubmit period, but it
   will be an exceptionally herculean effort. Also, please keep in mind that
   there is no guarantee of acceptance even after making changes. So, it is
   at the authors’ discretion about whether or not to proceed with
   revisions or withdraw the paper.

   There is split amongst the reviewers as to whether the failure of the
   tool is interesting or not. R1 raises concerns that the failure of the
   tool could be predicted from existing literature, suggesting little
   rationale for doing the work in the first place.  R2 asks whether there
   is something fundamentally different about people who continue to
   contribute to Wikipedia, and as such whether the system holds value in
   practice.  R3, on the other hand, sees much value in the systems
   contribution of the work as well as the real-world evaluation. R3's
   review has some suggestions of alternative framings that may make the
   contribution more valuable.

   In treatment of related work, many improvements are needed. R1 notes that
   the discussion of the well-known concept of legitimacy/authenticity in
   learning environments is missing.  R2 also points to missing literature
   about Wikipedian experience.

   R2 and R3 raise a number of methodological questions about the paper.  R2
   suggests the distribution of participants across the timeline may bias
   the results. R3, on the other hand, sees opportunity here, suggesting
   additional statistical analysis related to longevity and power users.
   Both R2 and R3 question the methodological choice and contribution of
   measuring perceptions of learning rather than actual learning.  Overall,
   this points to a need for at the very least justifying the methodological
   choices and at the most carrying out additional statistical analyses.

   In summary, there is quite a bit of work to be done. I wish the authors
   the best of luck, should they choose to continue in the review process.


Requested Revisions

   REQUIRED:
   -    Provide justification for why the study was worth carrying out, in
   response to R1 and R2’s concerns.  R3’s review may have some insight
   into alternative framings.
   -    State the research questions more explicitly, as per R2’s
   recommendation
   -    Address R1 and R2’s concerns about missing literature
   -    Ensure that the narrative around Wikipedia is clear to readers who
do
   not have an in-depth background in production/editing details.
   -    Improve the clarity of the results by using percentages or another
   baseline that allows comparison between numbers, as per R2’s review.
   -    Provide justification for measuring perceptions of learning versus
   actual learning.
   -    Provide a robust discussion of why the results are meaningful for
   researchers and/or practitioners.

   OPTIONAL, RECOMMENDED
   -    Consider carrying out additional statistical analyses as
recommended by
   R3.
   -    Provide a short justification for use of English language
Wikipedia, as
   per R2’s review.

Formatting and Reference Issues


Author Response

   Most or all of my comments were addressed.

Final Rating of Revision

   4  (Probably Accept)

The Review of Revision

   The authors addressed the majority of reviewer concerns during the
   revision period.  The explanation of study purpose and specific research
   questions is much clearer. The authors filled the gaps in the related
   work. Wikipedia-specific jargon was reduced. The data were reanalyzed to
   address reviewer concerns about temporal aspects of the data.
   Where the paper still struggles a bit is in clearly articulating its
   contribution.

   As such, the reviewers are largely positive, but also have significant
   reservations about the work. In what is otherwise a solid research study,
   the authors have struggled with clearly articulating how the work
   contributes to - and challenges - existing research knowledge.
   Ultimately, I think this is an issue of the paper just needing a bit more
   tweaking to the narrative. The reviewers all note ways in which the paper
   does provide a counterpoint to existing research.

   In particular, R1 notes that the findings of the paper are different than
   what one would expect given Halfaker's work on newcomer enculturation.
   This previous work indicates that interventions to boost confidence
   should have had a positive effect. But in this study, it didn't happen.
   So although the authors may have not called out this issue as explicitly
   as they could, it does appear that the the work does have some
   contrasting findings research about wikipedia newcomer enculturation. In
   addition, there are (perhaps understated) implications for gamification
   research which R3 describes as "fascinating."

   To sum up, the paper is right on the borderline for acceptance at a
   top-tier conference, and I would lean toward accepting it. It's not a
   perfect paper, but I am very encouraged by the amount of thought and
   conversation this work has raised amongst the reviewer. I'm also
   encouraged by how the work has cross-cutting implications across three
   research areas: learning, online peer production, and gamification.  So
   there is potential for rather broad appeal.

Coordinator's Final Report to Authors (meta-review)

   Congratulations on acceptance! The paper was discussed at the program
   committee meeting and was positively received. While no changes are
   mandatory, we do recommend taking a look at the second round reviews
   which have some additional suggestions for improvement.

Remaining Formatting and Reference Issues


Report completed

   Completed


------------------------ Submission 516, Review 1 ------------------------

Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users


Expertise

   4  (Expert)

First Round Overall Recommendation

   3  (Maybe acceptable (with significant modifications))

Contribution and Criteria for Evaluation

   In this paper, the authors present the design and two-pronged evaluation
   of a tutorial for new wikipedia editors that uses elements of
   gamification like missions and badges to help coach new editors and help
   them learn best practices and social norms of Wikipedia. The outcome is
   that users like the system but, based on behavioral measures, they don't
   actually learn from it. Learning interventions are a classic kind of
   research problem, and the paper should include robust measures of
   learning, as well as a good description of the designed intervention
   itself, why the design is expected to lead to learning, and a clear
   description of the study.

First Round Review

   This is a reasonably well motivated study with connections to appropriate
   literature and the writing is engaging and understandable. The problem of
   enculturating newcomers into projects like Wikipedia is well documented
   and this paper investigates a potential intervention with an admirably
   well-planned study. Designing learning interventions is really difficult
   and I commend the authors on a well-executed effort.

   Still, I am ambivalent about the paper because I would have predicted
   these outcomes based on the literature alone. In the discussion, the
   authors note that one mismatch between Wikipedia and the tutorial as
   designed involve the “gradual peripheral participation” of newcomers
   as they take on the identity of “Wikipedian.” They suggest that maybe
   speeding up this process is unnatural. I would argue that the most
   important concept from the literature on learning is missing from this
   discussion, and that’s “legitimacy” (also sometimes referred to in
   education and learning literature as “authenticity”.) The authors
   explain that by doing tasks in a pretend version of Wikipedia, they make
   it a safe space for newcomers to practice, yet performing “canned”
   tasks in a pretend system is the opposite of offering a legitimate form
   of participation. I immediately wonder, why not use what we know from the
   literature to create low-risk missions that newcomers can complete while
   legitimately contributing to the encyclopedia? Risk taking is a
   fundamental characteristic of games that makes them engaging; it
   certainly seems like it would play a role in people’s motivation in a
   scenario like this. Rather than eliminating risk, the literature on
   legitimate peripheral participation would suggest that finding the right
   degree of risk is required to facilitate progressive entree into a set of
   shared practices.

   I am disappointed by the missed opportunity here,  the outcome mainly
   seems to verify that what we know shouldn’t work based on the
   literature in fact doesn’t work. Yet still the paper isn’t bad and
   the study is carefully crafted and reported.

   With some extension and reflection, I think the discussion could help
   point future research in a more fruitful direction. There are millions of
   pages written on the challenges of designing learning interventions that
   change people’s behavior, this paper ends on a painfully obvious note.
   It’s true that usability isn’t all it takes, but what can we learn
   from TWA adventure about the design of systems to facilitate
   enculturation into a community of practice? What can we take away from
   this that might inform more successful tutorial systems in the future?

Author Response

   Most or all of my comments were addressed.

Final Rating of Revision

   4  (Probably Accept)

The Review of Revision

   The revised version of the paper has addressed many of the issues raised
   by me and other reviewers and, in some cases, even presents new analyses
   to address critiques. My initial response to the study was that, knowing
   all that I do about the literature on learning, I don't see why an
   intervention like TWA should be expected to work. The authors have
   addressed that problem by including citations to Guzdial and Tew that
   suggest inauthentic learning experiences can be effective sometimes. It's
   fine to point out exceptional cases - it doesn't change the fact that
   generally the literature would lead us to expect such interventions not
   to work. And it didn't. The fact that the authors thought it would work
   and can rationalize it post-hoc isn't an argument that the results are
   surprising. What might have been more compelling was the observation that
   confidence was bolstered and therefore should lead noobs to overcome the
   problems outlined in halfaker's "don't bite the newbies" paper - yet it
   didn't. The findings seem to suggest that Wikipedians are, indeed, born
   and not made. Either that or this simply isn't a good way of
   enculturating newcomers. Maybe people didn't find the tasks to be
   authentic in the context of an imagined community? Maybe people still
   think Wikipedians are mean and scary even if their confidence has been
   bolstered? Maybe gamification isn't a great way to engage would-be
   encyclopedia writers.

   I have raised my score to a 4 because the authors have done a good job in
   writing about a negative result, but I clearly also have some serious
   reservations. The findings suggest that Wikipedia needs to work on
   welcoming people, but we already knew that from the "don't bite the
   newbies" paper. In the end, I feel that although the paper presents a
   negative finding well, the many alternative explanations don't provide a
   satisfying narrative that leaves a reader with answers or theoretical
   insight.  The phenomenon that learners report they *like* or enjoy a
   learning intervention that has no impact on their behavior or learning
   is, unfortunately, all too familiar. (It's almost remarkable in that the
   intervention didn't even elicit a short-term Hawthorne Effect.)

Remaining Formatting and Reference Issues


------------------------ Submission 516, Review 2 ------------------------

Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users


Expertise

   4  (Expert)

First Round Overall Recommendation

   2  (Probably NOT acceptable)

Contribution and Criteria for Evaluation

   This paper's contribution is the design and evaluation of a structured
   introduction to a peer production community (English Wikipedia) called
   "The Wikipedia Adventure".  TWA's design is rooted in theories of
   gamification, and its utility is evaluated through a user survey and an
   invitation-based field experiment.  The paper reports on the survey
   respondents' satisfaction with TWA, and how their experiment results
   reveal some of the challenges of affecting lasting changes to contributor
   patterns in peer production communities.  These findings are then
   discussed in relation to cultural factors in Wikipedia, issuses of
   self-selection and voluntary participation, and the limitations of
   gamification.

   When evaluating a paper that describes the design of a system, the two
   main criteria are that the system and/or its development setting is/are
   novel, and that the way the system is evaluated is methodologically
   sound.

First Round Review

   As mentioned in the contribution section, this paper's contribution is
   the design and evaluation of a structured introduction to a peer
   production community based on gamification, called "The Wikipedia
   Adventure". This is a great idea and sounds like a useful addition to
   Wikipedia. The paper is written in a way that makes it easy to read, and
   provides the reader with a good introduction to how TWA's design is
   rooted in theories of gamification, thus applying these principles in
   what appears to be a novel setting. The paper also does a good job of
   discussing the findings, organizing them in a way that is easy to follow
   and touching on important points (e.g. cultural factors, and the
   limitations of self-selection and gamification).

   The overall ideas and approach taken in this paper are sound, they are in
   line with the criteria described previously.  Unfortunately, there are
   two major issues and several minor ones that need to be resolved before
   this paper is ready for publication. The first major issue is that the
   methodology used to evaluate performance in the invitation-based
   experiment measures contribution in a skewed manner and does not
   establish why that is appropriate. Secondly, the paper fails to consider
   arguments put forth by Panciera et al's "Wikipedians are Born, Not  Made"
   paper. This review will expand on both of these major issues below.
   Further below will be notes and comments with suggestions for improvement
   for specific sections of the paper, some of which are rather substantial
   as well.

   1: Evaluating TWA effectiveness by number of contributions
   ----------------------------------------------------------

   A major part of the paper is the evaluation of TWA's effect on subsequent
   contributions. To evaluate this an invitation-based field experiment is
   used, and the paper does a great job of justifying why that is
   appropriate in this setting.  The experiment runs from February 2014 and
   three months forward.  Exact dates are not given, so let us assume that
   it ran until the end of April 2014.  User contributions are then measured
   until the end of May 2014.

   There are two problems with this approach that the paper fails to address
   properly. One is the issue of right-truncation found in the data.
   Contributors who joined in early February 2014 would have about four
   months to make edits, whereas those who joined in late April would only
   have about a month. The model does contain a control variable for number
   of days in the experiment, but why is that appropriate in this context?
   If we examine other work in the same domain, they tend to either use a
   much longer time period (e.g. the Teahouse paper, citation 23, which uses
   6-9 months) or ensure that the time period is fixed (e.g. Kittur et al.
   "Herding the Cats: The Influence of Groups in Coordinating Peer
   Production", WikiSym 2009; or Zhu et al. "Effectiveness of Shared
   Leadership in Online Communities", CSCW 2012).

   Related to the right-truncation problem is the fact that the paper also
   fails to discuss and justify what a reasonable timespan for measuring the
   effect of TWA is, and that it will have an effect on the number of
   contributions made. It might for instance be that TWA instead has an
   effect on how long it takes before a user drops out of the system. If we
   assume that TWA has an effect on contributions, what timespan is needed
   to measure that effect? The paper assumes that a month is adequate to
   discover it, whereas one might suspect that it is only measurable over a
   longer period of time. If it is the case that a short period of time is
   appropriate (for instance because these users are likely to drop out
   after a certain amount of time) the paper needs to properly establish
   that, either by measuring it or referring to previous work.

   2: Wikipedians Are Born, Not Made
   ---------------------------------

   In their GROUP 2009 paper "Wikipedians Are Born, Not Made: A Study of
   Power Editors on Wikipedia", Panciera et al. show data that argues that
   those contributors who are going to stick around behave in a way that is
   different from the very beginning.  In followup work published in 2010
   they find similar differences in another peer production community.
   (Panciera et al. "Lurking? cyclopaths?: a quantitative lifecycle analysis
   of user behavior in a geowiki." CHI 2010)

   These two papers and the argument they put forth are relevant because
   they question who TWA is designed for. In the related work a reference to
   Bryant et al's "Becoming Wikipedian" is made, thereby suggesting that TWA
   is designed to teach someone how to be a Wikipedian. As Panciera et al's
   paper argues along the lines of these contributors already being
   Wikipedians, should TWA be designed to instead help these contributors
   stay productive?

   If Wikipedians are born, not made, then one could also question whether
   these contributors are at all going to use TWA. Maybe they ignore TWA
   because they are already productive and do not need it? Since the paper
   never makes any references to these papers and discusses issues related
   to this (e.g. "is the Teahouse more effective since it allows them to get
   answers when they need help?"), this whole topic area is left hanging.

   ---
   Below follows comments/notes for each section of the paper.

   Introduction:
   * An overall issue here is that there are few citations to sources. For
   instance a claim is made that "newly created accounts are the primary
   source of spam and vandalism on Wikipedia". Consider a "[citation
   needed]" added after that.
   * When citing multiple papers it is preferable that they are in order,
   e.g "[14, 23, 17]" should be "[14, 17, 23]" (page 1). This minor issue
   also occurs elsewhere in the paper.
   * "Unlike prior systems, TWA creates a structured experience that guides
   newcomers through critical pieces of Wikipedia knowledge..." Do we know
   that there are no other prior systems that offer a similar experience? It
   might be that there are none within the Wikipedia domain, but what about
   outside it?  That sentence is making a rather bold claim.
   * After reading the introduction, what is the reader expected to remember
   as the main findings in this paper? At the end of the introduction the
   following sentence is found: "The study underscores the importance of
   conducting multiple types of evaluations of social systems." Is that the
   main contribution? What about the implications for gamified structured
   introductions to peer production?

   Background:
   * "...women reported that they found that contributing to Wikipedia
   involved a high level of conflict and that they lacked confidence in
   their expertise [8]. This suggests that more effective onboarding tools
   could help incorporate newcomers." This is an important side of
   Wikipedia, but how does TWA's design help mitigate this issue? Are there
   design elements in TWA that aims to boost confidence in one's expertise?
   * At the end of the introduction we find the following two questions:
   "Would a gamified tutorial produce a positive, enjoyable educational
   experience for new Wikipedians? Would playing the tutorial impact
   newcomer participation patterns?" These are the paper's _research
   questions_! It would be very helpful to the reader if they were displayed
   more clearly, e.g. as separate items. They should not be hidden.

   System Design:
   * "...it does not depend on the availability, helpfulness, or
   intervention of existing Wikipedia editors..." The underlying argument
   here is that scalability is preferable to personal interaction when
   socializing newcomers (in peer production communities). Why is that the
   better solution? As discussed previously, TWA might be designed for
   contributors who are not going to stick around, why are those the right
   audience for it? Is the goal to provide _everyone_ with a scalable
   impersonale introduction, or is it better to provide _some_ (typically
   based on self-selection) with a personal introduction (e.g. the
   Teahouse)?

   Game-like elements (subsection of System Design):
   * In "Missions" a distinction is made between "basic" and "advanced"
   editing techniques. It appears to be somewhat arbitrary, why is adding
   sources advanced editing, but watchlists are not?
   * Your readers might not now what watchlists are, take care to write for
   a general audience, not everyone knows a lot about how Wikipedia works
   behind the scenes.

   Study 1: User Survey:
   * This paper doesn't discuss any other language editions of Wikipedia
   besides the English one, and makes the assumption that "Wikipedia" equals
   the English edition. Adding a mention that Wikipedia exists in multiple
   languages and  explaining why English was chosen as the language where
   TWA was launched would be very helpful.
   * The paper aims to measure "educational effectiveness". Why is a survey
   the appropriate way to measure that? Based on the description of the
   survey, it seems that it never asks specific questions to test whether
   TWA's users learned specific things, in other words whether the education
   was successful. Later when describing the results the phrase "learning to
   edit Wikipedia" is used, isn't that the _key_ learning goal of TWA? Yet
   the survey asks Likert-scale questions. In other words, you're measuring
   whether TWA users are under the impression that they learned something,
   not whether they actually did.
   * Figure 4 uses counts. While it shows that none of the questions had
   responses from all participants, it makes comparisions between questions
   with different response rates very difficult. Using percentages would
   allow for direct comparisons, and makes the references to the figure in
   the text easier to follow along with. The text refers to four questions
   with a certain percentage of responses, but leaves the math to the
   reader.
   * The survey leaves many questions unanswered, some of which the paper
   might want to address. Were any negative questions asked? Were there any
   control questions, such as a similar question worded slightly differently
   to allow for comparison between responses? As it is, this survey comes
   across as a set of positive statements about TWA that respondents agreed
   to. Given that respondents self-select and no attempts to contact users
   who didn't go through TWA appears to have been made, it is likely there
   is a bias in the responses, and that bias should be discussed.

   Study 2: Field Experiment:
   * The description of how accounts were selected to be included is rather
   confusing. First it describes 1,967 accounts that met the same criteria
   as for the user survey, however 10,000 individuals ("accounts"?) were
   invited to the beta. Why is one an order of magnitude larger than the
   other? Then in the second paragraph of "Methods" it describes the
   selection criteria, that at least one contribution would have to be made
   after getting invited. This would perhaps be much less confusing if the
   criteria were first explained, particularly how the experiment and
   control groups were set up, and then how many accounts were identified.
   * "This is a larger proportion of users than took up the invitation in
   Study 1, which may be due to changes in the invitation text." Earlier in
   the paper study 1 refers to a "beta", whereas this appears to be not. If
   this is the case, this is an important difference between the two that
   should be made clear to the reader.
   * "we measure the overall contributions as the total number of edits made
   by each account from the time of inclusion in the study until May 31,
   2014." When exactly is "time of inclusion", is that when they got the
   invite? What about when they completed one (or all) TWA mission(s)? The
   concern here is that all contributions are measured, whereas the
   experiment sets up a pre/post-scenario. Later on the paper refers to
   "subsequent contributions", indicating that contributions after a certain
   point in time was measured. This quickly becomes rather confusing,
   spelling out clearly what points in a user's account history is used
   (e.g. "we measure contributions at four points in time: when the user
   registered their account, the time of invitation, when they first started
   using TWA, and the end of the experiment") would be very helpful.
   * Why is a six-edit radius chosen when measuring word persistence?
   Halfaker et al. make no claim about what the radius should be in the
   referenced work, and Ekstrand et al suggest a 15 edit radius in a related
   paper (Ekstrand and Riedl "rv you're dumb: identifying discarded work in
   Wiki article history." WikiSym 2009) The six-edit radius also comes with
   an issue that is unadressed: how long does it take for an edit made by a
   contributor in the study to reach that six-edit radius? If it hasn't been
   reached at the end of the study period, that edit has to be discarded as
   its quality is unknown. In a related paper, Farzan and Kraut instead
   chose to use percentage of words that survived as a measure of quality
   (Farzan and Kraut "Wikipedia classroom experiment: bidirectional benefits
   of students' engagement in online production communities" CHI 2013)
   * Tables 1, 2, 3, and 4, as well as figure 6 should be brought closer
   together so it's easier to follow along. Table 1 occurs before the text
   that refers to it, and table 4 is two pages further along. Putting all
   tables and figure 6 on the same page might be a good solution.
   * Table 3 refers to users "reached" a mission. It is confusing how 181
   users reached the final mission but did not complete it, yet in the text
   it seems these 181 users actually did.
   * The post-hoc power analysis is very useful!

   Discussion:
   * "The new editors in our study may have had unpleasant experiences
   during their initial time on Wikipedia..." It appears that the survey
   asked no questions about this, yet is it not a very important issue
   related to TWA's success?
   * In "Limitations of gamification" the following sentence is found:
   "...our study is among the first that compares levels of participation in
   a task among individuals who were introduced to gamified learning first
   to those that were not." This is an _important_ finding, it shouldn't be
   hidden back here but instead be up front in the introduction!

Author Response

   Most or all of my comments were addressed.

Final Rating of Revision

   5  (Definitely Accept)

The Review of Revision

   First of all, this reviewer would like to congratulate the authors on the
   herculean effort that's been put into improving this paper and the
   quality that has resulted from it. The attached revision document is also
   of high quality, carefully considering the comments from the reviewers
   and arguing well for why some of our suggestions were not implemented.

   After carefully reading the revised version, my final recommendation for
   this paper changes to a 'Definitely Accept'. There are several reasons
   for why this paper ought to be included in the conference proceedings:

   1: It is a well-written paper. This was implied in some of my previous
   comments, as well as R3's applause. The revised version is no different
   from the initial one, the changes kept with the clear writing style and
   the content changes have further improved the paper.

   2: The literature on newcomer interventions in peer production
   communities (such as Wikipedia) is sparse. Much work has studied what
   happens to newcomers in these communities and proposed solutions, but
   large-scale interventions are few and far between. This paper therefore
   starts filling that gap in the literature.

   3: The design of TWA is well-founded. As the paper argues:

   "The design of our system was informed by previous empirical,
   theoretical, and systems work and our system performed well according to
   the types of survey self-report measures used to evaluate the usability
   of many social computing systems." (revised version, page 12)

   There are strong reasons to believe that this intervention _should_ work,
   partly due to the positive responses in the survey, as well as previous
   research on gamification. In other words, the fact that it didn't work is
   arguably a noteworthy result in and of itself.

   4: Figuring out why these types of interventions fail and/or what types
   of interventions succeed is probably on the order of a lifetime's worth
   of research work. As the authors argue in the revision notes, documenting
   these failures is important. We don't need to document _all_ of them, but
   given that this one describes an intervention with a reasonably solid
   foundation in previous work that indicated it was likely to succeed, it
   should be a sufficiently interesting example that the community will
   benefit from having documented.

   5: Connected with the last sentence in point #4 (it's a sufficiently
   interesting example), this paper's negative result can initiate a
   discussion around and motivate future research in this space in order to
   uncover what factors lead to a successful intervention in this space, as
   well as further document failures.

   Those things being said, there's still a bit of room for improvement in
   this paper, here are some final notes and suggestions:

   Introduction:
   In the introduction, the phrase "Social computing systems that aggregate
   voluntary contributions" is used, while in the conclusion the phrase
   "peer production" is used. Consistent terminology usage is useful.

   Awkward phrasing: "(a limited resources)"

   Awkward phrasing: "…how new users perceive to the system’s design and
   tone."

   Background:
   In the section "Why Gamify Becoming a Wikipedian?" a reference to
   Kriplean et al.'s barnstar work is made, "…badge-like social awards
   which confer external recognition of their achievements [31]." Something
   that wasn't mentioned in the previous review is that Kriplean et al.
   found that barnstars take a rather long time to be awarded (see their
   footnote 2, page 3), median edits is around 1,200 and median tenure is
   around a year. This suggests that barnstars are a rather slow process,
   creating somewhat of a broken feedback loop. TWA's faster achievements,
   and perhaps other solutions such as "WikiLove" and "thanks", can be seen
   as important improvements since they close the loop much faster. The lack
   of (positive) feedback on wikiwork is maybe one of the reasons newcomers
   don't stick around. (While that's not necessarily a suggestion for
   changes to this paper, it's perhaps something worth keeping in mind for
   future work)

   In the same section, a reference to GettingStarted is made. This
   reference points out some of GS' features, which seem to be very similar
   to some of what TWA does. The claims in the introduction that are made
   about TWA's novelty in creating a structured experience for newcomers are
   therefore maybe a bit strong?

   System Design: The Wikipedia Adventure:
   In the section "Game-like elements", subsection "Missions", a reference
   to setting up a user page is made. It might be useful to explain to the
   reader why creating a user page is important (e.g. that non-existent
   user/user talk pages are a flag to patrollers, or that it can signal
   stronger commitment to the community).

   Study 1: User Survey:
   The study and the results in Figure 5 are referred to in the text as
   measuring “user confidence” and “user engagement”. The two
   rightmost questions don't fit into that, they instead measure the
   participants' perception of whether TWA would be useful for other
   newcomers in the community. In addition to lacking reference in the text,
   an issue with this evaluation is that it's only asking one specific group
   to evaluate it. There's no survey of experienced contributors, for
   instance whether they perceive TWA participants to be "better"
   contributors. Lastly there is also the issue of whether a newcomer to the
   Wikipedia community is able to properly consider whether TWA is a good
   way to introduce newcomers, partly because they might not know what's
   missing.

   In the results section the following claim is made: "These findings
   provide validation of our choice to gamify the tutorial." That conclusion
   doesn't appear to be supported. The survey questions don't appear to poke
   at whether the gamification elements of the tutorial were the reason for
   the positive responses. It instead appears that we have a general
   evaluation of the perceived utility of TWA.

   Study 2: Field Experiment
   Table 2 in the results is still somewhat confusing. Using "Attrition"
   doesn't seem to work well either, since there appears to be 181
   participants who either started or completed mission seven. Maybe it's
   just that there's no mission eight, so the categorization scheme becomes
   difficult? Either way, some way of clarifying what happened to these 181
   participants would be helpful.

   Figure 6 might benefit greatly from the Y axis being log-scale for the
   edit counts, given that the distributions are so skewed. Not sure there's
   much usefulness in a box plot if there's no box.

   Discussion:
   "The null findings in these models indicate that the people who played
   the game and went on to contribute extensively would have done so
   anyway."
   So are Wikipedians born, not made, then? :-P

Remaining Formatting and Reference Issues

   Looks like the 15th page can be spared if the references are shortened,
   for instance by removing "ACM, New York, NY, USA" from all the ACM
   references, shortening the proceeding names, etc… Given that there
   isn't strictly a page limit, it's arguably not that necessary, but
   perhaps worth considering.


------------------------ Submission 516, Review 3 ------------------------

Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users

Reviewer:           AC-Reviewer

Expertise

   4  (Expert)

First Round Overall Recommendation

   3  (Maybe acceptable (with significant modifications))

Contribution and Criteria for Evaluation

   This paper presents the results of a deployment of a gameification-based
   system designed to retain new editors in Wikipedia. It is a negative
   results paper: the authors claim that they have conclusive evidence that
   the system did not work (although I have suggested a few additional lines
   of inquiry below that might problematize this assertion).

   The committee will have to have a discussion about how to evaluate this
   paper, and likely negative results papers more generally.

First Round Review

   This paper presents the results of a deployment of a gameification-based
   system designed to retain new editors in Wikipedia. It is a negative
   results paper: the authors claim that they have conclusive evidence that
   the system did not work (although I have suggested a few additional lines
   of inquiry below that might problematize this assertion).

   The paper is very well-written and has some large positives. It also is a
   negative results paper, and the committee will have to decide how to
   handle this. In general, I’m strongly sympathetic to arguments to
   include more negative results papers in our proceedings, but I’m quite
   unclear on the details of how to do so (e.g. what defines a top-quality
   negative results paper?). I’m hopeful that this paper can instigate a
   broader discussion on this topic at the PC meeting.

   All of that said, this paper also has a number of idiosyncratic
   limitations that make it perhaps not the best trial balloon for negative
   results papers. Below, I outline what I believe to be the paper’s
   positives and then describe these limitations in more detail, phrased as
   both critiques and questions.

   Overall, my recommendation is to invite the authors to revise and
   resubmit. If this occurs, I’ll want to see the below critiques
   addressed and the below questions answered (both through direct answers
   in the response to reviewers and through clarifications and changes to
   the paper). I’m hopeful through, through the R&R process, this paper
   can become an ideal negative results trial balloon.


   Important positives:

   * The authors built a system to solve a real-life problemand did a
   real-life, relatively large-scale deployment. Awesome!
   * The paper is easily in the top 95% in terms of writing quality. This is
   true both at the sentence level and at the narrative level. As a person
   who has to review lots of papers, this was a breath of fresh air.
   * The design of the game is quite well-thought-out, save a few relatively
   arbitrary decisions. I was particularly compelled by the use of
   gameification techniques that are also present in “real Wikipedia”
   (e.g. barnstar-like rewards).

   Critiques:

   CRITIQUE #1 – Excessive import placed on trivial self-report data: It
   is well-known that self-report data from participants is inferior to
   observations of actual behavior, and that self-report data can be quite
   unreliable more generally. As such, in my view, it is not a contribution
   to show that self-report data didn’t end up panning out in the
   behavioral results.

   In the next draft of this paper, I would like to see the authors address
   this issue. This might mean framing this paper as a full-on negative
   results paper, but lighter weight adaptations might be possible.


   Open questions:

   QUESTION #1: As noted above, this paper is a negative results paper at
   its core, and we’ll have to have a broad discussion about this at the
   PC meeting, assuming the paper makes it this far. In the event that this
   occurs, can the authors provide a more robust argument as to why these
   negative results are important for other researchers and practitioners?

   The paper attempts to argue that one contribution that comes out of its
   negative results is to distrust self-report data, but this is well-known
   (see below). The other negative results argument in the paper is that
   these results add to growing evidence of long-term gameificiation
   failures. I find this argument much more compelling. In other words, by
   expanding on this argument, the authors may be able to address this
   question.

   That said, regardless of how this question is addressed in the second
   draft, I’d like to see it done both through changes to the paper and
   through discussion in the response to reviewers.

   QUESTION #2 – Is there a possibility that the statistical framework
   employed is not appropriate for this particular study?

   The authors utilize a two-level statistical approach that I haven’t
   seen before in the CSCW/CHI literature. I enjoyed thinking about this
   approach, and the authors did a relatively good job explaining it. That
   said, I’m currently not convinced that it was the appropriate framework
   for this study. Here’s my reasoning:

   (1) The goal here is to introduce a treatment that ultimately will
   produce strong new members of the Wikipedia community at a higher rate
   than the control.
   (2) Let’s say the game produces 3 such members out of 100 new editors
   and the control produces 1, which looks like it might be the case.
   Let’s also say that this pattern additionally persists over a large n.
   (3) If this is true, why do we care about the potentially moderating
   effect of the invitations?

   The authors argue that new editors that responded to the invitation to
   play the game might just be new editors who are engaged and, critically,
   would have been power editors whether or not the game existed. However,
   barring a random fluke, shouldn’t these future power editors also have
   been in the control group? If I’m right here, I’m thinking the
   invitation doesn’t matter and a more traditional statistical analysis
   (or at least one targeted at identifying rare events) is appropriate.

   I could be wrong, but I want the authors to respond to this question,
   both through feedback to reviewers and clarifications in the paper.

   As an important side note, if we agree that this framework is the right
   way to go in the end, the authors should puff their chests more about
   this by claiming it as a contribution (assuming it hasn’t been used at
   CSCW before).

   Question #3 – Are the outcome variables considered here the best
   outcome variables? Are some critical variables missing?

   The authors seem focused on the average effects across the entire control
   and treatment groups (the two treatment groups, to be specific). However,
   would it not also be reasonable to consider the metric I describe above:
   the % of new editors that go on to be power editors? Since power editors
   end up contributing most of the edits anyway *over the long term*, to me
   this seems like the way to go (i.e. if this group of editors were
   followed for years, statistically significant differences would begin to
   emerge). If the authors agree, the authors need to reanalyze their data
   with this metric in mind.

   Another related outcome variable that might be useful to analyze is how
   long the new editors in each group remained active editors in the
   community (i.e. survival analysis). Because the data is quite old, this
   should be an easy new analysis to run, and longevity has been a variable
   interest in a number of peer production studies.

   In their second draft and the feedback to reviewers, I would like to see
   the authors discuss either new analyses related to power users or why thy
   did not consider this outcome variable. I would also like to see the same
   for survival analysis.

   QUESTION #4: Is there a path towards positive results?

   As noted above, I believe some discussion around this paper and negative
   results papers more generally will have to happen at the PC meeting.
   However, I think there are so missed opportunities here for positive
   results and that the authors were too quick to settle for negative
   results. This is likely an important factor to consider when deciding
   whether to accept a negative results paper.

   Most notably, there are several, well-motivated unexplored avenues that
   could lead to positive results that would have a much larger impact than
   the negative results presented here:

   * As noted above, examining additional outcome variables is important,
   most notably # of power editors and longevity.
   * Does the game work if folks are forced to play it prior to editing
   Wikipedia, as would be the case in most other institutionalized
   socialization contexts? This is not just a hypothetical: this game could
   be used in all Wikipedia Education Project classes and related endeavors.

Author Response

   Some of my comments were addressed.

Final Rating of Revision

   3  (Borderline)

The Review of Revision

   After reviewing the change log and the new draft, I remain on the fence
   about this paper. Below, I outline what I believe to be the key
   discussion points about this paper in preparation for a likely
   conversation with my fellow reviewers and at the PC meeting. First,
   though, I outline some important positives to keep in mind as we have
   this discussion:

   POSITIVES

   •    This paper is a canonical systems paper, and one that has a strong
   evaluation. The effort involved in putting together this paper is
   probably 2-3x that of the average quant/qual paper.
   •    I think the implications of these findings for gameification
research
   are very interesting, especially because they replicate and extend what
   has been found in prior meta-analyses. (This receives too little
   attention in the paper, though).
   •    The paper is very well written.
   •    The change log is by far the most detailed of any I have encountered
   thus far this year, although I think the authors were a little stuck in
   their ways in terms of actually making changes.

   Discussion point #1: What makes a good negative results paper? Does this
   paper meet these criteria?

   The revision of this paper doubles-down on the “negative results as
   contribution” message. This means that we as a committee have to define
   the conditions for a high-quality negative results paper, and do so in a
   way that won’t lead to moral hazards down the road. The authors did not
   do a good job in their change log arguing why this paper is a good
   negative results paper, instead making standard arguments about the
   importance of negative results (without much recognition of the
   challenges associated with evaluating them). Most if not all reviewers
   should have already been aware of the argumentation in the change log.

   As far as I can tell, this paper implicitly and explicitly states that
   the following is required for a good negative results paper:

   (1)  A sample size that gives us relative confidence that moderate growth
   of the experiment won’t lead to important effects in the end (e.g. we
   might see significant results, but not ones of a meaningful size).
   (2)  A discussion section that helps to interpret the negative results so
   that this paper can lead to some generalizable findings.
   (3)  The usual array of well-executed methods, excellent communication of
   results, etc.

   Upon significant reflection, I tend to agree that these criteria do help
   to turn a negative result into something that can be useful outside of
   the specific experiment (a pre-condition to the acceptance of any paper,
   in my view). However, I think we need to reflect on this more as a
   community.

   Critically, we also need to decide if this paper meets the above
   criteria, which is the subject of the next two sections of this review.
   Overall, I think #1 and #3 are spot on with this paper, but #2 is weaker.

   METHODS

   I don’t think the authors understood my concerns about their
   statistical approach. I do not take any issue with the two-level design.
   I take issue with the interpretation. The authors themselves argue that
   the invitation is an ecologically valid way to test the system, and to
   me, that means that – at least to some degree – the invitation is
   *part of the system*. After all, it would be necessary for its real-world
   deployment.  This makes the first level of the results quite important,
   although I agree that the second level contributes to important
   understanding as well. The original paper was written as a social science
   paper would be and tried to control away the invitation. My point is that
   the invitation can in many ways be considered the first interaction with
   the system, a point with which I think the authors agree. This is one way
   where this study is different than how it sounds like this technique is
   often employed in the social sciences.

   The good/bad news is that it looks like with the new results, the point
   is mostly moot: regardless of whether you consider the invitation as part
   of the system or not, there is no effect (and if there were an effect, it
   would be that the system made things worse; Note: This is assuming I’m
   understanding things correctly: the authors use the term ‘control’
   and ‘treatment’ without referring to whether they are referring to
   the first or second level).

   The only way in which this point is still an important one is that the
   interpretation is somewhat strained in this new draft with regard to this
   issue. I would confront it head-on in any future drafts. This would
   involve presenting two interpretations of the system: one that includes
   the invitation (which is ecologically valid in terms of how it would
   actually be deployed) [level one] and one that does not [level two].

   INTERPRETATIONS OF RESULTS

   I think the paper falls a bit short in interpreting what the results mean
   (criterion #2 for a good negative results paper, as per above). The key
   takeaway seems to be: designing gameified systems to support newcomers in
   Wikipedia is hard. I’m not sure that’s good enough.

   The implications for the gameification literature continue to fascinate
   me (this is a big plus for the paper in my book), but the authors write
   in their change log that this is not the focus of the paper, and this is
   reflected in the new draft. My thinking is that if some of the most
   interesting implications of this paper are in this space, why not make it
   the focus of the paper? I realize it’s not in the WP domain, but WP has
   a great deal of value in social computing as a test bed for social
   computing generally, not just studying WP.

Remaining Formatting and Reference Issues


Report completed