cdsc_examples_repository/cscw_changelogs/2017-the_wikipedia_adventure/twa-CSCW2017-reviews-round1.txt

From: <papers2017@cscw.acm.org>
Date: Tue, Jul 12, 2016 at 11:15 PM
Subject: CSCW 2017 notification - #516
To: snehanarayan@gmail.com
Cc: papers2017@cscw.acm.org


Dear Sneha Narayan -

Congratulations!

Your paper:

516 - The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users

is one of the 52% of CSCW 2017 submissions invited to revise and resubmit.
There were 530 total submissions to CSCW 2017, a similar number to last
year.  The reviewers for this submission believe that it has the potential
to be revised within four weeks -- revisions are due August 9, 2016 -- to
become a contribution to what will be an exceptional conference.

The program committee expects all authors to take advantage of this four
week revision period to improve their submissions by addressing reviewers'
comments (below). Some submissions need only minor revisions, while others
will require considerable work over the next four weeks to result in an
acceptable submission, and will not succeed without significant effort.
Your reviews, especially the summary report from the Coordinator, should
make clear what you should do. You can gauge your prospects from your
reviews and the summary report: overall scores of 4s and 5s indicate the
reviewers are very confident your paper will be acceptable within four
weeks with small edits. Overall scores of 3 and 4 indicate you have some
work to do. Scores of 3 and below indicate that some reviewers have serious
reservations, though other reviewers see promise.

The same reviewers will read and evaluate your revised submission (though
additional reviewers may be added for papers where the reviewers are
divided). You need not satisfy every reviewer or make every suggested
change, but your revision will need to convince most of the reviewers that
it is now ready for publication. For some papers the reviewers have
requested a lot of work, you might feel that it is too much to achieve in a
four week period. If you have the time to reach that goal: great!  If not,
that is okay, you are free to withdraw your submission. Please decide
whether or not the key points made by reviewers can be adequately addressed
in the time provided, given other demands on your time. If you choose to
withdraw your paper, please notify us explicitly at papers2017@cscw.acm.org.
Papers that are revised and re-submitted in the next round will receive
revised reviews.

Your revision must be accompanied by a separate "Summary of Changes"
document (in PDF format) that lists the reviewers' comments and your
responses, even for comments that did not lead to changes in the manuscript
(in which case you might explain why you chose not to make certain
suggested changes). This could be a set of bullet points, a table, or
numbered points by which reviewers' comments are summarized along with your
changes. This is not a rebuttal, but rather a description of changes made,
or of reasons you could not or chose not to take the reviewers' advice. To
become acceptable, your submission must be revised, and your document
describing the changes will greatly help reviewers see what you have or
have not changed, along with your reasons for doing so.

Just to be clear, you must submit a revised paper and summary of changes by
the deadline.  Any paper where a revision and summary are not submitted
will be considered to be withdrawn.

Example summaries from past years' papers can be found at
http://bit.ly/16U8BGM.

Please submit your revision and the response document at your "Submissions
in Progress" page at https://precisionconference.com/~cscw17a/ by 11:59 PM
PDT, August 9, 2016.

CSCW 2017 will be a great conference, and we sincerely hope you are part of
it! If you have any issues or questions, please let us know. And thanks
again for submitting.

Sincerely,
Louise Barkhuus, Marcos Borges, Wendy A. Kellogg
CSCW 2017 Co-chairs


------------------------ Submission 516, Review 4 ------------------------

Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users

Reviewer:           AC

Expertise

   2  (Passing Knowledge)

First Round Overall Recommendation

   3  (Maybe acceptable (with significant modifications))

Contribution and Criteria for Evaluation

   The paper presents the design and evaluation of a gamified tool for
   socializing and retaining new Wikipedia editors. Contribution criteria
   include (1) a description and rationale for the system; (2) system
   novelty and rationale for how it leads to learning; and (3) a
   methodologically sound evaluation.

First Round Review (if needed)


Coordinator's First-Round Report to Authors

   The paper presents the design and evaluation of a gamified tool for
   socializing and retaining new Wikipedia editors. The study found that
   users liked—but did not learn from—the system.

   The focus on improving the experience of newcomers in Wikipedia is
   relevant and important. Reviewers describe the study as well motivated
   and exceptionally well-written. Read R3’s comments on the writing
   quality and congratulate yourself!

   The reviewers, however, have many concerns about the paper—each
   focusing on a different aspect of the work. The concerns the reviewers
   note /may/ be addressable during the revise and resubmit period, but it
   will be an exceptionally herculean effort. Also, please keep in mind that
   there is no guarantee of acceptance even after making changes. So, it is
   at the authors’ discretion about whether or not to proceed with
   revisions or withdraw the paper.

   There is split amongst the reviewers as to whether the failure of the
   tool is interesting or not. R1 raises concerns that the failure of the
   tool could be predicted from existing literature, suggesting little
   rationale for doing the work in the first place.  R2 asks whether there
   is something fundamentally different about people who continue to
   contribute to Wikipedia, and as such whether the system holds value in
   practice.  R3, on the other hand, sees much value in the systems
   contribution of the work as well as the real-world evaluation. R3's
   review has some suggestions of alternative framings that may make the
   contribution more valuable.

   In treatment of related work, many improvements are needed. R1 notes that
   the discussion of the well-known concept of legitimacy/authenticity in
   learning environments is missing.  R2 also points to missing literature
   about Wikipedian experience.

   R2 and R3 raise a number of methodological questions about the paper.  R2
   suggests the distribution of participants across the timeline may bias
   the results. R3, on the other hand, sees opportunity here, suggesting
   additional statistical analysis related to longevity and power users.
   Both R2 and R3 question the methodological choice and contribution of
   measuring perceptions of learning rather than actual learning.  Overall,
   this points to a need for at the very least justifying the methodological
   choices and at the most carrying out additional statistical analyses.

   In summary, there is quite a bit of work to be done. I wish the authors
   the best of luck, should they choose to continue in the review process.


Requested Revisions

   REQUIRED:
   -    Provide justification for why the study was worth carrying out, in
   response to R1 and R2’s concerns.  R3’s review may have some insight
   into alternative framings.
   -    State the research questions more explicitly, as per R2’s
   recommendation
   -    Address R1 and R2’s concerns about missing literature
   -    Ensure that the narrative around Wikipedia is clear to readers who
do
   not have an in-depth background in production/editing details.
   -    Improve the clarity of the results by using percentages or another
   baseline that allows comparison between numbers, as per R2’s review.
   -    Provide justification for measuring perceptions of learning versus
   actual learning.
   -    Provide a robust discussion of why the results are meaningful for
   researchers and/or practitioners.

   OPTIONAL, RECOMMENDED
   -    Consider carrying out additional statistical analyses as
recommended by
   R3.
   -    Provide a short justification for use of English language
Wikipedia, as
   per R2’s review.

Formatting and Reference Issues


------------------------ Submission 516, Review 1 ------------------------

Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users


Expertise

   4  (Expert)

First Round Overall Recommendation

   3  (Maybe acceptable (with significant modifications))

Contribution and Criteria for Evaluation

   In this paper, the authors present the design and two-pronged evaluation
   of a tutorial for new wikipedia editors that uses elements of
   gamification like missions and badges to help coach new editors and help
   them learn best practices and social norms of Wikipedia. The outcome is
   that users like the system but, based on behavioral measures, they don't
   actually learn from it. Learning interventions are a classic kind of
   research problem, and the paper should include robust measures of
   learning, as well as a good description of the designed intervention
   itself, why the design is expected to lead to learning, and a clear
   description of the study.

Assessment of the Paper

   This is a reasonably well motivated study with connections to appropriate
   literature and the writing is engaging and understandable. The problem of
   enculturating newcomers into projects like Wikipedia is well documented
   and this paper investigates a potential intervention with an admirably
   well-planned study. Designing learning interventions is really difficult
   and I commend the authors on a well-executed effort.

   Still, I am ambivalent about the paper because I would have predicted
   these outcomes based on the literature alone. In the discussion, the
   authors note that one mismatch between Wikipedia and the tutorial as
   designed involve the “gradual peripheral participation” of newcomers
   as they take on the identity of “Wikipedian.” They suggest that maybe
   speeding up this process is unnatural. I would argue that the most
   important concept from the literature on learning is missing from this
   discussion, and that’s “legitimacy” (also sometimes referred to in
   education and learning literature as “authenticity”.) The authors
   explain that by doing tasks in a pretend version of Wikipedia, they make
   it a safe space for newcomers to practice, yet performing “canned”
   tasks in a pretend system is the opposite of offering a legitimate form
   of participation. I immediately wonder, why not use what we know from the
   literature to create low-risk missions that newcomers can complete while
   legitimately contributing to the encyclopedia? Risk taking is a
   fundamental characteristic of games that makes them engaging; it
   certainly seems like it would play a role in people’s motivation in a
   scenario like this. Rather than eliminating risk, the literature on
   legitimate peripheral participation would suggest that finding the right
   degree of risk is required to facilitate progressive entree into a set of
   shared practices.

   I am disappointed by the missed opportunity here,  the outcome mainly
   seems to verify that what we know shouldn’t work based on the
   literature in fact doesn’t work. Yet still the paper isn’t bad and
   the study is carefully crafted and reported.

   With some extension and reflection, I think the discussion could help
   point future research in a more fruitful direction. There are millions of
   pages written on the challenges of designing learning interventions that
   change people’s behavior, this paper ends on a painfully obvious note.
   It’s true that usability isn’t all it takes, but what can we learn
   from TWA adventure about the design of systems to facilitate
   enculturation into a community of practice? What can we take away from
   this that might inform more successful tutorial systems in the future?

Formatting and Reference Issues


------------------------ Submission 516, Review 2 ------------------------

Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users


Expertise

   4  (Expert)

First Round Overall Recommendation

   2  (Probably NOT acceptable)

Contribution and Criteria for Evaluation

   This paper's contribution is the design and evaluation of a structured
   introduction to a peer production community (English Wikipedia) called
   "The Wikipedia Adventure".  TWA's design is rooted in theories of
   gamification, and its utility is evaluated through a user survey and an
   invitation-based field experiment.  The paper reports on the survey
   respondents' satisfaction with TWA, and how their experiment results
   reveal some of the challenges of affecting lasting changes to contributor
   patterns in peer production communities.  These findings are then
   discussed in relation to cultural factors in Wikipedia, issuses of
   self-selection and voluntary participation, and the limitations of
   gamification.

   When evaluating a paper that describes the design of a system, the two
   main criteria are that the system and/or its development setting is/are
   novel, and that the way the system is evaluated is methodologically
   sound.

Assessment of the Paper

   As mentioned in the contribution section, this paper's contribution is
   the design and evaluation of a structured introduction to a peer
   production community based on gamification, called "The Wikipedia
   Adventure". This is a great idea and sounds like a useful addition to
   Wikipedia. The paper is written in a way that makes it easy to read, and
   provides the reader with a good introduction to how TWA's design is
   rooted in theories of gamification, thus applying these principles in
   what appears to be a novel setting. The paper also does a good job of
   discussing the findings, organizing them in a way that is easy to follow
   and touching on important points (e.g. cultural factors, and the
   limitations of self-selection and gamification).

   The overall ideas and approach taken in this paper are sound, they are in
   line with the criteria described previously.  Unfortunately, there are
   two major issues and several minor ones that need to be resolved before
   this paper is ready for publication. The first major issue is that the
   methodology used to evaluate performance in the invitation-based
   experiment measures contribution in a skewed manner and does not
   establish why that is appropriate. Secondly, the paper fails to consider
   arguments put forth by Panciera et al's "Wikipedians are Born, Not  Made"
   paper. This review will expand on both of these major issues below.
   Further below will be notes and comments with suggestions for improvement
   for specific sections of the paper, some of which are rather substantial
   as well.

   1: Evaluating TWA effectiveness by number of contributions
   ----------------------------------------------------------

   A major part of the paper is the evaluation of TWA's effect on subsequent
   contributions. To evaluate this an invitation-based field experiment is
   used, and the paper does a great job of justifying why that is
   appropriate in this setting.  The experiment runs from February 2014 and
   three months forward.  Exact dates are not given, so let us assume that
   it ran until the end of April 2014.  User contributions are then measured
   until the end of May 2014.

   There are two problems with this approach that the paper fails to address
   properly. One is the issue of right-truncation found in the data.
   Contributors who joined in early February 2014 would have about four
   months to make edits, whereas those who joined in late April would only
   have about a month. The model does contain a control variable for number
   of days in the experiment, but why is that appropriate in this context?
   If we examine other work in the same domain, they tend to either use a
   much longer time period (e.g. the Teahouse paper, citation 23, which uses
   6-9 months) or ensure that the time period is fixed (e.g. Kittur et al.
   "Herding the Cats: The Influence of Groups in Coordinating Peer
   Production", WikiSym 2009; or Zhu et al. "Effectiveness of Shared
   Leadership in Online Communities", CSCW 2012).

   Related to the right-truncation problem is the fact that the paper also
   fails to discuss and justify what a reasonable timespan for measuring the
   effect of TWA is, and that it will have an effect on the number of
   contributions made. It might for instance be that TWA instead has an
   effect on how long it takes before a user drops out of the system. If we
   assume that TWA has an effect on contributions, what timespan is needed
   to measure that effect? The paper assumes that a month is adequate to
   discover it, whereas one might suspect that it is only measurable over a
   longer period of time. If it is the case that a short period of time is
   appropriate (for instance because these users are likely to drop out
   after a certain amount of time) the paper needs to properly establish
   that, either by measuring it or referring to previous work.

   2: Wikipedians Are Born, Not Made
   ---------------------------------

   In their GROUP 2009 paper "Wikipedians Are Born, Not Made: A Study of
   Power Editors on Wikipedia", Panciera et al. show data that argues that
   those contributors who are going to stick around behave in a way that is
   different from the very beginning.  In followup work published in 2010
   they find similar differences in another peer production community.
   (Panciera et al. "Lurking? cyclopaths?: a quantitative lifecycle analysis
   of user behavior in a geowiki." CHI 2010)

   These two papers and the argument they put forth are relevant because
   they question who TWA is designed for. In the related work a reference to
   Bryant et al's "Becoming Wikipedian" is made, thereby suggesting that TWA
   is designed to teach someone how to be a Wikipedian. As Panciera et al's
   paper argues along the lines of these contributors already being
   Wikipedians, should TWA be designed to instead help these contributors
   stay productive?

   If Wikipedians are born, not made, then one could also question whether
   these contributors are at all going to use TWA. Maybe they ignore TWA
   because they are already productive and do not need it? Since the paper
   never makes any references to these papers and discusses issues related
   to this (e.g. "is the Teahouse more effective since it allows them to get
   answers when they need help?"), this whole topic area is left hanging.

   ---
   Below follows comments/notes for each section of the paper.

   Introduction:
   * An overall issue here is that there are few citations to sources. For
   instance a claim is made that "newly created accounts are the primary
   source of spam and vandalism on Wikipedia". Consider a "[citation
   needed]" added after that.
   * When citing multiple papers it is preferable that they are in order,
   e.g "[14, 23, 17]" should be "[14, 17, 23]" (page 1). This minor issue
   also occurs elsewhere in the paper.
   * "Unlike prior systems, TWA creates a structured experience that guides
   newcomers through critical pieces of Wikipedia knowledge..." Do we know
   that there are no other prior systems that offer a similar experience? It
   might be that there are none within the Wikipedia domain, but what about
   outside it?  That sentence is making a rather bold claim.
   * After reading the introduction, what is the reader expected to remember
   as the main findings in this paper? At the end of the introduction the
   following sentence is found: "The study underscores the importance of
   conducting multiple types of evaluations of social systems." Is that the
   main contribution? What about the implications for gamified structured
   introductions to peer production?

   Background:
   * "...women reported that they found that contributing to Wikipedia
   involved a high level of conflict and that they lacked confidence in
   their expertise [8]. This suggests that more effective onboarding tools
   could help incorporate newcomers." This is an important side of
   Wikipedia, but how does TWA's design help mitigate this issue? Are there
   design elements in TWA that aims to boost confidence in one's expertise?
   * At the end of the introduction we find the following two questions:
   "Would a gamified tutorial produce a positive, enjoyable educational
   experience for new Wikipedians? Would playing the tutorial impact
   newcomer participation patterns?" These are the paper's _research
   questions_! It would be very helpful to the reader if they were displayed
   more clearly, e.g. as separate items. They should not be hidden.

   System Design:
   * "...it does not depend on the availability, helpfulness, or
   intervention of existing Wikipedia editors..." The underlying argument
   here is that scalability is preferable to personal interaction when
   socializing newcomers (in peer production communities). Why is that the
   better solution? As discussed previously, TWA might be designed for
   contributors who are not going to stick around, why are those the right
   audience for it? Is the goal to provide _everyone_ with a scalable
   impersonale introduction, or is it better to provide _some_ (typically
   based on self-selection) with a personal introduction (e.g. the
   Teahouse)?

   Game-like elements (subsection of System Design):
   * In "Missions" a distinction is made between "basic" and "advanced"
   editing techniques. It appears to be somewhat arbitrary, why is adding
   sources advanced editing, but watchlists are not?
   * Your readers might not now what watchlists are, take care to write for
   a general audience, not everyone knows a lot about how Wikipedia works
   behind the scenes.

   Study 1: User Survey:
   * This paper doesn't discuss any other language editions of Wikipedia
   besides the English one, and makes the assumption that "Wikipedia" equals
   the English edition. Adding a mention that Wikipedia exists in multiple
   languages and  explaining why English was chosen as the language where
   TWA was launched would be very helpful.
   * The paper aims to measure "educational effectiveness". Why is a survey
   the appropriate way to measure that? Based on the description of the
   survey, it seems that it never asks specific questions to test whether
   TWA's users learned specific things, in other words whether the education
   was successful. Later when describing the results the phrase "learning to
   edit Wikipedia" is used, isn't that the _key_ learning goal of TWA? Yet
   the survey asks Likert-scale questions. In other words, you're measuring
   whether TWA users are under the impression that they learned something,
   not whether they actually did.
   * Figure 4 uses counts. While it shows that none of the questions had
   responses from all participants, it makes comparisions between questions
   with different response rates very difficult. Using percentages would
   allow for direct comparisons, and makes the references to the figure in
   the text easier to follow along with. The text refers to four questions
   with a certain percentage of responses, but leaves the math to the
   reader.
   * The survey leaves many questions unanswered, some of which the paper
   might want to address. Were any negative questions asked? Were there any
   control questions, such as a similar question worded slightly differently
   to allow for comparison between responses? As it is, this survey comes
   across as a set of positive statements about TWA that respondents agreed
   to. Given that respondents self-select and no attempts to contact users
   who didn't go through TWA appears to have been made, it is likely there
   is a bias in the responses, and that bias should be discussed.

   Study 2: Field Experiment:
   * The description of how accounts were selected to be included is rather
   confusing. First it describes 1,967 accounts that met the same criteria
   as for the user survey, however 10,000 individuals ("accounts"?) were
   invited to the beta. Why is one an order of magnitude larger than the
   other? Then in the second paragraph of "Methods" it describes the
   selection criteria, that at least one contribution would have to be made
   after getting invited. This would perhaps be much less confusing if the
   criteria were first explained, particularly how the experiment and
   control groups were set up, and then how many accounts were identified.
   * "This is a larger proportion of users than took up the invitation in
   Study 1, which may be due to changes in the invitation text." Earlier in
   the paper study 1 refers to a "beta", whereas this appears to be not. If
   this is the case, this is an important difference between the two that
   should be made clear to the reader.
   * "we measure the overall contributions as the total number of edits made
   by each account from the time of inclusion in the study until May 31,
   2014." When exactly is "time of inclusion", is that when they got the
   invite? What about when they completed one (or all) TWA mission(s)? The
   concern here is that all contributions are measured, whereas the
   experiment sets up a pre/post-scenario. Later on the paper refers to
   "subsequent contributions", indicating that contributions after a certain
   point in time was measured. This quickly becomes rather confusing,
   spelling out clearly what points in a user's account history is used
   (e.g. "we measure contributions at four points in time: when the user
   registered their account, the time of invitation, when they first started
   using TWA, and the end of the experiment") would be very helpful.
   * Why is a six-edit radius chosen when measuring word persistence?
   Halfaker et al. make no claim about what the radius should be in the
   referenced work, and Ekstrand et al suggest a 15 edit radius in a related
   paper (Ekstrand and Riedl "rv you're dumb: identifying discarded work in
   Wiki article history." WikiSym 2009) The six-edit radius also comes with
   an issue that is unadressed: how long does it take for an edit made by a
   contributor in the study to reach that six-edit radius? If it hasn't been
   reached at the end of the study period, that edit has to be discarded as
   its quality is unknown. In a related paper, Farzan and Kraut instead
   chose to use percentage of words that survived as a measure of quality
   (Farzan and Kraut "Wikipedia classroom experiment: bidirectional benefits
   of students' engagement in online production communities" CHI 2013)
   * Tables 1, 2, 3, and 4, as well as figure 6 should be brought closer
   together so it's easier to follow along. Table 1 occurs before the text
   that refers to it, and table 4 is two pages further along. Putting all
   tables and figure 6 on the same page might be a good solution.
   * Table 3 refers to users "reached" a mission. It is confusing how 181
   users reached the final mission but did not complete it, yet in the text
   it seems these 181 users actually did.
   * The post-hoc power analysis is very useful!

   Discussion:
   * "The new editors in our study may have had unpleasant experiences
   during their initial time on Wikipedia..." It appears that the survey
   asked no questions about this, yet is it not a very important issue
   related to TWA's success?
   * In "Limitations of gamification" the following sentence is found:
   "...our study is among the first that compares levels of participation in
   a task among individuals who were introduced to gamified learning first
   to those that were not." This is an _important_ finding, it shouldn't be
   hidden back here but instead be up front in the introduction!

Formatting and Reference Issues


------------------------ Submission 516, Review 3 ------------------------

Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users

Reviewer:           AC-Reviewer

Expertise

   4  (Expert)

First Round Overall Recommendation

   3  (Maybe acceptable (with significant modifications))

Contribution and Criteria for Evaluation

   This paper presents the results of a deployment of a gameification-based
   system designed to retain new editors in Wikipedia. It is a negative
   results paper: the authors claim that they have conclusive evidence that
   the system did not work (although I have suggested a few additional lines
   of inquiry below that might problematize this assertion).

   The committee will have to have a discussion about how to evaluate this
   paper, and likely negative results papers more generally.

Assessment of the Paper

   This paper presents the results of a deployment of a gameification-based
   system designed to retain new editors in Wikipedia. It is a negative
   results paper: the authors claim that they have conclusive evidence that
   the system did not work (although I have suggested a few additional lines
   of inquiry below that might problematize this assertion).

   The paper is very well-written and has some large positives. It also is a
   negative results paper, and the committee will have to decide how to
   handle this. In general, I’m strongly sympathetic to arguments to
   include more negative results papers in our proceedings, but I’m quite
   unclear on the details of how to do so (e.g. what defines a top-quality
   negative results paper?). I’m hopeful that this paper can instigate a
   broader discussion on this topic at the PC meeting.

   All of that said, this paper also has a number of idiosyncratic
   limitations that make it perhaps not the best trial balloon for negative
   results papers. Below, I outline what I believe to be the paper’s
   positives and then describe these limitations in more detail, phrased as
   both critiques and questions.

   Overall, my recommendation is to invite the authors to revise and
   resubmit. If this occurs, I’ll want to see the below critiques
   addressed and the below questions answered (both through direct answers
   in the response to reviewers and through clarifications and changes to
   the paper). I’m hopeful through, through the R&R process, this paper
   can become an ideal negative results trial balloon.


   Important positives:

   * The authors built a system to solve a real-life problemand did a
   real-life, relatively large-scale deployment. Awesome!
   * The paper is easily in the top 95% in terms of writing quality. This is
   true both at the sentence level and at the narrative level. As a person
   who has to review lots of papers, this was a breath of fresh air.
   * The design of the game is quite well-thought-out, save a few relatively
   arbitrary decisions. I was particularly compelled by the use of
   gameification techniques that are also present in “real Wikipedia”
   (e.g. barnstar-like rewards).

   Critiques:

   CRITIQUE #1 – Excessive import placed on trivial self-report data: It
   is well-known that self-report data from participants is inferior to
   observations of actual behavior, and that self-report data can be quite
   unreliable more generally. As such, in my view, it is not a contribution
   to show that self-report data didn’t end up panning out in the
   behavioral results.

   In the next draft of this paper, I would like to see the authors address
   this issue. This might mean framing this paper as a full-on negative
   results paper, but lighter weight adaptations might be possible.


   Open questions:

   QUESTION #1: As noted above, this paper is a negative results paper at
   its core, and we’ll have to have a broad discussion about this at the
   PC meeting, assuming the paper makes it this far. In the event that this
   occurs, can the authors provide a more robust argument as to why these
   negative results are important for other researchers and practitioners?

   The paper attempts to argue that one contribution that comes out of its
   negative results is to distrust self-report data, but this is well-known
   (see below). The other negative results argument in the paper is that
   these results add to growing evidence of long-term gameificiation
   failures. I find this argument much more compelling. In other words, by
   expanding on this argument, the authors may be able to address this
   question.

   That said, regardless of how this question is addressed in the second
   draft, I’d like to see it done both through changes to the paper and
   through discussion in the response to reviewers.

   QUESTION #2 – Is there a possibility that the statistical framework
   employed is not appropriate for this particular study?

   The authors utilize a two-level statistical approach that I haven’t
   seen before in the CSCW/CHI literature. I enjoyed thinking about this
   approach, and the authors did a relatively good job explaining it. That
   said, I’m currently not convinced that it was the appropriate framework
   for this study. Here’s my reasoning:

   (1) The goal here is to introduce a treatment that ultimately will
   produce strong new members of the Wikipedia community at a higher rate
   than the control.
   (2) Let’s say the game produces 3 such members out of 100 new editors
   and the control produces 1, which looks like it might be the case.
   Let’s also say that this pattern additionally persists over a large n.
   (3) If this is true, why do we care about the potentially moderating
   effect of the invitations?

   The authors argue that new editors that responded to the invitation to
   play the game might just be new editors who are engaged and, critically,
   would have been power editors whether or not the game existed. However,
   barring a random fluke, shouldn’t these future power editors also have
   been in the control group? If I’m right here, I’m thinking the
   invitation doesn’t matter and a more traditional statistical analysis
   (or at least one targeted at identifying rare events) is appropriate.

   I could be wrong, but I want the authors to respond to this question,
   both through feedback to reviewers and clarifications in the paper.

   As an important side note, if we agree that this framework is the right
   way to go in the end, the authors should puff their chests more about
   this by claiming it as a contribution (assuming it hasn’t been used at
   CSCW before).

   Question #3 – Are the outcome variables considered here the best
   outcome variables? Are some critical variables missing?

   The authors seem focused on the average effects across the entire control
   and treatment groups (the two treatment groups, to be specific). However,
   would it not also be reasonable to consider the metric I describe above:
   the % of new editors that go on to be power editors? Since power editors
   end up contributing most of the edits anyway *over the long term*, to me
   this seems like the way to go (i.e. if this group of editors were
   followed for years, statistically significant differences would begin to
   emerge). If the authors agree, the authors need to reanalyze their data
   with this metric in mind.

   Another related outcome variable that might be useful to analyze is how
   long the new editors in each group remained active editors in the
   community (i.e. survival analysis). Because the data is quite old, this
   should be an easy new analysis to run, and longevity has been a variable
   interest in a number of peer production studies.

   In their second draft and the feedback to reviewers, I would like to see
   the authors discuss either new analyses related to power users or why thy
   did not consider this outcome variable. I would also like to see the same
   for survival analysis.

   QUESTION #4: Is there a path towards positive results?

   As noted above, I believe some discussion around this paper and negative
   results papers more generally will have to happen at the PC meeting.
   However, I think there are so missed opportunities here for positive
   results and that the authors were too quick to settle for negative
   results. This is likely an important factor to consider when deciding
   whether to accept a negative results paper.

   Most notably, there are several, well-motivated unexplored avenues that
   could lead to positive results that would have a much larger impact than
   the negative results presented here:

   * As noted above, examining additional outcome variables is important,
   most notably # of power editors and longevity.
   * Does the game work if folks are forced to play it prior to editing
   Wikipedia, as would be the case in most other institutionalized
   socialization contexts? This is not just a hypothetical: this game could
   be used in all Wikipedia Education Project classes and related endeavors.

Formatting and Reference Issues