added material for TWA 2017

2019-06-11 17:54:32 -07:00 · 2019-06-11 17:54:32 -07:00 · 4fc082a8ac
commit 4fc082a8ac
parent 3e6d27447e
6 changed files with 3907 additions and 0 deletions
--- a/cscw_changelogs/2017-the_wikipedia_adventure/README.txt
+++ b/cscw_changelogs/2017-the_wikipedia_adventure/README.txt
@ -0,0 +1,7 @@
+Material for paper:
+
+Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron
+Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive
+Tutorial for New Users.” In Proceedings of the 20th ACM Conference on
+Computer-Supported Cooperative Work & Social Computing (CSCW ’17). New York,
+New York: ACM. https://doi.org/10.1145/2998181.2998307
--- a/cscw_changelogs/2017-the_wikipedia_adventure/refs-processed.bib
+++ b/cscw_changelogs/2017-the_wikipedia_adventure/refs-processed.bib
--- a/cscw_changelogs/2017-the_wikipedia_adventure/twa-CSCW2017-reviews-round1.txt
+++ b/cscw_changelogs/2017-the_wikipedia_adventure/twa-CSCW2017-reviews-round1.txt
@ -0,0 +1,718 @@
+From: <papers2017@cscw.acm.org>
+Date: Tue, Jul 12, 2016 at 11:15 PM
+Subject: CSCW 2017 notification - #516
+To: snehanarayan@gmail.com
+Cc: papers2017@cscw.acm.org
+
+
+Dear Sneha Narayan -
+
+Congratulations!
+
+Your paper:
+
+516 - The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
+for New Users
+
+is one of the 52% of CSCW 2017 submissions invited to revise and resubmit.
+There were 530 total submissions to CSCW 2017, a similar number to last
+year.  The reviewers for this submission believe that it has the potential
+to be revised within four weeks -- revisions are due August 9, 2016 -- to
+become a contribution to what will be an exceptional conference.
+
+The program committee expects all authors to take advantage of this four
+week revision period to improve their submissions by addressing reviewers'
+comments (below). Some submissions need only minor revisions, while others
+will require considerable work over the next four weeks to result in an
+acceptable submission, and will not succeed without significant effort.
+Your reviews, especially the summary report from the Coordinator, should
+make clear what you should do. You can gauge your prospects from your
+reviews and the summary report: overall scores of 4s and 5s indicate the
+reviewers are very confident your paper will be acceptable within four
+weeks with small edits. Overall scores of 3 and 4 indicate you have some
+work to do. Scores of 3 and below indicate that some reviewers have serious
+reservations, though other reviewers see promise.
+
+The same reviewers will read and evaluate your revised submission (though
+additional reviewers may be added for papers where the reviewers are
+divided). You need not satisfy every reviewer or make every suggested
+change, but your revision will need to convince most of the reviewers that
+it is now ready for publication. For some papers the reviewers have
+requested a lot of work, you might feel that it is too much to achieve in a
+four week period. If you have the time to reach that goal: great!  If not,
+that is okay, you are free to withdraw your submission. Please decide
+whether or not the key points made by reviewers can be adequately addressed
+in the time provided, given other demands on your time. If you choose to
+withdraw your paper, please notify us explicitly at papers2017@cscw.acm.org.
+Papers that are revised and re-submitted in the next round will receive
+revised reviews.
+
+Your revision must be accompanied by a separate "Summary of Changes"
+document (in PDF format) that lists the reviewers' comments and your
+responses, even for comments that did not lead to changes in the manuscript
+(in which case you might explain why you chose not to make certain
+suggested changes). This could be a set of bullet points, a table, or
+numbered points by which reviewers' comments are summarized along with your
+changes. This is not a rebuttal, but rather a description of changes made,
+or of reasons you could not or chose not to take the reviewers' advice. To
+become acceptable, your submission must be revised, and your document
+describing the changes will greatly help reviewers see what you have or
+have not changed, along with your reasons for doing so.
+
+Just to be clear, you must submit a revised paper and summary of changes by
+the deadline.  Any paper where a revision and summary are not submitted
+will be considered to be withdrawn.
+
+Example summaries from past years' papers can be found at
+http://bit.ly/16U8BGM.
+
+Please submit your revision and the response document at your "Submissions
+in Progress" page at https://precisionconference.com/~cscw17a/ by 11:59 PM
+PDT, August 9, 2016.
+
+CSCW 2017 will be a great conference, and we sincerely hope you are part of
+it! If you have any issues or questions, please let us know. And thanks
+again for submitting.
+
+Sincerely,
+Louise Barkhuus, Marcos Borges, Wendy A. Kellogg
+CSCW 2017 Co-chairs
+
+
+
+------------------------ Submission 516, Review 4 ------------------------
+
+Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
+for New Users
+
+Reviewer:           AC
+
+Expertise
+
+   2  (Passing Knowledge)
+
+First Round Overall Recommendation
+
+   3  (Maybe acceptable (with significant modifications))
+
+Contribution and Criteria for Evaluation
+
+   The paper presents the design and evaluation of a gamified tool for
+   socializing and retaining new Wikipedia editors. Contribution criteria
+   include (1) a description and rationale for the system; (2) system
+   novelty and rationale for how it leads to learning; and (3) a
+   methodologically sound evaluation.
+
+First Round Review (if needed)
+
+
+Coordinator's First-Round Report to Authors
+
+   The paper presents the design and evaluation of a gamified tool for
+   socializing and retaining new Wikipedia editors. The study found that
+   users liked—but did not learn from—the system.
+
+   The focus on improving the experience of newcomers in Wikipedia is
+   relevant and important. Reviewers describe the study as well motivated
+   and exceptionally well-written. Read R3’s comments on the writing
+   quality and congratulate yourself!
+
+   The reviewers, however, have many concerns about the paper—each
+   focusing on a different aspect of the work. The concerns the reviewers
+   note /may/ be addressable during the revise and resubmit period, but it
+   will be an exceptionally herculean effort. Also, please keep in mind that
+   there is no guarantee of acceptance even after making changes. So, it is
+   at the authors’ discretion about whether or not to proceed with
+   revisions or withdraw the paper.
+
+   There is split amongst the reviewers as to whether the failure of the
+   tool is interesting or not. R1 raises concerns that the failure of the
+   tool could be predicted from existing literature, suggesting little
+   rationale for doing the work in the first place.  R2 asks whether there
+   is something fundamentally different about people who continue to
+   contribute to Wikipedia, and as such whether the system holds value in
+   practice.  R3, on the other hand, sees much value in the systems
+   contribution of the work as well as the real-world evaluation. R3's
+   review has some suggestions of alternative framings that may make the
+   contribution more valuable.
+
+   In treatment of related work, many improvements are needed. R1 notes that
+   the discussion of the well-known concept of legitimacy/authenticity in
+   learning environments is missing.  R2 also points to missing literature
+   about Wikipedian experience.
+
+   R2 and R3 raise a number of methodological questions about the paper.  R2
+   suggests the distribution of participants across the timeline may bias
+   the results. R3, on the other hand, sees opportunity here, suggesting
+   additional statistical analysis related to longevity and power users.
+   Both R2 and R3 question the methodological choice and contribution of
+   measuring perceptions of learning rather than actual learning.  Overall,
+   this points to a need for at the very least justifying the methodological
+   choices and at the most carrying out additional statistical analyses.
+
+   In summary, there is quite a bit of work to be done. I wish the authors
+   the best of luck, should they choose to continue in the review process.
+
+
+Requested Revisions
+
+   REQUIRED:
+   -    Provide justification for why the study was worth carrying out, in
+   response to R1 and R2’s concerns.  R3’s review may have some insight
+   into alternative framings.
+   -    State the research questions more explicitly, as per R2’s
+   recommendation
+   -    Address R1 and R2’s concerns about missing literature
+   -    Ensure that the narrative around Wikipedia is clear to readers who
+do
+   not have an in-depth background in production/editing details.
+   -    Improve the clarity of the results by using percentages or another
+   baseline that allows comparison between numbers, as per R2’s review.
+   -    Provide justification for measuring perceptions of learning versus
+   actual learning.
+   -    Provide a robust discussion of why the results are meaningful for
+   researchers and/or practitioners.
+
+   OPTIONAL, RECOMMENDED
+   -    Consider carrying out additional statistical analyses as
+recommended by
+   R3.
+   -    Provide a short justification for use of English language
+Wikipedia, as
+   per R2’s review.
+
+Formatting and Reference Issues
+
+
+
+------------------------ Submission 516, Review 1 ------------------------
+
+Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
+for New Users
+
+
+Expertise
+
+   4  (Expert)
+
+First Round Overall Recommendation
+
+   3  (Maybe acceptable (with significant modifications))
+
+Contribution and Criteria for Evaluation
+
+   In this paper, the authors present the design and two-pronged evaluation
+   of a tutorial for new wikipedia editors that uses elements of
+   gamification like missions and badges to help coach new editors and help
+   them learn best practices and social norms of Wikipedia. The outcome is
+   that users like the system but, based on behavioral measures, they don't
+   actually learn from it. Learning interventions are a classic kind of
+   research problem, and the paper should include robust measures of
+   learning, as well as a good description of the designed intervention
+   itself, why the design is expected to lead to learning, and a clear
+   description of the study.
+
+Assessment of the Paper
+
+   This is a reasonably well motivated study with connections to appropriate
+   literature and the writing is engaging and understandable. The problem of
+   enculturating newcomers into projects like Wikipedia is well documented
+   and this paper investigates a potential intervention with an admirably
+   well-planned study. Designing learning interventions is really difficult
+   and I commend the authors on a well-executed effort.
+
+   Still, I am ambivalent about the paper because I would have predicted
+   these outcomes based on the literature alone. In the discussion, the
+   authors note that one mismatch between Wikipedia and the tutorial as
+   designed involve the “gradual peripheral participation” of newcomers
+   as they take on the identity of “Wikipedian.” They suggest that maybe
+   speeding up this process is unnatural. I would argue that the most
+   important concept from the literature on learning is missing from this
+   discussion, and that’s “legitimacy” (also sometimes referred to in
+   education and learning literature as “authenticity”.) The authors
+   explain that by doing tasks in a pretend version of Wikipedia, they make
+   it a safe space for newcomers to practice, yet performing “canned”
+   tasks in a pretend system is the opposite of offering a legitimate form
+   of participation. I immediately wonder, why not use what we know from the
+   literature to create low-risk missions that newcomers can complete while
+   legitimately contributing to the encyclopedia? Risk taking is a
+   fundamental characteristic of games that makes them engaging; it
+   certainly seems like it would play a role in people’s motivation in a
+   scenario like this. Rather than eliminating risk, the literature on
+   legitimate peripheral participation would suggest that finding the right
+   degree of risk is required to facilitate progressive entree into a set of
+   shared practices.
+
+   I am disappointed by the missed opportunity here,  the outcome mainly
+   seems to verify that what we know shouldn’t work based on the
+   literature in fact doesn’t work. Yet still the paper isn’t bad and
+   the study is carefully crafted and reported.
+
+   With some extension and reflection, I think the discussion could help
+   point future research in a more fruitful direction. There are millions of
+   pages written on the challenges of designing learning interventions that
+   change people’s behavior, this paper ends on a painfully obvious note.
+   It’s true that usability isn’t all it takes, but what can we learn
+   from TWA adventure about the design of systems to facilitate
+   enculturation into a community of practice? What can we take away from
+   this that might inform more successful tutorial systems in the future?
+
+Formatting and Reference Issues
+
+
+
+------------------------ Submission 516, Review 2 ------------------------
+
+Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
+for New Users
+
+
+Expertise
+
+   4  (Expert)
+
+First Round Overall Recommendation
+
+   2  (Probably NOT acceptable)
+
+Contribution and Criteria for Evaluation
+
+   This paper's contribution is the design and evaluation of a structured
+   introduction to a peer production community (English Wikipedia) called
+   "The Wikipedia Adventure".  TWA's design is rooted in theories of
+   gamification, and its utility is evaluated through a user survey and an
+   invitation-based field experiment.  The paper reports on the survey
+   respondents' satisfaction with TWA, and how their experiment results
+   reveal some of the challenges of affecting lasting changes to contributor
+   patterns in peer production communities.  These findings are then
+   discussed in relation to cultural factors in Wikipedia, issuses of
+   self-selection and voluntary participation, and the limitations of
+   gamification.
+
+   When evaluating a paper that describes the design of a system, the two
+   main criteria are that the system and/or its development setting is/are
+   novel, and that the way the system is evaluated is methodologically
+   sound.
+
+Assessment of the Paper
+
+   As mentioned in the contribution section, this paper's contribution is
+   the design and evaluation of a structured introduction to a peer
+   production community based on gamification, called "The Wikipedia
+   Adventure". This is a great idea and sounds like a useful addition to
+   Wikipedia. The paper is written in a way that makes it easy to read, and
+   provides the reader with a good introduction to how TWA's design is
+   rooted in theories of gamification, thus applying these principles in
+   what appears to be a novel setting. The paper also does a good job of
+   discussing the findings, organizing them in a way that is easy to follow
+   and touching on important points (e.g. cultural factors, and the
+   limitations of self-selection and gamification).
+
+   The overall ideas and approach taken in this paper are sound, they are in
+   line with the criteria described previously.  Unfortunately, there are
+   two major issues and several minor ones that need to be resolved before
+   this paper is ready for publication. The first major issue is that the
+   methodology used to evaluate performance in the invitation-based
+   experiment measures contribution in a skewed manner and does not
+   establish why that is appropriate. Secondly, the paper fails to consider
+   arguments put forth by Panciera et al's "Wikipedians are Born, Not  Made"
+   paper. This review will expand on both of these major issues below.
+   Further below will be notes and comments with suggestions for improvement
+   for specific sections of the paper, some of which are rather substantial
+   as well.
+
+   1: Evaluating TWA effectiveness by number of contributions
+   ----------------------------------------------------------
+
+   A major part of the paper is the evaluation of TWA's effect on subsequent
+   contributions. To evaluate this an invitation-based field experiment is
+   used, and the paper does a great job of justifying why that is
+   appropriate in this setting.  The experiment runs from February 2014 and
+   three months forward.  Exact dates are not given, so let us assume that
+   it ran until the end of April 2014.  User contributions are then measured
+   until the end of May 2014.
+
+   There are two problems with this approach that the paper fails to address
+   properly. One is the issue of right-truncation found in the data.
+   Contributors who joined in early February 2014 would have about four
+   months to make edits, whereas those who joined in late April would only
+   have about a month. The model does contain a control variable for number
+   of days in the experiment, but why is that appropriate in this context?
+   If we examine other work in the same domain, they tend to either use a
+   much longer time period (e.g. the Teahouse paper, citation 23, which uses
+   6-9 months) or ensure that the time period is fixed (e.g. Kittur et al.
+   "Herding the Cats: The Influence of Groups in Coordinating Peer
+   Production", WikiSym 2009; or Zhu et al. "Effectiveness of Shared
+   Leadership in Online Communities", CSCW 2012).
+
+   Related to the right-truncation problem is the fact that the paper also
+   fails to discuss and justify what a reasonable timespan for measuring the
+   effect of TWA is, and that it will have an effect on the number of
+   contributions made. It might for instance be that TWA instead has an
+   effect on how long it takes before a user drops out of the system. If we
+   assume that TWA has an effect on contributions, what timespan is needed
+   to measure that effect? The paper assumes that a month is adequate to
+   discover it, whereas one might suspect that it is only measurable over a
+   longer period of time. If it is the case that a short period of time is
+   appropriate (for instance because these users are likely to drop out
+   after a certain amount of time) the paper needs to properly establish
+   that, either by measuring it or referring to previous work.
+
+   2: Wikipedians Are Born, Not Made
+   ---------------------------------
+
+   In their GROUP 2009 paper "Wikipedians Are Born, Not Made: A Study of
+   Power Editors on Wikipedia", Panciera et al. show data that argues that
+   those contributors who are going to stick around behave in a way that is
+   different from the very beginning.  In followup work published in 2010
+   they find similar differences in another peer production community.
+   (Panciera et al. "Lurking? cyclopaths?: a quantitative lifecycle analysis
+   of user behavior in a geowiki." CHI 2010)
+
+   These two papers and the argument they put forth are relevant because
+   they question who TWA is designed for. In the related work a reference to
+   Bryant et al's "Becoming Wikipedian" is made, thereby suggesting that TWA
+   is designed to teach someone how to be a Wikipedian. As Panciera et al's
+   paper argues along the lines of these contributors already being
+   Wikipedians, should TWA be designed to instead help these contributors
+   stay productive?
+
+   If Wikipedians are born, not made, then one could also question whether
+   these contributors are at all going to use TWA. Maybe they ignore TWA
+   because they are already productive and do not need it? Since the paper
+   never makes any references to these papers and discusses issues related
+   to this (e.g. "is the Teahouse more effective since it allows them to get
+   answers when they need help?"), this whole topic area is left hanging.
+
+   ---
+   Below follows comments/notes for each section of the paper.
+
+   Introduction:
+   * An overall issue here is that there are few citations to sources. For
+   instance a claim is made that "newly created accounts are the primary
+   source of spam and vandalism on Wikipedia". Consider a "[citation
+   needed]" added after that.
+   * When citing multiple papers it is preferable that they are in order,
+   e.g "[14, 23, 17]" should be "[14, 17, 23]" (page 1). This minor issue
+   also occurs elsewhere in the paper.
+   * "Unlike prior systems, TWA creates a structured experience that guides
+   newcomers through critical pieces of Wikipedia knowledge..." Do we know
+   that there are no other prior systems that offer a similar experience? It
+   might be that there are none within the Wikipedia domain, but what about
+   outside it?  That sentence is making a rather bold claim.
+   * After reading the introduction, what is the reader expected to remember
+   as the main findings in this paper? At the end of the introduction the
+   following sentence is found: "The study underscores the importance of
+   conducting multiple types of evaluations of social systems." Is that the
+   main contribution? What about the implications for gamified structured
+   introductions to peer production?
+
+   Background:
+   * "...women reported that they found that contributing to Wikipedia
+   involved a high level of conflict and that they lacked confidence in
+   their expertise [8]. This suggests that more effective onboarding tools
+   could help incorporate newcomers." This is an important side of
+   Wikipedia, but how does TWA's design help mitigate this issue? Are there
+   design elements in TWA that aims to boost confidence in one's expertise?
+   * At the end of the introduction we find the following two questions:
+   "Would a gamified tutorial produce a positive, enjoyable educational
+   experience for new Wikipedians? Would playing the tutorial impact
+   newcomer participation patterns?" These are the paper's _research
+   questions_! It would be very helpful to the reader if they were displayed
+   more clearly, e.g. as separate items. They should not be hidden.
+
+   System Design:
+   * "...it does not depend on the availability, helpfulness, or
+   intervention of existing Wikipedia editors..." The underlying argument
+   here is that scalability is preferable to personal interaction when
+   socializing newcomers (in peer production communities). Why is that the
+   better solution? As discussed previously, TWA might be designed for
+   contributors who are not going to stick around, why are those the right
+   audience for it? Is the goal to provide _everyone_ with a scalable
+   impersonale introduction, or is it better to provide _some_ (typically
+   based on self-selection) with a personal introduction (e.g. the
+   Teahouse)?
+
+   Game-like elements (subsection of System Design):
+   * In "Missions" a distinction is made between "basic" and "advanced"
+   editing techniques. It appears to be somewhat arbitrary, why is adding
+   sources advanced editing, but watchlists are not?
+   * Your readers might not now what watchlists are, take care to write for
+   a general audience, not everyone knows a lot about how Wikipedia works
+   behind the scenes.
+
+   Study 1: User Survey:
+   * This paper doesn't discuss any other language editions of Wikipedia
+   besides the English one, and makes the assumption that "Wikipedia" equals
+   the English edition. Adding a mention that Wikipedia exists in multiple
+   languages and  explaining why English was chosen as the language where
+   TWA was launched would be very helpful.
+   * The paper aims to measure "educational effectiveness". Why is a survey
+   the appropriate way to measure that? Based on the description of the
+   survey, it seems that it never asks specific questions to test whether
+   TWA's users learned specific things, in other words whether the education
+   was successful. Later when describing the results the phrase "learning to
+   edit Wikipedia" is used, isn't that the _key_ learning goal of TWA? Yet
+   the survey asks Likert-scale questions. In other words, you're measuring
+   whether TWA users are under the impression that they learned something,
+   not whether they actually did.
+   * Figure 4 uses counts. While it shows that none of the questions had
+   responses from all participants, it makes comparisions between questions
+   with different response rates very difficult. Using percentages would
+   allow for direct comparisons, and makes the references to the figure in
+   the text easier to follow along with. The text refers to four questions
+   with a certain percentage of responses, but leaves the math to the
+   reader.
+   * The survey leaves many questions unanswered, some of which the paper
+   might want to address. Were any negative questions asked? Were there any
+   control questions, such as a similar question worded slightly differently
+   to allow for comparison between responses? As it is, this survey comes
+   across as a set of positive statements about TWA that respondents agreed
+   to. Given that respondents self-select and no attempts to contact users
+   who didn't go through TWA appears to have been made, it is likely there
+   is a bias in the responses, and that bias should be discussed.
+
+   Study 2: Field Experiment:
+   * The description of how accounts were selected to be included is rather
+   confusing. First it describes 1,967 accounts that met the same criteria
+   as for the user survey, however 10,000 individuals ("accounts"?) were
+   invited to the beta. Why is one an order of magnitude larger than the
+   other? Then in the second paragraph of "Methods" it describes the
+   selection criteria, that at least one contribution would have to be made
+   after getting invited. This would perhaps be much less confusing if the
+   criteria were first explained, particularly how the experiment and
+   control groups were set up, and then how many accounts were identified.
+   * "This is a larger proportion of users than took up the invitation in
+   Study 1, which may be due to changes in the invitation text." Earlier in
+   the paper study 1 refers to a "beta", whereas this appears to be not. If
+   this is the case, this is an important difference between the two that
+   should be made clear to the reader.
+   * "we measure the overall contributions as the total number of edits made
+   by each account from the time of inclusion in the study until May 31,
+   2014." When exactly is "time of inclusion", is that when they got the
+   invite? What about when they completed one (or all) TWA mission(s)? The
+   concern here is that all contributions are measured, whereas the
+   experiment sets up a pre/post-scenario. Later on the paper refers to
+   "subsequent contributions", indicating that contributions after a certain
+   point in time was measured. This quickly becomes rather confusing,
+   spelling out clearly what points in a user's account history is used
+   (e.g. "we measure contributions at four points in time: when the user
+   registered their account, the time of invitation, when they first started
+   using TWA, and the end of the experiment") would be very helpful.
+   * Why is a six-edit radius chosen when measuring word persistence?
+   Halfaker et al. make no claim about what the radius should be in the
+   referenced work, and Ekstrand et al suggest a 15 edit radius in a related
+   paper (Ekstrand and Riedl "rv you're dumb: identifying discarded work in
+   Wiki article history." WikiSym 2009) The six-edit radius also comes with
+   an issue that is unadressed: how long does it take for an edit made by a
+   contributor in the study to reach that six-edit radius? If it hasn't been
+   reached at the end of the study period, that edit has to be discarded as
+   its quality is unknown. In a related paper, Farzan and Kraut instead
+   chose to use percentage of words that survived as a measure of quality
+   (Farzan and Kraut "Wikipedia classroom experiment: bidirectional benefits
+   of students' engagement in online production communities" CHI 2013)
+   * Tables 1, 2, 3, and 4, as well as figure 6 should be brought closer
+   together so it's easier to follow along. Table 1 occurs before the text
+   that refers to it, and table 4 is two pages further along. Putting all
+   tables and figure 6 on the same page might be a good solution.
+   * Table 3 refers to users "reached" a mission. It is confusing how 181
+   users reached the final mission but did not complete it, yet in the text
+   it seems these 181 users actually did.
+   * The post-hoc power analysis is very useful!
+
+   Discussion:
+   * "The new editors in our study may have had unpleasant experiences
+   during their initial time on Wikipedia..." It appears that the survey
+   asked no questions about this, yet is it not a very important issue
+   related to TWA's success?
+   * In "Limitations of gamification" the following sentence is found:
+   "...our study is among the first that compares levels of participation in
+   a task among individuals who were introduced to gamified learning first
+   to those that were not." This is an _important_ finding, it shouldn't be
+   hidden back here but instead be up front in the introduction!
+
+Formatting and Reference Issues
+
+
+
+------------------------ Submission 516, Review 3 ------------------------
+
+Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
+for New Users
+
+Reviewer:           AC-Reviewer
+
+Expertise
+
+   4  (Expert)
+
+First Round Overall Recommendation
+
+   3  (Maybe acceptable (with significant modifications))
+
+Contribution and Criteria for Evaluation
+
+   This paper presents the results of a deployment of a gameification-based
+   system designed to retain new editors in Wikipedia. It is a negative
+   results paper: the authors claim that they have conclusive evidence that
+   the system did not work (although I have suggested a few additional lines
+   of inquiry below that might problematize this assertion).
+
+   The committee will have to have a discussion about how to evaluate this
+   paper, and likely negative results papers more generally.
+
+Assessment of the Paper
+
+   This paper presents the results of a deployment of a gameification-based
+   system designed to retain new editors in Wikipedia. It is a negative
+   results paper: the authors claim that they have conclusive evidence that
+   the system did not work (although I have suggested a few additional lines
+   of inquiry below that might problematize this assertion).
+
+   The paper is very well-written and has some large positives. It also is a
+   negative results paper, and the committee will have to decide how to
+   handle this. In general, I’m strongly sympathetic to arguments to
+   include more negative results papers in our proceedings, but I’m quite
+   unclear on the details of how to do so (e.g. what defines a top-quality
+   negative results paper?). I’m hopeful that this paper can instigate a
+   broader discussion on this topic at the PC meeting.
+
+   All of that said, this paper also has a number of idiosyncratic
+   limitations that make it perhaps not the best trial balloon for negative
+   results papers. Below, I outline what I believe to be the paper’s
+   positives and then describe these limitations in more detail, phrased as
+   both critiques and questions.
+
+   Overall, my recommendation is to invite the authors to revise and
+   resubmit. If this occurs, I’ll want to see the below critiques
+   addressed and the below questions answered (both through direct answers
+   in the response to reviewers and through clarifications and changes to
+   the paper). I’m hopeful through, through the R&R process, this paper
+   can become an ideal negative results trial balloon.
+
+
+   Important positives:
+
+   * The authors built a system to solve a real-life problemand did a
+   real-life, relatively large-scale deployment. Awesome!
+   * The paper is easily in the top 95% in terms of writing quality. This is
+   true both at the sentence level and at the narrative level. As a person
+   who has to review lots of papers, this was a breath of fresh air.
+   * The design of the game is quite well-thought-out, save a few relatively
+   arbitrary decisions. I was particularly compelled by the use of
+   gameification techniques that are also present in “real Wikipedia”
+   (e.g. barnstar-like rewards).
+
+   Critiques:
+
+   CRITIQUE #1 – Excessive import placed on trivial self-report data: It
+   is well-known that self-report data from participants is inferior to
+   observations of actual behavior, and that self-report data can be quite
+   unreliable more generally. As such, in my view, it is not a contribution
+   to show that self-report data didn’t end up panning out in the
+   behavioral results.
+
+   In the next draft of this paper, I would like to see the authors address
+   this issue. This might mean framing this paper as a full-on negative
+   results paper, but lighter weight adaptations might be possible.
+
+
+   Open questions:
+
+   QUESTION #1: As noted above, this paper is a negative results paper at
+   its core, and we’ll have to have a broad discussion about this at the
+   PC meeting, assuming the paper makes it this far. In the event that this
+   occurs, can the authors provide a more robust argument as to why these
+   negative results are important for other researchers and practitioners?
+
+   The paper attempts to argue that one contribution that comes out of its
+   negative results is to distrust self-report data, but this is well-known
+   (see below). The other negative results argument in the paper is that
+   these results add to growing evidence of long-term gameificiation
+   failures. I find this argument much more compelling. In other words, by
+   expanding on this argument, the authors may be able to address this
+   question.
+
+   That said, regardless of how this question is addressed in the second
+   draft, I’d like to see it done both through changes to the paper and
+   through discussion in the response to reviewers.
+
+   QUESTION #2 – Is there a possibility that the statistical framework
+   employed is not appropriate for this particular study?
+
+   The authors utilize a two-level statistical approach that I haven’t
+   seen before in the CSCW/CHI literature. I enjoyed thinking about this
+   approach, and the authors did a relatively good job explaining it. That
+   said, I’m currently not convinced that it was the appropriate framework
+   for this study. Here’s my reasoning:
+
+   (1) The goal here is to introduce a treatment that ultimately will
+   produce strong new members of the Wikipedia community at a higher rate
+   than the control.
+   (2) Let’s say the game produces 3 such members out of 100 new editors
+   and the control produces 1, which looks like it might be the case.
+   Let’s also say that this pattern additionally persists over a large n.
+   (3) If this is true, why do we care about the potentially moderating
+   effect of the invitations?
+
+   The authors argue that new editors that responded to the invitation to
+   play the game might just be new editors who are engaged and, critically,
+   would have been power editors whether or not the game existed. However,
+   barring a random fluke, shouldn’t these future power editors also have
+   been in the control group? If I’m right here, I’m thinking the
+   invitation doesn’t matter and a more traditional statistical analysis
+   (or at least one targeted at identifying rare events) is appropriate.
+
+   I could be wrong, but I want the authors to respond to this question,
+   both through feedback to reviewers and clarifications in the paper.
+
+   As an important side note, if we agree that this framework is the right
+   way to go in the end, the authors should puff their chests more about
+   this by claiming it as a contribution (assuming it hasn’t been used at
+   CSCW before).
+
+   Question #3 – Are the outcome variables considered here the best
+   outcome variables? Are some critical variables missing?
+
+   The authors seem focused on the average effects across the entire control
+   and treatment groups (the two treatment groups, to be specific). However,
+   would it not also be reasonable to consider the metric I describe above:
+   the % of new editors that go on to be power editors? Since power editors
+   end up contributing most of the edits anyway *over the long term*, to me
+   this seems like the way to go (i.e. if this group of editors were
+   followed for years, statistically significant differences would begin to
+   emerge). If the authors agree, the authors need to reanalyze their data
+   with this metric in mind.
+
+   Another related outcome variable that might be useful to analyze is how
+   long the new editors in each group remained active editors in the
+   community (i.e. survival analysis). Because the data is quite old, this
+   should be an easy new analysis to run, and longevity has been a variable
+   interest in a number of peer production studies.
+
+   In their second draft and the feedback to reviewers, I would like to see
+   the authors discuss either new analyses related to power users or why thy
+   did not consider this outcome variable. I would also like to see the same
+   for survival analysis.
+
+   QUESTION #4: Is there a path towards positive results?
+
+   As noted above, I believe some discussion around this paper and negative
+   results papers more generally will have to happen at the PC meeting.
+   However, I think there are so missed opportunities here for positive
+   results and that the authors were too quick to settle for negative
+   results. This is likely an important factor to consider when deciding
+   whether to accept a negative results paper.
+
+   Most notably, there are several, well-motivated unexplored avenues that
+   could lead to positive results that would have a much larger impact than
+   the negative results presented here:
+
+   * As noted above, examining additional outcome variables is important,
+   most notably # of power editors and longevity.
+   * Does the game work if folks are forced to play it prior to editing
+   Wikipedia, as would be the case in most other institutionalized
+   socialization contexts? This is not just a hypothetical: this game could
+   be used in all Wikipedia Education Project classes and related endeavors.
+
+Formatting and Reference Issues
--- a/cscw_changelogs/2017-the_wikipedia_adventure/twa-CSCW2017-reviews-round2.txt
+++ b/cscw_changelogs/2017-the_wikipedia_adventure/twa-CSCW2017-reviews-round2.txt
--- a/cscw_changelogs/2017-the_wikipedia_adventure/twa-CSCW2017-revision_summary.pdf
+++ b/cscw_changelogs/2017-the_wikipedia_adventure/twa-CSCW2017-revision_summary.pdf
--- a/cscw_changelogs/2017-the_wikipedia_adventure/twa-CSCW2017-revision_summary.tex
+++ b/cscw_changelogs/2017-the_wikipedia_adventure/twa-CSCW2017-revision_summary.tex