719 lines
36 KiB
Plaintext
719 lines
36 KiB
Plaintext
From: <papers2017@cscw.acm.org>
|
||
Date: Tue, Jul 12, 2016 at 11:15 PM
|
||
Subject: CSCW 2017 notification - #516
|
||
To: snehanarayan@gmail.com
|
||
Cc: papers2017@cscw.acm.org
|
||
|
||
|
||
Dear Sneha Narayan -
|
||
|
||
Congratulations!
|
||
|
||
Your paper:
|
||
|
||
516 - The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
|
||
for New Users
|
||
|
||
is one of the 52% of CSCW 2017 submissions invited to revise and resubmit.
|
||
There were 530 total submissions to CSCW 2017, a similar number to last
|
||
year. The reviewers for this submission believe that it has the potential
|
||
to be revised within four weeks -- revisions are due August 9, 2016 -- to
|
||
become a contribution to what will be an exceptional conference.
|
||
|
||
The program committee expects all authors to take advantage of this four
|
||
week revision period to improve their submissions by addressing reviewers'
|
||
comments (below). Some submissions need only minor revisions, while others
|
||
will require considerable work over the next four weeks to result in an
|
||
acceptable submission, and will not succeed without significant effort.
|
||
Your reviews, especially the summary report from the Coordinator, should
|
||
make clear what you should do. You can gauge your prospects from your
|
||
reviews and the summary report: overall scores of 4s and 5s indicate the
|
||
reviewers are very confident your paper will be acceptable within four
|
||
weeks with small edits. Overall scores of 3 and 4 indicate you have some
|
||
work to do. Scores of 3 and below indicate that some reviewers have serious
|
||
reservations, though other reviewers see promise.
|
||
|
||
The same reviewers will read and evaluate your revised submission (though
|
||
additional reviewers may be added for papers where the reviewers are
|
||
divided). You need not satisfy every reviewer or make every suggested
|
||
change, but your revision will need to convince most of the reviewers that
|
||
it is now ready for publication. For some papers the reviewers have
|
||
requested a lot of work, you might feel that it is too much to achieve in a
|
||
four week period. If you have the time to reach that goal: great! If not,
|
||
that is okay, you are free to withdraw your submission. Please decide
|
||
whether or not the key points made by reviewers can be adequately addressed
|
||
in the time provided, given other demands on your time. If you choose to
|
||
withdraw your paper, please notify us explicitly at papers2017@cscw.acm.org.
|
||
Papers that are revised and re-submitted in the next round will receive
|
||
revised reviews.
|
||
|
||
Your revision must be accompanied by a separate "Summary of Changes"
|
||
document (in PDF format) that lists the reviewers' comments and your
|
||
responses, even for comments that did not lead to changes in the manuscript
|
||
(in which case you might explain why you chose not to make certain
|
||
suggested changes). This could be a set of bullet points, a table, or
|
||
numbered points by which reviewers' comments are summarized along with your
|
||
changes. This is not a rebuttal, but rather a description of changes made,
|
||
or of reasons you could not or chose not to take the reviewers' advice. To
|
||
become acceptable, your submission must be revised, and your document
|
||
describing the changes will greatly help reviewers see what you have or
|
||
have not changed, along with your reasons for doing so.
|
||
|
||
Just to be clear, you must submit a revised paper and summary of changes by
|
||
the deadline. Any paper where a revision and summary are not submitted
|
||
will be considered to be withdrawn.
|
||
|
||
Example summaries from past years' papers can be found at
|
||
http://bit.ly/16U8BGM.
|
||
|
||
Please submit your revision and the response document at your "Submissions
|
||
in Progress" page at https://precisionconference.com/~cscw17a/ by 11:59 PM
|
||
PDT, August 9, 2016.
|
||
|
||
CSCW 2017 will be a great conference, and we sincerely hope you are part of
|
||
it! If you have any issues or questions, please let us know. And thanks
|
||
again for submitting.
|
||
|
||
Sincerely,
|
||
Louise Barkhuus, Marcos Borges, Wendy A. Kellogg
|
||
CSCW 2017 Co-chairs
|
||
|
||
|
||
|
||
------------------------ Submission 516, Review 4 ------------------------
|
||
|
||
Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
|
||
for New Users
|
||
|
||
Reviewer: AC
|
||
|
||
Expertise
|
||
|
||
2 (Passing Knowledge)
|
||
|
||
First Round Overall Recommendation
|
||
|
||
3 (Maybe acceptable (with significant modifications))
|
||
|
||
Contribution and Criteria for Evaluation
|
||
|
||
The paper presents the design and evaluation of a gamified tool for
|
||
socializing and retaining new Wikipedia editors. Contribution criteria
|
||
include (1) a description and rationale for the system; (2) system
|
||
novelty and rationale for how it leads to learning; and (3) a
|
||
methodologically sound evaluation.
|
||
|
||
First Round Review (if needed)
|
||
|
||
|
||
Coordinator's First-Round Report to Authors
|
||
|
||
The paper presents the design and evaluation of a gamified tool for
|
||
socializing and retaining new Wikipedia editors. The study found that
|
||
users liked—but did not learn from—the system.
|
||
|
||
The focus on improving the experience of newcomers in Wikipedia is
|
||
relevant and important. Reviewers describe the study as well motivated
|
||
and exceptionally well-written. Read R3’s comments on the writing
|
||
quality and congratulate yourself!
|
||
|
||
The reviewers, however, have many concerns about the paper—each
|
||
focusing on a different aspect of the work. The concerns the reviewers
|
||
note /may/ be addressable during the revise and resubmit period, but it
|
||
will be an exceptionally herculean effort. Also, please keep in mind that
|
||
there is no guarantee of acceptance even after making changes. So, it is
|
||
at the authors’ discretion about whether or not to proceed with
|
||
revisions or withdraw the paper.
|
||
|
||
There is split amongst the reviewers as to whether the failure of the
|
||
tool is interesting or not. R1 raises concerns that the failure of the
|
||
tool could be predicted from existing literature, suggesting little
|
||
rationale for doing the work in the first place. R2 asks whether there
|
||
is something fundamentally different about people who continue to
|
||
contribute to Wikipedia, and as such whether the system holds value in
|
||
practice. R3, on the other hand, sees much value in the systems
|
||
contribution of the work as well as the real-world evaluation. R3's
|
||
review has some suggestions of alternative framings that may make the
|
||
contribution more valuable.
|
||
|
||
In treatment of related work, many improvements are needed. R1 notes that
|
||
the discussion of the well-known concept of legitimacy/authenticity in
|
||
learning environments is missing. R2 also points to missing literature
|
||
about Wikipedian experience.
|
||
|
||
R2 and R3 raise a number of methodological questions about the paper. R2
|
||
suggests the distribution of participants across the timeline may bias
|
||
the results. R3, on the other hand, sees opportunity here, suggesting
|
||
additional statistical analysis related to longevity and power users.
|
||
Both R2 and R3 question the methodological choice and contribution of
|
||
measuring perceptions of learning rather than actual learning. Overall,
|
||
this points to a need for at the very least justifying the methodological
|
||
choices and at the most carrying out additional statistical analyses.
|
||
|
||
In summary, there is quite a bit of work to be done. I wish the authors
|
||
the best of luck, should they choose to continue in the review process.
|
||
|
||
|
||
Requested Revisions
|
||
|
||
REQUIRED:
|
||
- Provide justification for why the study was worth carrying out, in
|
||
response to R1 and R2’s concerns. R3’s review may have some insight
|
||
into alternative framings.
|
||
- State the research questions more explicitly, as per R2’s
|
||
recommendation
|
||
- Address R1 and R2’s concerns about missing literature
|
||
- Ensure that the narrative around Wikipedia is clear to readers who
|
||
do
|
||
not have an in-depth background in production/editing details.
|
||
- Improve the clarity of the results by using percentages or another
|
||
baseline that allows comparison between numbers, as per R2’s review.
|
||
- Provide justification for measuring perceptions of learning versus
|
||
actual learning.
|
||
- Provide a robust discussion of why the results are meaningful for
|
||
researchers and/or practitioners.
|
||
|
||
OPTIONAL, RECOMMENDED
|
||
- Consider carrying out additional statistical analyses as
|
||
recommended by
|
||
R3.
|
||
- Provide a short justification for use of English language
|
||
Wikipedia, as
|
||
per R2’s review.
|
||
|
||
Formatting and Reference Issues
|
||
|
||
|
||
|
||
------------------------ Submission 516, Review 1 ------------------------
|
||
|
||
Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
|
||
for New Users
|
||
|
||
|
||
Expertise
|
||
|
||
4 (Expert)
|
||
|
||
First Round Overall Recommendation
|
||
|
||
3 (Maybe acceptable (with significant modifications))
|
||
|
||
Contribution and Criteria for Evaluation
|
||
|
||
In this paper, the authors present the design and two-pronged evaluation
|
||
of a tutorial for new wikipedia editors that uses elements of
|
||
gamification like missions and badges to help coach new editors and help
|
||
them learn best practices and social norms of Wikipedia. The outcome is
|
||
that users like the system but, based on behavioral measures, they don't
|
||
actually learn from it. Learning interventions are a classic kind of
|
||
research problem, and the paper should include robust measures of
|
||
learning, as well as a good description of the designed intervention
|
||
itself, why the design is expected to lead to learning, and a clear
|
||
description of the study.
|
||
|
||
Assessment of the Paper
|
||
|
||
This is a reasonably well motivated study with connections to appropriate
|
||
literature and the writing is engaging and understandable. The problem of
|
||
enculturating newcomers into projects like Wikipedia is well documented
|
||
and this paper investigates a potential intervention with an admirably
|
||
well-planned study. Designing learning interventions is really difficult
|
||
and I commend the authors on a well-executed effort.
|
||
|
||
Still, I am ambivalent about the paper because I would have predicted
|
||
these outcomes based on the literature alone. In the discussion, the
|
||
authors note that one mismatch between Wikipedia and the tutorial as
|
||
designed involve the “gradual peripheral participation” of newcomers
|
||
as they take on the identity of “Wikipedian.” They suggest that maybe
|
||
speeding up this process is unnatural. I would argue that the most
|
||
important concept from the literature on learning is missing from this
|
||
discussion, and that’s “legitimacy” (also sometimes referred to in
|
||
education and learning literature as “authenticity”.) The authors
|
||
explain that by doing tasks in a pretend version of Wikipedia, they make
|
||
it a safe space for newcomers to practice, yet performing “canned”
|
||
tasks in a pretend system is the opposite of offering a legitimate form
|
||
of participation. I immediately wonder, why not use what we know from the
|
||
literature to create low-risk missions that newcomers can complete while
|
||
legitimately contributing to the encyclopedia? Risk taking is a
|
||
fundamental characteristic of games that makes them engaging; it
|
||
certainly seems like it would play a role in people’s motivation in a
|
||
scenario like this. Rather than eliminating risk, the literature on
|
||
legitimate peripheral participation would suggest that finding the right
|
||
degree of risk is required to facilitate progressive entree into a set of
|
||
shared practices.
|
||
|
||
I am disappointed by the missed opportunity here, the outcome mainly
|
||
seems to verify that what we know shouldn’t work based on the
|
||
literature in fact doesn’t work. Yet still the paper isn’t bad and
|
||
the study is carefully crafted and reported.
|
||
|
||
With some extension and reflection, I think the discussion could help
|
||
point future research in a more fruitful direction. There are millions of
|
||
pages written on the challenges of designing learning interventions that
|
||
change people’s behavior, this paper ends on a painfully obvious note.
|
||
It’s true that usability isn’t all it takes, but what can we learn
|
||
from TWA adventure about the design of systems to facilitate
|
||
enculturation into a community of practice? What can we take away from
|
||
this that might inform more successful tutorial systems in the future?
|
||
|
||
Formatting and Reference Issues
|
||
|
||
|
||
|
||
------------------------ Submission 516, Review 2 ------------------------
|
||
|
||
Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
|
||
for New Users
|
||
|
||
|
||
Expertise
|
||
|
||
4 (Expert)
|
||
|
||
First Round Overall Recommendation
|
||
|
||
2 (Probably NOT acceptable)
|
||
|
||
Contribution and Criteria for Evaluation
|
||
|
||
This paper's contribution is the design and evaluation of a structured
|
||
introduction to a peer production community (English Wikipedia) called
|
||
"The Wikipedia Adventure". TWA's design is rooted in theories of
|
||
gamification, and its utility is evaluated through a user survey and an
|
||
invitation-based field experiment. The paper reports on the survey
|
||
respondents' satisfaction with TWA, and how their experiment results
|
||
reveal some of the challenges of affecting lasting changes to contributor
|
||
patterns in peer production communities. These findings are then
|
||
discussed in relation to cultural factors in Wikipedia, issuses of
|
||
self-selection and voluntary participation, and the limitations of
|
||
gamification.
|
||
|
||
When evaluating a paper that describes the design of a system, the two
|
||
main criteria are that the system and/or its development setting is/are
|
||
novel, and that the way the system is evaluated is methodologically
|
||
sound.
|
||
|
||
Assessment of the Paper
|
||
|
||
As mentioned in the contribution section, this paper's contribution is
|
||
the design and evaluation of a structured introduction to a peer
|
||
production community based on gamification, called "The Wikipedia
|
||
Adventure". This is a great idea and sounds like a useful addition to
|
||
Wikipedia. The paper is written in a way that makes it easy to read, and
|
||
provides the reader with a good introduction to how TWA's design is
|
||
rooted in theories of gamification, thus applying these principles in
|
||
what appears to be a novel setting. The paper also does a good job of
|
||
discussing the findings, organizing them in a way that is easy to follow
|
||
and touching on important points (e.g. cultural factors, and the
|
||
limitations of self-selection and gamification).
|
||
|
||
The overall ideas and approach taken in this paper are sound, they are in
|
||
line with the criteria described previously. Unfortunately, there are
|
||
two major issues and several minor ones that need to be resolved before
|
||
this paper is ready for publication. The first major issue is that the
|
||
methodology used to evaluate performance in the invitation-based
|
||
experiment measures contribution in a skewed manner and does not
|
||
establish why that is appropriate. Secondly, the paper fails to consider
|
||
arguments put forth by Panciera et al's "Wikipedians are Born, Not Made"
|
||
paper. This review will expand on both of these major issues below.
|
||
Further below will be notes and comments with suggestions for improvement
|
||
for specific sections of the paper, some of which are rather substantial
|
||
as well.
|
||
|
||
1: Evaluating TWA effectiveness by number of contributions
|
||
----------------------------------------------------------
|
||
|
||
A major part of the paper is the evaluation of TWA's effect on subsequent
|
||
contributions. To evaluate this an invitation-based field experiment is
|
||
used, and the paper does a great job of justifying why that is
|
||
appropriate in this setting. The experiment runs from February 2014 and
|
||
three months forward. Exact dates are not given, so let us assume that
|
||
it ran until the end of April 2014. User contributions are then measured
|
||
until the end of May 2014.
|
||
|
||
There are two problems with this approach that the paper fails to address
|
||
properly. One is the issue of right-truncation found in the data.
|
||
Contributors who joined in early February 2014 would have about four
|
||
months to make edits, whereas those who joined in late April would only
|
||
have about a month. The model does contain a control variable for number
|
||
of days in the experiment, but why is that appropriate in this context?
|
||
If we examine other work in the same domain, they tend to either use a
|
||
much longer time period (e.g. the Teahouse paper, citation 23, which uses
|
||
6-9 months) or ensure that the time period is fixed (e.g. Kittur et al.
|
||
"Herding the Cats: The Influence of Groups in Coordinating Peer
|
||
Production", WikiSym 2009; or Zhu et al. "Effectiveness of Shared
|
||
Leadership in Online Communities", CSCW 2012).
|
||
|
||
Related to the right-truncation problem is the fact that the paper also
|
||
fails to discuss and justify what a reasonable timespan for measuring the
|
||
effect of TWA is, and that it will have an effect on the number of
|
||
contributions made. It might for instance be that TWA instead has an
|
||
effect on how long it takes before a user drops out of the system. If we
|
||
assume that TWA has an effect on contributions, what timespan is needed
|
||
to measure that effect? The paper assumes that a month is adequate to
|
||
discover it, whereas one might suspect that it is only measurable over a
|
||
longer period of time. If it is the case that a short period of time is
|
||
appropriate (for instance because these users are likely to drop out
|
||
after a certain amount of time) the paper needs to properly establish
|
||
that, either by measuring it or referring to previous work.
|
||
|
||
2: Wikipedians Are Born, Not Made
|
||
---------------------------------
|
||
|
||
In their GROUP 2009 paper "Wikipedians Are Born, Not Made: A Study of
|
||
Power Editors on Wikipedia", Panciera et al. show data that argues that
|
||
those contributors who are going to stick around behave in a way that is
|
||
different from the very beginning. In followup work published in 2010
|
||
they find similar differences in another peer production community.
|
||
(Panciera et al. "Lurking? cyclopaths?: a quantitative lifecycle analysis
|
||
of user behavior in a geowiki." CHI 2010)
|
||
|
||
These two papers and the argument they put forth are relevant because
|
||
they question who TWA is designed for. In the related work a reference to
|
||
Bryant et al's "Becoming Wikipedian" is made, thereby suggesting that TWA
|
||
is designed to teach someone how to be a Wikipedian. As Panciera et al's
|
||
paper argues along the lines of these contributors already being
|
||
Wikipedians, should TWA be designed to instead help these contributors
|
||
stay productive?
|
||
|
||
If Wikipedians are born, not made, then one could also question whether
|
||
these contributors are at all going to use TWA. Maybe they ignore TWA
|
||
because they are already productive and do not need it? Since the paper
|
||
never makes any references to these papers and discusses issues related
|
||
to this (e.g. "is the Teahouse more effective since it allows them to get
|
||
answers when they need help?"), this whole topic area is left hanging.
|
||
|
||
---
|
||
Below follows comments/notes for each section of the paper.
|
||
|
||
Introduction:
|
||
* An overall issue here is that there are few citations to sources. For
|
||
instance a claim is made that "newly created accounts are the primary
|
||
source of spam and vandalism on Wikipedia". Consider a "[citation
|
||
needed]" added after that.
|
||
* When citing multiple papers it is preferable that they are in order,
|
||
e.g "[14, 23, 17]" should be "[14, 17, 23]" (page 1). This minor issue
|
||
also occurs elsewhere in the paper.
|
||
* "Unlike prior systems, TWA creates a structured experience that guides
|
||
newcomers through critical pieces of Wikipedia knowledge..." Do we know
|
||
that there are no other prior systems that offer a similar experience? It
|
||
might be that there are none within the Wikipedia domain, but what about
|
||
outside it? That sentence is making a rather bold claim.
|
||
* After reading the introduction, what is the reader expected to remember
|
||
as the main findings in this paper? At the end of the introduction the
|
||
following sentence is found: "The study underscores the importance of
|
||
conducting multiple types of evaluations of social systems." Is that the
|
||
main contribution? What about the implications for gamified structured
|
||
introductions to peer production?
|
||
|
||
Background:
|
||
* "...women reported that they found that contributing to Wikipedia
|
||
involved a high level of conflict and that they lacked confidence in
|
||
their expertise [8]. This suggests that more effective onboarding tools
|
||
could help incorporate newcomers." This is an important side of
|
||
Wikipedia, but how does TWA's design help mitigate this issue? Are there
|
||
design elements in TWA that aims to boost confidence in one's expertise?
|
||
* At the end of the introduction we find the following two questions:
|
||
"Would a gamified tutorial produce a positive, enjoyable educational
|
||
experience for new Wikipedians? Would playing the tutorial impact
|
||
newcomer participation patterns?" These are the paper's _research
|
||
questions_! It would be very helpful to the reader if they were displayed
|
||
more clearly, e.g. as separate items. They should not be hidden.
|
||
|
||
System Design:
|
||
* "...it does not depend on the availability, helpfulness, or
|
||
intervention of existing Wikipedia editors..." The underlying argument
|
||
here is that scalability is preferable to personal interaction when
|
||
socializing newcomers (in peer production communities). Why is that the
|
||
better solution? As discussed previously, TWA might be designed for
|
||
contributors who are not going to stick around, why are those the right
|
||
audience for it? Is the goal to provide _everyone_ with a scalable
|
||
impersonale introduction, or is it better to provide _some_ (typically
|
||
based on self-selection) with a personal introduction (e.g. the
|
||
Teahouse)?
|
||
|
||
Game-like elements (subsection of System Design):
|
||
* In "Missions" a distinction is made between "basic" and "advanced"
|
||
editing techniques. It appears to be somewhat arbitrary, why is adding
|
||
sources advanced editing, but watchlists are not?
|
||
* Your readers might not now what watchlists are, take care to write for
|
||
a general audience, not everyone knows a lot about how Wikipedia works
|
||
behind the scenes.
|
||
|
||
Study 1: User Survey:
|
||
* This paper doesn't discuss any other language editions of Wikipedia
|
||
besides the English one, and makes the assumption that "Wikipedia" equals
|
||
the English edition. Adding a mention that Wikipedia exists in multiple
|
||
languages and explaining why English was chosen as the language where
|
||
TWA was launched would be very helpful.
|
||
* The paper aims to measure "educational effectiveness". Why is a survey
|
||
the appropriate way to measure that? Based on the description of the
|
||
survey, it seems that it never asks specific questions to test whether
|
||
TWA's users learned specific things, in other words whether the education
|
||
was successful. Later when describing the results the phrase "learning to
|
||
edit Wikipedia" is used, isn't that the _key_ learning goal of TWA? Yet
|
||
the survey asks Likert-scale questions. In other words, you're measuring
|
||
whether TWA users are under the impression that they learned something,
|
||
not whether they actually did.
|
||
* Figure 4 uses counts. While it shows that none of the questions had
|
||
responses from all participants, it makes comparisions between questions
|
||
with different response rates very difficult. Using percentages would
|
||
allow for direct comparisons, and makes the references to the figure in
|
||
the text easier to follow along with. The text refers to four questions
|
||
with a certain percentage of responses, but leaves the math to the
|
||
reader.
|
||
* The survey leaves many questions unanswered, some of which the paper
|
||
might want to address. Were any negative questions asked? Were there any
|
||
control questions, such as a similar question worded slightly differently
|
||
to allow for comparison between responses? As it is, this survey comes
|
||
across as a set of positive statements about TWA that respondents agreed
|
||
to. Given that respondents self-select and no attempts to contact users
|
||
who didn't go through TWA appears to have been made, it is likely there
|
||
is a bias in the responses, and that bias should be discussed.
|
||
|
||
Study 2: Field Experiment:
|
||
* The description of how accounts were selected to be included is rather
|
||
confusing. First it describes 1,967 accounts that met the same criteria
|
||
as for the user survey, however 10,000 individuals ("accounts"?) were
|
||
invited to the beta. Why is one an order of magnitude larger than the
|
||
other? Then in the second paragraph of "Methods" it describes the
|
||
selection criteria, that at least one contribution would have to be made
|
||
after getting invited. This would perhaps be much less confusing if the
|
||
criteria were first explained, particularly how the experiment and
|
||
control groups were set up, and then how many accounts were identified.
|
||
* "This is a larger proportion of users than took up the invitation in
|
||
Study 1, which may be due to changes in the invitation text." Earlier in
|
||
the paper study 1 refers to a "beta", whereas this appears to be not. If
|
||
this is the case, this is an important difference between the two that
|
||
should be made clear to the reader.
|
||
* "we measure the overall contributions as the total number of edits made
|
||
by each account from the time of inclusion in the study until May 31,
|
||
2014." When exactly is "time of inclusion", is that when they got the
|
||
invite? What about when they completed one (or all) TWA mission(s)? The
|
||
concern here is that all contributions are measured, whereas the
|
||
experiment sets up a pre/post-scenario. Later on the paper refers to
|
||
"subsequent contributions", indicating that contributions after a certain
|
||
point in time was measured. This quickly becomes rather confusing,
|
||
spelling out clearly what points in a user's account history is used
|
||
(e.g. "we measure contributions at four points in time: when the user
|
||
registered their account, the time of invitation, when they first started
|
||
using TWA, and the end of the experiment") would be very helpful.
|
||
* Why is a six-edit radius chosen when measuring word persistence?
|
||
Halfaker et al. make no claim about what the radius should be in the
|
||
referenced work, and Ekstrand et al suggest a 15 edit radius in a related
|
||
paper (Ekstrand and Riedl "rv you're dumb: identifying discarded work in
|
||
Wiki article history." WikiSym 2009) The six-edit radius also comes with
|
||
an issue that is unadressed: how long does it take for an edit made by a
|
||
contributor in the study to reach that six-edit radius? If it hasn't been
|
||
reached at the end of the study period, that edit has to be discarded as
|
||
its quality is unknown. In a related paper, Farzan and Kraut instead
|
||
chose to use percentage of words that survived as a measure of quality
|
||
(Farzan and Kraut "Wikipedia classroom experiment: bidirectional benefits
|
||
of students' engagement in online production communities" CHI 2013)
|
||
* Tables 1, 2, 3, and 4, as well as figure 6 should be brought closer
|
||
together so it's easier to follow along. Table 1 occurs before the text
|
||
that refers to it, and table 4 is two pages further along. Putting all
|
||
tables and figure 6 on the same page might be a good solution.
|
||
* Table 3 refers to users "reached" a mission. It is confusing how 181
|
||
users reached the final mission but did not complete it, yet in the text
|
||
it seems these 181 users actually did.
|
||
* The post-hoc power analysis is very useful!
|
||
|
||
Discussion:
|
||
* "The new editors in our study may have had unpleasant experiences
|
||
during their initial time on Wikipedia..." It appears that the survey
|
||
asked no questions about this, yet is it not a very important issue
|
||
related to TWA's success?
|
||
* In "Limitations of gamification" the following sentence is found:
|
||
"...our study is among the first that compares levels of participation in
|
||
a task among individuals who were introduced to gamified learning first
|
||
to those that were not." This is an _important_ finding, it shouldn't be
|
||
hidden back here but instead be up front in the introduction!
|
||
|
||
Formatting and Reference Issues
|
||
|
||
|
||
|
||
------------------------ Submission 516, Review 3 ------------------------
|
||
|
||
Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
|
||
for New Users
|
||
|
||
Reviewer: AC-Reviewer
|
||
|
||
Expertise
|
||
|
||
4 (Expert)
|
||
|
||
First Round Overall Recommendation
|
||
|
||
3 (Maybe acceptable (with significant modifications))
|
||
|
||
Contribution and Criteria for Evaluation
|
||
|
||
This paper presents the results of a deployment of a gameification-based
|
||
system designed to retain new editors in Wikipedia. It is a negative
|
||
results paper: the authors claim that they have conclusive evidence that
|
||
the system did not work (although I have suggested a few additional lines
|
||
of inquiry below that might problematize this assertion).
|
||
|
||
The committee will have to have a discussion about how to evaluate this
|
||
paper, and likely negative results papers more generally.
|
||
|
||
Assessment of the Paper
|
||
|
||
This paper presents the results of a deployment of a gameification-based
|
||
system designed to retain new editors in Wikipedia. It is a negative
|
||
results paper: the authors claim that they have conclusive evidence that
|
||
the system did not work (although I have suggested a few additional lines
|
||
of inquiry below that might problematize this assertion).
|
||
|
||
The paper is very well-written and has some large positives. It also is a
|
||
negative results paper, and the committee will have to decide how to
|
||
handle this. In general, I’m strongly sympathetic to arguments to
|
||
include more negative results papers in our proceedings, but I’m quite
|
||
unclear on the details of how to do so (e.g. what defines a top-quality
|
||
negative results paper?). I’m hopeful that this paper can instigate a
|
||
broader discussion on this topic at the PC meeting.
|
||
|
||
All of that said, this paper also has a number of idiosyncratic
|
||
limitations that make it perhaps not the best trial balloon for negative
|
||
results papers. Below, I outline what I believe to be the paper’s
|
||
positives and then describe these limitations in more detail, phrased as
|
||
both critiques and questions.
|
||
|
||
Overall, my recommendation is to invite the authors to revise and
|
||
resubmit. If this occurs, I’ll want to see the below critiques
|
||
addressed and the below questions answered (both through direct answers
|
||
in the response to reviewers and through clarifications and changes to
|
||
the paper). I’m hopeful through, through the R&R process, this paper
|
||
can become an ideal negative results trial balloon.
|
||
|
||
|
||
Important positives:
|
||
|
||
* The authors built a system to solve a real-life problemand did a
|
||
real-life, relatively large-scale deployment. Awesome!
|
||
* The paper is easily in the top 95% in terms of writing quality. This is
|
||
true both at the sentence level and at the narrative level. As a person
|
||
who has to review lots of papers, this was a breath of fresh air.
|
||
* The design of the game is quite well-thought-out, save a few relatively
|
||
arbitrary decisions. I was particularly compelled by the use of
|
||
gameification techniques that are also present in “real Wikipedia”
|
||
(e.g. barnstar-like rewards).
|
||
|
||
Critiques:
|
||
|
||
CRITIQUE #1 – Excessive import placed on trivial self-report data: It
|
||
is well-known that self-report data from participants is inferior to
|
||
observations of actual behavior, and that self-report data can be quite
|
||
unreliable more generally. As such, in my view, it is not a contribution
|
||
to show that self-report data didn’t end up panning out in the
|
||
behavioral results.
|
||
|
||
In the next draft of this paper, I would like to see the authors address
|
||
this issue. This might mean framing this paper as a full-on negative
|
||
results paper, but lighter weight adaptations might be possible.
|
||
|
||
|
||
Open questions:
|
||
|
||
QUESTION #1: As noted above, this paper is a negative results paper at
|
||
its core, and we’ll have to have a broad discussion about this at the
|
||
PC meeting, assuming the paper makes it this far. In the event that this
|
||
occurs, can the authors provide a more robust argument as to why these
|
||
negative results are important for other researchers and practitioners?
|
||
|
||
The paper attempts to argue that one contribution that comes out of its
|
||
negative results is to distrust self-report data, but this is well-known
|
||
(see below). The other negative results argument in the paper is that
|
||
these results add to growing evidence of long-term gameificiation
|
||
failures. I find this argument much more compelling. In other words, by
|
||
expanding on this argument, the authors may be able to address this
|
||
question.
|
||
|
||
That said, regardless of how this question is addressed in the second
|
||
draft, I’d like to see it done both through changes to the paper and
|
||
through discussion in the response to reviewers.
|
||
|
||
QUESTION #2 – Is there a possibility that the statistical framework
|
||
employed is not appropriate for this particular study?
|
||
|
||
The authors utilize a two-level statistical approach that I haven’t
|
||
seen before in the CSCW/CHI literature. I enjoyed thinking about this
|
||
approach, and the authors did a relatively good job explaining it. That
|
||
said, I’m currently not convinced that it was the appropriate framework
|
||
for this study. Here’s my reasoning:
|
||
|
||
(1) The goal here is to introduce a treatment that ultimately will
|
||
produce strong new members of the Wikipedia community at a higher rate
|
||
than the control.
|
||
(2) Let’s say the game produces 3 such members out of 100 new editors
|
||
and the control produces 1, which looks like it might be the case.
|
||
Let’s also say that this pattern additionally persists over a large n.
|
||
(3) If this is true, why do we care about the potentially moderating
|
||
effect of the invitations?
|
||
|
||
The authors argue that new editors that responded to the invitation to
|
||
play the game might just be new editors who are engaged and, critically,
|
||
would have been power editors whether or not the game existed. However,
|
||
barring a random fluke, shouldn’t these future power editors also have
|
||
been in the control group? If I’m right here, I’m thinking the
|
||
invitation doesn’t matter and a more traditional statistical analysis
|
||
(or at least one targeted at identifying rare events) is appropriate.
|
||
|
||
I could be wrong, but I want the authors to respond to this question,
|
||
both through feedback to reviewers and clarifications in the paper.
|
||
|
||
As an important side note, if we agree that this framework is the right
|
||
way to go in the end, the authors should puff their chests more about
|
||
this by claiming it as a contribution (assuming it hasn’t been used at
|
||
CSCW before).
|
||
|
||
Question #3 – Are the outcome variables considered here the best
|
||
outcome variables? Are some critical variables missing?
|
||
|
||
The authors seem focused on the average effects across the entire control
|
||
and treatment groups (the two treatment groups, to be specific). However,
|
||
would it not also be reasonable to consider the metric I describe above:
|
||
the % of new editors that go on to be power editors? Since power editors
|
||
end up contributing most of the edits anyway *over the long term*, to me
|
||
this seems like the way to go (i.e. if this group of editors were
|
||
followed for years, statistically significant differences would begin to
|
||
emerge). If the authors agree, the authors need to reanalyze their data
|
||
with this metric in mind.
|
||
|
||
Another related outcome variable that might be useful to analyze is how
|
||
long the new editors in each group remained active editors in the
|
||
community (i.e. survival analysis). Because the data is quite old, this
|
||
should be an easy new analysis to run, and longevity has been a variable
|
||
interest in a number of peer production studies.
|
||
|
||
In their second draft and the feedback to reviewers, I would like to see
|
||
the authors discuss either new analyses related to power users or why thy
|
||
did not consider this outcome variable. I would also like to see the same
|
||
for survival analysis.
|
||
|
||
QUESTION #4: Is there a path towards positive results?
|
||
|
||
As noted above, I believe some discussion around this paper and negative
|
||
results papers more generally will have to happen at the PC meeting.
|
||
However, I think there are so missed opportunities here for positive
|
||
results and that the authors were too quick to settle for negative
|
||
results. This is likely an important factor to consider when deciding
|
||
whether to accept a negative results paper.
|
||
|
||
Most notably, there are several, well-motivated unexplored avenues that
|
||
could lead to positive results that would have a much larger impact than
|
||
the negative results presented here:
|
||
|
||
* As noted above, examining additional outcome variables is important,
|
||
most notably # of power editors and longevity.
|
||
* Does the game work if folks are forced to play it prior to editing
|
||
Wikipedia, as would be the case in most other institutionalized
|
||
socialization contexts? This is not just a hypothetical: this game could
|
||
be used in all Wikipedia Education Project classes and related endeavors.
|
||
|
||
Formatting and Reference Issues
|