18
0

added material for TWA 2017

This commit is contained in:
Benjamin Mako Hill 2019-06-11 17:54:32 -07:00
parent 3e6d27447e
commit 4fc082a8ac
6 changed files with 3907 additions and 0 deletions

View File

@ -0,0 +1,7 @@
Material for paper:
Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron
Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive
Tutorial for New Users.” In Proceedings of the 20th ACM Conference on
Computer-Supported Cooperative Work & Social Computing (CSCW 17). New York,
New York: ACM. https://doi.org/10.1145/2998181.2998307

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,718 @@
From: <papers2017@cscw.acm.org>
Date: Tue, Jul 12, 2016 at 11:15 PM
Subject: CSCW 2017 notification - #516
To: snehanarayan@gmail.com
Cc: papers2017@cscw.acm.org
Dear Sneha Narayan -
Congratulations!
Your paper:
516 - The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users
is one of the 52% of CSCW 2017 submissions invited to revise and resubmit.
There were 530 total submissions to CSCW 2017, a similar number to last
year. The reviewers for this submission believe that it has the potential
to be revised within four weeks -- revisions are due August 9, 2016 -- to
become a contribution to what will be an exceptional conference.
The program committee expects all authors to take advantage of this four
week revision period to improve their submissions by addressing reviewers'
comments (below). Some submissions need only minor revisions, while others
will require considerable work over the next four weeks to result in an
acceptable submission, and will not succeed without significant effort.
Your reviews, especially the summary report from the Coordinator, should
make clear what you should do. You can gauge your prospects from your
reviews and the summary report: overall scores of 4s and 5s indicate the
reviewers are very confident your paper will be acceptable within four
weeks with small edits. Overall scores of 3 and 4 indicate you have some
work to do. Scores of 3 and below indicate that some reviewers have serious
reservations, though other reviewers see promise.
The same reviewers will read and evaluate your revised submission (though
additional reviewers may be added for papers where the reviewers are
divided). You need not satisfy every reviewer or make every suggested
change, but your revision will need to convince most of the reviewers that
it is now ready for publication. For some papers the reviewers have
requested a lot of work, you might feel that it is too much to achieve in a
four week period. If you have the time to reach that goal: great! If not,
that is okay, you are free to withdraw your submission. Please decide
whether or not the key points made by reviewers can be adequately addressed
in the time provided, given other demands on your time. If you choose to
withdraw your paper, please notify us explicitly at papers2017@cscw.acm.org.
Papers that are revised and re-submitted in the next round will receive
revised reviews.
Your revision must be accompanied by a separate "Summary of Changes"
document (in PDF format) that lists the reviewers' comments and your
responses, even for comments that did not lead to changes in the manuscript
(in which case you might explain why you chose not to make certain
suggested changes). This could be a set of bullet points, a table, or
numbered points by which reviewers' comments are summarized along with your
changes. This is not a rebuttal, but rather a description of changes made,
or of reasons you could not or chose not to take the reviewers' advice. To
become acceptable, your submission must be revised, and your document
describing the changes will greatly help reviewers see what you have or
have not changed, along with your reasons for doing so.
Just to be clear, you must submit a revised paper and summary of changes by
the deadline. Any paper where a revision and summary are not submitted
will be considered to be withdrawn.
Example summaries from past years' papers can be found at
http://bit.ly/16U8BGM.
Please submit your revision and the response document at your "Submissions
in Progress" page at https://precisionconference.com/~cscw17a/ by 11:59 PM
PDT, August 9, 2016.
CSCW 2017 will be a great conference, and we sincerely hope you are part of
it! If you have any issues or questions, please let us know. And thanks
again for submitting.
Sincerely,
Louise Barkhuus, Marcos Borges, Wendy A. Kellogg
CSCW 2017 Co-chairs
------------------------ Submission 516, Review 4 ------------------------
Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users
Reviewer: AC
Expertise
2 (Passing Knowledge)
First Round Overall Recommendation
3 (Maybe acceptable (with significant modifications))
Contribution and Criteria for Evaluation
The paper presents the design and evaluation of a gamified tool for
socializing and retaining new Wikipedia editors. Contribution criteria
include (1) a description and rationale for the system; (2) system
novelty and rationale for how it leads to learning; and (3) a
methodologically sound evaluation.
First Round Review (if needed)
Coordinator's First-Round Report to Authors
The paper presents the design and evaluation of a gamified tool for
socializing and retaining new Wikipedia editors. The study found that
users liked—but did not learn from—the system.
The focus on improving the experience of newcomers in Wikipedia is
relevant and important. Reviewers describe the study as well motivated
and exceptionally well-written. Read R3s comments on the writing
quality and congratulate yourself!
The reviewers, however, have many concerns about the paper—each
focusing on a different aspect of the work. The concerns the reviewers
note /may/ be addressable during the revise and resubmit period, but it
will be an exceptionally herculean effort. Also, please keep in mind that
there is no guarantee of acceptance even after making changes. So, it is
at the authors discretion about whether or not to proceed with
revisions or withdraw the paper.
There is split amongst the reviewers as to whether the failure of the
tool is interesting or not. R1 raises concerns that the failure of the
tool could be predicted from existing literature, suggesting little
rationale for doing the work in the first place. R2 asks whether there
is something fundamentally different about people who continue to
contribute to Wikipedia, and as such whether the system holds value in
practice. R3, on the other hand, sees much value in the systems
contribution of the work as well as the real-world evaluation. R3's
review has some suggestions of alternative framings that may make the
contribution more valuable.
In treatment of related work, many improvements are needed. R1 notes that
the discussion of the well-known concept of legitimacy/authenticity in
learning environments is missing. R2 also points to missing literature
about Wikipedian experience.
R2 and R3 raise a number of methodological questions about the paper. R2
suggests the distribution of participants across the timeline may bias
the results. R3, on the other hand, sees opportunity here, suggesting
additional statistical analysis related to longevity and power users.
Both R2 and R3 question the methodological choice and contribution of
measuring perceptions of learning rather than actual learning. Overall,
this points to a need for at the very least justifying the methodological
choices and at the most carrying out additional statistical analyses.
In summary, there is quite a bit of work to be done. I wish the authors
the best of luck, should they choose to continue in the review process.
Requested Revisions
REQUIRED:
- Provide justification for why the study was worth carrying out, in
response to R1 and R2s concerns. R3s review may have some insight
into alternative framings.
- State the research questions more explicitly, as per R2s
recommendation
- Address R1 and R2s concerns about missing literature
- Ensure that the narrative around Wikipedia is clear to readers who
do
not have an in-depth background in production/editing details.
- Improve the clarity of the results by using percentages or another
baseline that allows comparison between numbers, as per R2s review.
- Provide justification for measuring perceptions of learning versus
actual learning.
- Provide a robust discussion of why the results are meaningful for
researchers and/or practitioners.
OPTIONAL, RECOMMENDED
- Consider carrying out additional statistical analyses as
recommended by
R3.
- Provide a short justification for use of English language
Wikipedia, as
per R2s review.
Formatting and Reference Issues
------------------------ Submission 516, Review 1 ------------------------
Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users
Expertise
4 (Expert)
First Round Overall Recommendation
3 (Maybe acceptable (with significant modifications))
Contribution and Criteria for Evaluation
In this paper, the authors present the design and two-pronged evaluation
of a tutorial for new wikipedia editors that uses elements of
gamification like missions and badges to help coach new editors and help
them learn best practices and social norms of Wikipedia. The outcome is
that users like the system but, based on behavioral measures, they don't
actually learn from it. Learning interventions are a classic kind of
research problem, and the paper should include robust measures of
learning, as well as a good description of the designed intervention
itself, why the design is expected to lead to learning, and a clear
description of the study.
Assessment of the Paper
This is a reasonably well motivated study with connections to appropriate
literature and the writing is engaging and understandable. The problem of
enculturating newcomers into projects like Wikipedia is well documented
and this paper investigates a potential intervention with an admirably
well-planned study. Designing learning interventions is really difficult
and I commend the authors on a well-executed effort.
Still, I am ambivalent about the paper because I would have predicted
these outcomes based on the literature alone. In the discussion, the
authors note that one mismatch between Wikipedia and the tutorial as
designed involve the “gradual peripheral participation” of newcomers
as they take on the identity of “Wikipedian.” They suggest that maybe
speeding up this process is unnatural. I would argue that the most
important concept from the literature on learning is missing from this
discussion, and thats “legitimacy” (also sometimes referred to in
education and learning literature as “authenticity”.) The authors
explain that by doing tasks in a pretend version of Wikipedia, they make
it a safe space for newcomers to practice, yet performing “canned”
tasks in a pretend system is the opposite of offering a legitimate form
of participation. I immediately wonder, why not use what we know from the
literature to create low-risk missions that newcomers can complete while
legitimately contributing to the encyclopedia? Risk taking is a
fundamental characteristic of games that makes them engaging; it
certainly seems like it would play a role in peoples motivation in a
scenario like this. Rather than eliminating risk, the literature on
legitimate peripheral participation would suggest that finding the right
degree of risk is required to facilitate progressive entree into a set of
shared practices.
I am disappointed by the missed opportunity here, the outcome mainly
seems to verify that what we know shouldnt work based on the
literature in fact doesnt work. Yet still the paper isnt bad and
the study is carefully crafted and reported.
With some extension and reflection, I think the discussion could help
point future research in a more fruitful direction. There are millions of
pages written on the challenges of designing learning interventions that
change peoples behavior, this paper ends on a painfully obvious note.
Its true that usability isnt all it takes, but what can we learn
from TWA adventure about the design of systems to facilitate
enculturation into a community of practice? What can we take away from
this that might inform more successful tutorial systems in the future?
Formatting and Reference Issues
------------------------ Submission 516, Review 2 ------------------------
Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users
Expertise
4 (Expert)
First Round Overall Recommendation
2 (Probably NOT acceptable)
Contribution and Criteria for Evaluation
This paper's contribution is the design and evaluation of a structured
introduction to a peer production community (English Wikipedia) called
"The Wikipedia Adventure". TWA's design is rooted in theories of
gamification, and its utility is evaluated through a user survey and an
invitation-based field experiment. The paper reports on the survey
respondents' satisfaction with TWA, and how their experiment results
reveal some of the challenges of affecting lasting changes to contributor
patterns in peer production communities. These findings are then
discussed in relation to cultural factors in Wikipedia, issuses of
self-selection and voluntary participation, and the limitations of
gamification.
When evaluating a paper that describes the design of a system, the two
main criteria are that the system and/or its development setting is/are
novel, and that the way the system is evaluated is methodologically
sound.
Assessment of the Paper
As mentioned in the contribution section, this paper's contribution is
the design and evaluation of a structured introduction to a peer
production community based on gamification, called "The Wikipedia
Adventure". This is a great idea and sounds like a useful addition to
Wikipedia. The paper is written in a way that makes it easy to read, and
provides the reader with a good introduction to how TWA's design is
rooted in theories of gamification, thus applying these principles in
what appears to be a novel setting. The paper also does a good job of
discussing the findings, organizing them in a way that is easy to follow
and touching on important points (e.g. cultural factors, and the
limitations of self-selection and gamification).
The overall ideas and approach taken in this paper are sound, they are in
line with the criteria described previously. Unfortunately, there are
two major issues and several minor ones that need to be resolved before
this paper is ready for publication. The first major issue is that the
methodology used to evaluate performance in the invitation-based
experiment measures contribution in a skewed manner and does not
establish why that is appropriate. Secondly, the paper fails to consider
arguments put forth by Panciera et al's "Wikipedians are Born, Not Made"
paper. This review will expand on both of these major issues below.
Further below will be notes and comments with suggestions for improvement
for specific sections of the paper, some of which are rather substantial
as well.
1: Evaluating TWA effectiveness by number of contributions
----------------------------------------------------------
A major part of the paper is the evaluation of TWA's effect on subsequent
contributions. To evaluate this an invitation-based field experiment is
used, and the paper does a great job of justifying why that is
appropriate in this setting. The experiment runs from February 2014 and
three months forward. Exact dates are not given, so let us assume that
it ran until the end of April 2014. User contributions are then measured
until the end of May 2014.
There are two problems with this approach that the paper fails to address
properly. One is the issue of right-truncation found in the data.
Contributors who joined in early February 2014 would have about four
months to make edits, whereas those who joined in late April would only
have about a month. The model does contain a control variable for number
of days in the experiment, but why is that appropriate in this context?
If we examine other work in the same domain, they tend to either use a
much longer time period (e.g. the Teahouse paper, citation 23, which uses
6-9 months) or ensure that the time period is fixed (e.g. Kittur et al.
"Herding the Cats: The Influence of Groups in Coordinating Peer
Production", WikiSym 2009; or Zhu et al. "Effectiveness of Shared
Leadership in Online Communities", CSCW 2012).
Related to the right-truncation problem is the fact that the paper also
fails to discuss and justify what a reasonable timespan for measuring the
effect of TWA is, and that it will have an effect on the number of
contributions made. It might for instance be that TWA instead has an
effect on how long it takes before a user drops out of the system. If we
assume that TWA has an effect on contributions, what timespan is needed
to measure that effect? The paper assumes that a month is adequate to
discover it, whereas one might suspect that it is only measurable over a
longer period of time. If it is the case that a short period of time is
appropriate (for instance because these users are likely to drop out
after a certain amount of time) the paper needs to properly establish
that, either by measuring it or referring to previous work.
2: Wikipedians Are Born, Not Made
---------------------------------
In their GROUP 2009 paper "Wikipedians Are Born, Not Made: A Study of
Power Editors on Wikipedia", Panciera et al. show data that argues that
those contributors who are going to stick around behave in a way that is
different from the very beginning. In followup work published in 2010
they find similar differences in another peer production community.
(Panciera et al. "Lurking? cyclopaths?: a quantitative lifecycle analysis
of user behavior in a geowiki." CHI 2010)
These two papers and the argument they put forth are relevant because
they question who TWA is designed for. In the related work a reference to
Bryant et al's "Becoming Wikipedian" is made, thereby suggesting that TWA
is designed to teach someone how to be a Wikipedian. As Panciera et al's
paper argues along the lines of these contributors already being
Wikipedians, should TWA be designed to instead help these contributors
stay productive?
If Wikipedians are born, not made, then one could also question whether
these contributors are at all going to use TWA. Maybe they ignore TWA
because they are already productive and do not need it? Since the paper
never makes any references to these papers and discusses issues related
to this (e.g. "is the Teahouse more effective since it allows them to get
answers when they need help?"), this whole topic area is left hanging.
---
Below follows comments/notes for each section of the paper.
Introduction:
* An overall issue here is that there are few citations to sources. For
instance a claim is made that "newly created accounts are the primary
source of spam and vandalism on Wikipedia". Consider a "[citation
needed]" added after that.
* When citing multiple papers it is preferable that they are in order,
e.g "[14, 23, 17]" should be "[14, 17, 23]" (page 1). This minor issue
also occurs elsewhere in the paper.
* "Unlike prior systems, TWA creates a structured experience that guides
newcomers through critical pieces of Wikipedia knowledge..." Do we know
that there are no other prior systems that offer a similar experience? It
might be that there are none within the Wikipedia domain, but what about
outside it? That sentence is making a rather bold claim.
* After reading the introduction, what is the reader expected to remember
as the main findings in this paper? At the end of the introduction the
following sentence is found: "The study underscores the importance of
conducting multiple types of evaluations of social systems." Is that the
main contribution? What about the implications for gamified structured
introductions to peer production?
Background:
* "...women reported that they found that contributing to Wikipedia
involved a high level of conflict and that they lacked confidence in
their expertise [8]. This suggests that more effective onboarding tools
could help incorporate newcomers." This is an important side of
Wikipedia, but how does TWA's design help mitigate this issue? Are there
design elements in TWA that aims to boost confidence in one's expertise?
* At the end of the introduction we find the following two questions:
"Would a gamified tutorial produce a positive, enjoyable educational
experience for new Wikipedians? Would playing the tutorial impact
newcomer participation patterns?" These are the paper's _research
questions_! It would be very helpful to the reader if they were displayed
more clearly, e.g. as separate items. They should not be hidden.
System Design:
* "...it does not depend on the availability, helpfulness, or
intervention of existing Wikipedia editors..." The underlying argument
here is that scalability is preferable to personal interaction when
socializing newcomers (in peer production communities). Why is that the
better solution? As discussed previously, TWA might be designed for
contributors who are not going to stick around, why are those the right
audience for it? Is the goal to provide _everyone_ with a scalable
impersonale introduction, or is it better to provide _some_ (typically
based on self-selection) with a personal introduction (e.g. the
Teahouse)?
Game-like elements (subsection of System Design):
* In "Missions" a distinction is made between "basic" and "advanced"
editing techniques. It appears to be somewhat arbitrary, why is adding
sources advanced editing, but watchlists are not?
* Your readers might not now what watchlists are, take care to write for
a general audience, not everyone knows a lot about how Wikipedia works
behind the scenes.
Study 1: User Survey:
* This paper doesn't discuss any other language editions of Wikipedia
besides the English one, and makes the assumption that "Wikipedia" equals
the English edition. Adding a mention that Wikipedia exists in multiple
languages and explaining why English was chosen as the language where
TWA was launched would be very helpful.
* The paper aims to measure "educational effectiveness". Why is a survey
the appropriate way to measure that? Based on the description of the
survey, it seems that it never asks specific questions to test whether
TWA's users learned specific things, in other words whether the education
was successful. Later when describing the results the phrase "learning to
edit Wikipedia" is used, isn't that the _key_ learning goal of TWA? Yet
the survey asks Likert-scale questions. In other words, you're measuring
whether TWA users are under the impression that they learned something,
not whether they actually did.
* Figure 4 uses counts. While it shows that none of the questions had
responses from all participants, it makes comparisions between questions
with different response rates very difficult. Using percentages would
allow for direct comparisons, and makes the references to the figure in
the text easier to follow along with. The text refers to four questions
with a certain percentage of responses, but leaves the math to the
reader.
* The survey leaves many questions unanswered, some of which the paper
might want to address. Were any negative questions asked? Were there any
control questions, such as a similar question worded slightly differently
to allow for comparison between responses? As it is, this survey comes
across as a set of positive statements about TWA that respondents agreed
to. Given that respondents self-select and no attempts to contact users
who didn't go through TWA appears to have been made, it is likely there
is a bias in the responses, and that bias should be discussed.
Study 2: Field Experiment:
* The description of how accounts were selected to be included is rather
confusing. First it describes 1,967 accounts that met the same criteria
as for the user survey, however 10,000 individuals ("accounts"?) were
invited to the beta. Why is one an order of magnitude larger than the
other? Then in the second paragraph of "Methods" it describes the
selection criteria, that at least one contribution would have to be made
after getting invited. This would perhaps be much less confusing if the
criteria were first explained, particularly how the experiment and
control groups were set up, and then how many accounts were identified.
* "This is a larger proportion of users than took up the invitation in
Study 1, which may be due to changes in the invitation text." Earlier in
the paper study 1 refers to a "beta", whereas this appears to be not. If
this is the case, this is an important difference between the two that
should be made clear to the reader.
* "we measure the overall contributions as the total number of edits made
by each account from the time of inclusion in the study until May 31,
2014." When exactly is "time of inclusion", is that when they got the
invite? What about when they completed one (or all) TWA mission(s)? The
concern here is that all contributions are measured, whereas the
experiment sets up a pre/post-scenario. Later on the paper refers to
"subsequent contributions", indicating that contributions after a certain
point in time was measured. This quickly becomes rather confusing,
spelling out clearly what points in a user's account history is used
(e.g. "we measure contributions at four points in time: when the user
registered their account, the time of invitation, when they first started
using TWA, and the end of the experiment") would be very helpful.
* Why is a six-edit radius chosen when measuring word persistence?
Halfaker et al. make no claim about what the radius should be in the
referenced work, and Ekstrand et al suggest a 15 edit radius in a related
paper (Ekstrand and Riedl "rv you're dumb: identifying discarded work in
Wiki article history." WikiSym 2009) The six-edit radius also comes with
an issue that is unadressed: how long does it take for an edit made by a
contributor in the study to reach that six-edit radius? If it hasn't been
reached at the end of the study period, that edit has to be discarded as
its quality is unknown. In a related paper, Farzan and Kraut instead
chose to use percentage of words that survived as a measure of quality
(Farzan and Kraut "Wikipedia classroom experiment: bidirectional benefits
of students' engagement in online production communities" CHI 2013)
* Tables 1, 2, 3, and 4, as well as figure 6 should be brought closer
together so it's easier to follow along. Table 1 occurs before the text
that refers to it, and table 4 is two pages further along. Putting all
tables and figure 6 on the same page might be a good solution.
* Table 3 refers to users "reached" a mission. It is confusing how 181
users reached the final mission but did not complete it, yet in the text
it seems these 181 users actually did.
* The post-hoc power analysis is very useful!
Discussion:
* "The new editors in our study may have had unpleasant experiences
during their initial time on Wikipedia..." It appears that the survey
asked no questions about this, yet is it not a very important issue
related to TWA's success?
* In "Limitations of gamification" the following sentence is found:
"...our study is among the first that compares levels of participation in
a task among individuals who were introduced to gamified learning first
to those that were not." This is an _important_ finding, it shouldn't be
hidden back here but instead be up front in the introduction!
Formatting and Reference Issues
------------------------ Submission 516, Review 3 ------------------------
Title: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial
for New Users
Reviewer: AC-Reviewer
Expertise
4 (Expert)
First Round Overall Recommendation
3 (Maybe acceptable (with significant modifications))
Contribution and Criteria for Evaluation
This paper presents the results of a deployment of a gameification-based
system designed to retain new editors in Wikipedia. It is a negative
results paper: the authors claim that they have conclusive evidence that
the system did not work (although I have suggested a few additional lines
of inquiry below that might problematize this assertion).
The committee will have to have a discussion about how to evaluate this
paper, and likely negative results papers more generally.
Assessment of the Paper
This paper presents the results of a deployment of a gameification-based
system designed to retain new editors in Wikipedia. It is a negative
results paper: the authors claim that they have conclusive evidence that
the system did not work (although I have suggested a few additional lines
of inquiry below that might problematize this assertion).
The paper is very well-written and has some large positives. It also is a
negative results paper, and the committee will have to decide how to
handle this. In general, Im strongly sympathetic to arguments to
include more negative results papers in our proceedings, but Im quite
unclear on the details of how to do so (e.g. what defines a top-quality
negative results paper?). Im hopeful that this paper can instigate a
broader discussion on this topic at the PC meeting.
All of that said, this paper also has a number of idiosyncratic
limitations that make it perhaps not the best trial balloon for negative
results papers. Below, I outline what I believe to be the papers
positives and then describe these limitations in more detail, phrased as
both critiques and questions.
Overall, my recommendation is to invite the authors to revise and
resubmit. If this occurs, Ill want to see the below critiques
addressed and the below questions answered (both through direct answers
in the response to reviewers and through clarifications and changes to
the paper). Im hopeful through, through the R&R process, this paper
can become an ideal negative results trial balloon.
Important positives:
* The authors built a system to solve a real-life problemand did a
real-life, relatively large-scale deployment. Awesome!
* The paper is easily in the top 95% in terms of writing quality. This is
true both at the sentence level and at the narrative level. As a person
who has to review lots of papers, this was a breath of fresh air.
* The design of the game is quite well-thought-out, save a few relatively
arbitrary decisions. I was particularly compelled by the use of
gameification techniques that are also present in “real Wikipedia”
(e.g. barnstar-like rewards).
Critiques:
CRITIQUE #1 Excessive import placed on trivial self-report data: It
is well-known that self-report data from participants is inferior to
observations of actual behavior, and that self-report data can be quite
unreliable more generally. As such, in my view, it is not a contribution
to show that self-report data didnt end up panning out in the
behavioral results.
In the next draft of this paper, I would like to see the authors address
this issue. This might mean framing this paper as a full-on negative
results paper, but lighter weight adaptations might be possible.
Open questions:
QUESTION #1: As noted above, this paper is a negative results paper at
its core, and well have to have a broad discussion about this at the
PC meeting, assuming the paper makes it this far. In the event that this
occurs, can the authors provide a more robust argument as to why these
negative results are important for other researchers and practitioners?
The paper attempts to argue that one contribution that comes out of its
negative results is to distrust self-report data, but this is well-known
(see below). The other negative results argument in the paper is that
these results add to growing evidence of long-term gameificiation
failures. I find this argument much more compelling. In other words, by
expanding on this argument, the authors may be able to address this
question.
That said, regardless of how this question is addressed in the second
draft, Id like to see it done both through changes to the paper and
through discussion in the response to reviewers.
QUESTION #2 Is there a possibility that the statistical framework
employed is not appropriate for this particular study?
The authors utilize a two-level statistical approach that I havent
seen before in the CSCW/CHI literature. I enjoyed thinking about this
approach, and the authors did a relatively good job explaining it. That
said, Im currently not convinced that it was the appropriate framework
for this study. Heres my reasoning:
(1) The goal here is to introduce a treatment that ultimately will
produce strong new members of the Wikipedia community at a higher rate
than the control.
(2) Lets say the game produces 3 such members out of 100 new editors
and the control produces 1, which looks like it might be the case.
Lets also say that this pattern additionally persists over a large n.
(3) If this is true, why do we care about the potentially moderating
effect of the invitations?
The authors argue that new editors that responded to the invitation to
play the game might just be new editors who are engaged and, critically,
would have been power editors whether or not the game existed. However,
barring a random fluke, shouldnt these future power editors also have
been in the control group? If Im right here, Im thinking the
invitation doesnt matter and a more traditional statistical analysis
(or at least one targeted at identifying rare events) is appropriate.
I could be wrong, but I want the authors to respond to this question,
both through feedback to reviewers and clarifications in the paper.
As an important side note, if we agree that this framework is the right
way to go in the end, the authors should puff their chests more about
this by claiming it as a contribution (assuming it hasnt been used at
CSCW before).
Question #3 Are the outcome variables considered here the best
outcome variables? Are some critical variables missing?
The authors seem focused on the average effects across the entire control
and treatment groups (the two treatment groups, to be specific). However,
would it not also be reasonable to consider the metric I describe above:
the % of new editors that go on to be power editors? Since power editors
end up contributing most of the edits anyway *over the long term*, to me
this seems like the way to go (i.e. if this group of editors were
followed for years, statistically significant differences would begin to
emerge). If the authors agree, the authors need to reanalyze their data
with this metric in mind.
Another related outcome variable that might be useful to analyze is how
long the new editors in each group remained active editors in the
community (i.e. survival analysis). Because the data is quite old, this
should be an easy new analysis to run, and longevity has been a variable
interest in a number of peer production studies.
In their second draft and the feedback to reviewers, I would like to see
the authors discuss either new analyses related to power users or why thy
did not consider this outcome variable. I would also like to see the same
for survival analysis.
QUESTION #4: Is there a path towards positive results?
As noted above, I believe some discussion around this paper and negative
results papers more generally will have to happen at the PC meeting.
However, I think there are so missed opportunities here for positive
results and that the authors were too quick to settle for negative
results. This is likely an important factor to consider when deciding
whether to accept a negative results paper.
Most notably, there are several, well-motivated unexplored avenues that
could lead to positive results that would have a much larger impact than
the negative results presented here:
* As noted above, examining additional outcome variables is important,
most notably # of power editors and longevity.
* Does the game work if folks are forced to play it prior to editing
Wikipedia, as would be the case in most other institutionalized
socialization contexts? This is not just a hypothetical: this game could
be used in all Wikipedia Education Project classes and related endeavors.
Formatting and Reference Issues

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff