ml_measurement_error_overleaf/#article.Rtex#

\documentclass[floatsintext, draftfirst, man]{apa7}
<<init,echo=FALSE>>=
library(knitr)
library(ggplot2)
library(data.table)
knitr::opts_chunk$set(fig.show='hold')
f <- function (x) {formatC(x, format="d", big.mark=',')}
format.percent <- function(x) {paste(f(x*100),"\\%",sep='')}

theme_set(theme_bw())
source('resources/functions.R')
source('resources/variables.R')
source('resources/real_data_example.R')
@


\usepackage{epstopdf}% To incorporate .eps illustrations using PDFLaTeX, etc.
\usepackage{subcaption}% Support for small, `sub' figures and tables
\usepackage{tikz}
\usetikzlibrary{positioning, shapes, arrows, shadows}

\def \parrotpdf {\includegraphics[]{parrot.pdf}}
\DeclareUnicodeCharacter{1F99C}{\parrotpdf}
\usepackage{tabularx}
\usepackage[utf8]{inputenc}
\usepackage{wrapfig}
\usepackage[T1]{fontenc}
\usepackage{textcomp}
\usepackage{listings}
\usepackage{xcolor}

%New colors defined below
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}

%Code listing style named "mystyle"
\lstdefinestyle{mystyle}{
  backgroundcolor=\color{backcolour}, commentstyle=\color{codegreen},
  keywordstyle=\color{magenta},
  numberstyle=\tiny\color{codegray},
  stringstyle=\color{codepurple},
  basicstyle=\ttfamily\footnotesize,
  breakatwhitespace=false,
  breaklines=true,
  captionpos=b,
  keepspaces=true,
  numbers=left,
  numbersep=5pt,
  showspaces=false,
  showstringspaces=false,
  showtabs=false,
  tabsize=2
}

% \usepackage[garamond]{mathdesign}

% \usepackage[letterpaper,left=1in,right=1in,top=1in,bottom=1in]{geometry}

% packages i use in essentially every document
\usepackage{graphicx}
\usepackage{enumerate}

% packages i use in many documents but leave off by default
\usepackage{amsmath}%}, amsthm, amssymb}
\DeclareMathOperator*{\argmin}{arg\,min} % thin space, limits underneath in displays
\DeclareMathOperator*{\argmax}{arg\,max} % thin space, limits underneath in displays


\usepackage{subcaption}
% import and customize urls
% \usepackage[usenames,dvipsnames]{color}
% \usepackage[breaklinks]{hyperref}

\hypersetup{colorlinks=true, linkcolor=black, citecolor=black, filecolor=blue,
     urlcolor=blue, unicode=true}

% add bibliographic stuff
\usepackage[american]{babel}
\usepackage{csquotes}
\usepackage[natbib=true, style=apa, sortcites=true, backend=biber]{biblatex}
\addbibresource{Bibliography.bib}
\DeclareLanguageMapping{american}{american-apa}

\defbibheading{secbib}[\bibname]{%
  \section*{#1}%
  \markboth{#1}{#1}%
  \baselineskip 14.2pt%
  \prebibhook}

\def\citepos#1{\citeauthor{#1}'s (\citeyear{#1})}
\def\citespos#1{\citeauthor{#1}' (\citeyear{#1})}
\newcommand\TODO[1]{\textsc{\color{red} #1}}

% I've gotten advice to make this as general as possible to attract the widest possible audience.
\title{Automated Content Misclassification Causes Bias in Regression. Can We Fix It? Yes We Can!}

\shorttitle{Automated Content Misclassification}

\authorsnames[1,2,3]{Nathan TeBlunthuis, Valerie Hase, Chung-hong Chan}
\authorsaffiliations{{{Department of Communication Studies, Northwestern University}, {School of Information, University of Michigan}}, {LMU Munich}, {GESIS - Leibniz-Institut für Sozialwissenschaften}}
\leftheader{TeBlunthuis, Hase \& Chan}

\keywords{
Content Analysis; Machine Learning; Classification Error; Attenuation Bias; Simulation; Computational Methods; Big Data; AI;
}

\abstract{
Automated classifiers have become widely popular measurement devices in communication science. These classifiers, often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video.
Even the most accurate non-trivial automated classifiers make errors that cause biased inferences in downstream statistical findings—unless analyses account for these errors.
As we show in a systematic literature review of SML applications,
communication scholars rarely acknowledge this important problem of ``ignoring misclassification in automated content analysis''.
In principle, existing statistical methods that use ``gold standard'' validation data, such as that created by human annotators, can account for misclassification and produce correct statistical results.
We introduce and test such methods, including a new method we design and implement in the R package \texttt{misclassificationmodels}, via Monte-Carlo simulations designed to reveal each method's limitations. Based on these results, we provide recommendations for addressing misclassification errors via statistical correction methods. In sum, automated classifiers, even those below common accuracy standards, can be useful for measurement with careful study design and appropriate correction methods.
}
\begin{document}
\maketitle
%\section{Introduction}


\emph{Automated classifiers} (ACs) based on supervised machine learning (SML) have rapidly gained popularity
as part of the \emph{automated content analysis} toolkit in communication science \citep{baden_three_2022}. With these measurement devices, researchers can categorize large samples of text, images, video or other types of data into predefined categories \citep{scharkow_content_2017}. In communication science, studies for instance use ACs to automatically classify topics \citep{vermeer_online_2020} or frames \citep{opperhuizen_framing_2019} in news articles or social media posts.

% TODO: restore citation to fortuna_toxic_2020 below
However, there is increasing concern about the validity of automated content analysis \citep{baden_three_2022, grimmer_text_2013}. As we demonstrate using the Perspective toxicity classifier, even very accurate ACs make \emph{misclassifications}
which can lead to incorrect statistical findings—unless correctly modeled \citep{scharkow_how_2017, fong_machine_2021}. Research areas where ACs have the greatest potential—e.g., content moderation, social media bots, affective polarization, or radicalization—are haunted by the specter of methodological questions related to misclassification \citep{baden_three_2022, rauchfleisch_false_2020}: How accurate must an AC be to usefully measure a variable? When—if ever—should an AC built for one context be used in another \citep{gonzalez-bailon_signals_2015, hede_toxicity_2021}? How do biases that an AC learns from training data affect findings of downstream analyses \citep{millimet_accounting_2022}? Knowing that high classification accuracy limits the risks of misleading inference, careful researchers might use only those ACs having excellent predictive performance. Yet, important social scientific concepts such as news tone \citep{van_atteveldt_validity_2021} %even ones as seemingly straightforward as sentiment \citep{van_atteveldt_validity_2021}, toxicity \citep{fortuna_toxic_2020}
or civility \citep{hede_toxicity_2021}
%and institutional frameworks \citep{rice_machine_2021}
 can be challenging to classify with high performance.

Despite these concerns, a systematic literature review of \emph{N} = 48 studies employing SML-based text classification to study substantial empirical questions shows that the problem of \emph{ignoring misclassification} is widespread. This review demonstrates a troubling lack of attention to the threats ACs introduce—and virtually no mitigation of such threats. In the current state of affairs, ACs are unlikely to be useful for studying nuanced concepts. Researchers will either draw misleading conclusions from inaccurate ACs or avoid ACs in favor of costly methods such as manually coding large samples \citep{van_atteveldt_validity_2021}.

Our primary contribution is to \emph{introduce and test statistical methods for addressing misclassification} with the goal of rescuing ACs from this dismal state \citep{carroll_measurement_2006, buonaccorsi_measurement_2010, yi_handbook_2021}. We consider some recently some proposed methods including \citet{fong_machine_2021}'s generalized method of moments (GMM) calibration method, \citet{zhang_how_2021}'s pseudo-likelihood models,  and \citet{blackwell_multiple_2012}'s application of imputation methods. To overcome limitations of the methods above, we develop our own specialized implementation of a general likelihood modeling framework drawn from the statistical literature on measurement error \citep{carroll_measurement_2006}, which we implement via the experimental R package \texttt{misclassificationmodels}.

 We test the error correction methods using Monte Carlo simulations of four prototypical situations representative of those identified by our systematic review: Using ACs to measure either (1) a dependent or (2) an independent variable where the classifier makes misclassifications that are either (a) easy to correct or (b) more difficult (e.g., when an AC is biased and misclassifications  and covariates are correlated).
The more difficult cases are important.
As the real-data example we provide in the next section demonstrates, even modest biases in very accurate ACs can cause misleading statistical findings.

% Such biases can easily result when classifier errors affect human behavior, such as that of social media moderators \maskparencite{teblunthuis_effects_2021}. Studies using classifiers from APIs that are also used in sociotechnical systems therefore be particularly prone to to differential error, which can cause misleading statistics even when classification accuracy is high.

% Our Supplementary Materials present numerous extensions of these scenarios.  We show that none of the existing error correction methods are effective in all scenarios.
%— multiple imputation fails in scenario 2; GMM calibration fails in scenario 1b and is not designed for scenario 2; and the pseudo-likelihood method fails in scenario 1 and in scenario 2b.  When correctly applied, our likelihood modeling is the only correction method recovering the true parameters in all scenarios. %We provide our implementation as an R package.

%  , and our approach based on maximum likelihood methods \citep{carroll_measurement_2006} .

 %By doing so, we follow a handful of recent studies in which social scientists have used samples of human-labeled \emph{validation data} to account for misclassification by automated classifiers.

 % This paragraph is likely to get cut, but its useful so that we have a working outline: In what follows, we begin with an overview of automated content analysis to describe how AC-based measures can affect downstream analyses and how these errors thus threaten progress in automated text classification often used in the field of Computational Social Science (CSS). We substantiate our claims via a systematic literature review of \emph{N}=49 empirical studies employing SML for classification (see \nameref{appendix:lit.review} for details).


% Although the methods above are all effective in bivariate least squares regression when an AC is used to measure a covariate, validation data are error-free, and measurement error is \emph{nondifferential} (conditionally independent of the outcome given other covariates),
% these methods all have limitations in more general cases. Below, we present simulated scenarios in which each of these methods fail to recover the true parameters.

% so long as the coders' errors are conditionally independent given observable variables.

% In our discussion section, we provide detailed recommendations based on our literature review and our simulations.
According to our simulations, even biased classifiers with low predictive performance can be useful in conjunction with appropriate validation data.
As a result, we are optimistic about the potential of ACs for communication science and beyond if researchers statistically correct for misclassification.
Current practices of ``validating'' ACs by publishing misclassification rates are important but provide no safeguard against statistical distortions.

In sum, this paper makes a methodological contribution by introducing the often-ignored problem of ``ignoring misclassification in automated content analysis'' by testing approaches to address this problem via Monte Carlo simulations and introducing a new method for error correction.
The required assumptions for error correction methods are no more difficult than those already commonly adopted in traditional content analyses—and much more reasonable than the current default approach
This method can succeed where others fail, is easily applied by experienced regression modelers, and is straightforward to extend.
Profoundly, our contributions suggest automated content will progress not through ever more accurate classifiers, but through rigorous human-coding and error modeling.


\section{Illustrating the Problem of Bias through Misclassification in the Perspective API}

There is no perfect AC. All non-trival ACs make errors.
This inevitable misclassification causes bias in statistical inference, and in the estimation of regression models in particular \citep{carroll_measurement_2006, scharkow_how_2017}.
This bias can lead researchers to make both type-1 (false discovery) and type-2 errors (failure to reject the null) in hypotheses tests. Here, we illustrate the problem of bias as a consequence of AC-based misclassification for a common example in communication research: detecting and understanding harmful social media content. In recent years, communication researchers have increasingly employed automated tools, and the  Perspective toxicity classifier in particular,  \citep{cjadams_jigsaw_2019} to detect toxicity in online content \citep[e.g.,][]{hopp_social_2019, kim_distorting_2021, salminen_topic-driven_2020}. To illustrate biases this AC and others like it may introduce, we compare using toxicity scores for comments created by manual content analysis to automated classifications made by Perspective.

 To do so, we use the Civil Comments dataset, which was released in 2019 by Jigsaw, the Alphabet corporation subsidiary which develops Perspective. This dataset has \Sexpr{f(dv.example[['n.annotated.comments']])} comments in English made on independent news sites that were all manually coded for ``toxicity'' and for whether they disclose each of several aspects of personal identity including race and ethnicity.

We then also obtained AC-based variables for the toxicity of comments from the Perspective API in November 2022. Perspective's toxicity classifier performs very well in this dataset, with an accuracy of \Sexpr{format.percent(iv.example[['civil_comments_accuracies']][['toxicity_acc']])} and an F1 score of \Sexpr{round(iv.example[['civil_comments_f1s']][['toxicity_f1']],2)}. Nevertheless, if we treat the human annotations as the ground-truth, the classifier is modestly biased. For instance, it disproportionately misclassifies comments that disclose a racial or ethnic identity as toxic (Pearson's $\rho=\Sexpr{round(dv.example[['civil_comments_cortab']]['toxicity_error','race_disclosed'],2)}$).
As a result of these misclassifications, regression analyses of the Civil Comments dataset using Perspective to measure toxicity can produce different results than those using human annotations.

In our first example, we consider the logistic regression model predicting whether a comment contains \emph{racial or ethnic identity disclosure} using \emph{number of likes},  \emph{toxicity} and the interaction of these two independent variables as covariates. Although this is a toy example constructed to illustrate a statistical problem, it is a realistic investigation of how disclosing aspects of one's identity on social media relates to the normative reception of one's behavior.

\begin{figure}[htbp!]
\centering
\begin{subfigure}{\linewidth}
<<real.data.example.iv,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.asp=0.3,cache=F>>=
p <- plot.civilcomments.iv.example()
print(p)
@
\subcaption{\emph{Example 1} illustrates bias when automatic classifications are a covariate in logistic regression.\label{fig:real.data.example.iv}}
\end{subfigure}

\begin{subfigure}{\linewidth}
<<real.data.example.dv,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.asp=0.3,cache=F>>=
p <- plot.civilcomments.dv.example()
print(p)
@
\subcaption{\emph{Example 2} illustrates bias in regression when automatic classifications are the outcome in logistic regression. \label{fig:real.data.example.dv}}

\end{subfigure}

\caption{Misclassification by Perspective causes bias in regression analyses as shown in the annotated civil comments dataset.
Figure \ref{fig:real.data.example.iv} compares a model using automatic toxicity classifications to a model using human toxicity annotations and shows that the 95\% confidence interval of the coefficient for likes contains 0.
In Figure \ref{fig:real.data.example.dv}, a model predicting automatic toxicity classifications for toxicity detects a negative correlation between likes and toxicity that is not found when human annotations are used instead.  A \Sexpr{format.percent(iv.sample.prop)} random sample of \Sexpr{f(iv.sample.count)} annotations does not provide sufficient statistical power to distinguish the false discovery from 0.
In both examples, a random \Sexpr{format.percent(iv.sample.prop)} sample of \Sexpr{f(iv.sample.count)} annotations does not provide sufficient statistical power to distinguish the coefficient for likes from 0.  Yet the methods we introduce can use this sample to model the misclassifications and obtain results close to those using the full dataset of annotations.
\label{fig:real.data.example}
}
\end{figure}

As shown in Figure \ref{fig:real.data.example}, a researcher using Perspective's automatic toxicity classifications could draw different conclusions than if she had instead used the human annotations.  Specifically, evidence using the AC could lead her to reject her hypothesized direct relationship between likes and identity disclosure conclude and to instead conclude that the correlation between likes and disclosure is entirely mediated by toxicity.
This is because the coefficient for likes is statistically indistinguishable from 0 and the coefficient for the interaction between likes and toxicity is positive and well-estimated.  However, using the human annotations, she would have instead found a subtle positive direct relationship between likes and identity disclosure.

Obtaining such a large number of high-quality human annotations is impractical for all but the most well-resourced research teams. The direct relationship between likes and identity disclosure is so subtle that even a random sample of \Sexpr{format.percent(iv.sample.prop)} of annotations lacks sufficient statistical power to detect it.
However, our method can use this sample of annotations to correct the bias introduced by Perspective's misclassifications while preserving enough statistical power to detect the direct relationship between likes and identity disclosure at the 95\% confidence level with estimates similar to those in the model using all \Sexpr{f(dv.example[['n.annotated.comments']])}  annotations.

This first example demonstrates that misclassification errors, even from a very accurate model in a large dataset can mislead a researcher into rejecting a hypothesis of a nonzero effect.
Our second example shows that the problem of misclassification bias can also lead to false discovery by driving detection of a nonzero relationship.

For simplicity, our second example uses the same variables as the first. Only this time \emph{toxicity} is the outcome predicted by a logistic regression model with covariates a comment's number of \emph{likes}, \emph{racial or ethnic identity disclosure}, and the interaction of these two variables.
As shown in Figure \ref{fig:real.data.example.dv}, using Perspective's automatic classifications to measure toxicity results in a small negative coefficient for likes, but there is no detectable relationship in the dataset of annotations.   The model using a \Sexpr{format.percent(dv.sample.prop) } sample of \Sexpr{f(dv.sample.count)} annotations cannot rule out such a weak relationship (the estimated effect using the AC is in the 95\% confidence interval), but our error correction method using this sample and Perspective's automatic classifications together can do so.

These examples show that that misclassification can produce misleading statistical findings,  even with a very accurate and modestly biased automatic classifier.  If we consider hypothesis tests of non-zero coefficients, automatic classifications instead of human annotations caused both type-I and type-II errors in our examples.  Although the effect sizes in these cases are rather subtle and would not be detectable in smaller datasets, such small effects commonly found using large datasets can easily result from subtle biases in observational study designs \citep{kaplan_big_2014}.  Such small effect sizes may not appear practically or theoretically important, but note that the consequences of bias from automatic classification for coefficients in these examples (i.e., the interaction term in the first example and \emph{identity disclosure} in the second) are larger.
Of course, with a less accurate or more biased AC,  misclassification will be even more prone to cause type-I and type-II errors in large effect sizes.
Importantly, these errors are correctable using human annotations.  Although this example required \Sexpr{iv.sample.count} annotations, a large number representing considerable effort, to consistently do so, this is a small fraction of the entire dataset.
Additional details on these examples are available in Appendix \ref{appendix:perspective}.

We have now illustrated the problem. Next we will discuss it in greater depth.

\subsection{Problem I: Misclassification can cause anti-conservative bias}

A large dataset does not reduce such inferential bias  \citep{carroll_measurement_2006, van_smeden_reflection_2020}. It is often believed—incorrectly—that misclassification causes only conservative bias (i.e., bias towards 0) because this is true in the simplest cases of least squares regression—when measurement error in the only covariate is classical or when measurement error in the outcome is unbiased
\citep{carroll_measurement_2006, loken_measurement_2017, van_smeden_reflection_2020}.\footnote{Measurement error is \emph{classical} when $W = X + \xi$ because the variance of an AC's predictions is greater than the variance of the true value \citep{carroll_measurement_2006}.  If nondifferential measurement error is not classical then it is called Berkson, and we would write $X = W + \xi$ instead of $W = X + \xi$.  In general, Berkson measurement error is easier to deal with than classical error. It is hard to imagine how a AC would have Berkson errors (the predictions would have to have lower variance than the training data), so, following prior work, we do not consider Berkson errors \citep{fong_machine_2021, zhang_how_2021}.}  As a result, researchers interested in a hypothesis of a statistically significant relationship may not consider misclassification an important threat to validity \citep{loken_measurement_2017}. However, there are at least two compelling reasons that misclassification is a serious concern.

First, the inferential bias that misclassification causes is not necessarily conservative \citep{carroll_measurement_2006, loken_measurement_2017, van_smeden_reflection_2020}. In logistic regression or other nonlinear models, random measurement error can cause bias away from 0.
Moreover differential measurement error (i.e., error not conditionally independent of the outcome given the other covariates) can bias inference in any direction and lead to wildly misleading conclusions. Researchers can check the assumption of nondifferential measurement error via graphical and statistical conditional independence tests \citep{carroll_measurement_2006, fong_machine_2021}.
For example, \citet{fong_machine_2021} suggest using Sargan's J-test of the null hypothesis that the product of the AC's predictions and regression residuals have an expected value of 0.

Users of ACs should be especially conscious of differential measurement error due to the nonlinear behavior of many ACs \citep{breiman_statistical_2001}.
For instance, ACs designed in one context and applied in another are likely to cause differential measurement error. The Perspective API used to classify toxic content, for example, was developed for social media comments, but performs much worse when applied to news data \citep{hede_toxicity_2021}.
Differential measurement error is also likely to arise when an AC used for measurement shapes behavior in the sociotechnical system under study. For example, the Perspective API is used for moderation in many forums \citep{hede_toxicity_2021} and the ORES API is used by Wikipedia moderators \citep{teblunthuis_effects_2021}.
Therefore, its predictions may have causal effects on outcomes related to moderation which cause differential error in regression models using these ACs as covariates.

\subsection{Problem II: Systematic Biases in Specific Research Areas}
%TODO: uncomment citation below
The second reason that misclassification is a concern is that it may systematically contaminate the literature in a research area.   If certain ACs become standard measurement devices within a research area, such as the LIWC dictionary to measure sentiment \citep{boukes_whats_2020},
%\citep{dobbrick_enhancing_2021}
Google's Perspective API used to measure toxicity \citep{hosseini_deceiving_2017} or Botometer used to classify social media bots \citep[see, for a critical discussion][]{rauchfleisch_false_2020}, such research areas may become confused by  systematic biases. For example, \citet{scharkow_how_2017} argue that media's ``minimal effects'' on political opinions and behavior may be an artifact of  how many study designs in this area have common sources of measurement error that created systematic bias towards 0.  Conversely, if researchers selectively report statistically significant hypothesis tests, measurement error can introduce an upward bias in the magnitude of reported effect sizes and contribute to a replication crisis \citep{loken_measurement_2017}.


% First, we note that when the anticipated effect size is large enough, traditional content analysis of a random sample has the advantage over the considerable complexity of automated content analysis.
% ACs should be used when costs prohibit traditional content analysis of sample size sufficient to detect anticipated effect sizes, but where collective a relatively small sample of validation data is tractable.

% When the data used to train an AC is not representative of the study population, as is the case with commercial APIs or other black-box classifiers, this increases the risk of differential measurement error, which can introduce extremely misleading forms of statistical bias. Even this form of error can be addressed.


% Therefore, we recommend reporting (and preregistering) at least two aforementioned corrective methods in addition to uncorrected estimates. When machine learning classification is used for an independent variable, we recommend multiple imputation because it is robust to differential error and it simple to implement. However, our simulations show that multiple imputation does not work well when machine learning classification is used for the dependent variable.  Greater care may be required if measurement error may be differential, because specifying the error model may open many degrees of research freedom and plausible error moe
\section{Misclassification in Automated Content Analysis: Reviewing Reporting and Error Correction Practices}

% In traditional content analysis, humans use their judgement to classify messages, and automated content analysis uses computers as an instrument to

% % can be defined either as a research approach or as an instrument.

% In this paper, automated content analysis is defined as a research approach, which is a sub-type of content analysis for
% In contrast to manual content analysis, the difference is that the instrument used to code messages shifts from human judgment to computer algorithms \citep{scharkow2017content}. These computer algorithms, which can also be confusingly defined as ``automated content analysis" in the instrumental sense, are called automated coding techniques (versus manual coding techniques) in this paper.


% Social scientists have long recognized that measurement error can be an important methodological concern, but this concern has often been neglected \citep{schwartz_neglected_1985}.


% There have been several papers outlining what automated coding techniques are in the "toolbox" of communication researchers (key papers are \citep{scharkow2017content} and \citep{boumans:2015:tst}).
% Unsupervised and supervised machine learning procedures are deployed for coding.
% There has been discussion on the best practices for deploying unsupervised machine learning for communication research \citep{maier:2018:ALT}.
% This paper is going to focus only on classification.
% Researchers have raised concerns about validity issues of the approach \citep{scharkow2017content}. And by definition, the coding made by this technique is an imperfect surrogate of manual coding \citep{boumans:2015:tst}. When machine-classified surrogates are used in regression analyses for ``making replicable and valid inferences from texts", measurement errors are introduced \citep{fong_machine_2021}. A formal mathematical definition of these measurement errors is available later.

% In the next section, all communication research studies with SML are reviewed to show how researchers deals with these measurement errors.

% Furthermore, human classifiers also make errors and none of the prior methods consider how errors in the validation data can bias statistical results \citep{geis_statistical_2021, song_validations_2020, bachl_correcting_2017, scharkow_how_2017}.

 % Changeme to bring back citations after ICA
 Misclassification is a long-standing concern in
  the content analysis literature which has extensively studied difficulties in human-labeling through the framework of intercoder reliability \citep{krippendorff_reliability_2004}.
 %, hayes_answering_2007, gwet_computing_2008}.
 The increasing use of metrics such as Krippendorf's $\alpha$
 %and Gwet's AC \citep{gwet_computing_2008, krippendorff_reliability_2004},
 demonstrates transparency efforts in reporting imperfect manual annotations  \citep{lovejoy_assessing_2014}. Moreover, \citet{bachl_correcting_2017} introduced methods for correcting proportion estimates using data from multiple independent human coders.
Despite this awareness of threats posed by manual misclassification, our review below demonstrates that misclassification by ACs is often downplayed.

Content analysis focuses on ``\emph{making replicable and valid inferences from texts (or other meaningful matter) to the contexts of their use}'' \citep[p. 24, emphasis in original]{krippendorff_content_2018}. Automated content analysis, where computers are used as measurement devices, has gained traction in communication science \citep{baden_three_2022, junger_unboxing_2022} \maskparencite{hase_computational_2022}.
One common automated content analysis method is supervised machine learning (SML)  \citep{scharkow_content_2017}.\footnote{Automated content analysis includes a range of other methods both for assigning content to predefined categories (e.g., dictionaries) and for assigning content to unknown categories (e.g., topic modeling) \citep{grimmer_text_2013}. Here, we focus on SML-based ACs. However, our arguments extend to other deductive approaches introducing misclassifications such as dictionary-based classification.} In essence, the procedure is to train an algorithm—e.g., a naïve Bayes classifier, decision tree, or artificial neural network—on manually coded material as the training set. The trained classifier is then used to predict categories in new, as of yet unseen data. Automatic classifiers enable researchers to inexpensively measure categorical variables in large data sets of digitized media. This promises to be useful for study designs requiring large samples such as to infer effect sizes smaller than would be possible using a sample size that humans could feasibly classify.

But are scholars aware that misclassification by ACs poses threats to the validity of downstream analyses? Although such issues in the context of manual content analysis have attracted much debate  \citep{bachl_correcting_2017}, this is less true for misclassification by newly popular automatic classifiers.
To understand how social scientists, including communication scholars, use SML-based classifiers to construct variables and engage with the problem of misclassification, we conducted a systematic literature review (see Appendix \ref{appendix:lit.review}  in our Supplement for details\footnote{Anonymized link for review: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}}). Our review builds on studies identified by recent reviews on automated content analysis, including SML \citep{baden_three_2022, hase_computational_2022, junger_unboxing_2022, song_validations_2020}. Our goal in our review is not to comprehensively review all SML studies
%\footnote{In fact, our review likely underestimates the use of the method, as we focused on text-based SML methods in the social science domain employed for empirical analyses.}
but to provide a picture of common practices, with an eye toward awareness of misclassification and its statistical implications.

We identified a total of 48 empirical studies published between 2013 and 2021—more than half of which were published in communication journals—which employed SML-based text classification to create 146 variables. Studies used SML-based text classification to perform tasks stuch as identifying frames \citep{opperhuizen_framing_2019} or topics \citep{vermeer_online_2020}. They often employed SML-based ACs to create dichotomous (50\%) or other categorical (22.9\%) variables\footnote{Metric variables were also created in 35.4\% of studies, mostly via the non-parametric method by Hopkins and King \citeyear{hopkins_method_2010} estimating proportions instead of classifying documents, something we do not focus on.}. Although 89.6\% of empirical studies used SML-based ACs to report descriptive statistics,
%— from the prevalence of topics in online news \citep{vermeer_online_2020} to incivility in social media posts \citep{su_uncivil_2018} —,
many also employed automated classification for downstream statistical analyses by using ACs as dependent (43.8\%) and independent (39.6\%) variables in multiple regression models. These regression analyses tend to be reported in higher-status journals compared to papers only reporting proportions.

Given the rising popularity of SML-based text classification, our review indicates a worrying \emph{lack of transparency when reporting SML-based text classification}, similar to that reported in previous studies \citep{reiss_reporting_2022}: A large share of studies do not report important methodological decisions related to the sampling and sizes of training and test sets or to intercoder reliability (see Appendix \ref{appendix:lit.review}). This lack of transparency concerning model validation not only limits the degree to which researchers can evaluate studies, but also makes replicating such analyses to correct for misclassification nearly impossible. Most importantly, our review finds that \emph{studies almost never reflected upon or corrected for misclassification in their automated content analyses}. According to our review, only 18.8\% of studies discussed in any way the possibility that an AC misclassified texts. Only a single article reported using error correction methods.

\subsection{Is Transparancy about Misclassification Enough?}
%TODO Uncomment below

Commonly recommend practices in automated content analyses address the threats of misclassification through \emph{transparency} in the form of reporting metrics such as precision, recall, F1 and AUC scores computed using human-classified validation data  \citep{grimmer_text_2013}.
%, pilny_using_2019}.
These metrics are intended to promote confidence in inferences resulting from the use of ACs by demonstrating high predictiveness. However, our literature review indicates that they are not always included in reporting, at least when it comes to SML-based text classifications.
%Moreover, such metrics can limit the potential impact of measurement error if they dissuade researchers from using inaccurate classifiers.

Moreover, high predictiveness according to these metrics may be less protective from measurement error than it seems.
Algorithms and models for building effective automated classifiers were developed in the culture of algorithmic modeling associated with fields like computer science and management \citep{breiman_statistical_2001}.
As a paradigm, SML takes the opposite position on the bias-variance tradeoff from conventional statistics. Its methods achieve high predictiveness by throwing unbiased inference to the wind and pursuing prediction at all costs \citep{breiman_statistical_2001}.
On their own, predictiveness metrics provide no guarantees about the accuracy of downstream statistical inferences.

In fact, steps made in the interest of predictiveness may increase inferential bias.
As a growing body of scholarship critical of the hasty adoption of SML in criminal justice, healthcare, content moderation, and employment has demonstrated, machine learning models boasting high performance often have biases. These result from the use of non-representative training datasets and spurious correlations that neither reflect causal mechanisms nor generalize in different (sub)populations \citep{bender_dangers_2021}.
% \citep{obermeyer_dissecting_2019, kleinberg_algorithmic_2018, bender_dangers_2021, wallach_big_2019, noble_algorithms_2018}.
For example, \citet{hede_toxicity_2021} show that, when applied to news datasets, the Perspecitve API overestimates incivility in topics such as racial identity, violence and sex. These automatic classifications will likely introduce differential measurement error to a regression model of an outcome related to such topics.
If ACs used in communication science also have such biases, these biases may flow downstream, by way of differential or systematic measurement error, into statistical inferences.

The good news is that human-classified validation data can do more than benchmark predictive performance to increase transparency about measurement errors. With an appropriate model,  validation data can effectively correct biases in statistical inferences.


%yi_handbook_2021,buonaccorsi_measurement_2010
\section{Correcting for Misclassification}
Statisticians have extensively studied problems that measurement errors can cause for statistical inferences and proposed statistical methods to correct them \citep[see][]{carroll_measurement_2006, fuller_measurement_1987}.
We therefore narrow our focus to methods that are particularly appropriate to dealing with misclassifications by ACs: \citet{fong_machine_2021}'s GMM calibration method, \citet{zhang_how_2021}'s pseudo-likelihood model, and approaches that promise greater generality—multiple imputation, \citep{blackwell_multiple_2012} and likelihood modeling \citep{carroll_measurement_2006}.
%Measurement error is a vast and deep subject in statistics. We recommend \citet{carroll_measurement_2006} as a graduate-level textbook on the subject.

In the interest of clarity, we introduce some notation in this section. Say $X$ is the covariate that is automatically classified, and $X^*$ is a sample of validation data.  The automatic classifications are $W$, $Z$ is a second covariate, and $Y$ is the outcome.
To illustrate, consider an idealized example study from social media research: whether someone breaks a rule on a social media site and how long it takes for them to be banned.
This study might analyze the regression model $Y = B_0 + B_1 X + B_2 Z + \varepsilon$ where $Y$ is the (log-scaled) time until an account is banned, $X$ is whether the account broke a rule, and $Z$ is a covariate related to the account's reputation, such as the number of posts. Humans can observe whether an account breaks a rule, but human classifications are expensive and only available in a relatively small sample $X^*$. In contrast, an SML model can make automatic classifications $W$ for the entire dataset. But how do we correct for errors introduced by such ACs?

\emph{Regression calibration} uses observable variables, including the automatic classifications $W$ and other variables measured without error $Z$, to approximate the true value of a covariate $X$ \citep{carroll_measurement_2006}. \citet{fong_machine_2021} propose a regression calibration procedure designed for supervised machine learning that we refer to as \emph{GMM calibration} or abbreviate as GMM.\footnote{\citet{fong_machine_2021} describe their method within an instrumental variable framework, but it is equivalent to regression calibration and regression calibration is the standard term in measurement error literature.} For their calibration model, \citet{fong_machine_2021} use 2-stage least squares (2SLS), regressing observable covariates $Z$ and AC predictions $W$ onto the validation data and then use the resulting model to approximate the covariate $\hat{X}$.
Next, \citet{fong_machine_2021} use the generalized method of moments (gmm) to combine the estimate based on the approximated covariate $\hat{X}$ and the estimate using the validation data $X^*$. This method makes efficient use of validation data and provides an asymptotic theory for deriving confidence intervals. The GMM method's assumptions do not include strong assumptions about the distribution of the outcome $Y$, but are still violated by differential error \citep{fong_machine_2021}. GMM, like other regression calibration techniques, is not designed to correct for misclassification in the outcome.

\emph{Multiple imputation} (MI) treats measurement error as a missing data problem because the true value of $X$ is observed in the validation data $X^*$ and missing otherwise \citep{blackwell_multiple_2012}.  For example, the regression calibration step in \citet{fong_machine_2021}'s GMM method uses least squares regression to impute unobserved values of the covariate $X$. Indeed, \citet{carroll_measurement_2006} describe regression calibration when validation data are available as ``simply a poor person's imputation methodology'' (pp. 70).
Like regression calibration, multiple imputation uses a model to infer likely values of possibly misclassified variables. The difference is that multiple imputation samples several (hence \emph{multiple} imputation) entire datasets filling in the missing data from the predictive probability distribution of the covariate $X$ conditional on the other variables $\{X,Y,Z\}$, then runs a statistical analysis on each of these sampled datasets and pools the results of each of these analyses \citep{blackwell_multiple_2012}. Note that  $Y$ is included among the imputing variables, giving the MI approach the potential to address differential error.  \citet{blackwell_multiple_2012} claim that their MI method works with differential measurement error (so long as the bias in the measurement error can be modeled) and when measurement error is in the outcome or in a covariate.

\emph{Maximum likelihood methods} (MLE) can effectively deal with measurement error in ACs by maximizing a likelihood that correctly specifies an \emph{error model} of the probability of the automatic classifications conditional on the true value and the outcome \citep{carroll_measurement_2006}.
In contrast to the GMM and the MI approach, which predict values of the mismeasured variable, the MLE method accounts for all possible values of the variable by ``integrating them out'' of the likelihood.
``Integrating out'' means adding both possible values of a binary variable to the likelihood, weighted by the likelihood of the error model.
MLE methods have two advantages in the context of ACs. First, they are quite general and can be applied to any model with a convex likelihood including generalized linear models (GLMs) and generalized additive models (GAMs).
Second, assuming the model is correctly specified, MLE estimators are fully consistent whereas regression calibration estimators are only approximately consistent \citep{carroll_measurement_2006}.  Practically, this means that MLE methods can have greater statistical efficiency and require less validation data to make precise estimates.

The MLE approach is conceptually different from the GMM one. The GMM approach first imputes likely values and then runs the main analysis on imputed values. By contrast, MLE approaches estimate—all in one step—the main analysis using the full dataset and the error model estimated using only the validation data \citep{carroll_measurement_2006}.
The MLE approach is applicable both when the automatically classified variable is a covariate and when it is the outcome.

\emph{``Pseudo-likelihood''} methods (PL)—even if not always explicitly labeled this way—are another approach. \citet{zhang_how_2021} proposes a method that approximates the error model using quantities from the AC's confusion matrix—the positive and negative predictive values in the case of a mismeasured covariate and the AC's false positive and false negative rates in the case of a mismeasured outcome.  Because quantities from the confusion matrix are neither data nor model parameters, \citet{zhang_how_2021}'s method is technically a ``pseudo-likelihood'' method. A clear benefit of this idea is that it only requires summary quantities derived from validation data. It can thus be applied when validation data are unavailable. We will discuss likelihood methods in greater depth in the presentation of our MLE framework below.

Statisticians have studied other methods for correcting measurement error that we do not test in our simulations including simulation extrapolation, Bayesian estimation, and score function methods. As we argue in Appendix \ref{appendix:other.methods} of our Supplement, these approaches are not advantageous for correcting misclassification when validation data is available.


\subsection{Proposing a Likelihood Modeling Approach to Correct Misclassification}

% This section basically translates Carroll et al. for a technically advanced 1st year graduate student.
We now elaborate on our likelihood modeling approach
by applying \citet{carroll_measurement_2006}'s presentation of the general statistical theory of likelihood modeling for measurement error correction to the context of binary classification when validation data is available.
The idea is to use an \emph{error model} of the conditional probability of the automatic classifications given the true classifications and other variables on which automatic classifications depend.
In other words, the error model estimates the conditional probability mass function of the automatic classifications.

% When a variable is measured with error, this error introduces uncertainty. The overall idea of correcting an analysis with a mismeasured variable through likelihood modeling is to use

Including the error model in the likelihood effectively accounts for uncertainty of the true classifications and, assuming the error model gives consistent estimates of the conditional probability of the automatic classifications given the true values, is sufficient to obtain consistent estimates using MLE \citep{carroll_measurement_2006}. The MLE approach is particularly well-suited to misclassification by ACs because it can be quite straightforward to fit the error model when the mismeasured variable is discrete.

\subsubsection{When an Automatic Classifier Predicts a Covariate}

Say we want to fit the linear regression model $Y=B_0 + B_1 X + B_2 Z + \varepsilon$ and an AC makes classifications $W$ that predict the discrete covariate $X$—for instance, whether a message by a social media account broke a rule according to an AC, to then explain the time until the account is banned.
Maximizing ($\mathcal{L}(\Theta|Y,W)$), the likelihood of parameters $\Theta$ given data $W$ and $Y$, can jointly fit the regression model of $Y$ having parameters $\Theta_Y= \{B_0, B_1, B_2\}$ and an error model of $W$ because  $P(Y,W|\theta)$,
the joint probability of $Y$ and $W$, can be factored into the product of three terms: $P(Y|\Theta_Y)$, the regression model we want to fit, $P(W|X,Y)$, the error model, and $P(X|Z)$, a model for the probability of $X$.
Therefore, calculating these three conditional probabilities  is sufficient to calculate the joint probability of the outcome and automatic classifications and obtain a consistent estimate despite misclassification.

For instance, we can assume that the probability of $W$ follows a logistic regression model of $Y$, $X$ and $Z$ and that the probability of $X$ follows a logistic regression model of $Z$. In this case, the likelihood model below is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\} = \{\{B_0, B_1, B_2\}, \{\alpha_0, \alpha_1, \alpha_2\}, \{\gamma_0, \gamma_1\}\}$.


\begin{align}
    \mathcal{L}(\Theta | Y, W) &= \prod_{i=0}^{N}\sum_{x} {P(Y_i| X_i, Z_i, \Theta_Y)P(W_i|X_i, Y_i, Z_i, \Theta_W)P(X_i|Z_i, \Theta_X)} \label{eq:covariate.reg.general}\\
    P(Y_i| X_i, Z_i, \Theta_Y) &= \phi(B_0 + B_1 X_i + B_2 Z_i) \\
    P(W_i| X_i, Y_i, Z_i, \Theta_W) &= \frac{1}{1 + e^{-(\alpha_0 + \alpha_1 Y_i + \alpha_2 X_i)}} \label{eq:covariate.logisticreg.w} \\
    P(X_i | Z_i, \Theta_X) &= \frac{1}{1 + e^{-(\gamma_0 + \gamma_1 Z_i)}}
\end{align}


\noindent $\phi$ is the normal probability distribution function.  Note that Equation \ref{eq:covariate.reg.general} models differential error taking the form of a linear relationship between $W$ and $Y$.  When error is nondifferential, the dependence between $W$ and $Y$ can be removed from Equations \ref{eq:covariate.reg.general} and \ref{eq:covariate.logisticreg.w}.

Calculating the three conditional probabilities in practice requires specifying models on which validity of the method depends.
This framework is very general and a wide range of probability models, such as  generalized additive models (GAMs) or Gaussian process classification, may be used to estimate $P(W|X,Y)$ and $P(X|Z)$ \citep{williams_bayesian_1998}.
For simplicity, we proceed with a focus on linear regression for the probability of $Y$ and logistic regression for the probability of $W$ and the probability of $X$.

\subsubsection{When an Automatic Classifier Predicts the Outcome}

We now turn to the case when an AC makes classifications $W$ that predict the discrete-valued outcome $Y$—for example to use an automatically classifier predicting whether social media users break rules to test hypotheses about why they do so.
This case is simpler than the case above where an automatic classifier is used to measure a covariate $X$ because there is no need to specify a model for the probability of $X$.

If we assume that the probability of $Y$ follows a logistic regression model of $X$ and $Z$, and allow $W$ to be biased and directly depend on $X$ and $Z$, then maximizing the following likelihood is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W\} = \{\{B_0, B_1, B_2\},\{\alpha_0, \alpha_1, \alpha_2, \alpha_3\}\}$.

\begin{align}
    \mathcal{L}(\Theta|Y,W) &= \prod_{i=0}^{N} {\sum_{x}{P(Y_i | X_i, Z_i, \Theta_Y)P(W_i|X_i, Z_i, Y_i, \Theta_W)}} \label{eq:depvar.general}\\
    P(Y_i| X_i, Z_i, \Theta_Y) &= \frac{1}{1 + e^{-(B_0 + B_1 X_i + B_2 Z_i)}} \\
    P(W_i | Y_i, X_i, Z_i, \Theta_W) &= \frac{1}{1 + e^{-(\alpha_0 + \alpha_1 Y_i + \alpha_2 X_i + \alpha_3 Z_i)}} \label{eq:depvar.w}
\end{align}

If the AC's errors are conditionally independent of $X$ and $Z$ given the model for $W$ then the dependence of $W$ on $X$ and $Z$ can be omitted from equations \ref{eq:depvar.general} and \ref{eq:depvar.w}.
Additional details are available in Appendix \ref{appendix:derivation} of the Supplement.


% TODO: bring back once appendix is ready.
% as we demonstrate in Appendix \ref{appendix:lit.review} .

\section{Simulation Design}

% \TODO{Create a table summarizing the simulations and the parameters.}

In this section, we  present four Monte Carlo simulations (\emph{Simulations 1a}, \emph{1b}, \emph{2a}, and \emph{2b}) to evaluate existing methods (GMM, MI, PL) as well as our approach (MLE) for correcting statistical inference when a variable is measured by an error-prone AC. We first describe the set-up of our Monte Carlo simulations before delving into the four prototypical scenarios we identified via our literature review and therefore simulated.

\subsection{Parameters of the Monte Carlo simulations}
Monte Carlo simulations are a common tool for evaluating statistical methods, including (automated) content analysis \citep[e.g.][]{song_validations_2020,bachl_correcting_2017,geis_statistical_2021, fong_machine_2021,zhang_how_2021}.
A Monte Carlo simulation defines a model of study design in terms of a data generating process from which datasets are repeatedly sampled. Running an analysis on each sampled dataset provides an empirical distribution of the results the analysis would obtain over study replications. The methods affords exploration of finite-sample performance, robustness to assumption violations, comparison across several methods, and ease of interpretability \citep{mooney_monte_1997}.

For each prototypical scenario, we ran up to six analyses. Four of these test error correction methods: \emph{GMM calibration} (GMM) \citep{fong_machine_2021}, \emph{multiple imputation} (MI) \citep{blackwell_multiple_2012}, \emph{Zhang's pseudo-likelihood model} (PL) \citep{zhang_how_2021}, and our \emph{likelihood modeling} (MLE) approach. GMM is not designed for the case when an automatically classified variable is the outcome, so we omit this method in \emph{Simulations 2a} and \emph{2b}. We compare error correction methods to two other approaches: the \emph{feasible} estimator in which researchers abstain from using ACs by using only perfectly accurate manually annotated validation data (i.e., cases where manual coders agree on codes)
%and illustrates the motivation for using an AC in these scenarios—validation alone provide insufficient statistical power for a sufficiently precise hypothesis test.
and the \emph{naïve} estimator, representative of common practice, where researchers use AC- based classifications $W$ as a stand-ins for $X$.

We repeat each simulation with different amounts of automatically classified data (ranging from \Sexpr{min(N.sizes)} to \Sexpr{max(N.sizes)} observations) and human labeled data (ranging from \Sexpr{min(m.sizes)} to \Sexpr{max(m.sizes)}
observations).

\begin{equation}
 Y= B_0^* + B_1^*W +  B_2^*Z  + \varepsilon^* = B_0^* + B_1^*(X + \xi) + B_2^*Z
\label{mod:measerr.ols}
\end{equation}

We evaluate each analytical approach in terms of \emph{consistency}, whether the estimates of parameters $\hat{B_X}$ and $\hat{B_Z}$ have expected values nearly equal to the true values $B_X$ and $B_Z$; \emph{efficiency}, how precisely the parameters are estimated and how precision improves with additional automatically classified or human labeled data; and \emph{uncertainty quantification}, how well the 95\% confidence intervals provided by each method approximate the confidence interval of parameter estimates across Monte Carlo simulations.

%These simulations are designed to verify that error correction methods from prior work are effective in ideal scenarios and to create the simplest possible cases where these methods are inconsistent. Showing how prior methods fail is instructive for understanding how our MLE approach does better both in these artificial simulations and in practical projects.
We use the \texttt{predictionError} R package  \citep{fong_machine_2021} for the GMM method, the \texttt{Amelia} R package for the MI approach, and the \texttt{optim()} R function for implementing \citet{zhang_how_2021}'s PL approach and our approach.

\subsection{Four Prototypical Scenarios}

We simulate regression models with two covariates ($X$ and $Z$). This sufficiently constrains our study's scope but is general enough to be applied in a wide range of research studies.
%Simulating studies with  two covariates lets us study how  measurement error in one covariate can cause bias in coefficient estimates of other covariates.
Whether the methods we evaluate below are effective or not depends on the conditional dependence structure among the covariates, the outcome $Y$, and the model predictions $W$.
This structure determines whether covariate measurement error is differential and whether outcome measurement error is systematic \citep{carroll_measurement_2006}.
We illustrate our simulated scenarios using Bayesian networks to represent the conditional dependence structure of the variables in Figure \ref{bayesnets}
\citep{pearl_fusion_1986}.
%In these figures, an edge between two variables indicates that they have a direct relationship.  Two nodes that are not neighbors are statistically independent given the variables between them on the graph. For example, in Figure \ref{fig:simulation.1a}, the automatic classifications $W$ are conditionally independent of $Y$ given $X$ because all paths between $W$ and $Y$ contain $X$. This indicates that the model $Y=B_0 +B_1 W+ B_2 Z$ (the \emph{naïve estimator}) has non-differential error because the automatic classifications $W$ are conditionally independent of $Y$ given $X$. However, in Figure \ref{fig:simulation.1b}, there is an edge between $W$ and $Y$ to indicate that $W$ is not conditionally independent of $Y$ given  other variables. Therefore, the naïve estimator has differential error.

We first simulate two cases when an AC is used to measure a covariate with and without differential error. Then, we simulate two cases where an AC is used to measure the outcome either making errors that are correlated with predictors or not.

\input{bayesnets.tex}

\subsection{Measurement Error in a Covariate (\emph{Simulations 1a} and \emph{1b})}

We consider studies with a goal of testing a hypotheses about the coefficients $B_1$ and $B_2$ in the least squares regression (Model \ref{mod:true.ols}).
\begin{equation}
Y=B_0 + B_1 X + B_2 Z + \varepsilon
    \label{mod:true.ols}
\end{equation}
In this example, $Y$ is continuous variable,  $X$ is a binary variable measured with an AC, and $Z$ is a normally distributed variable with mean 0 and standard deviation  \Sexpr{sim1.z.sd} measured without error.
For example, $Y$ could be the time until an account on an online forum is banned, $X$ if a message breaks one of the forum's rules, and $Z$ the account's reputation score. $X$ and $Z$ are negatively correlated because high-reputation accounts may be less likely to break rules.
%$Z$ can indicate if the message is in German or English, the two possible languages in the hypothetical study.

Say that human content coders can observe $X$ perfectly, but each observation is so expensive that observing $X$ for a large sample is infeasible.
%Instead, the human coders can measure $X$ without error for a subsample of size $m << N$.
To scale up content analysis, a SML-based AC makes predictions $W$ of $X$—for instance predicting if any of the messages from that social media user break the rules.
Both scenarios have a normally distributed outcome $Y$ and two binary-valued covariates $X$ and $Z$, which are balanced ($P(X)=P(Z)=0.5$) and correlated (Pearson's $\rho=\Sexpr{round(sim1a.cor.xz,2)}$). Simulating balanced covariates serves simplicity so that accuracy is adequate to quantify the predictive performance of our simulated classifier. Simulating correlated covariates is helpful to study how misclassification in one variable affects parameter inference in other covariates.
To represent a research study design where automated classification is needed to obtain sufficient statistical power, $Z$ and $X$ can explain only \Sexpr{format.percent(sim1.R2)} of variance in $Y$.
% TODO, bring back when these simulations are in the appendix.
%Additional simulations in appendix \ref{appendix:sim1.imbalanced} show results for variations of \emph{Simulation 1} with imbalanced covariates explaining a range of variances, different classifier accuracies, heteroskedastic misclassifications and deviance from normality in the an outcome $Y$.

In \emph{Simulation 1a}, visualized in Figure \ref{fig:simulation.1a}, we simulate an AC with \Sexpr{format.percent(sim1a.acc)} accuracy to reflect a situation where $X$ may be difficult to predict, but an automated classifier, represented as a logistic regression model having linear predictor $W^*$, provides a useful signal. The \emph{naïve estimator} has classical and nondifferential measurement error because $W=X+\xi$ because $\xi$ is normally distributed with mean $0$ and $\xi$ is conditionally independent of $Y$ given $X$ and $Z$ ($P(\xi| Y,X,Z) = P(\xi|X,Z)$).
%For simplicity, the AC's errors $\xi$ are independent of all other variables. In Appendix F, we demonstrate that the methods we study perform similarly when $\xi$ is heteroskedastic, correlated with $X$ or  $Z$. Note that heteroskedasticity does not imply differential error. Suppose, for example, that AC's accuracy predicting rule violations $W$ depends on language $Z$. As a result, $\xi$ and $Z$ are correlated, and since time-till-ban $Y$ and repuation $Z$ are also correlated, $\xi$ is in turn correlated with $Y$. Despite this, the error in Model \ref{mod:measerr.ols} remains nondifferential, because $Y$ is conditionally independent of $\xi$  given $Z$ and $X$.

% Measuring $X$ is expensive, perhaps requiring trained human annotators, but an automated classifier can predict $X$ with  We choose this level of accuracy to reflect a situation where $X$ may be difficult to predict

% The classifier, perhaps a proprietary API, has unobservable features $K$. The classifier's predictions $W=X + \xi$ are unbiased—the errors $\xi$ are not correlated with $Y$,$X$ or $Z$.  Figure \ref{fig:simulation.1} shows a Bayesian network representing \emph{Simulation 1}'s conditional dependencies of $Z$, $Y$, $K$, $Z$ and $W$ as a directed acyclic graph (DAG).

% \emph{Simulation 2} extends \emph{Simulation 1} by making the automated classifier classification errors $\xi$ that are correlated with $Y$ even after accounting for $Z$ and $x$.

In \emph{Simulation 1b} visualized in Figure \ref{fig:simulation.1b}, the AC's predictions directly depend on the outcome $Y$, so we can test error correction methods in the presence of differential error.
We create this dependence by simulating an AC with $\Sexpr{format.percent(sim1b.acc)}$ accuracy that makes predictions $W$ that are negatively correlated with the residuals of the linear regression of $X$ and $Z$ on $Y$ (Pearson's $\rho=\Sexpr{round(sim1b.cor.resid.w_pred,2)}$). As a result, this AC makes fewer false-positives and more false-negatives at greater levels of $Y$. Although the false-negative rate of the AC is \Sexpr{format.percent(sim1b.fnr)} overall, when $Y<=0$ the false-negative rate is only \Sexpr{format.percent(sim1b.fnr.y0)}, but when $Y>=0$ it rises to \Sexpr{format.percent(sim1b.fnr.y1)}.
%Figure \ref{fig:simulation.1b} shows a Bayesian network representing conditional dependencies of $Z$, $Y$, $Z$ and $W$ in \emph{Simulation 1b}.

These simulations are prototypical of an AC that influences behavior in a system under study such as if community moderators use ACs to identify rule-breakers and correct their behavior. False negatives may cause delays in moderation increasing $Y$ (time-until-ban), while false-positives could draw moderator scrutiny and cause them to issue speedy bans.
This mechanism is not mediated by observable variables such as reputation ($Z$) or the true rule-breaking ($X$). Therefore,  Model \ref{mod:measerr.ols} has differential error.

\subsection{Measurement Error in the Outcome (Simulation 2a and 2b)}

We then simulate using an AC to measure the dependent variable $Y$,  a binary covariate $X$, and a continuous covariate $Z$. For example, $Y$ describes whether a message is rule-breaking, $X$ whether the user leaving the message has been warned by moderators, and $Z$ a reputation score. The goal is to estimate $B_1$ and $B_2$ in the following logistic regression model:

\begin{equation}
    P(y) = \frac{1}{1 + e^{-(B_0 + B_1 x + B_2 z)}}
    \label{mod:measerr.logit}
\end{equation}

\noindent As was true for $X$ in \emph{Simulation 1}, human coders can observe $Y$, but at considerable expense, and an AC makes predictions $W = Y + \xi$ .

\emph{Simulation 2a} (visualized in Figure \ref{fig:simulation.2a}) and \emph{Simulation 2b} (visualized in Figure \ref{fig:simulation.2b}) implement these scenarios. Here, $X$ and $Z$ are balanced $P(X)=P(Z)=0.5$ and correlated.
 (Pearson's $\rho=\Sexpr{round(sim2a.cor.xz,2)}$).
As in \emph{Simulation 1} we simulate scenarios where an AC is of practical use to  estimate subtle relationships.  In \emph{Simulation 1} we chose the variance of the normally distributed outcome given our chosen coefficients $B_X$ and $B_Z$, but this is not appropriate for \emph{Simulation 2}'s logistic regression so we choose, somewhat arbitrarily, $B_X=\Sexpr{sim2.Bx}$ and $B_Z=\Sexpr{sim2.Bz}$.

Again, we simulate ACs with moderate predictive performance.
The AC in \emph{Simulation 2a} is \Sexpr{format.percent(sim2a.AC.acc)} accurate and the AC in \emph{Simulation 2b} is \Sexpr{format.percent(sim2b.AC.acc)} accurate. In \emph{Simulation 2a}, the predictions $W$ are unbiased because classification errors $\xi$ have mean $0$ and are independent of covariates $X$ and $Z$.  However, in \emph{Simulation 2b}  the predictions are biased because their errors $\xi$ are correlated with $Z$ (Pearson's $\rho = \Sexpr{round(sim2b.error.cor.z,2)}$).
One way such a correlation might obtain in our example of online moderation is if community members are adept at skirting the rules without violating them. Such members are both likely to be warned by moderators and also to leave messages misclassified as rule-breaking.

\section{Simulation Results}

We visualize the consistency, efficiency, and the accuracy of uncertainty quantification of each method in each prototypical scenario.
%Our main results are presented as plots visualizing the consistency (i.e., does the method, on average, recover the true parameter?), efficiency (i.e., how precise are estimates and does precision improve as sample size increases?), and the accuracy of uncertainty quantification of each method in each scenario.
For example, Figure \ref{fig:sim1a.x} visualizes results for \emph{Simulation 1a}. Its subplots each show a simulation with a given total sample size (No. observations) and validation sample size (No. validation data).

To understand how each plot visualizes the consistency of estimators, see for instance the leftmost column in the bottom-left subplot illustrating performance of the naïve estimator using AC classifications $W$ to stand in for the true variable $X$. The center of the black circle locates the expected value of the point estimate over our \Sexpr{n.simulations} simulations. For the naïve estimator in Figure \ref{fig:sim1a.x}, the circle is far below the dashed line which shows the true value of $B_X$, indicating that misclassification causes a dramatic bias toward 0 and that the estimator is inconsistent.

To assess efficiency, we mark the region in which point estimate falls in 95\% of the simulations with black lines.
These black lines in the bottom-left subplot of Figure \ref{fig:sim1a.x} for example show that the feasible estimator, which uses only perfectly accurate validation data, is consistent but less precise than the estimates from correction methods that use both automatic classifications and human-labeled data.

The accuracy of the method's uncertainty quantification can be seen by comparing the gray lines,  which show for each method the expected value of its approximate 95\% confidence intervals over the \Sexpr{n.simulations} simulations for each method, to the neighboring black lines.
 The \emph{PL} column in the bottom-left subplot of Figure \ref{fig:sim1a.x} shows that the method's  95\% confidence interval is biased towards 0 when the number of human labels is low.  This result is expected because the method does not account for uncertainty in misclassification probabilities estimated using the sample of true classifications.
Now that we have explained how to interpret our plots, we will  unpack them for each simulated scenario.

\subsection{Simulation 1a: When Misclassifications Are Independent of the Outcome}

\begin{figure}
<<example1.x,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.simulation.iv(plot.df.example.1, iv='x')
grid.draw(p)
@
\caption{Estimates of $B_X$ in multivariate regression with $X$ measured using machine learning and model accuracy independent of $X$, $Y$, and $Z$. All methods, except the pseudo-likelihood method obtain precise and accurate estimates given sufficient validation data. \label{fig:sim1a.x}}
\end{figure}

As visualized in Figure \ref{fig:sim1a.x}, the naïve estimator is severely biased in its estimation of $B_X$ in \emph{Simulation 1a}.
Fortunately, error correction methods including our MLE method as well as the GMM and MI approach produce consistent estimates and acceptably accurate confidence intervals.
Notably, the PL method is inconsistent and considerable bias remains when the number of human classifications is much less than the total number of observations.  The most likely source of this inconsistency is that $P(X=x)$ is missing from the pseudo-likelihood as can be seen by comparing Equation \ref{eq:mle.covariate.chainrule.4} in our Supplement to Equations 24-28 from \citet{zhang_how_2021}. The bottom row of Figure
\ref{fig:sim1a.x} shows that the precision of MLE and GMM estimates increase in larger datasets.
However, this is not true for multiple imputation (MI).
Therefore, GMM calibration and MLE appear to use automatic classifications more efficiently than MI does.

%It is important to correct misclassification error even when an AC is only used as a statistical control \citep[for example]{weld_adjusting_2022}, because when a covariate $Z$ is correlated with $X$, misclassifications of $X$ cause bias in the \emph{naïve} estimates of $B_Z$, the regression coefficient of $Z$ on $Y$. As Figure \ref{fig:sim1a.z} in Appendix \ref{appendix:main.sim.plots} shows, methods that effectively correct estimates of $X$ in \emph{Simulation 1a} also correct estimates of $B_Z$.

In brief, when misclassifications cause nondifferential error, our simulations provide evidence that MLE and GMM calibration are both effective, efficient and provide accurate uncertainty quantification.  These two methods complement each other since they have different assumptions and advantages.  In theory, MLE depends on correctly specifying the likelihood and its robustness to incorrect specifications is difficult to analyze \citep{carroll_measurement_2006}.  GMM calibration depends on an exclusion restriction instead of such distributional assumptions \citep{fong_machine_2021}.
As discussed above, MLE's advantages over GMM calibration come from the relative ease with which it can be extended to more complex statistical models such as generalized linear models (GLMs) and generalized additive models (GAMs).
Therefore, in cases similar to \emph{Simulation 1a} we recommend using both GMM and an appropriately specified MLE model.

\subsection{Simulation 1b: When Misclassifications Depend on the Outcome}

Differential error can give rise to dramatic bias that is more difficult to correct using measurement error methods.
As Figure \ref{fig:sim1b.x} shows, the naïve estimator is opposite in sign to the true parameter in \emph{Simulation 1b}.
Of the four methods we test, only the MLE and the MI approach provide consistent estimates. This is expected because these are the only two methods using the outcome $Y$ to adjust for errors in classifications. The bottom row of Figure \ref{fig:sim1b.x} shows how the precision of the MI and MLE estimates increase with additional unlabeled data.  As with \emph{Simulation 1a}, MLE uses this data more efficiently than MI does. However, due to the low accuracy and bias of the AC, additional unlabeled data improves precision less than one might expect. Both methods provide acceptably accurate confidence intervals. Figure \ref{fig:sim1b.z} in the Supplement shows that as in \emph{Simulation 1a}, effective correction for misclassifications of $X$ is required to consistently estimate $B_Z$, the coefficient of $Z$ on $Y$.  Looking at results from methods that do not correct differential error is useful for understanding their limitations. When few true values of $X$ are known, GMM is nearly as bad as the naïve estimator, and PL is also visibly biased. Both improve when a greater proportion of the entire dataset is labeled because they combine their AC-based estimates with the feasible estimator.

\begin{figure}
<<example2.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.simulation.iv(plot.df.example.2, iv='x')
grid.draw(p)
@
\caption{Estimates of $B_X$ in multivariate regression with $X$ measured using machine learning, where model accuracy correlated with $X$ and $Y$.  Only multiple imputation and our MLE model with a full specification of the error model obtain consistent estimates of $B_X$. \label{fig:sim1b.x}}
\end{figure}

In sum, our simulations suggest that the MLE method is the superior choice when misclassifications are not conditionally independent of the outcome given observed covariates.  Although MI estimations are consistent, the method's practicality is limited by its inefficiency.

\subsection{Simulation 2a: When Random Misclassifications Affect the Outcome}

\begin{figure}
<<example3.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
#plot.df <-
p <- plot.simulation.dv(plot.df.example.3,'z')
grid.draw(p)
@
\caption{Estimates of $B_Z$ in  \emph{Simulation 1b}, multivariate regression with $Y$ measured using an imperfect automatic classifier. Only our MLE model obtains consistent estimates \label{fig:sim2a.x}.}
\end{figure}

Ignoring misclassification in dependent variables also introduces bias as evidenced by the naïve estimator's inaccuracy illustrated in Figure \ref{fig:sim2a.x}. Both our MLE method and MI are able to correct this error and provide consistent estimates, but MLE is more efficient.
It is puzzling that the MI estimator is inconsistent and does not improve with more human-labeled data.
%Note that the GMM estimator is not designed to correct misclassifications in the outcome.
The PL approach is also inconsistent, especially when the validation dataset is small compared to the entire dataset, but it is closer to recovering the true parameter than the MI or naïve estimators.
Based on Figure \ref{fig:sim2a.x}, it is clear that the precision of the MLE estimator improves with the addition of unlabeled data to a greater extent than the PL estimator.  The PL estimator provides only modest improvements in precision compared to the feasible estimator.
When the amount of human-labled data is low, inaccuracies in the 95\% confidence intervals of both the MLE and PL become visible.  As before, PL's inaccurate confidence intervals are due to its use of finite-sample estimates of the automatic classification probabilities.
%In both cases, the poor finite-sample properties of the fischer-information quadratic approximation contribute to this inaccuracy. In Appendix \ref{appendix:sim1.profile}, we show that the MLE method's inaccuracy vanishes when using the profile-likelihood method instead.

 In brief, our simulations suggest that MLE is the best of the methods we tested when misclassifications affect the dependent variable. It is the only consistent option and more efficient than the PL method, which is almost consistent.

\subsection{Simulation 2b: When Misclassifications Affecting the Outcome Are Biased}

\begin{figure}
<<example.4.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
#plot.df <-
p <- plot.simulation.dv(plot.df.example.4,'z')
grid.draw(p)
@
\caption{Estimates of $B_Z$ in \emph{Simulation 2b}, multivariate regression with $Y$ measured using an automatic classifier that makes errors correlated a covariate $X$. Only our MLE model with a full specification of the error model obtains consistent estimates. \label{fig:sim2b.x}}
\end{figure}

In \emph{Simulation 2b}, misclassifiations in the outcome are correlated with a covariate $X$. As shown in Figure \ref{fig:sim2b.x}, this type of misclassification can cause dramatic bias in the naïve estimator.
Similar to \emph{Simulation 1a}, MI is inconsistent, however PL is also inconsistent because it does not account for $X$ in its measurement error model.
As in \emph{Simulation 1b}, our MLE method obtains consistent estimates, but only does much better than the feasible estimator when the dataset is large.
Figure \ref{appendix:main.sim.plots} in the Supplement shows the precision of estimates for the coefficient for $X$ improves with additional data to a greater extent and so this imprecision is mainly in estimating the coefficient for $Z$, the variable correlated with misclassification.
Therefore, our simulations suggest that MLE is the best method when misclassifications in the outcome are correlated with a covariate.

\section{Transparency Is Not Enough. We Can Fix It!: Recommendations for Automated Content Analyses}

``Validate, Validate, Validate'' \citep{grimmer_text_2013} is one of the guiding mantras for automated content analysis. It reminds us that ACs can produce misleading results and of the importance of steps to ascertain their validity, for instance by making misclassificition rates transparent.
\citet[p.5]{grimmer_text_2013} write that
``when categories are known [...], scholars must demonstrate that the supervised methods are able to reliably replicate human coding.''
This suggests that quantifying an AC's predictive performance by comparing human-labeled validation data to  automatic classifications sufficiently establishes an AC's validity and thereby the validity of downstream analyses.


Like \citet{grimmer_text_2013}, we are deeply concerned that computational methods may produce invalid evidence. In this sense, their validation mantra animates this paper. But transparency about misclassification rates via metrics such as precision or recall leaves unanswered an important question: Is comparing automated classifications to external ground truth sufficient to claim validity? Or is there something else we can do and should do? We think there is: Using statistical methods to not only quantify but also correct for misclassification. Our study provides several recommendations in this regard.
%Similar to recent work in communication science \citep{mahl_noise_2022, stoll_supervised_2020}, our goal is not only to \textit{highlight} and \textit{quantify} common pitfalls in automated content analysis applications of ACs but to also \textit{propose} constructive guidelines on the road ahead.

\subsubsection{Construct Validation Data before Building an AC}

Analyzing human-coded data for validation is often done \textit{post facto},  e.g., to calculate predictiveness metrics an AC is built. We propose to instead to collect and use manually annotated validation data \textit{ante facto}.
Practically speaking, the main reason to use an AC is feasibility, i.e., avoiding to label large data sets manually.
For example, a large dataset may be necessary to study a small effect xand manually labeling such a dataset may be more expensive than building an AC.
In this way, ACs can be seen as a cost-saving procedure that exchanges the expense of manual labeling in exchange for the threats to validity posed by misclassification.
However, building an AC can also be very expensive because of the considerable costs of human annotation, software development, and computational resources needed to train ACs. Due to this often unpredictable effort, we caution researchers against building an AC unless doing so is necessary to obtain useful evidence. Instead, validation data should be used \textit{ante facto}, with researchers beginning with preliminary analysis of human-coded data  from which they can discern if an AC is necessary.
In our simulations, the ``feasible estimate'' is less precise but consistent in all cases. So if fortune shines and this estimate sufficiently answers one's research question, the costs of building the AC are avoided.
If feasible estimation fails to provide convincing evidence, for example by not rejecting a null hypothesis, the human-labeled data is not wasted. It can be reused to validate the AC and account for misclassification in downstream analysis.
%One potential problem of this \textit{ante facto} approach is that conducting two statistical tests of the same hypothesis increases the chances of false discover. A simple solution to this is to adjust the significance threshold $\alpha$ for drawing conclusions from the feasible estimate. %We recommend p < .01. %That said, it might useful  use an AC in a preliminary analysis, prior to collecting validation data when an AC such as one available from an API, is available for reuse and confusion matrix quantities necessary for the pseudo-likelihood (PL) method are published. Although (PL) is inconsistent when used for a covariate, this can be corrected if the true rate of $X$ can be estimated.
%Caution is still warranted because ACs can perform quite differently from one dataset to another so we recommend collecting validation representative of your study's dataset and using another appropriate method for published studies.

\subsubsection{Use Validation Data to Evaluate Differential Error}

% Let's suppose an AC is used to the feasible estimator is insufficiently informative
%There are many guides on how to train and validate ACs \citep[e.g.][]{grimmer_text_2013,van_atteveldt_validity_2021}. However, they mostly refer to performance metrics such as the F1-score or Area under the Curve (AUC). The problem with this approach is that such criteria make misclassifications transparent but do not provide information on how misclassification will affect downstream analyses and how to correct for such effects.
%One reason for this is that such criterion do not account for differential error or for correlation between misclassifications in the outcome and a regression covariate—both of which can give rise to extremely misleading statistics.
As we argued and demonstrated in our simulations, biases introduced by misclassification may not be trivial to adjust. Here, knowing whether an AC makes differential misclassification is particularly important for downstream analysises: It determines which correction method might work best.
Fortunately, human coded data can be used to investigate differential misclassification.
For example, ``algorithmic audits'' \citep[e.g.][]{rauchfleisch_false_2020, kleinberg_algorithmic_2018} evaluate the performance of AC across different subgroups in the data for example when using AC for corpora of different languages or data from different social media platforms. Differential misclassification can be ruled out if the performance is the same across all analytically relevant subgroups and other variables.
We strongly recommend using such methods to test for differential misclassification and design the measurement error model within our MLE framework. Evidence that the model effectively corrects differential error can be provided by tests of conditional independence between the automatic classifications $W$ and the outcome $Y$ given a chosen model of $P(W|Y,X,Z)$, the conditional probability of the automatic classifications given the outcome and covariates.

\subsubsection{Correct for Misclassification Errors (Twice) Instead of Being Naïve}

Across  our simulations, we showed that the naïve estimator is biased. Testing different error correction methods, we found that these generate different levels of consistency, efficiency, and accuracy in uncertainty quantification. That said, our proposed MLE method should be considered as a versatile method because it is the only method capable of producing consistent estimates in prototypical situations studied here. We recommend the MLE method as the first ``go-to'' method. The method requires specifying the error model, but this can be known if one follows our second recommendation. We developed the \textbf{misclassificationmodels} R package to facilitate adoption of our MLE method (see Appendix \ref{appendix:misclassificationmodels} in our Supplement).

We recommend comparing our MLE approach to another error correction method. Consistency between two correction methods shows that results are robust independent of the choice of correction method. If the AC is used to predict the dependent variable, PL might be a reasonable choice. For cases of AC-predicted covariates, GMM calibration is a good choice if error is nondifferential. Otherwise, MI can be considered.
The range of viable choices in error correction motivates our next recommendation.

\subsubsection{Provide a Full Account of  Methodological Decisions and Robustness Checks}

Finally, we add our voices to those
recommending that researchers report methodological decisions so other can understand and replicate their design \citep{pipal_if_2022, reiss_reporting_2022}. These decisions include but are not limited to choices concerning test and training data (e.g., size, sampling, split in cross-validation procedures, balance), manual annotations (size of manually annotated data, number of coders, intercoder values, size of data coded for intercoder testing), and the classifier itself (choice of algorithm or ensemble, different accuracy metrics). They extend to reporting different error correction methods as proposed by our third recommendation.
In our review, we found that reporting such decisions is not yet common, at least in the context of SML-based text classification.
When correcting for misclassification, uncorrected results will often provide a lower-bound on effect sizes; corrected analyses will provide more accurate but less conservative results.
Therefore, both corrected and uncorrected estimates should be presented as part of making potential multiverses of findings transparent.
% we
% To report instead of hiding methodological decisions and related uncertainty that may emerge in generated results,
We realize that researchers might need to cut methodological information, especially for empirical studies, to conform to either word limits or reviewers. If word limitations are the problem, this information could be reported in appendices.
% Here, the field might consider adopting ---or adapting--- machine learning reporting standards such as DOME (Computational Biology) and PRIME (Diagnostic medicine).


\section{Conclusion and Limitations}

We introduced the often-ignored problem of misclassification in automated content analysis, a topic often discussed in the context of manual content analysis \citep{scharkow_how_2017}, but that we believe has not attracted enough attention within the computational social science community. In a systematic review of SML applications, we show that scholars rarely acknowledge this problem. We therefore discuss a range of statistical methods that use manually annotated validation data as a ``gold standard'' to account for misclassification and produce correct statistical results, including a new MLE method we design. Using Monte-Carlo simulations, we show that our method provides consistent estimates, especially in less trivial situations involving differential error. Based on these results, we provide four recommendations for the future of automated content analysis: Researchers should (1) construct manually annotated validation data before running ACs to see whether using human-labeled data is sufficient, (2) use validation data to test for differential error and choose error correction methods (3) correct for misclassifications via more than one error correction method, and (4) be transparent about the methodological decisions involved in SML-based classifications and error correction.

Our study has several limitations. First, the simulations and methods we introduce focus on misclassification by automated tools. They provisionally assume that human coders do not make errors.
This assumption can be reasonable if intercoder reliability is very high but this may not always be the case.
%Alternatively, validation data can be treated as a gold standard if the goal is measuring \emph{how a person categorizes content}, as opposed to the more common approach of measuring presumably objective content categories. That said, the prevailing approaches in content analysis use human coders to measure a latent category who are prone to misclassification.
Thus, it may be important to account for measurement error by human classifiers and by automatic classifiers simultaneously. In theory, it is possible to extend our MLE approach in order to do so \citep{carroll_measurement_2006}.
However, because the true values of content categories are never observed, accounting for automatic and human misclassification at once requires latent variable methods that bear considerable additional complexity and assumptions \citep{pepe_insights_2007}. We leave the integration of such methods into our MLE framework for future work. Second, the simulations we present do not consider a number of factors that may influence the performance and robustness of the methods we test including classifier accuracy, heteroskedasticity, and violations of distributional assumptions.  We are working to investigate such factors by extending our simulations. We simulated datasets with balanced covariates, but classifiers are often used to measure rare occurrences. Imbalanced covariates will require greater sample sizes of validation data to correct misclassification bias.
In such cases, validation data may be collected more efficiently using approaches that provide balanced, but unrepresentative samples.
Such non-representative sampling requires correction methods to account for probability that a datapoint will be sampled, but we have not evaluated if the correction methods can do so.

\setcounter{biburlnumpenalty}{9001}
\printbibliography[title = {References}]

\clearpage
\appendix

\section{Perspective API Example}\label{appendix:perspective}

The civil comments dataset represented the human-coded variables we analyzed as proportions of annotators who labeled a comment as ``toxic'' or as disclosing each of several aspects of personal identity including race and ethnicity.
For the purposes of this exercise, we convert the annotation proportions into indicators of the majority view. The dataset also includes counts of ``reactions'' (e.g., 'funny', 'like', 'sad') to each comment.

 Our maximum-likelihood based error correction technique in this example requires specifying models for the Perspective's scores and, in the case where these scores are used as a covariate, a model for the human annotations.  In our first example, where toxicity was used as a covariate, we used the \emph{human annotations}, \emph{identity disclosure}, and the interaction of these two variables in the model for scores.  We omitted \emph{likes} from this model because they are virtually uncorrelated with misclassifications (Pearson's $\rho=\Sexpr{iv.example[['civil_comments_cortab']]['toxicity_error','likes']}$).  Our model for the human annotations is an intercept-only model.

 In our second example, where toxicity is the outcome, we use the fully interacted model of the \emph{human annotations}, \emph{identity disclosure}, and \emph{likes} in our model for the human annotations because all three variables are correlated with the Perspective scores.

\section{Systematic Literature Review} \label{appendix:lit.review}

To understand scholarly awareness of measurement errors, we conducted a systematic literature review of common practices in SML-based text classification.

\subsection{Identification of Relevant Studies}
To identify relevant studies, we relied on four recent reviews on the use of AC with a focus on communication science \citep{baden_three_2022, hase_computational_2022, junger_unboxing_2022, song_validations_2020}. We contacted authors of respective studies who, thankfully, either already published their data in an open-science approach or shared their data with us when asked.
Based on their reviews, we collected \emph{N} = 110 studies that, according to their analyses, included some type of SML (for an overview, see Figure \ref{fig:FigureA1}).

\begin{figure}
    \centering
    \includegraphics{measurement_flow.pdf}
    \caption{Identifying relevant studies for the literature review}
    \label{fig:FigureA1}
\end{figure}

We first removed 8 duplicate studies identified by several reviews. Two coders then coded the remaining \emph{N} = 102 studies of our preliminary sample for relevance. After an intercoder test (\emph{N} = 10, $\alpha$ = .89), coders sorted studies into one of four categories: Similar to previous reviews \citep{hase_computational_2022}, we only included studies either focusing on methodologically advancing SML-based ACs (Code = 1) or applying the method in empirical studies (Code = 2). In contrast, we removed studies that did not include any SML approach (Code = 3) or only used SML-based text classification for data cleaning, not data analysis (Code = 4)—for instance to sort out topically irrelevant articles.

Subsequently, \emph{N} = 69 studies remained in our sample of relevant articles. Out of these, only empirical studies (\emph{N} = 48) were coded in further detail. We explicitly excluded methodological studies for understanding common practices within SML-based text classification since these will like include far more robustness and validity tests than commonly employed in empirical settings.

\subsection{Manual Coding of Relevant Empirical Studies}
For the remaining \emph{N} = 48 empirical studies, we created a range of variables (for an overview, see Table \ref{tab:TableA1}). Based on data from the Social Sciences Citation Index (SSCI), we identified whether studies were published in journals classified as belonging to \emph{Communication} and their \emph{Impact} according to their H index. In addition, two coders manually coded...
\begin{itemize}
 \item the type of variables created via SML-based ACS using the variables \emph{Dichotomous} (0 = No, 1 = Yes), \emph{Categorical} (0 = No, 1 = Yes), \emph{Ordinal} (0 = No, 1 = Yes), \emph{Metric} (0 = No, 1 = Yes),
 \item whether variables were used in descriptive or multivariate analyses using the variables \emph{Descriptive} (0 = No, 1 = Yes), \emph{Independent} (0 = No, 1 = Yes), \emph{Dependent} (0 = No, 1 = Yes),
 \item how classifiers were trained and validated via manually annotated data using the variables \emph{Size Training Data} (Open String), \emph{Size Test Data} (Open String), \emph{Size Data Intercoder Test} (Open String), \emph{Intercoder Reliability} (Open String), \emph{Accuracy of Classifier} (Open String),
 \item and whether articles mentioned and/or corrected for misclassifications using the variables \emph{Error Mentioned} (0 = No, 1 = Yes) and \emph{Error Corrected}) (0 = No, 1 = Yes).

\end{itemize}

\begin{table}
  \caption{Variables Coded for Relevant Empirical Studies}
  \label{tab:TableA1}
  \begin{tabular}{l l l l}         \toprule
  Category               & Variable                      & Krippendorf's $\alpha$  & \% or \emph{M} (\emph{SD}) \\ \midrule
  Type of Journal        & \emph{Communication}          & n.a.                    & 55.1\%  \\
                         & \emph{Impact}                 & n.a.                    & \emph{M = 3.69} \\
  Type of Variable       & \emph{Dichotomous}            & 0.86                    & 50\%  \\
                         & \emph{Categorical}            & 1                       & 22.9\% \\
                         & \emph{Ordinal}                & 0.85                    & 10.4\% \\
                         & \emph{Metric}                 & 1                       & 35.4\% \\
  Use of Variable        & \emph{Descriptive}            & 0.89                    & 89.6\% \\
                         & \emph{Independent}            & 1                       & 43.8\% \\
                         & \emph{Dependent}              & 1                       & 39.6\% \\
  Information on Classifier & \emph{Size Training Data}  & 0.95                    & 66.7\%  \\
                         & \emph{Size Test Data}      & 0.79                    & 52.1\%  \\
                         & \emph{Size Data Intercoder Test}  & 1      & 43.8\%  \\
                         & \emph{Intercoder Reliability}  & 0.8             & 56.2\%  \\
                         & \emph{Accuracy of Classifier}  & 0.77                   & 85.4\%  \\
  Measurement Error      & \emph{Error Mentioned}        & 1                       & 18.8\% \\
                         & \emph{Error Corrected}        & 1                       & 2.1\% \\ \bottomrule
  \end{tabular}
\end{table}

\subsection{Results}

Overall, more than half of all studies were published in communication journals (\emph{Communication}: 55.1\%). Across domains, SML-based ACs were most often used to create dichotomous measurements (\emph{Dichotomous}: 50\%), followed by variables on a metric (\emph{Metric}: 35.4\%), categorical (\emph{Categorical}: 22.9\%), or ordinal scale (\emph{Ordinal}: 10.4\%). Almost all studies used SML-based classifications to report descriptive statistics on created variables (\emph{Descriptive}: 89.6\%). However, many also used these in downstream analyses, either as dependent variables (\emph{Dependent}: 39.6\%) or independent variables (\emph{Independent}: 43.8\%) in multivariate models. When regressing the use of multivariate models for each variable on the status of journals in which respective studies were published \emph{Impact}) via a mixed model where variables are nested in studies and journals, we find that both correlate: The use of multivariate modeling is more widespread in high-impact journals (\emph{B} = 13.525, \emph{p} < .001)

Overall, we found a persistent lack of transparency in reporting important information: Only slightly more than half of all studies included information on, for instance, the size of training or test sets (\emph{Size Training Data}: 66.7\%, \emph{Size Test Data}: 52.1\%). Even fewer included information on the size of manually annotated data for intercoder testing (\emph{Size Data Intercoder Test}: 43.8\%) or respective reliability values (\emph{Intercoder Reliability}: 56.2\%). Lastly, not all studies reported how well their classifier performed by using metrics such as precision, recall, or F1-scores (\emph{Accuracy of Classifier}: 85.4\%).

Lastly, we also found that few studies mentioned the issue of misclassification or measurement errors (\emph{Error Mentioned}: 18.8\%, with only a single study correcting for such (\emph{Error Corrected}: 2.2\%).

\section{Other methods not tested}
\label{appendix:other.methods}
Simulation extrapolation (SIMEX) uses a simulation of the process generating measurement error to model how measurement error affects an analysis and ultimately to approximate an analysis with no measurement error \citep{carroll_measurement_2006}. SIMEX is a very powerful and general method that can be used without validation data, but may be more complicated than necessary to correct measurement error from ACs when validation data are available. Likelihood methods are easy to apply to classification errors so SIMEX seems unnecessary \citep{carroll_measurement_2006}.

Score function methods derive estimating equations for models without measurement error and then solve them either exactly or using numerical integration \citep{carroll_measurement_2006, yi_handbook_2021}.
The main advantage of score function methods may have over likelihood-based methods is that they do not require distributional assumptions about the mismeasured covariates. This advantage has limited use in the context of ACs because binary classifications must follow Bernoulli distributions.

We also do not consider Bayesian methods (aside from the Amelia implementation of multiple imputation) because we expect these to have similar limitations to the maximum likelihood methods we consider.  Bayesian methods may have other advantages resulting from posterior inference, and may generalize to a wide range of applications, but specifying prior distributions introduces additional methodological complexity and posterior inference is computationally intensive making Bayesian methods less convenient for monte-carlo simulation.


\section{Deriving the maximum likelihood approach}
\label{appendix:derivation}
\subsection{When an AC measures a covariate}
To show why $L(\theta|Y,W)$ can be factored, we follow \citet{carroll_measurement_2006} and begin by observing the following fact from basic probability theory.

\begin{align}
    P(Y,W) &= \sum_{x}{P(Y,W,X=x)}
    \label{eq:mle.covariate.chainrule.1}\\
    &= \sum_{x}{P(Y|W,X=x)P(W,X=x)}
    \label{eq:mle.covariate.chainrule.2}\\
    &= \sum_{x}{P(Y,X=x)P(W|Y,X=x)}  \label{eq:mle.covariate.chainrule.3} \\
    &= \sum_{x}{P(Y|X=x)P(W|Y,X=x)P(X=x)} \label{eq:mle.covariate.chainrule.4}
\end{align}
\noindent
Equation \ref{eq:mle.covariate.chainrule.1} integrates $X$ out of the joint probability of $Y$ and $W$ by summing over its possible values $x$. If $X$ is binary, this means adding the probability given $x=1$ to the probability given $x=0$.  When $X$ is observed, say $x=0$, then $P(X=0)=1$ and $P(X=1)=0$. As a result, only the true value of $X$ contributes to the likelihood. However, when $X$ is unobserved, all of its possible values contribute. In this way, integrating out $X$ allows us to include data where $X$ is not observed to the likelihood.

Equation \ref{eq:mle.covariate.chainrule.2} uses the chain rule of probability to factor the joint probability $P(Y,W)$ of $Y$ and $W$ from $P(Y|W,X)$, the conditional probability of $Y$ given $W$ and $X$ and $P(W,X=x)$, the joint probability of $W$ and $X$. This lets us see how maximizing $\mathcal{L}(\Theta|Y,W)$, the joint likelihood of $\Theta$ given $Y$ and $W$ accounts for the uncertainty of the automatic classifications. For each possible value $x$ of $X$, it weights the model of the outcome $Y$  by the probability that $x$ is the true value and that the AC outputs $W$.

Equation \ref{eq:mle.covariate.chainrule.3} shows a different way to factor the joint probability $P(Y,W)$ so that $W$ is not in model of $Y$. Since $X$ and $W$ are correlated, if $W$ in the model for $Y$  estimation of $B_1$ will be biased.  By including $Y$ in the model for $W$, Equation \ref{eq:mle.covariate.chainrule.3} can account for differential measurement error.

Equation \ref{eq:mle.covariate.chainrule.4} factors $P(Y,X=x)$ the joint probability of $Y$ and $X$ into $P(Y|X=x)$, the conditional probability of $Y$ given $X$, $P(W|X=x,Y)$, the conditional probability of $W$ given $X$ and $Y$, and $P(X=x)$ the probability of $X$.  This shows that fitting a model $Y$ given $X$, in this framework, such as the regression model $Y = B_0 + B_1 X + B_2 Z$ requires including  $X$.  Without validation data, $P(X=x)$ is difficult to calculate without strong assumptions \citep{carroll_measurement_2006}, but $P(X=x)$ can easily be estimated using a sample of validation data.

%Our appendix includes supplementary simulations that explore how robust our method to model mispecification.
Equations \ref{eq:mle.covariate.chainrule.1}--\ref{eq:mle.covariate.chainrule.4} demonstrate the generality of this method because the conditional probabilities may be calculated using a wide range of probability models.  For simplicity, we proceed with a focus on linear regression for the probability of $Y$ and logistic regression for the probability of $W$ and the probability of $X$. However, more flexible probability models such as generalized additive models (GAMs) or Gaussian process classification may be useful for modeling nonlinear conditional probability functions \citep{williams_bayesian_1998}.

\subsection{When an AC measures the outcome}

Again, we will maximize $\mathcal{L}(\Theta|Y,W)$, the joint likelihood of the parameters $\Theta$ given the outcome $Y$ and the automatic classifications $W$ measure the dependent variable $Y$ \citep{carroll_measurement_2006}.
Therefore, we use the law of total probability to integrate out $Y$ and the chain rule of probability to factor the joint probability into $P(Y)$, the probability of $Y$, and $P(W|Y)$ as the conditional probability of $W$ given $Y$.

\begin{align}
    P(Y,W) &= \sum_{y}{P(Y=y,W)} \\
        &= \sum_{y}{P(Y)P(W|Y)}
\end{align}

As above, the conditional probability of $W$ given $Y$ must be calculated using a model. The range of possible models is vast and analysts must choose a model that accurately describes the conditional dependence of $W$ on $Y$.

We implement these methods in \texttt{R} using the \texttt{optim} library for maximum likelihood estimation.  Our implementation supports models specified using \texttt{R}'s formula syntax can fit linear and logistic regression models when an AC measures a covariate and logistic regression models when an AC measures the outcome. Our implementation provides two methods for approximating confidence intervals: The Fischer information quadratic approximation, and the profile likelihood method provided in the \texttt{R} package \texttt{bbmle}.  The Fischer approximation usually works well in simple models fit to large samples and is fast enough for practical use for the large number of simulations we present. However, the profile likelihood method provides more accurate confidence intervals \citep{carroll_measurement_2006}.

\section{misclassificationmodels: The R package} \label{appendix:misclassificationmodels}

The package provides a function to conduct regression analysis but also corrects for misclassification in proxy using the information in validation data. The function is very simular to \textbf{glm()} but with two changes:

\begin{itemize}
\item The formula interface has been extended with the double-pipe operator to denote proxy variable. For example, \textbf{x || w} indicates \textit{w} is the proxy of the ground truth \textit{x}.
\item The validation data must be provided
\end{itemize}

The following code listing shows a typical correction scenario:
\lstset{style=mystyle}
\begin{lstlisting}[language=R, caption=A demo of misclassificationmodels]
library(misclassificationmodels)
## research_data contains the following columns: y, w, z
## val_data contains the following columns: y, w, x, z
# w is a proxy of x
res <- glm_fixit(formula = y ~ x || w + z,
                 data = research_data,
                 data2 = val_data)
summary(res)
\end{lstlisting}

% For more information about the package, please refer to our online appendix.


\section{Additional plots from Simulations 1 and 2}
\label{appendix:main.sim.plots}

\begin{figure}
<<example1.g,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=

p <- plot.simulation.iv(plot.df.example.1,iv='z')

grid.draw(p)
@
\caption{Estimates of $B_Z$ in \emph{simulation 1a}, multivariate regression with $X$ measured using machine learning and model accuracy independent of $X$, $Y$, and $Z$. All methods obtain precise and accurate estimates given sufficient validation data.}
\end{figure}

\begin{figure}
<<example2.g, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.simulation.iv(plot.df.example.2, iv='z')
grid.draw(p)
@
\caption{Estimates of $B_Z$ in multivariate regression with $X$ measured using machine learning and model accuracy correlated with $X$ and $Y$ and error is differential.  Only multiple imputation and our MLE model with a full specification of the error model obtain consistent estimates of $B_X$.\label{fig:sim1b.z}}
\end{figure}

\begin{figure}
<<example3.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
#plot.df <-
p <- plot.simulation.dv(plot.df.example.3,'z')
grid.draw(p)
@
\caption{Estimates of $B_Z$ in \emph{simulation 2a}, multivariate regression with $Y$ measured using an AC that makes errors. Only our MLE model with a full specification of the error model obtains consistent estimates.}
\end{figure}

\begin{figure}
<<example.4.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
#plot.df <-
p <- plot.simulation.dv(plot.df.example.4,'x')
grid.draw(p)
@
\caption{Estimates of $B_X$ in \emph{simulation 2b} multivariate regression with $Y$ measured using machine learning, model accuracy correlated with $Z$ and $Y$ and differential error. Only our MLE model with a full specification of the error model obtains consistent estimates.}
\end{figure}


% \section{Additional simulations}
% \subsection{Heteroskedasktic but nondifferential misclassifications}\label{appendix:sim1.hetero}

% \subsection{Imbalanced covariates}
% \label{appendix:sim1.imbalanced}
\end{document}

\subsection{Profile likelihood improves uncertainty quantification}
\label{appendix:sim1.profile}

\section{Four prototypical scenarios}

We must clearly distinguish four types of measurement error that arise in this context.
The first type occurs when a covariate is measured with error and this error can be made statistically independent of the outcome by conditioning on other covariates. In this case the error is called nondifferential.
The second type, differential error occurs when a covariate is measured with error that is systematically correlated with the outcome, even after accounting for the other covariates \citep{carroll_measurement_2006}.
These two types of error apply when an AC is used to measure a covariate.
When an AC is used to measure an outcome, errors can be random—uncorrelated with the covariates or they can be systematic—correlated with a covariate.

nondifferential measurement error and random error in the outcome are relatively straightforward to correct. We will argue below that differential measurement error can be avoided when an AC is carefully designed. Yet the risk of differential measurement error is considerable in such cases as multilingual text classification because the ease of classification may systematically vary in relation to the outcome and covariates or as when a model trained in one context is applied in another.

Research using ACs based on supervised machine learning may be particularly prone to differential and systematic measurement error. Problems of bias and generalizability have machine learning field of machine learning more generally has


%Statistical theory and simulations have shown that all these methods are effective (though some are more efficient) when ``ground-truth'' observations are unproblematic and when classifiers only make random, but not systematic, errors.  We contribute by testing these methods in more difficult cases likely to arise in text-as-data studies.

%
% All prior methods for correcting measurement error using validation data presume that the validation data is error-free. However, the methodological content analysis literature has extensively studied the difficulties in human-labeling theoretically and substantively significant content categories  through the lens of inter-coder reliability. We contribute novel methods that account for both inter-coder reliability and machine classification error.

Our monte-carlo simulations show that different error-correction methods fail in different cases and that none is always the best.  For example, methods that can correct for differential error will be inefficient when none is present.  In addition, Fong and Taylor \citep{fong_machine_2021}'s method-of-moments estimator exchanges distributional assumptions for an exclusion restriction and fails in different cases from methods based on parametric models, such as ours.


\subsection{Our Contributions}

\begin{itemize}
    \item Introduce this methodological problem to Communication Research; argue that this is not too far from ignoring disagreement in manual codings
    \item Document the prevalence of automated content analysis to show the importance of the problem.
    \item Summarize available statistical methods for adjusting for measurement error and bias.
    \item Evaluate these methods in realistic scenarios to show when they work and when they do not.
    \item Recommend best practices for applied automated content analysis.
    \item Chart directions for future research to advance methods for automated content analysis.
\end{itemize}

\section{Background}

\subsection{Methods used to correct measurement error in simulation scenarios}

We'll compare the performance of these methods in terms of:

\begin{itemize}
    \item Consistency: Does the method recover the true parameter on average?
    \item Efficiency: How precise are the estimates? Does precision improve with sample size?
    \item Robustness: Does the method work when parametric assumptions are violated?
\end{itemize}

We'll run simulations that vary along these dimensions:

\begin{itemize}
    \item Explained variance (function of $B_XZ$ and $\varepsilon$)
    \item Predictor accuracy (we'll always have balanced classes).
    \item iterrater reliability
    \item Data type of measured variable: binary / likert
    \item Distribution of other variable: normal, lognormal, binary
    \item Unlabeled sample size
    \item Labeled sample size
\end{itemize}


\subsection{Explanation of Bayesian Networks / Causal Dags for representing scenarios}

In this section we present the design of our simulation studies. So far I have designed the following three scenarios (though I have some work to do to polish them and fix bugs):

\subsection{Definition of MLE Models}

We model example 1 and 2,
\section{Discussion}

\citet{fong_machine_2021} argue, and we agree, that a carefully designed AC can avoid forms of measurement error that are more difficult to deal with.  However, tailoring an AC from scratch requires considerable effort and expense compared to reuse an AC developed for common purposes as the wide popularity that classifiers like LIWC and Perspective enjoy demonstrates. Our recommended approaches of GMM calibration, multiple imputation and likelihood modeling can all be concieved as fine-tuning steps that transform general purpose classifiers into tailored classifiers capable of providing reliable inferences.

A natural response to the above extended meditation on measurement error in the context of automatic classifiers is to question the purpose of using ACs at all.  It seems strange to think that by using model's predictions of a variable to build another model predicting that same variable we can solve the problems introduced by first model.  Indeed, the more complex modeling strategies we propose are only necessary to correct the shortcomings of an AC.   We envision ACs such as a commercial APIs, widely used dictionaries, or ACs that are generalized to new contexts that are likely to have such shortcomings because such ACs may provide information about a variable that would be difficult to obtain otherwise.

Even though machine learning algorithms such as random forests might obtain greater performance at automatic classification, this comes at the expense of bias that may be difficult to model using validation data \citep{breiman_statistical_2001}.
Instead of tayloring an AC for a research study, using predictive features directly to infer missing validation data using multiple imputation or to model the probability of a variable in the likelihood modeling framework may be simpler and more likely to result in valid inferences.

% A common strategy is to use a machine learning classifier $g(\mathbf{K})$ (e.g., the Perspective API) to obtain   Often, researchers use the $N^*$ observations of $\mathbf{x}$ to build $\hat{\mathbf{w}}=g(\mathbf{Z})$. Other times they may use a different ``black-box'' model $g(\mathbf{Z})$ that is perhaps trained on a larger dataset different from that used to estimate $B$.


% Although it is often claimed that this bias is a conservative ``attenuation'' of estimates toward zero, this is only necessarily the case of ordinary linear regression with 2 variables when the bias is uncorrelated with $\mathbf{x}$ and $\mathbf{y}$ \citep{carroll_measurement_2006}.  What's more, in conditions likely to occur in social scientific research, such as when the explained variance of the regression model is very low, the estimate of $\hat{B}^*$ can be \emph{more precise} than that of $\hat{B}$.  As a result, the measurement error of a machine learning classifier is not always conservative but can result in false discovery  \citep{carroll_measurement_2006}.


 Note that specific forms of statistical bias are of particular concern for scientific measurement and although these may often be related to biases against social groups \cite[][e.g.]{obermeyer_dissecting_2019}, these notions of bias are not equivalent \cite{kleinberg_algorithmic_2018}.  Introduce multi-lingual text classification as an example.

(attenuation bias / correlation dilution), but this bias towards zero defeats the purpose of automated content analysis in the first place!
\subsection{Rationale}
\begin{itemize}
    \item Automated content analysis is all the rage.  Tons of people are doing it, but they all have the same problem: their models are inaccurate. They don't know if the model is accurate enough to trust their inferences.

    \item Social scientists often adopt performance criteria and standards for machine learning predictors used in computer science. These criteria do not tell how well a predictor works as a measurement device for a given scientific study.

    \item In general, prediction errors result in biased estimates of regression coefficients. In simple models with optimistic assumptions this bias will be conservative (attenuation bias / correlation dilution), but this bias towards zero defeats the purpose of automated content analysis in the first place!

    \item In more general scenarios (e.g., GLMs, differential error, multivariate regression), prediction errors can create bias that is not conservative.

    \item Statisticians have studied measurement error for a long time, and have developed several methods, but the settings they consider most often lack features of automated content analysis. Specifically:

    \begin{itemize}
        \item The availability of (potentially inaccurate) validation data. (Most methods are designed for \emph{sensors} where the distribution of the error can be known, but error can be assumed to be nondifferential).
        \item Differential error—the amount of noise is not independent of observations.

        \item The possibility of bias in addition to noise.
    \end{itemize}

    \item Conducting simulations to evaluate existing methods including regression calibration, the extension of regression calibration by Fong and Taylor (2021) \cite{fong_machine_2021}, multiple imputation, and simulation extrapolation.

    \item These issues become even more important, and also more complex in important research designs such as those involving multiple languages.


\subsection{Imperfect human-coded validation data}

All approaches stated above depend on the human-coded validation data $X^*$. Most often, ACs are also trained on human-coded material. The content analysis literature has long been documented how unreliable human coding and manual content analysis papers routinely report intercoder reliability as a result \citep{krippendorff_content_2018}.  Intercoder reliability metrics typically assume that human coders are interchangeable and the only source of disagreement is ``coder idiosyncrasies'' \citep{krippendorff_reliability_2004}. A previous monte-carlo simulation operationalizes these ``coder idiosyncrasies''  as a fixed probability that a coder makes a random guess independent of the coder and of the material \citep{geis_statistical_2021}. In this work, we accept this ``interchangeable coders making random errors'' (ICMRE) assumption. Under this optimistic assumption, only ``coder idiosyncrasies''  cause misclassification error in the validation data.

\citet{song_validations_2020}'s monte-carlo simulation demonstrates that human-coded $X^*$ with a lower intercoder reliability generates more biased classification accuracy of the AC. So even if manual annotation errors are only due to the ICMRE assumption, they may bias results. None of the above correction approaches account for the imperfect human coding of $X^*$, although \citet{zhang_how_2021} identifies the omission of this as a weakness of his proposed approach. Even in the context of manual content analysis, these ``coder idiosyncrasies'' are not routinely adjusted (although methods are available, e.g. \citet{bachl_correcting_2017}).
An advantage of our proposed method over prior approaches is that it automatically accounts for imperfection of human coding under the ICMRE assumption because the random errors in validation data are independent from the AC errors.

Precision of estimates can be improved using more than one independent coder. With two coders, for example, two sets of validation data are generated, $X^*_{1}$, $X^*_{2}$. We then list-wise delete all data that $X^*_{1} \neq X^*_{2}$. If the ICMRE assumption holds, the deleted data, where two coders disagree, can only be due to ``coder idiosyncrasies''. As coders are assumed to be interchangeable, the probability of two interchangeable coders both making the same misclassification error is much less than the probability that one makes a misclassification error . Using such ``labeled-only, coherent-only'' (LOCO) data improves the precision of consistent estimates in our simulation.


\subsection{Measurement error in validation data}

The simulations above assume that validation data is perfectly accurate. This is obviously unrealistic because, validation data, such as that obtained from human classifiers, normally has inaccuracies.
To evaluate the robustness of correction methods to imperfect validation data, we extend our scenarios with with nondifferential error with simulated validation data that is misclassified \Sexpr{format.percent(med.loco.accuracy)} of the time at random.

\subsubsection{Recommendation II: Employ at Least Two Manual Coders, not One}

Independent of whether researchers use manually annotated data for the feasible approach or AC, principles of manual content analysis, including justifying one's sample size, still apply.
%\citep[for details]{krippendorff_content_2018}.
%TODO uncomment below after ICA
Arguably, the most important problem in traditional content
analysis is whether human coders are capable of reliably classifying content into the categories under study. With multiple human coders labelling the same data, metrics such as Krippendorff's $\alpha$
%and Gwet's $AC$
can quantify ``intercoder reliability'' in terms of how often coders agree and disagree \citep{krippendorff_reliability_2004}.
These metrics all assume that disagreements are due to
``coder idiosyncrasies'' that are independent of the data \citep{krippendorff_reliability_2004}.

We recommend that such metrics also be used to establish intercoder reliability in all of the human-labeled data, not only a smaller subset for intercoder testing.
Other than that, the gold standard data is also reused in later steps and those steps can be influenced by these ``coder idiosyncrasies'' \citep{song_validations_2020}.
We recommend that the gold standard data should be manually coded by two coders, not one. It allows the calculation of interrater reliability, a more accurate validation of the AC's performance, and better correction. With additional independent coders,  would eliminate even more of these ``coder idiosyncrasies'' than two coders.


However, the gains from introducing additional coders are diminishing so using more than two coders may not be cost effective.
\end{itemize}


\section{Accounting for errors in the validation data}

In this section, we extend \emph{Simulation 1b} and \emph{Simulation 2b} with


\begin{figure}
<<example.5.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
#plot.df <-
p <- plot.simulation.irr(plot.df.example.5,'z')
grid.draw(p)
@
\caption{Estimates of $B_Z$ in multivariate regression with $X$ measured using machine learning, with validation data collected by 2 independent coders that make random errors.}
\end{figure}
\begin{figure}
<<example.5.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.625,cache=F>>=
#plot.df <-
p <- plot.simulation.irr(plot.df.example.5,'x')
grid.draw(p)
@
\caption{Estimates of $B_X$ in multivariate regression with $X$ measured using machine learning, with validation data collected by 2 independent coders that make random errors.}
\end{figure}

\begin{figure}
<<example.6.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.625,cache=F>>=
#plot.df <-
p <- plot.simulation.irr.dv(plot.df.example.6,'z')
grid.draw(p)
@
\caption{Estimates of $B_Z$ in multivariate regression with $Y$ measured using machine learning, with validation data collected by 2 independent coders that make random errors.}
\end{figure}
\begin{figure}
<<example.6.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.625,cache=F>>=
#plot.df <-
p <- plot.simulation.irr.dv(plot.df.example.6,'x')
grid.draw(p)
@
\caption{Estimates of $B_X$ in multivariate regression with $Y$ measured using machine learning, with validation data collected by 2 independent coders that make random errors.}
\end{figure}