ml_measurement_error_overleaf/article.Rtex

\documentclass[floatsintext, draftfirst, man]{apa7}
<<init,echo=FALSE>>=
library(knitr)
library(ggplot2)
library(data.table)
knitr::opts_chunk$set(fig.show='hold')
f <- function (x) {formatC(x, format="d", big.mark=',')}
format.percent <- function(x) {paste(f(x*100),"\\%",sep='')}

theme_set(theme_bw())
source('resources/functions.R')
source('resources/variables.R')
source('resources/real_data_example.R')
@


\usepackage{epstopdf}% To incorporate .eps illustrations using PDFLaTeX, etc.
\usepackage{subcaption}% Support for small, `sub' figures and tables
\usepackage{tikz}
\usetikzlibrary{positioning, shapes, arrows, shadows, arrows.meta}

\def \parrotpdf {\includegraphics[]{parrot.pdf}}
\DeclareUnicodeCharacter{1F99C}{\parrotpdf}
\usepackage{tabularx}
\usepackage[utf8]{inputenc}
\usepackage{wrapfig}
\usepackage[T1]{fontenc}
\usepackage{textcomp}
\usepackage{listings}
\usepackage{xcolor}

%New colors defined below
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}

\setcounter{secnumdepth}{3}

\lstdefinestyle{mystyle}{
  backgroundcolor=\color{backcolour}, commentstyle=\color{codegreen},
  keywordstyle=\color{magenta},
  numberstyle=\tiny\color{codegray},
  stringstyle=\color{codepurple},
  basicstyle=\ttfamily\footnotesize,
  breakatwhitespace=false,
  breaklines=true,
  captionpos=b,
  keepspaces=true,
  numbers=left,
  numbersep=5pt,
  showspaces=false,
  showstringspaces=false,
  showtabs=false,
  tabsize=2
}
\usepackage{graphicx}
\usepackage{enumerate}

\usepackage{amsmath}%}, amsthm, amssymb}
\DeclareMathOperator*{\argmin}{arg\,min} % thin space, limits underneath in displays
\DeclareMathOperator*{\argmax}{arg\,max} % thin space, limits underneath in displays


\usepackage{subcaption}
\hypersetup{colorlinks=true, linkcolor=black, citecolor=black, filecolor=blue,
     urlcolor=blue, unicode=true}


\usepackage[american]{babel}
\usepackage{csquotes}
\usepackage[natbib=true, style=apa, sortcites=true, backend=biber]{biblatex}
\addbibresource{Bibliography.bib}
\DeclareLanguageMapping{american}{american-apa}

\defbibheading{secbib}[\bibname]{
  \section*{#1}
  \markboth{#1}{#1}
  \baselineskip 14.2pt
  \prebibhook}

\def\citepos#1{\citeauthor{#1}'s (\citeyear{#1})}
\def\citespos#1{\citeauthor{#1}' (\citeyear{#1})}
\newcommand\TODO[1]{\textsc{\color{red} #1}}

\title{Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!}

\shorttitle{Can We Fix It? Yes We Can!}

\keywords{
Automated Content Analysis; Machine Learning; Classification Error; Attenuation Bias; Simulation; Computational Methods; Big Data; AI
}

\abstract{

Automated classifiers (ACs), often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video, and have become widely popular measurement devices in communication science and related fields.
Despite this popularity, even highly accurate classifiers make errors that cause misclassification bias and misleading results in downstream analyses—unless such analyses account for these errors.
As we show in a systematic literature review of SML applications,
communication scholars largely ignore misclassification bias.
In principle, existing statistical methods can use ``gold standard'' validation data, such as that created by human annotators, to correct misclassification bias and produce consistent estimates.
We introduce and test such methods, including a new method we design and implement in the R package \texttt{misclassificationmodels}, via Monte Carlo simulations designed to reveal each method's limitations, which we also release. Based on our results, we recommend our new error correction method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.
}


\begin{document}
\maketitle


\emph{Automated classifiers} (ACs) based on supervised machine learning (SML) have rapidly gained popularity
as part of the \emph{automated content analysis} toolkit in communication science \citep{baden_three_2022}. With ACs, researchers can categorize large samples of text, images, video or other types of data into predefined categories \citep{scharkow_thematic_2013}. Studies for instance use SML-based classifiers to study frames \citep{burscher_teaching_2014}, tonality \citep{van_atteveldt_validity_2021},
or civility \citep{hede_toxicity_2021} in news media texts or social media posts.


However, there is increasing concern about the validity of automated content analysis for studying theories and concepts from communication science \citep{baden_three_2022, hase_computational_2022}. We add to this debate by analyzing \emph{misclassification bias}---how misclassifications by ACs distort statistical findings—unless correctly modeled \citep{fong_machine_2021}. Research areas where ACs have the greatest potential—e.g., content moderation, social media bots, affective polarization, or radicalization—are haunted by the specter of methodological questions related to misclassification bias \citep{rauchfleisch_false_2020}: How accurate must an AC be to measure a variable? Can an AC built for one context be used in another \citep{burscher_using_2015,hede_toxicity_2021}? Is comparing automated classifications to some external ground truth sufficient to claim validity? How do biases in AC-based measurements affect downstream statistical analyses \citep{millimet_accounting_2022}?


Our study begins with a demonstration of misclassification bias in a real-world example based on the Perspective toxicity classifier.
Next, we provide a systematic literature review of $N = 48$ studies employing SML-based text classification.
Although communication scholars have long scrutinized related questions about manual content analysis for which they have recently proposed statistical corrections  \citep{bachl_correcting_2017, geis_statistical_2021}, misclassification bias in automated content analysis is largely ignored.
Our review demonstrates a troubling lack of attention to the threats ACs introduce and virtually no mitigation of such threats. As a result, in the current state of affairs, researchers are likely to either draw misleading conclusions from inaccurate ACs or avoid ACs in favor of costly methods such as manually coding large samples \citep{van_atteveldt_validity_2021}.

Our primary contribution, an effort to rescue ACs from this dismal state, is to \emph{introduce and test methods for correcting misclassification bias} \citep{carroll_measurement_2006, buonaccorsi_measurement_2010, yi_handbook_2021}. We consider three recently proposed methods: \citet{fong_machine_2021}'s generalized method of moments calibration method, \citet{zhang_how_2021}'s pseudo-likelihood models,  and \citet{blackwell_unified_2017-1}'s application of imputation methods. To overcome these methods' limitations, we draw a general likelihood modeling framework  from the statistical literature on measurement error \citep{carroll_measurement_2006} and tailor it to the problem of misclassification bias. Our novel implementation is the experimental R package \texttt{misclassificationmodels}.

 We test these four error correction methods and compare them against ignoring misclassification (the naïve approach) and refraining from automated content analysis by only using manual coding (the feasible approach). We use Monte Carlo simulations to model four prototypical situations identified by our review: Using ACs to measure either (1) an independent or (2) a dependent variable where the classifier makes misclassifications that are either (a) easy to correct (when an AC is unbiased and misclassifications are uncorrelated with covariates i.e.,  \emph{nonsystematic misclassification}) or (b) more difficult (when an AC is biased and misclassifications are correlated with covariates i.e., \emph{systematic misclassification}).

According to our simulations, even biased classifiers without high predictive performance can be useful in conjunction with appropriate validation data and error correction methods.
As a result, we are optimistic about the potential of ACs and automated content analysis for communication science and related fields—if researchers correct for misclassification.
Current practices of ``validating'' ACs by making misclassification rates transparent via metrics such as the F1 score, however, provide little safeguard against misclassification bias.

In sum, we make a methodological contribution by introducing the often-ignored problem of misclassification bias in automated content analysis, testing error correction methods to address this problem via Monte Carlo simulations, and introducing a new method for error correction.
Profoundly, we conclude that automated content analysis will progress not only---or even primarily---by building more accurate classifiers but by rigorous human annotation and statistical error modeling.

\section{Why Misclassification is a Problem: an Example Based on the Perspective API}

There is no perfect AC. All ACs make errors.
This inevitable misclassification causes bias in statistical inference \citep{carroll_measurement_2006, scharkow_how_2017}, leading researchers to make both type-I (false discovery) and type-II errors (failure to reject the null) in hypotheses tests. To illustrate the problematic consequences of this misclassification bias, we focus on real-world data and a specific research area in communication research: detecting and understanding harmful social media content. Communication researchers often employ automated tools such as the Perspective toxicity classifier \citep{cjadams_jigsaw_2019} to detect toxicity in online content \citep[e.g.,][]{hopp_social_2019, kim_distorting_2021, votta_going_2023}.
As shown next, however, relying on toxicity scores created by ACs such as the Perspective API as (in-)dependent variables produces different results than using measurements created via manual annotation.

To illustrate this, we use the Civil Comments dataset released in 2019 by Jigsaw, the Alphabet corporation subsidiary behind the Perspective API. Methodological details on the data and our example are available in Appendix \ref{appendix:perspective}. The dataset has \Sexpr{f(dv.example[['n.annotated.comments']])} English-language comments made on independent news sites. It also includes manual annotations of each comment concerning its toxicity (\emph{toxicity}), whether it discloses aspects of personal identity like race or ethnicity \emph{(identity disclosure)}, and the number of likes it received \emph{(number of likes)}.


In addition to manual annotations of each comment, we obtained AC-based toxicity classifications from the Perspective API in November 2022. Perspective's toxicity classifier performs very well, with an accuracy of \Sexpr{format.percent(iv.example[['civil_comments_accuracies']][['toxicity_acc']])} and an F1 score of \Sexpr{round(iv.example[['civil_comments_f1s']][['toxicity_f1']],2)}. Nevertheless, if we treat human annotations as the ground-truth, the classifier makes systematic misclassifications for it is modestly biased and disproportionately misclassifies comments disclosing racial or ethnic identity as toxic (Pearson's $\rho=\Sexpr{round(dv.example[['civil_comments_cortab']]['toxicity_error','race_disclosed'],2)}$).

First, let us consider \emph{misclassification in an independent variable}. As an example, we use a logistic regression model to predict whether a comment contains \emph{identity disclosure} using \emph{number of likes}, \emph{toxicity}, and their interaction as independent variables. Although this is a toy example, it resembles a realistic investigation of how disclosing aspects of one's identity online relates to normative reception of one's behavior.
\begin{figure}[htbp!]
\centering
\begin{subfigure}{\linewidth}
<<real.data.example.iv,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.asp=0.3,cache=F>>=
p <- plot.civilcomments.iv.example()
print(p)
@
\subcaption{\emph{Example 1}: Misclassification in an independent variable.\label{fig:real.data.example.iv}}
\end{subfigure}

\begin{subfigure}{\linewidth}
<<real.data.example.dv,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.asp=0.3,cache=F>>=
p <- plot.civilcomments.dv.example()
print(p)
@
\subcaption{\emph{Example 2}: Misclassification in a dependent variable. \label{fig:real.data.example.dv}}

\end{subfigure}

\caption{Bias through Misclassification: a Real-World Example Using the Perspective API and the Civil Comments Dataset.\label{fig:real.data.example}}
\end{figure}
As shown in Figure \ref{fig:real.data.example.iv}, relying on AC-based toxicity classifications may lead researchers to reject a hypothesized direct relationship between likes and identity disclosure. Instead, the model suggests that their correlation is entirely mediated by toxicity.
In contrast, using human annotations would lead researchers to conclude a subtle positive direct relationship between likes and identity disclosure.
This  demonstrates that even a very accurate AC can introduce type-II errors, i.e. researchers failing to rejecting a null hypothesis due to misclassification.

Second, let us consider \emph{misclassification in a dependent variable}. We now predict the \emph{toxicity} of a comment with \emph{number of likes}, \emph{identity disclosure} in a comment, and their interaction as independent variables.
As shown in Figure \ref{fig:real.data.example.dv}, using Perspective's classification of toxicity results in a small negative direct effect of likes. However, there is no detectable relationship when using manual annotations. As such, misclassification can also lead to type-I error, i.e., false discovery of a nonzero relationship.

\section{Why Transparency about Misclassification Is Not Enough}

Although the Perspective API is certainly accurate enough to be useful to content moderators, the example above demonstrates that this does not imply usefulness for social science  \citep{grimmer_machine_2021-1}.
Machine learning takes the opposite position on the bias-variance trade-off than conventional statistics does and achieves high predictiveness at the cost of more biased inference \citep{breiman_statistical_2001}. As a growing body of scholarship critical of the hasty adoption of machine learning in criminal justice, healthcare, or content moderation demonstrates,
ACs boasting high performance often have biases related to social categories \citep{barocas_fairness_2019}. Such biases often result from non-representative training data and spurious correlations that neither reflect causal mechanisms nor generalize to different populations \citep{bender_dangers_2021}.
Much of this critique targets unjust consequences of these biases to individuals. Our example shows that these biases can also contaminate scientific studies using ACs as measurement devices. Even very accurate ACs can cause both type-I and type-II errors, which become more likely when  classifiers are less accurate or more biased, or when effect sizes are small.

We argue that current common practices to address such limitations are insufficient. These practices assert validity by reporting classifier performance on manually annotated data quantified via metrics like accuracy, precision, recall, or the F1 score \citep{hase_computational_2022, baden_three_2022, song_validations_2020}.
These steps promote confidence in results by making misclassification transparent, but our example indicates bias can flow downstream into statistical inferences, despite high predictiveness.
Instead of relying only on transparency rituals to ward off misclassification bias, researchers can and should use validation data to correct it.

These claims may surprise because of the wide-spread misconception that misclassification causes only conservative bias (i.e., bias towards null effects). This is believed because it is true for bivariate least squares regression when misclassifications are  nonsystematic
\citep{carroll_measurement_2006, loken_measurement_2017, van_smeden_reflection_2020}.\footnote{Measurement error is \emph{classical} when it is nonsystematic and the the variance of an AC's predictions is greater than the variance of the true value \citep{carroll_measurement_2006}.
Measurement error in an independent variable is called ``differential'' if it is not conditionally independent of the dependent variable given the other independent variables.
Measurement error in an independent variable can be nondifferential and not classical when the variance of the misclassified variable is less than the variance of the true value.  This is called Berkson error and in general is easier to deal with than classical error. It is hard to imagine how a AC would have Berkson errors as predictions would then have lower variance than the training data. Following prior work, we thus do not consider Berkson errors \citep{fong_machine_2021, zhang_how_2021}. We call measurement error in the dependent variable \emph{systematic} when it is correlated
with an independent variable. We use this more general term to simplify our discussions that pertain equally to misclassified independent and dependent variables.}  As a result, researchers interested in a hypothesis of a statistically significant relationship may not consider misclassification an important threat to validity \citep{loken_measurement_2017}.

However, as shown in our example, misclassification bias can be anti-conservative \citep{carroll_measurement_2006, loken_measurement_2017, van_smeden_reflection_2020}. In regression models with more than one independent variable, or in nonlinear models, such as the logistic regression we used in our example, even nonsystematic misclassification can cause bias away from 0.
Second, systematic misclassification can bias inference in any direction.


ACs designed in one context and applied in another are likely to commit systematic misclassification. For example, the Perspective API used to classify toxic content was developed for social media comments but performs much worse when applied to news data \citep{hede_toxicity_2021}. Systematic misclassification may also arise when an AC used for measurement shapes behavior in a sociotechnical system under study. As examples, the Perspective API is used for online forum moderation \citep{hede_toxicity_2021}, as is the ORES API for Wikipedia moderators \citep{teblunthuis_effects_2021}.
Misclassifications from such classifiers can be systematic because they have causal effects on outcomes related to moderation.


If ACs become standard measurement devices, for instance
Google's Perspective API for measuring toxicity \citep[see critically][]{hosseini_deceiving_2017} or Botometer for classifying social media bots \citep[see critically][]{rauchfleisch_false_2020}, entire research areas may be subject to systematic biases.
Even if misclassification bias is usually conservative, it can slow progress in a research area.   Consider how \citet{scharkow_how_2017} argue that media's ``minimal effects'' on political opinions and behavior in linkage studies may be an artifact of measurement errors in both manual content analyses and self-reported media use in surveys.  Conversely, if researchers selectively report statistically significant hypothesis tests, misclassification can introduce an upward bias in the magnitude of reported effect sizes and contribute to a replication crisis \citep{loken_measurement_2017}.

\section{Quantifying the Problem: Error Correction Methods in SML-based Text Classification}

To understand how social scientists, including communication scholars, engage with the problem of misclassification in automated content analysis,
we conducted a systematic literature review of studies using supervised machine learning (SML) for text classification (see Appendix \ref{appendix:lit.review} in our Supplement for details).\footnote{Automated content analysis includes a range of methods both for assigning content to predefined categories (e.g., dictionaries) and for assigning content to unknown categories (e.g., topic modeling) \citep{grimmer_text_2013}. While we focus on SML, our arguments extend to other approaches such as dictionary-based classification and even beyond the specific context of text classification.}
Our sample consists of studies identified by similar reviews on automated content analysis \citep{baden_three_2022, hase_computational_2022, junger_unboxing_2022, song_validations_2020}. Our goal is not to comprehensively review all SML studies
but to provide a picture of common practices, with an eye toward awareness of misclassification and its statistical implications.

We identified a total of 48 empirical studies published between 2013 and 2021, more than half of which were published in communication journals. Studies used SML-based text classification for purposes such as to measure frames \citep{opperhuizen_framing_2019} or topics \citep{vermeer_online_2020}. They often employed SML-based ACs to create dichotomous (50\%) or other categorical variables (23\%).\footnote{Metric variables were created in 35\% of studies, mostly via the non-parametric method by \citet{hopkins_method_2010}.} Of these empirical studies, many used SML-based ACs as independent variables (44\%) or dependent variables (40\%) in multivariate analyses, and 90\% reported univariate statistics such as proportions.


Overall, our review reveals a \emph{lack of transparency when reporting SML-based text classification}, similar to that previously reported \citep{reiss_reporting_2022}: A large share of studies do not report important methodological decisions related to sampling and sizes of training and test sets (see Appendix \ref{appendix:lit.review}). This lack of transparency concerning model validation not only limits the degree to which researchers can evaluate studies, but also makes replicating such analyses to correct  misclassification bias nearly impossible. Most important, our review finds that \emph{studies almost never reflected upon nor corrected  misclassification bias}. According to our review, 85\% of studies reported metrics such as recall or precision, but only 19\% of studies explicitly stated that an AC misclassified texts which may introduce measurement error. Only a single article reported using error correction methods.  To address the clear need for methods for understanding misclassification bias and correcting it, we now introduce and discuss existing methods to do so.


\section{Addressing the Problem: Existing Approaches for Correcting Misclassification}
Statisticians have extensively studied measurement error (including misclassification), the problems it causes for statistical inference, and methods for correcting these problems \citep[see][]{carroll_measurement_2006, fuller_measurement_1987}.
We narrow our focus to three existing methods recently proposed for dealing with misclassification bias in the context of automated content analysis: \citet{fong_machine_2021}'s GMM calibration method, multiple imputation \citep{blackwell_unified_2017-1}, and \citet{zhang_how_2021}'s pseudo-likelihood model.\footnote{Statisticians have studied other methods including simulation extrapolation, Bayesian estimation, and score function methods. As we argue in Appendix \ref{appendix:other.methods}, these error correction methods are not advantageous when manually annotated data is available, as is often the case with ACs.}


In the interest of clarity, we introduce some notation. Say we want to estimate a regression model $Y = B_0 + B_1 X + B_2 Z + \varepsilon$ where $X$ is an independent variable for which a small sample of manually annotated data $X^*$ and automated classifications $W$ are observed. Fully observed are $Z$, a second independent variable and $Y$,  the dependent variable.
To illustrate, in our first real-world example, $X$ is toxicity, $X^*$ are the civil comment annotations, $W$ are the Perspective API's toxicity classification, $Z$ are likes, and $Y$ is identity disclosure.
Say the sample of annotated data $X^*$ is too small to convincingly test a hypothesis, but collecting additional annotations is too expensive.
In contrast, an AC can make classifications $W$ for the entire dataset but introduces misclassification bias. How can we correct this bias in an automated content analysis?

\emph{Regression calibration} uses observable variables, including automated classifications $W$ and other variables measured without error $Z$, to approximate the true value of $X$ \citep{carroll_measurement_2006}. \citet{fong_machine_2021} propose a regression calibration procedure designed for SML that we refer to as \emph{GMM calibration} or GMM.\footnote{\citet{fong_machine_2021} describe their method within an instrumental variable framework, but it is equivalent to regression calibration, the standard term in measurement error literature.} For their calibration model, \citet{fong_machine_2021} use 2-stage least squares (2SLS). They regress the observed variables $Z$ and AC predictions $W$ onto the manually annotated data and then use the resulting model to approximate $X$ as $\hat{X}$. They then use the generalized method of moments (gmm) to combine estimates based on the approximated independent variable $\hat{X}$ and estimates based on the manually annotated data $X^*$. This method makes efficient use of manually annotated data and provides an asymptotic theory for deriving confidence intervals. The GMM approach does not make strong assumptions about the distribution of the outcome $Y$, but can be invalidated by systematic misclassification \citep{fong_machine_2021}. GMM, like other regression calibration techniques, is not designed to correct for misclassification in the outcome.
\emph{Multiple imputation} (MI) treats misclassification as a missing data problem. It understands the true value of $X$ to be observed in manually annotated data $X^*$ and missing otherwise \citep{blackwell_unified_2017-1}.

Like regression calibration, multiple imputation uses a model to infer likely values of possibly misclassified variables. The difference is that multiple imputation samples several (hence \emph{multiple} imputation) entire datasets filling in the missing data from the predictive probability distribution of $X$ conditional on other variables $\{W,Y,Z\}$, then runs a statistical analysis on each of these sampled datasets and pools the results of each of these analyses \citep{blackwell_unified_2017-1}. Note that  $Y$ is included among the imputing variables, giving the MI approach the potential to address \emph{differential error,} when systematic misclassification makes automatic classifications conditionally dependent on the outcome given other independent variables.
 \citet{blackwell_unified_2017-1} claim that the MI method is relatively robust when it comes to small violations of the assumption of nondifferential error. Moreover, in theory, the MI approach can be used for correcting misclassifications both in independent and dependent variables.

\emph{``Pseudo-likelihood''} methods (PL)—even if not always explicitly labeled this way—are another approach for correcting misclassification bias. \citet{zhang_how_2021} proposes a method that approximates the error model using quantities from the AC's confusion matrix—the positive and negative predictive values in the case of a mismeasured independent variable and the AC's false positive and false negative rates in the case of a mismeasured dependent variable.  Because quantities from the confusion matrix are neither data nor model parameters, \citet{zhang_how_2021}'s method is technically a ``pseudo-likelihood'' method. A clear benefit is that this method only requires summary quantities derived from manually annotated data, for instance via a confusion matrix.

\subsection{Proposing Maximum Likelihood Adjustment for Misclassification}

We now elaborate on \emph{Maximum Likelihood Adjustement} (MLA), a new method we propose for correcting misclassification bias. Our method tailors \citet{carroll_measurement_2006}'s presentation of the general statistical theory of likelihood modeling for measurement error correction to the context of automated content analysis.\footnote{In particular see Chapter 8 (especially example 8.4) and Chapter 15. (especially 15.4.2).}  The MLA approach deals with misclassification bias by maximizing a likelihood that correctly specifies an \emph{error model} of the probability of the automated classifications conditional on the true value and the outcome \citep{carroll_measurement_2006}.
In contrast to the GMM and the MI approach, which predict values of the misclassified variable, the MLA method accounts for all possible values of the variable by ``integrating them out'' of the likelihood.
``Integrating out'' means adding possible values of a variable to the joint likelihood, weighted by the likelihood  of the error model.

MLA methods have four advantages in the context of ACs that reflect the benefits of integrating out partially observed discrete variables. First, they are  general in that they can be applied to any model with a convex likelihood including generalized linear models (GLMs) and generalized additive models (GAMs).
Second, assuming the model is correctly specified, MLA estimators are fully consistent whereas regression calibration estimators are only approximately consistent \citep{carroll_measurement_2006}.  Practically, this means that MLA methods can have greater statistical efficiency and require less manually annotated data to make precise estimates.
Third, the MLA approach is applicable both for correcting for misclassification in a dependent and an independent variable.
Fourth, and most important, MLA can be effective when misclassification is systematic.

\subsubsection{When an Automated Classifier Predicts an Independent Variable}

In general, if we want to estimate a model $P(Y|\Theta_Y, X, Z)$ for $Y$ given $X$ and $Z$ with parameters $\Theta_Y$, we can use AC classifications $W$ predicting $X$ to gain statistical power without introducing misclassification bias by maximizing ($\mathcal{L}(\Theta|Y,W)$), the likelihood of the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\}$ in a joint model of $Y$ and $W$  \citep{carroll_measurement_2006}.
The joint probability of $Y$ and $W$ can be factored into the product of three terms: $P(Y|X,Z,\Theta_Y)$, the model with parameters $\Theta_Y$ we want to estimate, $P(W|X,Y,Z \Theta_W)$, a model for $W$ having parameters $\Theta_W$, and $P(X|Z, \Theta_X)$, a model for $X$ having parameters $\Theta_X$.
Calculating these three conditional probabilities is sufficient to calculate the joint probability of the dependent variable and automated classifications and thereby obtain a consistent estimate despite misclassification. $P(W|X,Y, \Theta_W)$ is called the \emph{error model} and $P(X|Z, \Theta_X)$ is called the \emph{exposure model} \citep{carroll_measurement_2006}.

To illustrate, consider the regression model  $Y=B_0 + B_1 X + B_2 Z + \varepsilon$  and automated classifications $W$ of the independent variable $X$.
We can assume that the probability of $W$ follows a logistic regression model of $Y$, $X$, and $Z$ and that the probability of $X$ follows a logistic regression model of $Z$. In this case, the likelihood model below is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\} = \{\{B_0, B_1, B_2\}, \{\alpha_0, \alpha_1, \alpha_2, \alpha_3\}, \{\gamma_0, \gamma_1\}\}$.

\begin{align}
    \mathcal{L}(\Theta | Y, W) &= \prod_{i=0}^{N}\sum_{x} {P(Y_i| X_i, Z_i, \Theta_Y)P(W_i|X_i, Y_i, Z_i, \Theta_W)P(X_i|Z_i, \Theta_X)} \label{eq:covariate.reg.general}\\
    P(Y_i| X_i, Z_i, \Theta_Y) &= \phi(B_0 + B_1 X_i + B_2 Z_i) \\
    P(W_i| X_i, Y_i, Z_i, \Theta_W) &= \frac{1}{1 + e^{-(\alpha_0 + \alpha_1 Y_i + \alpha_2 X_i + \alpha_3 Z_i)}} \label{eq:covariate.logisticreg.w} \\
    P(X_i| Z_i, \Theta_X) &= \frac{1}{1 + e^{-(\gamma_0 + \gamma_1 Z_i)}}
\end{align}


\noindent where $\phi$ is the normal probability density function.  Note that Equation \ref{eq:covariate.reg.general} models differential error (i.e., $Y$ is not independent of $W$ conditional on $X$ and $Z$) via a linear relationship between $W$ and $Y$.  When error is nondifferential, the dependence between $W$ and $Y$ can be removed from Equations \ref{eq:covariate.reg.general} and \ref{eq:covariate.logisticreg.w}.

Estimating the three conditional probabilities in practice requires specifying models on which the validity of the method depends.
This framework is very general and a wide range of probability models, such as generalized additive models (GAMs) or Gaussian process classification, may be used to estimate $P(W| X, Y, Z, \Theta_W)$ and $P(X|Z,\Theta_X)$ \citep{williams_bayesian_1998}.

\subsubsection{When an Automated Classifier Predicts a Dependent Variable}

We now turn to the case when an AC makes classifications $W$ that predict a discrete dependent variable $Y$.
In our second real-data example, $W$ is the Perspective API's toxicity classifications and $Y$ is the true value of toxicity.
This case is simpler than the case above where an AC is used to measure an independent variable $X$ because there is no need to specify a model for the probability of $X$.
If we assume that the probability of $Y$ follows a logistic regression model of $X$ and $Z$ and allow $W$ to be biased and to directly depend on $X$ and $Z$, then maximizing the following likelihood is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W\} = \{\{B_0, B_1, B_2\},\{\alpha_0, \alpha_1, \alpha_2, \alpha_3\}\}$.

\begin{align}
    \mathcal{L}(\Theta|Y,W) &= \prod_{i=0}^{N} {\sum_{x}{P(Y_i | X_i, Z_i, \Theta_Y)P(W_i|X_i, Z_i, Y_i, \Theta_W)}} \label{eq:depvar.general}\\
    P(Y_i| X_i, Z_i, \Theta_Y) &= \frac{1}{1 + e^{-(B_0 + B_1 X_i + B_2 Z_i)}} \\
    P(W_i | Y_i, X_i, Z_i, \Theta_W) &= \frac{1}{1 + e^{-(\alpha_0 + \alpha_1 Y_i + \alpha_2 X_i + \alpha_3 Z_i)}} \label{eq:depvar.w}
\end{align}

If the AC's errors are conditionally independent of $X$ and $Z$ given  $W$, the dependence of $W$ on $X$ and $Z$ can be omitted from equations \ref{eq:depvar.general} and \ref{eq:depvar.w}.

Additional details on the likelihood modeling approach are available in Appendix \ref{appendix:derivation} of the Supplement.


\section{Evaluating Misclassification Models: Monte Carlo Simulations}


We now present four Monte Carlo simulations (\emph{Simulations 1a}, \emph{1b}, \emph{2a}, and \emph{2b}) with which we evaluate existing methods (GMM, MI, PL) and our approach (MLA) for correcting misclassification bias.

Monte Carlo simulations are a tool for evaluating statistical methods, including (automated) content analysis \citep[e.g.,][]{song_validations_2020,bachl_correcting_2017,geis_statistical_2021, fong_machine_2021,zhang_how_2021}.
They are defined by a data generating process from which datasets are repeatedly sampled. Repeating an analysis for each of these datasets provides an empirical distribution of results the analysis would obtain over study replications. Monte Carlo simulation affords exploration of finite-sample performance, robustness to assumption violations, comparison across several methods, and ease of interpretability \citep{mooney_monte_1997}.
Such simulations allow exploration of how results depend on assumptions about the data-generating process and analytical choices and are thus an important tool for designing studies that account for misclassification.


\subsection{Parameters of the Monte Carlo Simulations}

In our simulations, we tested four error correction methods: \emph{GMM calibration} (GMM) \citep{fong_machine_2021}, \emph{multiple imputation} (MI) \citep{blackwell_unified_2017-1}, \emph{Zhang's pseudo-likelihood model} (PL) \citep{zhang_how_2021}, and our \emph{maximum likelihood adjustment} approach (MLA). We use the \texttt{predictionError} R package \citep{fong_machine_2021} for the GMM method, the \texttt{Amelia} R package for the MI approach, and our own implementation of \citet{zhang_how_2021}'s PL approach in R.
We develop our MLA approach in the R package \texttt{misclassificationmodels}.
For PL and MLA, we quantify uncertainty using the Fisher information quadratic approximation.\footnote{The code for reproducing our simulations and our experimental R package is available here: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}.}

In addition, we compare these error correction methods to two common approaches in communication science: the \emph{feasible} estimator (i.e., conventional content analysis that uses only manually annotated data and not ACs)

and the \emph{naïve} estimator (i.e., using AC-based classifications $W$ as stand-ins for $X$, thereby ignoring misclassification).
According to our systematic review, the \emph{naïve} approach reflects standard practice in studies employing SML for text classification.

We evaluate each of the six analytical approaches in terms of \emph{consistency} (whether the estimates of parameters $\hat{B_X}$ and $\hat{B_Z}$ have expected values nearly equal to the true values $B_X$ and $B_Z$), \emph{efficiency} (how precisely the parameters are estimated and how precision improves with additional data), and \emph{uncertainty quantification} (how well the 95\% confidence intervals approximate the range including 95\% of parameter estimates across simulations).
To evaluate efficiency, we repeat each simulation with different amounts of total observations, i.e., unlabeled data to be classified by an AC (ranging from \Sexpr{min(N.sizes)} to \Sexpr{max(N.sizes)} observations), and manually annotated observations (ranging from \Sexpr{min(m.sizes)} to \Sexpr{max(m.sizes)}
observations). Since our review indicated that ACs are most often used to create binary variables, we restrict our simulations to misclassifications related to a binary (in-)dependent variable.

\subsection{Four Prototypical Scenarios for Our Monte Carlo Simulations}

We simulate regression models with two independent variables ($X$ and $Z$). This sufficiently constrains our study's scope but the scenario is general enough to be applied in a wide range of research studies.
Whether the methods we evaluate below are effective or not depends on the conditional dependence structure among independent variables, the dependent variable $Y$, and automated classifications $W$.
This structure determines if adjustment for systematic misclassification is required \citep{carroll_measurement_2006}.
In Figure \ref{bayesnets}, we illustrate our scenarios via Bayesian networks representing the conditional dependence structure of variables  \citep{pearl_fusion_1986}:
We first simulate two cases where an AC measures an independent variable without (\emph{Simulation 1a}) and with differential error (\emph{Simulation 1b}). Then, we simulate using an AC to measure the dependent variable, either one with misclassifications that are uncorrelated (\emph{Simulation 2a}) or correlated with an independent variable (\emph{Simulation 2b}). GMM is not designed to correct misclassifications in dependent variables, so we omit this method in \emph{Simulations 2a} and \emph{2b}.

\input{bayesnets.tex}

\subsubsection{Misclassification in an Independent Variable (\emph{Simulations 1a} and \emph{1b})}

We first consider studies with the goal of testing hypotheses about the coefficients $B_1$ and $B_2$ in a least squares regression:
\begin{equation}
Y=B_0 + B_1 X + B_2 Z + \varepsilon
    \label{mod:true.ols}
\end{equation}

In this simulated example, $Y$ is continuous variable,  $X$ is a binary variable measured with an AC, and $Z$ is a normally distributed variable with mean 0 and standard deviation \Sexpr{sim1.z.sd} measured without error.
Both simulations have a normally distributed dependent variable $Y$ and two binary independent variables $X$ and $Z$, which are balanced ($P(X)=P(Z)=0.5$) and correlated (Pearson's $\rho=\Sexpr{round(sim2a.cor.xz,2)}$).
To represent a study design where an AC is needed to obtain sufficient statistical power, $Z$ and $X$ can explain only \Sexpr{format.percent(sim1.R2)} of the variance in $Y$.

In \emph{Simulation 1a} (Figure \ref{fig:simulation.1a}), we simulate an AC with \Sexpr{format.percent(sim1a.acc)} accuracy.\footnote{Classifier accuracy varies between our simulations because it is difficult to jointly specify classifier accuracy and the required correlations among variables and due to random variation between simulation runs. We report the median accuracy over simulation runs.}  This reflects a situation where $X$ may be difficult to predict, but the AC, represented as a logistic regression model having linear predictor $W^*$ provides a useful signal.
We simulate nondifferential misclassification because $W=X+\xi$, $\xi$ is normally distributed with mean $0$, and $\xi$ and $W$ are conditionally independent of $Y$ given $X$ and $Z$.

In our real-data example, we included an example where the Perspective API disproportionately misclassified comments as toxic if they disclosed aspects of identities which resulted in differential misclassification.
In \emph{Simulation 1b} (Figure \ref{fig:simulation.1b}), we test how error correction methods can handle such differential error by making AC predictions similarly depend on the dependent variable $Y$.
This simulated AC has $\Sexpr{format.percent(sim1b.acc)}$ accuracy and makes predictions $W$ that are negatively correlated with the residuals of the linear regression of $X$ and $Z$ on $Y$ (Pearson's $\rho=\Sexpr{round(sim1b.cor.resid.w_pred,2)}$). As a result, this AC makes fewer false-positives and more false-negatives at greater levels of $Y$.

\subsubsection{Measurement Error in a Dependent Variable (\textit{Simulation 2a} and \textit{2b})}

We then simulate using an AC to measure the dependent variable $Y$ which we aim to explain given a binary independent variable $X$ and a continuous independent variable $Z$. The goal is to estimate $B_1$ and $B_2$ in the following logistic regression model:

\begin{equation}
    P(Y) = \frac{1}{1 + e^{-(B_0 + B_1 X + B_2 Z)}}
    \label{mod:measerr.logit}
\end{equation}

In \emph{Simulation 2a} (see Figure \ref{fig:simulation.2a}) and \emph{Simulation 2b} (see Figure \ref{fig:simulation.2b}) $X$ and $Z$ are, again, balanced ($P(X)=P(Z)=0.5$) and correlated
 (Pearson's $\rho=\Sexpr{round(sim2a.cor.xz,2)}$).
In \emph{Simulation 1}, we chose the variance of the normally distributed outcome given our chosen coefficients $B_X$ and $B_Z$, but this is not appropriate for \emph{Simulation 2}'s logistic regression. We therefore choose, somewhat arbitrarily, $B_X=\Sexpr{sim2.Bx}$ and $B_Z=\Sexpr{sim2.Bz}$. We again simulate ACs with moderate predictive performance.
The AC in \emph{Simulation 2a} is \Sexpr{format.percent(sim2a.AC.acc)} accurate and the AC in \emph{Simulation 2b} is \Sexpr{format.percent(sim2b.AC.acc)} accurate. In \emph{Simulation 2a}, the misclassifications are nonsystematic as $\xi$ has mean $0$ and is independent of $X$ and $Z$.  However, in \emph{Simulation 2b}  the misclassifications $\xi$ are systematic and correlated with $Z$ (Pearson's $\rho = \Sexpr{round(sim2b.error.cor.z,2)}$).


\section{Simulation Results}

For each method, we visualize the consistency, efficiency, and the accuracy of uncertainty quantification of estimates across prototypical scenarios.

For example, Figure \ref{fig:sim1a.x} visualizes results for \emph{Simulation 1a}. Each subplot shows a simulation with a given total sample size (No. observations) and a given sample of manually annotated observations (No. manually annotated observations).
To assess a method's consistency, we locate the expected value of the point estimate across simulations with the center of the black circle. As an example, see the leftmost column in the bottom-left subplot of Figure \ref{fig:sim1a.x}. For the naïve estimator, the circle is far below the dashed line indicating the true value of $B_X$. Here, ignoring misclassification causes bias toward 0 and the estimator is inconsistent. To assess a method's efficiency, we mark the region in which point estimate falls in 95\% of the simulations with black lines.
The black lines in the bottom-left subplot of Figure \ref{fig:sim1a.x} for example show that the feasible estimator, which uses only manually annotated data, is consistent but less precise than estimates from error correction methods. To assess each  method's uncertainty quantification, compare the gray lines,  which show the expected value of a method's approximate 95\% confidence intervals across simulations, to the corresponding black lines.
 The \emph{PL} column in the bottom-left subplot of Figure \ref{fig:sim1a.x} for instance shows that the method's  95\% confidence interval is biased towards 0 when the number of manually annotated observations is smaller.  This is to be expected because the PL estimator does not account for uncertainty in misclassification probabilities estimated using the sample of manually annotated observations.


\subsection{\emph{Simulation 1a:} Nonsystematic Misclassification of an Independent Variable}

Figure \ref{fig:sim1a.x} illustrates \emph{Simulation 1a}. Here, the naïve estimator is severely biased in its estimation of $B_X$.
Fortunately, error correction methods (GMM, MI, MLA) produce consistent estimates and acceptably accurate confidence intervals.
Notably, the PL method is inconsistent and considerable bias remains when the sample of annotations is much smaller than the entire dataset.  This is likely due to $P(X=x)$ missing from the PL estimation.\footnote{Compare Equation \ref{eq:mle.covariate.chainrule.4} in Appendix \ref{appendix:derivation} to Equations 24-28 from \citet{zhang_how_2021}.} Figure
\ref{fig:sim1a.x} also shows that MLA and GMM estimates become more precise in larger datasets.
As \citet{fong_machine_2021} also observed, this precision improvement is less pronounced for MI estimates, indicating that
GMM and MLA use automated classifications more efficiently than MI.

\begin{figure}[htbp!]
<<example1.x,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
p <- plot.simulation.iv(plot.df.example.1, iv='x')
grid.draw(p)
@
\caption{Simulation 1a: Nonsystematic misclassification of an independent variable. Error correction methods, except for PL, obtain precise and accurate estimates given sufficient manually annotated data. \label{fig:sim1a.x}}
\end{figure}

In brief, when misclassifications cause nondifferential error, MLA and GMM are effective, efficient, and provide accurate uncertainty quantification.  They complement each other due to different assumptions: MLA depends on correctly specifying the likelihood but its robustness to incorrect specifications is difficult to analyze \citep{carroll_measurement_2006}. The GMM approach depends on the exclusion restriction instead of distributional assumptions \citep{fong_machine_2021}.
MLA's advantage over GMM come from the relative ease with which it can be extended to for instance generalized linear models (GLMs) or generalized additive models (GAMs).
In cases similar to \emph{Simulation 1a}, we therefore recommend both GMM and MLA to correct for misclassification.

\subsection{\emph{Simulation 1b:} Systematic Misclassification of an Independent Variable}


\begin{figure}[htbp!]
<<example2.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
p <- plot.simulation.iv(plot.df.example.2, iv='x')
grid.draw(p)
@
\caption{\emph{Simulation 1b:} Systematic misclassification of an independent variable. Only the the MLA approach obtains consistent estimates of $B_X$. \label{fig:sim1b.x}}
\end{figure}

Figure \ref{fig:sim1b.x} illustrates \emph{Simulation 1b}. Here, systematic misclassification gives rise to differential error and creates more extreme misclassification bias that is more difficult to correct.
As Figure \ref{fig:sim1b.x} shows, the naïve estimator is opposite in sign to the true parameter.
Of the four methods we test, only the MLA and the MI approach provide consistent estimates. This is expected because they use $Y$ to adjust for misclassifications. The bottom row of Figure \ref{fig:sim1b.x} shows how the precision of the MI and MLA estimates increase with additional observations.  As in \emph{Simulation 1a}, MLA uses this data more efficiently than MI does. However, due to the low accuracy and bias of the AC, additional unlabeled data improves precision less than one might expect. Both methods provide acceptably accurate confidence intervals. Figure \ref{fig:sim1b.z} in Appendix \ref{appendix:main.sim.plots} shows that, as in \emph{Simulation 1a}, effective correction for misclassifications of $X$ is required to consistently estimate $B_Z$, the coefficient of $Z$ on $Y$.  Inspecting results from methods that do not correct for differential error is useful for understanding their limitations. When few annotations of $X$ are observed, GMM is nearly as bad as the naïve estimator. PL is also visibly biased. Both improve when a greater proportion of the data is labeled since they combine AC-based estimates with the feasible estimator.
In sum, our simulations suggest that the MLA approach is superior in conditions of differential error.  Although estimations by the MI approach are consistent, the method's practicality is limited by its inefficiency.

\subsection{\emph{Simulation 2a:} Nonsystematic Misclassification of a Dependent Variable}

 \begin{figure}[htbp!]
<<example3.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
#plot.df <-
p <- plot.simulation.dv(plot.df.example.3,'z')
grid.draw(p)
@
\caption{Simulation 2a: Nonsystematic misclassification of a dependent variable. Only the MLA approach obtains consistent estimates. \label{fig:sim2a.x}}
\end{figure}

Figure \ref{fig:sim2a.x} illustrates \emph{Simulation 2a}: nonsystematic misclassification of a dependent variable. This also introduces bias as evidenced by the naïve estimator's inaccuracy. Our MLA method
is able to correct this error and provide consistent estimates.
Surprisingly, the MI estimator is inconsistent and does not improve with more human-labeled data.
The PL approach is also inconsistent, especially when only few of all observations are annotated manually. It is closer to recovering the true parameter than the MI or the naïve estimator, but provides only modest improvements in precision compared to the feasible estimator.
It is clear that the precision of the MLA estimator improves with more observations data to a greater extent than the PL estimator.
When the amount of human-labled data is low, inaccuracies in the 95\
 In brief, our simulations suggest that MLA is the best error correction method when random misclassifications affect the dependent variable. It is the only consistent option and more efficient than the PL method, which is almost consistent.

\subsection{\emph{Simulation 2b}: Systematic Misclassification of a Dependent Variable}

\begin{figure}
<<example.4.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=

p <- plot.simulation.dv(plot.df.example.4,'z')
grid.draw(p)
@
\caption{Simulation 2b: Systematic misclassification of a dependent variable. Only the MLA approach obtains consistent estimates. \label{fig:sim2b.x}}
\end{figure}

In \emph{Simulation 2b}, misclassifiations of the dependent variable $Y$ are correlated with an independent variable $X$. As shown in Figure \ref{fig:sim2b.x}, this causes dramatic bias in the naïve estimator.
Similar to \emph{Simulation 2a}, MI is inconsistent. PL is also inconsistent because it does not account for $Y$ when correcting for misclassifications.
As in \emph{Simulation 1b}, our MLA method obtains consistent estimates, but only does much better than the feasible estimator when the dataset is large.
Figure \ref{fig:sim2b.z} in Appendix \ref{appendix:main.sim.plots} shows that the precision of estimates for the coefficient for $X$ improves with additional data to a greater extent. As such, this imprecision is mainly in estimating the coefficient for $Z$, the variable correlated with misclassification.

Therefore, our simulations suggest that MLA is the best method when misclassifications in the dependent variable are correlated with an independent variable.

\section{Transparency about Misclassification Is Not Enough—We Have To Fix It! Recommendations for Automated Content Analysis}

``Validate, Validate, Validate'' \citep[p. 269]{grimmer_text_2013} is one of the guiding mantras for automated content analysis. It reminds us that ACs can produce misleading results and of the importance of steps to ascertain validity, for instance by making misclassification transparent.
Like \citet{grimmer_text_2013}, we are deeply concerned that computational methods may produce invalid evidence. In this sense, their validation mantra animates this paper. But transparency about misclassification rates via metrics such as precision or recall leaves unanswered an important question: Is comparing automated classifications to some external ground truth sufficient to claim that results are valid? Or is there something else we can do and should do?

We think there is: Using statistical methods to not only quantify but also correct for misclassification. Our study provides several recommendations in this regard, as summarized in Figure \ref{fig:FigureRecommendations}.

\begin{figure}[hbt!]
\centering
\input{flowchart_recommendations.tex}
     \caption{Recommendations for Automated Content Analysis Study Design}
     \label{fig:FigureRecommendations}
\end{figure}

\subsubsection{Step 1: Attempt Manual Content Analysis}

Manual content annotation is often done \textit{post facto},  for instance to calculate predictiveness of an already existing AC such as Google's Perspective classifier. We propose to instead use manually annotated data \textit{ante facto}, i.e. before building or validating an AC.
Practically speaking, the main reason to use an AC is feasibility: to avoid the costs of manually coding a large dataset.
One may for example need a large dataset to study an effect one assumes to be small. Manually labeling such a dataset is expensive.
Often, ACs are seen as a cost-saving procedure but scholars often fail to consider the threats to validity posed by misclassification.
Moreover, validating an existing AC or building a new AC is also expensive, for instance due to costs of computational resources or manual annotation of (perhaps smaller) test and training datasets.

We therefore caution researchers against preferring automated over manual content analysis unless doing so is necessary to obtain useful evidence. We agree with \citet{baden_three_2022} who argue that ``social science researchers may be well-advised to eschew the promises of computational tools and invest instead into carefully researcher-controlled, limited-scale manual studies'' (p. 11). In particular, we recommend using manually annotated data \textit{ante facto}: Researchers should begin by examining human-annotated data so to discern if an AC is necessary. In our simulations, the feasible estimator is less precise but consistent in all cases. So if fortune shines and this estimate sufficiently answers one's research question, manual coding is sufficient. Here, scholars should rely on existing recommendations for descriptive and inferential statistics when using manual content analysis \citep{geis_statistical_2021, bachl_correcting_2017}. If the feasible estimator however fails to provide convincing evidence, for example by not rejecting the null, manually annotated data is not wasted. It can be reused to build an AC or correct misclassification bias.

\subsubsection{Step 2: Use Manually Annotated Data to Detect Systematic Misclassification}

As demonstrated in our simulations, knowing whether an AC makes systematic misclassifications is important: It determines which correction methods can work.
Fortunately, manually annotated data can be used to detect systematic misclassification.
For example, \citet{fong_machine_2021} suggest using Sargan's J-test of the null hypothesis that the product of the AC's predictions and regression residuals have an expected value of 0.
More generally, one can test if the data's conditional independence structures can be represented by Figures \ref{fig:simulation.1a} or \ref{fig:simulation.2a}. This can be done, for example, via likelihood ratio tests of $P(W|X,Z) = P(W|X,Y,Z)$ (if an AC measures an independent variable $X$) or of $P(W|Y) = P(W|Y,Z,X)$ (if an AC measures a dependent variable $Y$) or by visual inspection of plots of relating misclassifications to other variables  \citep{carroll_measurement_2006}.
 We strongly recommend using such methods to test for systematic misclassification and to design an appropriate correction.

\subsubsection{Step 3: Correct for Misclassification Bias Instead of Being Naïve}

Across our simulations, we showed that the naïve estimator is biased. Testing different error correction methods, we found that these generate different levels of consistency, efficiency, and accuracy in uncertainty quantification. That said, our proposed MLA method should be considered as a versatile method because it is the only method capable of producing consistent estimates in prototypical situations studied here. We recommend the MLA method as the first ``go-to'' method.  As shown in Appendix \ref{appendix:robustness}, this method requires specifying a valid error model to obtain consistent estimates. One should take care that the model not have omitted variables including nonlinearities and interactions.
Our \textbf{misclassificationmodels} R package provides reasonable default error models and a user-friendly interface to facilitate adoption of our MLA method (see Appendix \ref{appendix:misclassificationmodels}).

When feasible, we recommend comparing the MLA approach to another error correction method. Consistency between two correction methods shows that results are robust independent of the correction method. If the AC is used to predict an independent variable, GMM is a good choice if error is nondifferential. Otherwise, MI can be considered.
Unfortunantly, if the AC is used to predict a dependent variable, our simulations do not support a strong suggestion for a second method.
PL might be useful reasonable choice with enough manually annotated data and non-differential error.
This range of viable choices in error correction methods also  motivates our next recommendation.

\subsubsection{Step 4: Provide a Full Account of  Methodological Decisions}

Finally, we add our voices to those
recommending that researchers report methodological decisions so other can understand and replicate their design \citep{pipal_if_2022, reiss_reporting_2022}, especially in the context of machine learning \citep{mitchell_model_2019}. These decisions include but are not limited to choices concerning test and training data (e.g., size, sampling, split in cross-validation procedures, balance), manual annotations (size, number of annotators, intercoder values, size of data annotated for intercoder testing), and the classifier itself (choice of algorithm or ensemble, different accuracy metrics). They extend to reporting different error correction methods as proposed by our third recommendation.
In our review, we found that reporting such decisions is not yet common, at least in the context of SML-based text classification.
When correcting for misclassification, uncorrected results will often provide a lower-bound on effect sizes; corrected analyses will provide more accurate but less conservative results.
Therefore, both corrected and uncorrected estimates should be presented as part of making potential multiverses of findings transparent.


\section{Conclusion and Limitations}

Misclassification bias is an important threat to validity in studies that use automatic classifiers to measure statistical variables.
As we showed in an example with data from the Perspective API, widely used and very accurate automated classifiers can cause type-I and type-II errors.
As evidence by our literature review, this problem has not attracted enough attention within communication science \citep[but see][]{bachl_correcting_2017} nor in the broader computational social science community.
Although current best-practices of reporting metrics of classifier performance on manually annotated validation data, for instance metrics like precision or recall, are important, they provide little protection from misclassification bias.
These practices use annotations to enact a transparency ritual to ward against misclassification bias, but annotations can do much more. With the right statistical model, they can correct misclassification bias.

We introduce maximum likelihood adjustment (MLA), a new method we designed to correct misclassification bias and use Monte Carlo simulations to
evaluate it in comparison to other recently proposed error correction methods.
Our MLA method is the only one that is effective across a wide range of scenarios. It is also straightforward to use. Our implementation in the R package \texttt{misclassificationmodels} provides a familiar formula interface for regression models.
Remarkably, our simulations show that our method can use even an automated classifier below common accuracy standards to obtain consistent estimates.  Therefore, low accuracy is not necessarily a barrier to using an AC.

Based on these results, we provide four recommendations for the future of automated content analysis: Researchers should (1) attempt manual content analysis before building or validating ACs to see whether human-labeled data is sufficient, (2) use manually annotated data to test for systematic misclassification and choose appropriate error correction methods, (3) correct for misclassification via error correction methods, and (4) be transparent about the methodological decisions involved in AC-based classifications and error correction.

Our study has several limitations. First, the simulations and methods we introduce focus on misclassification by automated tools. They provisionally assume that human annotators do not make errors, especially not systematic ones.
This assumption can be reasonable if intercoder reliability is very high but, as with ACs, this may not always be the case.

 Thus, it may be important to account for measurement error by human coders  \citep{bachl_correcting_2017} and by automated classifiers simultaneously. In theory, it is possible to extend our MLA approach in order to do so \citep{carroll_measurement_2006}.
However, because the true values of content categories are never observed, accounting for automated and human misclassification at once requires latent variable methods that bear considerable additional complexity and assumptions \citep{pepe_insights_2007}. We leave the integration of such methods into our MLA framework for future work. In addition, our method requires an additional assumption that the error model is correct. As we argue in Appendix \ref{appendix:robustness} (section \ref{appendix:assumption}), this assumption is often acceptable.
Second, the simulations we present do not consider all possible factors that may influence the performance and robustness of error correction methods including classifier accuracy, heteroskedasticity, and violations of distributional assumptions.  We are working to investigate such factors, as shown in  Appendix \ref{appendix:robustness}, by extending our simulations.

\setcounter{biburlnumpenalty}{9001}
\printbibliography[title = {References}]

\clearpage
\appendix
\addcontentsline{toc}{section}{Appendices}
\stepcounter{section}
\section{Perspective API Example}
\label{appendix:perspective}

Our example relies on the publicly available Civil Comments dataset \citep{cjadams_jigsaw_2019}. The dataset contains around 2 million comments collected from independent English-language news sites between 2015 and 2017. We rely on a subset of \Sexpr{f(dv.example[['n.annotated.comments']])} comments which were manually annotated both for toxicity (\emph{toxicity}) and disclosure of identity (\emph{disclosure}) in a comment. The dataset also includes counts of likes each comment received (\emph{number of likes}).

Each comment was labeled by up to ten manual annotators (although selected comments were labeled by even more annotators). Originally, the dataset represents \emph{toxicity} and \emph{disclosure} as proportions of annotators who labeled a comment as toxic or as disclosing aspects of personal identity including race and ethnicity.
For our analysis, we converted these proportions into indicators of the majority view to transform both variables to a binary scale.

Our MLA method works in this scenario, as shown in \ref{fig:real.data.example.app} below.

\begin{figure}[htbp!]
\centering
\begin{subfigure}{\linewidth}
<<real.data.example.iv.full,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.asp=0.3,cache=F>>=
p <- plot.civilcomments.iv.example(include.models=c("Automatic Classification", "All Annotations", "Annotation Sample", "Error Correction"))
print(p)
@
\subcaption{\emph{Example 1}: Misclassification in an independent variable. \label{fig:real.data.example.iv.app}}
\end{subfigure}

\begin{subfigure}{\linewidth}
<<real.data.example.dv.full,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.asp=0.3,cache=F>>=
p <- plot.civilcomments.dv.example(include.models=c("Automatic Classification", "All Annotations", "Annotation Sample", "Error Correction"))
print(p)
@
\subcaption{\emph{Example 2}: Misclassification in a dependent variable. \label{fig:real.data.example.dv.app}}

\end{subfigure}
\caption{Real-data example including correction using MLA. \label{fig:real.data.example.app}}
\end{figure}

Our maximum-likelihood based error correction technique in this example requires specifying models for the Perspective's scores and, in the case where these scores are used as a covariate, a model for the human annotations.  In our first example, where toxicity was used as a covariate, we used the \emph{human annotations}, \emph{identity disclosure}, and the interaction of these two variables in the model for scores.  We omitted \emph{likes} from this model because they are virtually uncorrelated with misclassifications (Pearson's $\rho=\Sexpr{iv.example[['civil_comments_cortab']]['toxicity_error','likes']}$).  Our model for the human annotations is an intercept-only model.

In our second example, where toxicity is the outcome, we use the fully interacted model of the \emph{human annotations}, \emph{identity disclosure}, and \emph{likes} in our model for the human annotations because all three variables are correlated with the Perspective scores.

\section{Systematic Literature Review} \label{appendix:lit.review}

To inform our simulations, we reviewed studies using SML for text classification.

\subsection{Identification of Relevant Studies}
Our sample was drawn from four recent reviews on the use of AC within the context of communication science and the social sciences more broadly \citep{baden_three_2022, hase_computational_2022, junger_unboxing_2022, song_validations_2020}. Authors of respective studies had either already published their data in an open-science approach or thankfully shared their data with us when contacted.
From their reviews, we collected \emph{N} = 110 studies that included some type of SML (for an overview, see Figure \ref{fig:FigureA1}).

\begin{figure}
    \centering
    \includegraphics{measurement_flow.pdf}
    \caption{Identifying relevant studies for the literature review}
    \label{fig:FigureA1}
\end{figure}

We first removed 8 duplicate studies identified by several reviews. Two coders then coded the remaining \emph{N} = 102 studies of our preliminary sample for relevance. After an intercoder test (\emph{N} = 10, $\alpha$ = .89), we excluded studies not fulfilling inclusion criteria, here studies not including any SML approach and studies only using SML for data cleaning, not data analysis—for instance to sort out topically irrelevant articles. Next, we removed studies focusing on methodologically advancing SML-based ACs since these studies often include far more robustness and validity tests than commonly employed in empirical settings. Subsequently, all relevant empirical studies (\emph{N} = 48) were coded in further detail.

\subsection{Manual Coding of Relevant Empirical Studies}
For manual coding, we created a range of variables (for an overview, see Table \ref{tab:TableA1}). Based on data from the Social Sciences Citation Index (SSCI), we identified whether studies were published in journals classified as belonging to \emph{Communication} and their \emph{Impact} according to their H index. In addition, two authors manually coded...
\begin{itemize}
 \item the type of variables created via SML-based ACS using the variables \emph{Dichotomous} (0 = No, 1 = Yes), \emph{Categorical} (0 = No, 1 = Yes), \emph{Ordinal} (0 = No, 1 = Yes), and \emph{Metric} (0 = No, 1 = Yes),
 \item whether variables were used in descriptive or multivariate analyses using the variables \emph{Descriptive} (0 = No, 1 = Yes), \emph{Independent} (0 = No, 1 = Yes), and \emph{Dependent} (0 = No, 1 = Yes),
 \item how classifiers were trained and validated via manually annotated data using the variables \emph{Size Training Data} (Open String), \emph{Size Test Data} (Open String), \emph{Size Data Intercoder Test} (Open String), \emph{Intercoder Reliability} (Open String), and \emph{Accuracy of Classifier} (Open String),
 \item whether articles mentioned and/or corrected for misclassifications using the variables \emph{Error Mentioned} (0 = No, 1 = Yes) and \emph{Error Corrected} (0 = No, 1 = Yes).
\end{itemize}

\begin{table}
  \caption{Variables Coded for Relevant Empirical Studies}
  \label{tab:TableA1}
  \begin{tabular}{l l l l}         \toprule
  Category               & Variable                      & Krippendorf's $\alpha$  & \% or \emph{M} (\emph{SD}) \\ \midrule
  Type of Journal        & \emph{Communication}          & n.a.                    & 67\%  \\
                         & \emph{Impact}                 & n.a.                    & \emph{M = 4} \\
  Type of Variable       & \emph{Dichotomous}            & 0.86                    & 50\%  \\
                         & \emph{Categorical}            & 1                       & 23\% \\
                         & \emph{Ordinal}                & 0.85                    & 10\% \\
                         & \emph{Metric}                 & 1                       & 35\% \\
  Use of Variable        & \emph{Descriptive}            & 0.89                    & 90\% \\
                         & \emph{Independent}            & 1                       & 44\% \\
                         & \emph{Dependent}              & 1                       & 40\% \\
  Information on Classifier & \emph{Size Training Data}  & 0.95                    & 67\%  \\
                         & \emph{Size Test Data}      & 0.79                    & 52\%  \\
                         & \emph{Size Data Intercoder Test}  & 1      & 44\%  \\
                         & \emph{Intercoder Reliability}  & 0.8             & 56\%  \\
                         & \emph{Accuracy of Classifier}  & 0.77                   & 85\%  \\
  Measurement Error      & \emph{Error Mentioned}        & 1                       & 19\% \\
                         & \emph{Error Corrected}        & 1                       & 2\% \\ \bottomrule
  \end{tabular}
\end{table}

\subsection{Results}

SML-based ACs were most often used to create dichotomous measurements (\emph{Dichotomous}: 50\%), followed by variables on a metric (\emph{Metric}: 35\%), categorical (\emph{Categorical}: 23\%), or ordinal scale (\emph{Ordinal}: 10\%). Almost all studies used SML-based classifications to report descriptive statistics on created variables (\emph{Descriptive}: 90\%). However, many also used these in downstream analyses, either as dependent variables (\emph{Dependent}: 40\%) or independent variables (\emph{Independent}: 44\%) in statistical models.

Only slightly more than half of all studies included information on the size of training or test sets (\emph{Size Training Data}: 67\%, \emph{Size Test Data}: 52\%). Even fewer included information on the size of manually annotated data for intercoder testing (\emph{Size Data Intercoder Test}: 44\%) or respective reliability values (\emph{Intercoder Reliability}: 56\%). Lastly, not all studies reported how well their classifier performed by using metrics such as precision, recall, or F1-scores (\emph{Accuracy of Classifier}: 85\%). Lastly,  few studies exlicitly mentioned the issue of misclassification (\emph{Error Mentioned}: 19\%, with only a single study correcting for such (\emph{Error Corrected}: 2\%).

\section{Other Error Correction Methods}
\label{appendix:other.methods}
Statisticans have introduce a range of other error correction methods which we did not test in our simulations. Here, we briefly discuss three additional methods and explain why we did not include them in our simulations.

\emph{Simulation extrapolation} (SIMEX) simulates the process generating measurement error to model how measurement error affects an analysis and ultimately to approximate an analysis with no measurement error \citep{carroll_measurement_2006}. SIMEX is a very powerful and general method that can be used without manually annotated data, but may be more complicated than necessary to correct measurement error from ACs when manually annotated data is available. Likelihood methods are easy to apply to misclassification so SIMEX seems unnecessary \citep{carroll_measurement_2006}.

\emph{Score function methods} derive estimating equations for models without measurement error and then solve them either exactly or using numerical integration \citep{carroll_measurement_2006, yi_handbook_2021}.
The main advantage of score function methods may have over likelihood-based methods is that they do not require distributional assumptions about mismeasured independent variables. This advantage has limited use in the context of ACs because binary classifications must follow Bernoulli distributions.

We also do not consider \emph{Bayesian methods} (aside from the Amelia implementation of the MI approach) because we expect these to have similar limitations to the maximum likelihood methods we consider. Bayesian methods may have other advantages resulting from posterior inference and may generalize to a wide range of applications. However, specifying prior distributions introduces additional methodological complexity and posterior inference is computationally intensive, making Bayesian methods less convenient for Monte Carlo simulations.


\section{Deriving the Maximum Likelihood Approach}
\label{appendix:derivation}
In the following, we derive our MLA approach for addressing misclassifications.
\subsection{When an AC Measures an Independent Variable}
To explain why the MLA approach is effective, we follow \citet{carroll_measurement_2006} and begin by observing the following fact from basic probability theory:

\begin{align}
    P(Y,W) &= \sum_{x}{P(Y,W,X=x)}
    \label{eq:mle.covariate.chainrule.1}\\
    &= \sum_{x}{P(Y|W,X=x)P(W,X=x)}
    \label{eq:mle.covariate.chainrule.2}\\
    &= \sum_{x}{P(Y,X=x)P(W|Y,X=x)}  \label{eq:mle.covariate.chainrule.3} \\
    &= \sum_{x}{P(Y|X=x)P(W|Y,X=x)P(X=x)} \label{eq:mle.covariate.chainrule.4}
\end{align}
\noindent
Equation \ref{eq:mle.covariate.chainrule.1} integrates $X$ out of the joint probability of $Y$ and $W$ by summing over its possible values $x$. If $X$ is binary, this means adding the probability given $x=1$ to the probability given $x=0$.  When $X$ is observed, say $x=0$, then $P(X=0)=1$ and $P(X=1)=0$. As a result, only the true value of $X$ contributes to the likelihood. However, when $X$ is unobserved, all of its possible values contribute. In this way, integrating out $X$ allows us to include data where $X$ is not observed in the likelihood.

Equation \ref{eq:mle.covariate.chainrule.2} uses the chain rule of probability to factor the joint probability $P(Y,W)$ of $Y$ and $W$ from $P(Y|W,X)$, the conditional probability of $Y$ given $W$ and $X$, and $P(W,X=x)$, the joint probability of $W$ and $X$. This lets us see how maximizing $\mathcal{L}(\Theta|Y,W)$, the joint likelihood of $\Theta$ given $Y$ and $W$ accounts for the uncertainty of automated classifications. For each possible value $x$ of $X$, it weights the model of the outcome $Y$  by the probability that $x$ is the true value and that the AC outputs $W$.

Equation \ref{eq:mle.covariate.chainrule.3} shows a different way to factor the joint probability $P(Y,W)$ so that $W$ is not in the model of $Y$. Since $X$ and $W$ are correlated, if $W$ is in the model for $Y$, the estimation of $B_1$ will be biased.  By including $Y$ in the model for $W$, Equation \ref{eq:mle.covariate.chainrule.3} can account for differential measurement error.

Equation \ref{eq:mle.covariate.chainrule.4} factors $P(Y,X=x)$ the joint probability of $Y$ and $X$ into $P(Y|X=x)$, the conditional probability of $Y$ given $X$, $P(W|X=x,Y)$, the conditional probability of $W$ given $X$ and $Y$, and $P(X=x)$ the probability of $X$.  This shows that fitting a model $Y$ given $X$ in this framework, such as the regression model $Y = B_0 + B_1 X + B_2 Z$ requires including  the exposure model for $P(X=x)$.  Without validation data, $P(X=x)$ is difficult to calculate without strong assumptions \citep{carroll_measurement_2006}, but $P(X=x)$ can easily be estimated using a sample of validation data.

Equations \ref{eq:mle.covariate.chainrule.1}--\ref{eq:mle.covariate.chainrule.4} demonstrate the generality of this method because the conditional probabilities may be calculated using a wide range of probability models.
 For simplicity, we have focused on linear regression for the probability of $Y$ and logistic regression for the probability of $W$ and the probability of $X$. However, more flexible probability models such as generalized additive models (GAMs) or Gaussian process classification may be useful for modeling nonlinear conditional probability functions \citep{williams_bayesian_1998}.


\subsection{When an AC Measures the Dependent Variable}

Again, we will maximize $\mathcal{L}(\Theta|Y,W)$, the joint likelihood of the parameters $\Theta$ given the outcome $Y$ and automated classifications $W$ measuring the dependent variable $Y$ \citep{carroll_measurement_2006}.
We again use the law of total probability to integrate out $Y$ and the chain rule of probability to factor the joint probability into $P(Y)$, the probability of $Y$, and $P(W|Y)$, the conditional probability of $W$ given $Y$.

\begin{align}
    P(Y,W) &= \sum_{y}{P(Y=y,W)} \\
        &= \sum_{y}{P(Y)P(W|Y)}
\end{align}

As above, the conditional probability of $W$ given $Y$ must be calculated using a model. The range of possible models is vast and analysts must choose a model that accurately describes the conditional dependence of $W$ on $Y$.

We implement these methods in \texttt{R} using the \texttt{optim} library for maximum likelihood estimation.  Our implementation supports models specified using \texttt{R}'s formula syntax. It can fit linear and logistic regression models when an AC measures an independent variable and logistic regression models when an AC measures the dependent variable. Our implementation provides two methods for approximating confidence intervals: The Fischer information quadratic approximation and the profile likelihood method provided in the \texttt{R} package \texttt{bbmle}.  The Fischer approximation usually works well in simple models fit to large samples and is fast enough for practical use for the large number of simulations we present. However, the profile likelihood method provides more accurate confidence intervals \citep{carroll_measurement_2006}.


\subsection{Comment on Model Assumptions}
\label{appendix:assumption}

How burdensome is the assumption that the error model be able to consistently estimate the conditional probability of $W$ given $Y$?  If this assumption were much more difficult than those already accepted by the model for $Y$ given $X$ and $Z$, one would fear that using the MLA correction method introduces greater validity threats than it removes. In particular, one may worry that unobserved variables $U$ are omitted from our model for $P(Y,W)$.  As demonstrated in Appendix \ref{appendix:robustness} (section \ref{appendix:misspec}), the MLA method is less effective when variables are omitted from the error model.

However, if we believe our outcome model for $P(Y|X,Z)$ is consistent this threat is substantially reduced.  If one can assume a model for $P(Y|X,Z)$, it is often reasonable assume the variables needed to model $P(W|X,Y,Z)$ are observed.
Furthermore, since $W$ is an output from an automated classifier it depends only on the classifier's features, which are observable in principle. As a result, and as suggested by \citet{fong_machine_2021}, one should consider including all such features in the error model.

However, due to the highly nonlinear nature of machine learning classifiers, specifying the functional form of the error model may require care in practice.  One option is to calibrate an AC's to one's dataset and thereby obtain accurate estimates of its predicted probabilities.


\section{misclassificationmodels: The R package} \label{appendix:misclassificationmodels}

The package provides a function to conduct regression analysis but also corrects for misclassification using information from manually annotated data. The function is very similar to \textbf{glm()} but with two changes:

\begin{itemize}
\item The formula interface has been extended with the double-pipe operator to denote proxy variable. For example, \textbf{x || w} indicates that \textit{w} is the proxy of the ground truth \textit{x}.
\item The manually annotated data must be provided via the argument \textit{data2}
\end{itemize}

The following snippet shows a typical scenario, here for correcting misclassifications in an independent variable:
\lstset{style=mystyle}
\begin{lstlisting}[language=R, caption=A demo of misclassificationmodels]
library(misclassificationmodels)
## research_data contains the following columns: y, w, z
## val_data contains the following columns: y, w, x, z
# w is a proxy of x
res <- glm_fixit(formula = y ~ x || w + z,
                 data = research_data,
                 data2 = val_data)
summary(res)
\end{lstlisting}

For more information about the package,  please see here: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}.


\section{Additional plots for Simulations 1 and 2}
\label{appendix:main.sim.plots}

Appendix \ref{appendix:main.sim.plots} includes addition plots for our main simulations across \emph{Simulation 1a-2b}. It visualizes estimates of $B_Z$, the second independent variable in our inferential model. Here, \ref{fig:sim1a.z} visualizes estimates of $B_Z$ in \emph{Simulation 1a}, \ref{fig:sim1b.z} visualizes estimates of $B_Z$ in \emph{Simulation 1b}, \ref{fig:sim2a.z} visualizes estimates of $B_Z$ in \emph{Simulation 2a}, and \ref{fig:sim2b.z} visualizes estimates of $B_Z$ in \emph{Simulation 2b}.

\begin{figure}[htbp!]
<<example1.g,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=

p <- plot.simulation.iv(plot.df.example.1,iv='z')

grid.draw(p)
@
\caption{Estimates of $B_Z$ in \emph{simulation 1a}, multivariate regression with $X$ measured using an AC and model accuracy independent of $X$, $Y$, and $Z$. All error correction methods obtain precise and accurate estimates of of $B_Z$ given sufficient validation data.}
\label{fig:sim1a.z}
\end{figure}

\begin{figure}[htbp!]
<<example2.g, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
p <- plot.simulation.iv(plot.df.example.2, iv='z')
grid.draw(p)
@
\caption{Estimates of $B_Z$ in multivariate regression with $X$ measured using an AC and differential error.  Only multiple imputation and our MLA approach obtain consistent estimates of $B_Z$.\label{fig:sim1b.z}}
\end{figure}

\begin{figure}[htbp!]
<<example3.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
#plot.df <-
p <- plot.simulation.dv(plot.df.example.3,'x')
grid.draw(p)
@
\caption{Estimates of $B_Z$ in \emph{simulation 2a}, multivariate regression with $Y$ measured using an AC and misclassifications being uncorrelated with independent variables. Only our MLA approach obtains consistent estimates of $B_Z$.}
\label{fig:sim2a.z}
\end{figure}

\begin{figure}[htbp!]
<<example.4.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
#plot.df <-
p <- plot.simulation.dv(plot.df.example.4,'x')
grid.draw(p)
@
\caption{Estimates of $B_X$ in \emph{simulation 2b}, multivariate regression with $Y$ measured using an AC and misclassifications being correlated with independent variables. Only our MLA approach obtains consistent estimates of $B_Z$. \label{fig:sim2b.z}}
\end{figure}


\section{Robustness Tests}\label{appendix:robustness}

Appendix \ref{appendix:robustness} discusses robustness tests for our simulations. In the following sections, we show what happens when  the error model is misspecified (section \ref{appendix:misspec}), when the accuracy of the classifier varies (section \ref{appendix:accuracy}), when the classified variable is not balanced but skewed  (section \ref{appendix:imbalanced}), and when the degree of systematic misclassification changes (section \ref{appendix:degreebias}).

%\clearpage
\subsection{Robustness Test I: Misspecification of the Error Correction Model}
\label{appendix:misspec}
In \emph{Simulation 1b} and \emph{2b}, the MLA method was able to correct systematic misclassification using the error models in equations \ref{eq:covariate.reg.general} and \ref{eq:depvar.general}.
However, this depends on the error model consistently estimating the conditional probability of automated classifications given the true value and the outcome.
If the misclassifications and the outcome are conditionally dependent on an omitted variable $Z$, this will not be possible.
Here, we demonstrate how misspecification of the error correction model affects results in the context of misclassification in an independent variable (see section \ref{appendix:misspec.iv}) and a dependent variable (see section \ref{appendix:misspec.dv}).

\subsubsection{Systematic Misclassification of an Independent Variable}
\label{appendix:misspec.iv}
Repeating \emph{Simulation 1b}, what happens when the error model is misspecified? Figure \ref{fig:iv.noz} visualizes effects on $B_X$ (upper panel) and $B_Z$ (lower panel). It shows that a misspecified MLA model is unable to fully correct misclassification bias: Although estimates of $B_X$ are close to the true estimate and estimates of $B_Z$ are better than the näive estimator, $B_Z$ is still clearly biased.


\begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth}
<<truth.intercept.only.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
source('resources/robustness_check_plots.R')
p <- plot.robustness.1('x')
grid.draw(p)
@
\label{fig:iv.noz.x}
\caption{Estimates of $B_X$ are close to the true value despite the misspecified error correction model.}
\end{subfigure}

\begin{subfigure}{0.95\textwidth}
<<truth.intercept.only.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
source('resources/robustness_check_plots.R')
p <- plot.robustness.1('z')
grid.draw(p)
@
\label{fig:iv.noz.z}
\caption{Estimates of $B_Z$ are biased given a misspecified error correction model.}
\end{subfigure}
\caption{Robustness Test I: Misspecification of the Error Correction Model, Simulation 1b}
\label{fig:iv.noz}
\end{figure}

\subsubsection{Systematic Misclassification of a Dependent Variable}
\label{appendix:misspec.dv}
Next, we repeat \emph{Simulation 2b} with a misspecified error correction model. Figure \ref{fig:dv.noz} shows that a misspecified error model is, again, unable to obtain consistent estimates of $B_Z$.

\begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth}
<<dv.noz.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
source('resources/robustness_check_plots.R')
p <- plot.robustness.1.dv('x')
grid.draw(p)
@
\label{fig:dv.noz.x}
\caption{Estimates of $B_X$ are close to the true value despite the misspecified error correction model.}
\end{subfigure}

\begin{subfigure}{0.95\textwidth}
<<misspec.y.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
source('resources/robustness_check_plots.R')
p <- plot.robustness.1.dv('z')
grid.draw(p)
@
\label{fig:dv.noz.z}
\caption{Estimates of $B_Z$ are biased given a misspecified error correction model.}
\end{subfigure}
\caption{Robustness Test I: Misspecification of the Error Correction Model, Simulation 2b}
\label{fig:dv.noz}
\end{figure}

\clearpage

\subsection{Robustness Test II: Varying Accuracy of the Automated Classifier}
\label{appendix:accuracy}

<<load.robustness.2, echo=FALSE, message=FALSE, warning=FALSE, result='hide'>>=
source('resources/robustness_check_plots.R')
@

According to our literature review, the accuracy of reported classifiers strongly varies. But how does the performance of the classifier affect error correction methods and remaining bias in inferential modeling? To test this, we repeat \emph{Simulation 1a} (see Section \ref{appendix:iv.predacc}) and \emph{Simulation 2a} (see Section \ref{appendix:dv.predacc}) to show how varying accuracy of the AC affects estimates of independent variables $B_X$ and $B_Z$. Here, we let classifier accuracy range
from \Sexpr{format.percent(min(robust_2_min_acc))} to \Sexpr{format.percent(max(robust_2_max_acc))}. We present results for a scenario withn 5,000 classifications and 200 manual annotations.

\subsubsection{Varying Accuracy of an AC Predicting an Independent Variable}
\label{appendix:iv.predacc}
In Figure \ref{fig:iv.predacc}, we present results for \emph{Simulation 1a} where the independent variable is created via an AC.
As expected, a more accurate classifier causes less misclassification bias. All the error correction methods also provide more precise estimates when used with a more accurate classifier.

\subsubsection{Varying Accuracy of an AC Predicting a Dependent Variable}
\label{appendix:dv.predacc}
We then repeat these simulations for \emph{Simulation 2a}, where the dependent variable is created via an AC. As Figure \ref{fig:dv.predacc} shows, patterns are similar: error correction methods provide more precise estimates when used with a more accurate classifier.

\begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth}
<<predacc.x.y, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.robustness.2.iv('x')
grid.draw(p)
@
\caption{Estimates of $B_X$ improve with higher accuracy of the AC.}
\end{subfigure}

\begin{subfigure}{0.95\textwidth}
<<predacc.x.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.robustness.2.iv('z')
grid.draw(p)
@
\caption{Estimates of $B_Z$ improve with higher accuracy of the AC.}
\end{subfigure}
\caption{Robustness Test II: Varying Accuracy of the Automated Classifier, Simulation 1a}
\label{fig:iv.predacc}
\end{figure}

\begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth}
<<predacc.x.y.dv, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.robustness.2.dv('x')
grid.draw(p)
@
\caption{Estimates of $B_X$ improve with higher accuracy of the AC.}
\end{subfigure}

\begin{subfigure}{0.95\textwidth}
<<predacc.x.z.dv, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.robustness.2.dv('z')
grid.draw(p)
@
\caption{Estimates of $B_Z$ improve with higher accuracy of the AC.}
\end{subfigure}
\caption{Robustness Test II: Varying Accuracy of the Automated Classifier, Simulation 2a}
\label{fig:dv.predacc}
\end{figure}


\clearpage

\subsection{Robustness Test III: Misclassification in Imbalanced Variables}
\label{appendix:imbalanced}

For simplicity, our main simulations include balanced classified variables.  However, classifiers are often used to measure imbalanced variables, which can be more difficult to predict.  As a next robustness test, we therefore replicate \emph{Simulation 1a} (see section \ref{appendix:imbalanced.iv}) and \emph{Simulation 2a} (see section \ref{appendix:imbalanced.dv}) to analyze whether the MLA error correction method performs similarly well with imbalanced classified variables. We do so for the scenario with 5,000 classifications and 200 manual annotations.

\subsubsection{Imbalance in Classified Independent Variables}
\label{appendix:imbalanced.iv}

Replicating \emph{Simulation 1a}, Figure \ref{fig:iv.imbalanced} illustrates that our MLA method performs similarly well with imbalance in classified independent variables.
However, the quality of uncertainty quantification of methods tends to degrade as imbalance increases, as seen by comparing the neighboring black and gray lines when the probability of X is 0.95 in Figure \ref{fig:iv.imbalanced.bx}.  This suggests that imbalanced data requires additional validation data for effective misclassification correction.  Please note that the PL approach has a very large range of estimates and is thus excluded in Figure \ref{fig:iv.imbalanced} for readability.

\begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth}
<<iv.imbalance.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>=
p <- plot.robustness.3.iv('x',n.classifications=5000, n.annotations=200)
grid.draw(p)
@
\caption{Estimates of $B_X$ are close to true values given imbalance in $X$. \label{fig:iv.imbalanced.bx}}
\end{subfigure}

\begin{subfigure}{0.95\textwidth}
<<iv.imbalance.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.robustness.3.iv('z',n.classifications=5000, n.annotations=200)
grid.draw(p)
@
\caption{Estimates of $B_Z$ are close to true values given imbalance in $X$.}
\end{subfigure}
\caption{Robustness Test III: Misclassification in Imbalanced Variables, Simulation 1a}
\label{fig:iv.imbalanced}
\end{figure}

\subsubsection{Imbalance in Classified Dependent Variables}
\label{appendix:imbalanced.dv}
Replicating \emph{Simulation 2a}, Figure \ref{fig:dv.imbalanced} further illustrates that our MLA method performs similarly well with imbalance in classified dependent variables. The PL approach is, again, removed due to the large range of its estimates.
\begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth}
<<dv.imbalance.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>=
p <- plot.robustness.3.dv('x',n.classifications=5000, n.annotations=200)
grid.draw(p)
@
\caption{Estimates of $B_X$ are close to true values given imbalance in $Y$.}
\end{subfigure}

\begin{subfigure}{0.95\textwidth}
<<dv.imbalance.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.robustness.3.dv('z',n.classifications=5000, n.annotations=200)
grid.draw(p)
@
\caption{Estimates of $B_Z$ are close to true values given imbalance in $Y$.}
\end{subfigure}
\caption{Robustness Test III: Misclassification in Imbalanced Variables, Simulation 2a}
\label{fig:dv.imbalanced}
\end{figure}
\clearpage

\subsection{Robustness Test IV: Different Degrees of Systematic Misclassification}
\label{appendix:degreebias}

Lastly, we explore what happens if misclassification is more or less systematic. To do so, we replicate \emph{Simulation 1b} (see section \ref{appendix:degreebias.iv}) and \emph{Simulation 2b} (see section \ref{appendix:degreebias.dv}) with 1000 classifications and 100 manual annotations.  We vary the amount of systematic misclassification in \emph{Simulation 1b} via the logistic regression coefficient of $Y$ on $W$ while keeping the overall classifier accuracy close to 0.73. In \emph{Simulation 2b}, we similarly use a range of values for the coefficient of $Z$ on $W$.

\subsubsection{Systematic Misclassification in an Independent Variable}
\label{appendix:degreebias.iv}

Replicating \emph{Simulation 1b}, Figure \ref{fig:iv.degreebias} underlines that our MLA method performs  well even for higher degrees of systematic misclassification in the independent variable.

\begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth}
<<iv.bias.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>=
p <- plot.robustness.4.iv('x')
grid.draw(p)
@
\caption{Estimates of $B_X$ are close to true values given different degrees of misclassication in $X$.}
\end{subfigure}

\begin{subfigure}{0.95\textwidth}
<<iv.bias.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.robustness.4.iv('z')
grid.draw(p)
@

\caption{Estimates of $B_Z$ are close to true values given different degrees of misclassication in $X$.}
\end{subfigure}
\caption{Robustness Test IV: Different Degrees of Systematic Misclassification, Simulation 1b}
\label{fig:iv.degreebias}
\end{figure}

\subsubsection{Systematic Misclassification in a Dependent Variable}
\label{appendix:degreebias.dv}
Replicating \emph{Simulation 2b}, Figure \ref{fig:dv.degreebias} comes to similar conclusions. In the case of systematic misclassification in the dependent variable, we can observe that the bias in the naïve estimator switches from negative to positive as systematic misclassification increases.

\begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth}
<<dv.bias.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>=
p <- plot.robustness.4.dv('x')
grid.draw(p)
@
\caption{Estimates of $B_X$ are close to true values given different degrees of misclassication in $Y$.}
\end{subfigure}

\begin{subfigure}{0.95\textwidth}
<<dv.bias.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.robustness.4.dv('z')
grid.draw(p)
@
\caption{Estimates of $B_Z$ become inconsistent with increasing misclassification in $Y$.}
\end{subfigure}
\caption{Robustness Test IV: Different Degrees of Systematic Misclassification, Simulation 2b}
\label{fig:dv.degreebias}
\end{figure}

\end{document}