diff --git a/article.Rtex b/article.Rtex index aad212b..4e08e3b 100644 --- a/article.Rtex +++ b/article.Rtex @@ -98,7 +98,7 @@ source('resources/real_data_example.R') % I've gotten advice to make this as general as possible to attract the widest possible audience. \title{Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!} -\shorttitle{Automated Content Misclassification} +\shorttitle{Can We Fix It? Yes We Can!} \authorsnames[1,2,3]{Nathan TeBlunthuis, Valerie Hase, Chung-hong Chan} \authorsaffiliations{{{School of Information, University of Michigan},{Department of Communication Studies, Northwestern University}}, {Department of Media and Communication, LMU Munich}, {GESIS - Leibniz-Institut für Sozialwissenschaften}} @@ -111,13 +111,13 @@ Automated Content Analysis; Machine Learning; Classification Error; Attenuation \abstract{ -We show how automated classifiers (ACs), even biased ACs without high accuracy, can be statistically useful in communication research. -These classifiers, often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video, and have become widely popular measurement devices in communication science and related fields. +%We show how automated classifiers (ACs), even biased ACs without high accuracy, can be statistically useful in communication research. +Automated classifiers (ACs), often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video, and have become widely popular measurement devices in communication science and related fields. Despite this popularity, even highly accurate classifiers make errors that cause misclassification bias and misleading results in downstream analyses—unless such analyses account for these errors. As we show in a systematic literature review of SML applications, communication scholars largely ignore misclassification bias. In principle, existing statistical methods can use ``gold standard'' validation data, such as that created by human annotators, to correct misclassification bias and produce consistent estimates. -We introduce and test such methods, including a new method we design and implement in the R package \texttt{misclassificationmodels}, via Monte-Carlo simulations designed to reveal each method's limitations, which we also release. Based on our results, we recommend our method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods. +We introduce and test such methods, including a new method we design and implement in the R package \texttt{misclassificationmodels}, via MonteCarlo simulations designed to reveal each method's limitations, which we also release. Based on our results, we recommend our new error correction method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods. } % fix bug in apa7 package: https://tex.stackexchange.com/questions/645947/adding-appendices-in-toc-using-apa7-package @@ -127,8 +127,8 @@ We introduce and test such methods, including a new method we design and impleme \maketitle %\section{Introduction} -\tableofcontents -\clearpage +%\tableofcontents +%\clearpage \emph{Automated classifiers} (ACs) based on supervised machine learning (SML) have rapidly gained popularity as part of the \emph{automated content analysis} toolkit in communication science \citep{baden_three_2022}. With ACs, researchers can categorize large samples of text, images, video or other types of data into predefined categories \citep{scharkow_thematic_2013}. Studies for instance use SML-based classifiers to study frames \citep{burscher_teaching_2014}, tonality \citep{van_atteveldt_validity_2021}, %even ones as seemingly straightforward as sentiment \citep{van_atteveldt_validity_2021}, toxicity \citep{fortuna_toxic_2020} or civility \citep{hede_toxicity_2021} in news media texts or social media posts. @@ -144,7 +144,7 @@ Next, we provide a systematic literature review of \emph{N} = 48 studies employi Although communication scholars have long scrutinized related questions about manual content analysis for which they have recently proposed statistical corrections \citep{bachl_correcting_2017, geis_statistical_2021}, misclassification bias in automated content analysis is largely ignored. Our review demonstrates a troubling lack of attention to the threats ACs introduce and virtually no mitigation of such threats. As a result, in the current state of affairs, researchers are likely to either draw misleading conclusions from inaccurate ACs or avoid ACs in favor of costly methods such as manually coding large samples \citep{van_atteveldt_validity_2021}. -Our primary contribution, an effort rescue ACs from this dismal state, is to \emph{introduce and test methods for correcting misclassification bias} \citep{carroll_measurement_2006, buonaccorsi_measurement_2010, yi_handbook_2021}. We consider three recently proposed methods: \citet{fong_machine_2021}'s generalized method of moments calibration method, \citet{zhang_how_2021}'s pseudo-likelihood models, and \citet{blackwell_unified_2017-1}'s application of imputation methods. To overcome these methods' limitations, we draw a general likelihood modeling framework from the statistical literature on measurement error \citep{carroll_measurement_2006} and tailor it to the problem of misclassification bias. Our novel implementation is the experimental R package \texttt{misclassificationmodels}.\footnote{The code for the experimental package can be found here: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}.} +Our primary contribution, an effort to rescue ACs from this dismal state, is to \emph{introduce and test methods for correcting misclassification bias} \citep{carroll_measurement_2006, buonaccorsi_measurement_2010, yi_handbook_2021}. We consider three recently proposed methods: \citet{fong_machine_2021}'s generalized method of moments calibration method, \citet{zhang_how_2021}'s pseudo-likelihood models, and \citet{blackwell_unified_2017-1}'s application of imputation methods. To overcome these methods' limitations, we draw a general likelihood modeling framework from the statistical literature on measurement error \citep{carroll_measurement_2006} and tailor it to the problem of misclassification bias. Our novel implementation is the experimental R package \texttt{misclassificationmodels}. We test these four error correction methods and compare them against ignoring misclassification (the naïve approach) and refraining from automated content analysis by only using manual coding (the feasible approach). We use Monte Carlo simulations to model four prototypical situations identified by our review: Using ACs to measure either (1) an independent or (2) a dependent variable where the classifier makes misclassifications that are either (a) easy to correct (when an AC is unbiased and misclassifications are uncorrelated with covariates i.e., \emph{nonsystematic misclassification}) or (b) more difficult (when an AC is biased and misclassifications are correlated with covariates i.e., \emph{systematic misclassification}). %The more difficult cases are important. @@ -170,7 +170,7 @@ Our primary contribution, an effort rescue ACs from this dismal state, is to \em % In our discussion section, we provide detailed recommendations based on our literature review and our simulations. According to our simulations, even biased classifiers without high predictive performance can be useful in conjunction with appropriate validation data and error correction methods. As a result, we are optimistic about the potential of ACs and automated content analysis for communication science and related fields—if researchers correct for misclassification. -Current practices of ``validating'' ACs by making misclassification rates transparent via metrics such as the F1 score, however, provide little safegaurd against misclassification bias. +Current practices of ``validating'' ACs by making misclassification rates transparent via metrics such as the F1 score, however, provide little safeguard against misclassification bias. In sum, we make a methodological contribution by introducing the often-ignored problem of misclassification bias in automated content analysis, testing error correction methods to address this problem via Monte Carlo simulations, and introducing a new method for error correction. %The required assumptions for error correction methods are no more difficult than those already commonly adopted in traditional content analyses—and much more reasonable than the current default approach. @@ -229,14 +229,14 @@ As shown in Figure \ref{fig:real.data.example.dv}, using Perspective's classific \section{Why Transparency about Misclassification Is Not Enough} -Although the Perspective API is no doubt accurate enough to be useful to content moderators, the example above demonstrates that this does not imply usefulness for social science \citep{grimmer_machine_2021-1}. -Machine learning takes the opposite position on the bias-variance trade-off than conventional statistics does and achieves high predictiveness at the cost of unbiased inference \citep{breiman_statistical_2001}. As a growing body of scholarship critical of the hasty adoption of machine learning in criminal justice, healthcare, or content moderation demonstrates, -ACs boasting high performance often have biases related to social categories \citep{barocas_fairness_2019}. Such biases in machine learning often result from non-representative training data and spurious correlations that neither reflect causal mechanisms nor generalize to different populations \citep{bender_dangers_2021}. +Although the Perspective API is certainly accurate enough to be useful to content moderators, the example above demonstrates that this does not imply usefulness for social science \citep{grimmer_machine_2021-1}. +Machine learning takes the opposite position on the bias-variance trade-off than conventional statistics does and achieves high predictiveness at the cost of more biased inference \citep{breiman_statistical_2001}. As a growing body of scholarship critical of the hasty adoption of machine learning in criminal justice, healthcare, or content moderation demonstrates, +ACs boasting high performance often have biases related to social categories \citep{barocas_fairness_2019}. Such biases often result from non-representative training data and spurious correlations that neither reflect causal mechanisms nor generalize to different populations \citep{bender_dangers_2021}. Much of this critique targets unjust consequences of these biases to individuals. Our example shows that these biases can also contaminate scientific studies using ACs as measurement devices. Even very accurate ACs can cause both type-I and type-II errors, which become more likely when classifiers are less accurate or more biased, or when effect sizes are small. -We argue that current common practices to address such limitations are insufficient. These practices assert validity by reporting classifier performance on manually annotated data quantified as metrics including accuracy, precision, recall, or the F1 score \citep{hase_computational_2022, baden_three_2022, song_validations_2020}. +We argue that current common practices to address such limitations are insufficient. These practices assert validity by reporting classifier performance on manually annotated data quantified via metrics like accuracy, precision, recall, or the F1 score \citep{hase_computational_2022, baden_three_2022, song_validations_2020}. These steps promote confidence in results by making misclassification transparent, but our example indicates bias can flow downstream into statistical inferences, despite high predictiveness. -Instead of relying on transparency rituals to ward off misclassification bias, researchers can and should use validation data to understand and correct it. +Instead of relying on transparency rituals to ward off misclassification bias, researchers can and should use validation data to no only report but correct it. % \citep{obermeyer_dissecting_2019, kleinberg_algorithmic_2018, bender_dangers_2021, wallach_big_2019, noble_algorithms_2018}. %For example, \citet{hede_toxicity_2021} show that, when applied to news datasets, the Perspecitve API overestimates incivility related to topics such as racial identity, violence, and sex. @@ -263,7 +263,7 @@ Misclassifications from such classifiers can be systematic because they have cau If ACs become standard measurement devices, for instance %the LIWC dictionary to measure sentiment \citep{boukes_whats_2020}, %\citep{dobbrick_enhancing_2021} -Google's Perspective API for measuring toxicity \citep[see critically][]{hosseini_deceiving_2017} or Botometer for classifying social media bots \citep[see critically][]{rauchfleisch_false_2020}, entire literatures may have systematic biases. +Google's Perspective API for measuring toxicity \citep[see critically][]{hosseini_deceiving_2017} or Botometer for classifying social media bots \citep[see critically][]{rauchfleisch_false_2020}, entire research areas may be subject to systematic biases. Even if misclassification bias is usually conservative, it can slow progress in a research area. Consider how \citet{scharkow_how_2017} argue that media's ``minimal effects'' on political opinions and behavior in linkage studies may be an artifact of measurement errors both in manual content analyses and self-reported media use in surveys. Conversely, if researchers selectively report statistically significant hypothesis tests, misclassification can introduce an upward bias in the magnitude of reported effect sizes and contribute to a replication crisis \citep{loken_measurement_2017}. @@ -303,7 +303,7 @@ Even if misclassification bias is usually conservative, it can slow progress in To understand how social scientists, including communication scholars, engage with the problem of misclassification in automated content analysis, %SML classifiers enable researchers to inexpensively measure categorical variables in large data sets. This promises to be useful for study designs requiring large samples such as to infer effect sizes smaller than would be possible using smaller samples humans could feasibly classify. %But are scholars aware that misclassification by ACs poses threats to the validity of downstream analyses? Although such issues in the context of manual content analysis have attracted much debate \citep{bachl_correcting_2017}, this is less true for misclassification by newly popular automatic classifiers. -we conducted a systematic literature review of studies using supervised machine learning (SML) for text classification (see Appendix \ref{appendix:lit.review} in our Supplement for details).\footnote{Automated content analysis includes a range of methods both for assigning content to predefined categories (e.g., dictionaries) and for assigning content to unknown categories (e.g., topic modeling) \citep{grimmer_text_2013, oehmer-pedrazzi_automated_2023}. While we focus on SML, our arguments extend to other approaches such as dictionary-based classification and even beyond the specific context of text classification.} +we conducted a systematic literature review of studies using supervised machine learning (SML) for text classification (see Appendix \ref{appendix:lit.review} in our Supplement for details).\footnote{Automated content analysis includes a range of methods both for assigning content to predefined categories (e.g., dictionaries) and for assigning content to unknown categories (e.g., topic modeling) \citep{grimmer_text_2013}. While we focus on SML, our arguments extend to other approaches such as dictionary-based classification and even beyond the specific context of text classification.} Our sample consists of studies identified by similar reviews on automated content analysis \citep{baden_three_2022, hase_computational_2022, junger_unboxing_2022, song_validations_2020}. Our goal is not to comprehensively review all SML studies %\footnote{In fact, our review likely underestimates the use of the method, as we focused on text-based SML methods in the social science domain employed for empirical analyses.} but to provide a picture of common practices, with an eye toward awareness of misclassification and its statistical implications. @@ -329,7 +329,7 @@ In contrast, an AC can make classifications $W$ for the entire dataset but intro \emph{Multiple imputation} (MI) treats misclassification as a missing data problem. It understands the true value of $X$ to be observed in manually annotated data $X^*$ and missing otherwise \citep{blackwell_unified_2017-1}. %For example, the regression calibration step in \citet{fong_machine_2021}'s GMM method uses least squares regression to impute unobserved values of the covariate $X$. Indeed, \citet{carroll_measurement_2006} describe regression calibration when validation data are available as ``simply a poor person's imputation methodology'' (pp. 70). -Like regression calibration, multiple imputation uses a model to infer likely values of possibly misclassified variables. The difference is that multiple imputation samples several (hence \emph{multiple} imputation) entire datasets filling in the missing data from the predictive probability distribution of $X$ conditional on other variables $\{W,Y,Z\}$, then runs a statistical analysis on each of these sampled datasets and pools the results of each of these analyses \citep{blackwell_unified_2017-1}. Note that $Y$ is included among the imputing variables, giving the MI approach the potential to address \emph{differential error,} when systematic misclassification makes automatic classifications conditionally dependent on the outcome given the other independent variables. +Like regression calibration, multiple imputation uses a model to infer likely values of possibly misclassified variables. The difference is that multiple imputation samples several (hence \emph{multiple} imputation) entire datasets filling in the missing data from the predictive probability distribution of $X$ conditional on other variables $\{W,Y,Z\}$, then runs a statistical analysis on each of these sampled datasets and pools the results of each of these analyses \citep{blackwell_unified_2017-1}. Note that $Y$ is included among the imputing variables, giving the MI approach the potential to address \emph{differential error,} when systematic misclassification makes automatic classifications conditionally dependent on the outcome given other independent variables. \citet{blackwell_unified_2017-1} claim that the MI method is relatively robust when it comes to small violations of the assumption of nondifferential error. Moreover, in theory, the MI approach can be used for correcting misclassifications both in independent and dependent variables. \emph{``Pseudo-likelihood''} methods (PL)—even if not always explicitly labeled this way—are another approach for correcting misclassification bias. \citet{zhang_how_2021} proposes a method that approximates the error model using quantities from the AC's confusion matrix—the positive and negative predictive values in the case of a mismeasured independent variable and the AC's false positive and false negative rates in the case of a mismeasured dependent variable. Because quantities from the confusion matrix are neither data nor model parameters, \citet{zhang_how_2021}'s method is technically a ``pseudo-likelihood'' method. A clear benefit is that this method only requires summary quantities derived from manually annotated data, for instance via a confusion matrix. %We will discuss likelihood methods in greater depth in the presentation of our MLA framework below. @@ -337,7 +337,7 @@ Like regression calibration, multiple imputation uses a model to infer likely va \subsection{Proposing Maximum Likelihood Adjustment for Misclassification} % This section basically translates Carroll et al. for a technically advanced 1st year graduate student. -We now elaborate on \emph{Maximum Likelihood Adjustement} (MLA), a new method we propose for correcting misclassification bias. Our method tailors \citet{carroll_measurement_2006}'s presentation of the general statistical theory of likelihood modeling for measurement error correction to context of automated content analysis.\footnote{In particular see Chapter 8 (especially example 8.4) and Chapter 15. (especially 15.4.2).} The MLA approach deals with misclassification bias by maximizing a likelihood that correctly specifies an \emph{error model} of the probability of the automated classifications conditional on the true value and the outcome \citep{carroll_measurement_2006}. +We now elaborate on \emph{Maximum Likelihood Adjustement} (MLA), a new method we propose for correcting misclassification bias. Our method tailors \citet{carroll_measurement_2006}'s presentation of the general statistical theory of likelihood modeling for measurement error correction to the context of automated content analysis.\footnote{In particular see Chapter 8 (especially example 8.4) and Chapter 15. (especially 15.4.2).} The MLA approach deals with misclassification bias by maximizing a likelihood that correctly specifies an \emph{error model} of the probability of the automated classifications conditional on the true value and the outcome \citep{carroll_measurement_2006}. In contrast to the GMM and the MI approach, which predict values of the misclassified variable, the MLA method accounts for all possible values of the variable by ``integrating them out'' of the likelihood. ``Integrating out'' means adding possible values of a variable to the likelihood, weighted by the likelihood of the error model. @@ -357,11 +357,11 @@ Fourth, and most important, MLA can be effective when misclassification is syste \subsubsection{When an Automated Classifier Predicts an Independent Variable} In general, if we want to estimate a model $P(Y|\Theta_Y, X, Z)$ for $Y$ given $X$ and $Z$ with parameters $\Theta_Y$, we can use AC classifications $W$ predicting $X$ to gain statistical power without introducing misclassification bias by maximizing ($\mathcal{L}(\Theta|Y,W)$), the likelihood of the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\}$ in a joint model of $Y$ and $W$ \citep{carroll_measurement_2006}. -The joint probability of $Y$ and $W$, can be factored into the product of three terms: $P(Y|X,Z,\Theta_Y)$, the model with parameters $\Theta_Y$ we want to estimate, $P(W|X,Y, \Theta_W)$, a model for $W$ having parameters $\Theta_W$, and $P(X|Z, \Theta_X)$, a model for $X$ having parameters $\Theta_X$. +The joint probability of $Y$ and $W$ can be factored into the product of three terms: $P(Y|X,Z,\Theta_Y)$, the model with parameters $\Theta_Y$ we want to estimate, $P(W|X,Y, \Theta_W)$, a model for $W$ having parameters $\Theta_W$, and $P(X|Z, \Theta_X)$, a model for $X$ having parameters $\Theta_X$. Calculating these three conditional probabilities is sufficient to calculate the joint probability of the dependent variable and automated classifications and thereby obtain a consistent estimate despite misclassification. $P(W|X,Y, \Theta_W)$ is called the \emph{error model} and $P(X|Z, \Theta_X)$ is called the \emph{exposure model} \citep{carroll_measurement_2006}. -To illustrate, the regression model $Y=B_0 + B_1 X + B_2 Z + \varepsilon$, predicts the discrete independent variable $X$. -We can assume that the probability of $W$ follows a logistic regression model of $Y$, $X$ and $Z$ and that the probability of $X$ follows a logistic regression model of $Z$. In this case, the likelihood model below is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\} = \{\{B_0, B_1, B_2\}, \{\alpha_0, \alpha_1, \alpha_2\}, \{\gamma_0, \gamma_1\}\}$. +To illustrate, the regression model $Y=B_0 + B_1 X + B_2 Z + \varepsilon$ includes predictions for the independent variable $X$. +We can assume that the probability of $W$ follows a logistic regression model of $Y$, $X$, and $Z$ and that the probability of $X$ follows a logistic regression model of $Z$. In this case, the likelihood model below is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\} = \{\{B_0, B_1, B_2\}, \{\alpha_0, \alpha_1, \alpha_2\}, \{\gamma_0, \gamma_1\}\}$. \begin{align} \mathcal{L}(\Theta | Y, W) &= \prod_{i=0}^{N}\sum_{x} {P(Y_i| X_i, Z_i, \Theta_Y)P(W_i|X_i, Y_i, Z_i, \Theta_W)P(X_i|Z_i, \Theta_X)} \label{eq:covariate.reg.general}\\ @@ -373,7 +373,7 @@ We can assume that the probability of $W$ follows a logistic regression model of \noindent where $\phi$ is the normal probability density function. Note that Equation \ref{eq:covariate.reg.general} models differential error (i.e., $Y$ is not independent of $W$ conditional on $X$ and $Z$) via a linear relationship between $W$ and $Y$. When error is nondifferential, the dependence between $W$ and $Y$ can be removed from Equations \ref{eq:covariate.reg.general} and \ref{eq:covariate.logisticreg.w}. -Estimating the three conditional probabilities in practice requires specifying models on which validity of the method depends. +Estimating the three conditional probabilities in practice requires specifying models on which the validity of the method depends. This framework is very general and a wide range of probability models, such as generalized additive models (GAMs) or Gaussian process classification, may be used to estimate $P(W| X, Y, Z, \Theta_W)$ and $P(X|Z,\Theta_X)$ \citep{williams_bayesian_1998}. \subsubsection{When an Automated Classifier Predicts a Dependent Variable} @@ -381,7 +381,6 @@ This framework is very general and a wide range of probability models, such as g We now turn to the case when an AC makes classifications $W$ that predict a discrete dependent variable $Y$. In our second real-data example, $W$ is the Perspective API's toxicity classifications and $Y$ is the true value of toxicity. This case is simpler than the case above where an AC is used to measure an independent variable $X$ because there is no need to specify a model for the probability of $X$. - If we assume that the probability of $Y$ follows a logistic regression model of $X$ and $Z$ and allow $W$ to be biased and to directly depend on $X$ and $Z$, then maximizing the following likelihood is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W\} = \{\{B_0, B_1, B_2\},\{\alpha_0, \alpha_1, \alpha_2, \alpha_3\}\}$. \begin{align} @@ -392,19 +391,18 @@ If we assume that the probability of $Y$ follows a logistic regression model of If the AC's errors are conditionally independent of $X$ and $Z$ given $W$, the dependence of $W$ on $X$ and $Z$ can be omitted from equations \ref{eq:depvar.general} and \ref{eq:depvar.w}. -Additional details on the likelihood modeling approach available in Appendix \ref{appendix:derivation} of the Supplement. +Additional details on the likelihood modeling approach are available in Appendix \ref{appendix:derivation} of the Supplement. -\section{Evaluating Misclassification Models: Monte-Carlo Simulations} +\section{Evaluating Misclassification Models: Monte Carlo Simulations} % \TODO{Create a table summarizing the simulations and the parameters.} We now present four Monte Carlo simulations (\emph{Simulations 1a}, \emph{1b}, \emph{2a}, and \emph{2b}) with which we evaluate existing methods (GMM, MI, PL) and our approach (MLA) for correcting misclassification bias. Monte Carlo simulations are a tool for evaluating statistical methods, including (automated) content analysis \citep[e.g.,][]{song_validations_2020,bachl_correcting_2017,geis_statistical_2021, fong_machine_2021,zhang_how_2021}. -They are defined by a data generating process from which datasets are repeatedly sampled. Repeating an analyses for each of these datasets provides an empirical distribution of results the analysis would obtain over study replications. Monte-carlo simulation affords exploration of finite-sample performance, robustness to assumption violations, comparison across several methods, and ease of interpretability \citep{mooney_monte_1997}. +They are defined by a data generating process from which datasets are repeatedly sampled. Repeating an analysis for each of these datasets provides an empirical distribution of results the analysis would obtain over study replications. Monte Carlo simulation affords exploration of finite-sample performance, robustness to assumption violations, comparison across several methods, and ease of interpretability \citep{mooney_monte_1997}. Such simulations allow exploration of how results depend on assumptions about the data-generating process and analytical choices and are thus an important tool for designing studies that account for misclassification. -Code for reproducing our simulations is available here: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}.} @@ -412,11 +410,11 @@ Code for reproducing our simulations is available here: \url{https://osf.io/pyqf In our simulations, we tested four error correction methods: \emph{GMM calibration} (GMM) \citep{fong_machine_2021}, \emph{multiple imputation} (MI) \citep{blackwell_unified_2017-1}, \emph{Zhang's pseudo-likelihood model} (PL) \citep{zhang_how_2021}, and our \emph{maximum likelihood adjustment} approach (MLA). We use the \texttt{predictionError} R package \citep{fong_machine_2021} for the GMM method, the \texttt{Amelia} R package for the MI approach, and our own implementation of \citet{zhang_how_2021}'s PL approach in R. We develop our MLA approach in the R package \texttt{misclassificationmodels}. -For PL and MLA, we quantify uncertainty using the fisher information quadratic approximation. +For PL and MLA, we quantify uncertainty using the fisher information quadratic approximation.\footnote{The code for reproducing our simulations and our experimental R package is available here: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}.} In addition, we compare these error correction methods to two common approaches in communication science: the \emph{feasible} estimator (i.e., conventional content analysis that uses only manually annotated data and not ACs) %and illustrates the motivation for using an AC in these scenarios—validation alone provide insufficient statistical power for a sufficiently precise hypothesis test. -and the \emph{naïve} estimator (i.e., using AC-based classifications $W$ as stand-ins for $X$, thereby ignoring misclassifications). +and the \emph{naïve} estimator (i.e., using AC-based classifications $W$ as stand-ins for $X$, thereby ignoring misclassification). According to our systematic review, the \emph{naïve} approach reflects standard practice in studies employing SML for text classification. We evaluate each of the six analytical approaches in terms of \emph{consistency} (whether the estimates of parameters $\hat{B_X}$ and $\hat{B_Z}$ have expected values nearly equal to the true values $B_X$ and $B_Z$), \emph{efficiency} (how precisely the parameters are estimated and how precision improves with additional data), and \emph{uncertainty quantification} (how well the 95\% confidence intervals approximate the range including 95\% of parameter estimates across simulations). @@ -431,7 +429,7 @@ observations). Since our review indicated that ACs are most often used to create %These simulations are designed to verify that error correction methods from prior work are effective in ideal scenarios and to create the simplest possible cases where these methods are inconsistent. Showing how prior methods fail is instructive for understanding how our MLA approach does better both in these artificial simulations and in practical projects. -\subsection{Four Prototypical Scenarios for our Monte Carlo Simulations} +\subsection{Four Prototypical Scenarios for Our Monte Carlo Simulations} We simulate regression models with two independent variables ($X$ and $Z$). This sufficiently constrains our study's scope but the scenario is general enough to be applied in a wide range of research studies. %Simulating studies with two covariates lets us study how measurement error in one covariate can cause bias in coefficient estimates of other covariates. @@ -450,10 +448,9 @@ We first consider studies with the goal of testing hypotheses about the coeffici Y=B_0 + B_1 X + B_2 Z + \varepsilon \label{mod:true.ols} \end{equation} -In our first real-data example, $Y$ was a discrete variable---whether a comment self-disclosed a racial or ethnic identity, $X$ was if a comment was toxic, and $Z$ was the number of likes. +%In our first real-data example, $Y$ was a discrete variable---whether a comment self-disclosed a racial or ethnic identity, $X$ was if a comment was toxic, and $Z$ was the number of likes. In this simulated example, $Y$ is continuous variable, $X$ is a binary variable measured with an AC, and $Z$ is a normally distributed variable with mean 0 and standard deviation \Sexpr{sim1.z.sd} measured without error. %The simulated example could represent a study of $Y$, the time until an social media account is banned, $X$ if the account posted a comment including toxicity, and $Z$ the account's reputation score. $X$ and $Z$ are negatively correlated because high-reputation accounts may be less likely to post comments including toxicity. - %$Z$ can indicate if the message is in German or English, the two possible languages in the hypothetical study. %Say that human content coders can observe $X$ perfectly, but each observation is so expensive that observing $X$ for a large sample is infeasible. %Instead, the human coders can measure $X$ without error for a subsample of size $m << N$. @@ -475,9 +472,8 @@ We simulate nondifferential misclassification because $W=X+\xi$, $\xi$ is normal % \emph{Simulation 2} extends \emph{Simulation 1} by making the automated classifier classification errors $\xi$ that are correlated with $Y$ even after accounting for $Z$ and $x$. -In our first real-data example, the Perspective API predicted comment toxicity, which was an independent variable of a regression model in which racial/ethnic identity disclosure was the dependent variable. The API disproportionately misclassified as toxic comments disclosing such identities which toxic which resulted in differential misclassification. - -In \emph{Simulation 1b} (Figure \ref{fig:simulation.1b}), we test how error correction methods can handle differential error by making AC predictions similarly depend on the dependent variable $Y$. +In our real-data example, we included an example where the Perspective API disproportionately misclassified comments as toxic if they disclosed aspects of identities which resulted in differential misclassification. +In \emph{Simulation 1b} (Figure \ref{fig:simulation.1b}), we test how error correction methods can handle such differential error by making AC predictions similarly depend on the dependent variable $Y$. This simulated AC has $\Sexpr{format.percent(sim1b.acc)}$ accuracy and makes predictions $W$ that are negatively correlated with the residuals of the linear regression of $X$ and $Z$ on $Y$ (Pearson's $\rho=\Sexpr{round(sim1b.cor.resid.w_pred,2)}$). As a result, this AC makes fewer false-positives and more false-negatives at greater levels of $Y$. %Although the false-negative rate of the AC is \Sexpr{format.percent(sim1b.fnr)} overall, when $Y<=0$ the false-negative rate is only \Sexpr{format.percent(sim1b.fnr.y0)}, but when $Y>=0$ it rises to \Sexpr{format.percent(sim1b.fnr.y1)}. @@ -490,7 +486,7 @@ This simulated AC has $\Sexpr{format.percent(sim1b.acc)}$ accuracy and makes pre \subsubsection{Measurement Error in a Dependent Variable (\textit{Simulation 2a} and \textit{2b})} -We then simulate using an AC to measure the dependent variable $Y$, a binary independent variable $X$, and a continuous independent variable $Z$. The goal is to estimate $B_1$ and $B_2$ in the following logistic regression model: +We then simulate using an AC to measure the dependent variable $Y$ which we aim to explain given a binary independent variable $X$ and a continuous independent variable $Z$. The goal is to estimate $B_1$ and $B_2$ in the following logistic regression model: \begin{equation} P(Y) = \frac{1}{1 + e^{-(B_0 + B_1 X + B_2 Z)}} @@ -499,7 +495,7 @@ We then simulate using an AC to measure the dependent variable $Y$, a binary in %As was true for $X$ in \emph{Simulation 1}, human coders can observe $Y$ but doing so may be costly. We may thus instead use an AC that makes predictions $W = Y + \xi$ . -\noindent In our second real-data example, $Y$ is if a comment contains toxicity, $X$ is if the comment discloses racial or ethnic identity, and $Z$ is the number of times the comment was ``liked''. +%\noindent In our second real-data example, $Y$ is if a comment contains toxicity, $X$ is if the comment discloses racial or ethnic identity, and $Z$ is the number of times the comment was ``liked''. In \emph{Simulation 2a} (see Figure \ref{fig:simulation.2a}) and \emph{Simulation 2b} (see Figure \ref{fig:simulation.2b}) $X$ and $Z$ are, again, balanced ($P(X)=P(Z)=0.5$) and correlated (Pearson's $\rho=\Sexpr{round(sim2a.cor.xz,2)}$). @@ -680,7 +676,7 @@ As we showed in an example with data from the Perspective API, widely used and v As evidence by our literature review, this problem not attracted enough attention within communication science \citep[but see][]{bachl_correcting_2017} and even in the broader computational social science community. Therefore, although current best-practices of reporting metrics of classifier performance on manually annotated validation data are important, they provide little protection from misclassification bias. These practices use annotations to enact a transparency ritual to ward against misclassification bias, but annotations can do much more. With the right statistical model, they can correct misclassification bias. -We introduce maximum likelihood adjustment, a new method we designed to correct misclassification bias and use monte-carlo simulations to +We introduce maximum likelihood adjustment, a new method we designed to correct misclassification bias and use Monte Carlo simulations to evaluate it and compare it to other recently proposed methods. Our method is the only one that is effective across a wide range of scenarios. It is also straightforward to use. Our implementation in the R package \texttt{misclassificationmodels} provides a familiar formula interface for regression models. @@ -710,26 +706,26 @@ Our example relies on the publicly available Civil Comments dataset \citep{cjada Each comment was labeled by up to ten manual annotators (although selected comments were labeled by even more annotators). Originally, the dataset represents \emph{toxicity} and \emph{disclosure} as proportions of annotators who labeled a comment as toxic or as disclosing aspects of personal identity including race and ethnicity. For our analysis, we converted these proportions into indicators of the majority view to transform both variables to a binary scale. -\begin{figure}[htbp!] -\centering -\begin{subfigure}{\linewidth} -<>= -p <- plot.civilcomments.iv.example(include.models=c("Automatic Classification", "All Annotations", "Annotation Sample", "Error Correction")) -print(p) -@ -\subcaption{\emph{Example 1}: Misclassification in an independent variable.\label{fig:real.data.example.iv.app}} -\end{subfigure} +%\begin{figure}[htbp!] +%\centering +%\begin{subfigure}{\linewidth} +%<>= +%p <- plot.civilcomments.iv.example(include.models=c("Automatic %Classification", "All Annotations", "Annotation Sample", "Error %Correction")) +%print(p) +%@ +%\subcaption{\emph{Example 1}: Misclassification in an independent %variable.\label{fig:real.data.example.iv.app}} +%\end{subfigure} -\begin{subfigure}{\linewidth} -<>= -p <- plot.civilcomments.dv.example(include.models=c("Automatic Classification", "All Annotations", "Annotation Sample", "Error Correction")) -print(p) -@ -\subcaption{\emph{Example 2}: Misclassification in a dependent variable. \label{fig:real.data.example.dv.app}} +%\begin{subfigure}{\linewidth} +%<>= +%p <- plot.civilcomments.dv.example(include.models=c("Automatic %Classification", "All Annotations", "Annotation Sample", "Error %Correction")) +%print(p) +%@ +%\subcaption{\emph{Example 2}: Misclassification in a dependent variable. %\label{fig:real.data.example.dv.app}} -\end{subfigure} -\caption{Real-data example including correction using MLA.} -\end{figure} +%\end{subfigure} +%\caption{Real-data example including correction using MLA.} +%\end{figure} % Our maximum-likelihood based error correction technique in this example requires specifying models for the Perspective's scores and, in the case where these scores are used as a covariate, a model for the human annotations. In our first example, where toxicity was used as a covariate, we used the \emph{human annotations}, \emph{identity disclosure}, and the interaction of these two variables in the model for scores. We omitted \emph{likes} from this model because they are virtually uncorrelated with misclassifications (Pearson's $\rho=\Sexpr{iv.example[['civil_comments_cortab']]['toxicity_error','likes']}$). Our model for the human annotations is an intercept-only model. @@ -800,7 +796,7 @@ Statisticans have introduce a range of other error correction methods which we d \emph{Score function methods} derive estimating equations for models without measurement error and then solve them either exactly or using numerical integration \citep{carroll_measurement_2006, yi_handbook_2021}. The main advantage of score function methods may have over likelihood-based methods is that they do not require distributional assumptions about mismeasured independent variables. This advantage has limited use in the context of ACs because binary classifications must follow Bernoulli distributions. -We also do not consider \emph{Bayesian methods} (aside from the Amelia implementation of the MI approach) because we expect these to have similar limitations to the maximum likelihood methods we consider. Bayesian methods may have other advantages resulting from posterior inference and may generalize to a wide range of applications. However, specifying prior distributions introduces additional methodological complexity and posterior inference is computationally intensive, making Bayesian methods less convenient for Monte-Carlo simulation. +We also do not consider \emph{Bayesian methods} (aside from the Amelia implementation of the MI approach) because we expect these to have similar limitations to the maximum likelihood methods we consider. Bayesian methods may have other advantages resulting from posterior inference and may generalize to a wide range of applications. However, specifying prior distributions introduces additional methodological complexity and posterior inference is computationally intensive, making Bayesian methods less convenient for Monte Carlo simulations. \section{Deriving the Maximum Likelihood Approach}