Merge branch 'master' of https://git.overleaf.com/62a956eb9b9254783cc84c82 into osf

2023-03-07 13:21:40 -08:00
parent 316ce33b55 4219aaba23
commit 23672eecb4
1 changed files with 12 additions and 12 deletions
--- a/article.Rtex
+++ b/article.Rtex
@@ -138,7 +138,7 @@ However, there is increasing concern about the validity of automated content ana
 %Knowing that high classification accuracy limits the risks of misleading inference, careful researchers might use only ACs with excellent predictive performance. 
 Our study begins with a demonstration of misclassification bias in a real-world example based on the Perspective toxicity classifier.
-Next, we provide a systematic literature review of \emph{N} = 48 studies employing SML-based text classification.
+Next, we provide a systematic literature review of $N = 48$ studies employing SML-based text classification.
 Although communication scholars have long scrutinized related questions about manual content analysis for which they have recently proposed statistical corrections  \citep{bachl_correcting_2017, geis_statistical_2021}, misclassification bias in automated content analysis is largely ignored.
 Our review demonstrates a troubling lack of attention to the threats ACs introduce and virtually no mitigation of such threats. As a result, in the current state of affairs, researchers are likely to either draw misleading conclusions from inaccurate ACs or avoid ACs in favor of costly methods such as manually coding large samples \citep{van_atteveldt_validity_2021}. 
@@ -178,7 +178,7 @@ Profoundly, we conclude that automated content analysis will progress not only--
 \section{Why Misclassification is a Problem: an Example Based on the Perspective API}
 There is no perfect AC. All ACs make errors.
-This inevitable misclassification causes bias in statistical inference \citep{carroll_measurement_2006, scharkow_how_2017}, leading researchers to make both type-1 (false discovery) and type-2 errors (failure to reject the null) in hypotheses tests. To illustrate the problematic consequences of this misclassification bias, we focus on real-world data and a specific research area in communication research: detecting and understanding harmful social media content. Communication researchers often employ automated tools such as the Perspective toxicity classifier \citep{cjadams_jigsaw_2019} to detect toxicity in online content \citep[e.g.,][]{hopp_social_2019, kim_distorting_2021, votta_going_2023}.
+This inevitable misclassification causes bias in statistical inference \citep{carroll_measurement_2006, scharkow_how_2017}, leading researchers to make both type-I (false discovery) and type-II errors (failure to reject the null) in hypotheses tests. To illustrate the problematic consequences of this misclassification bias, we focus on real-world data and a specific research area in communication research: detecting and understanding harmful social media content. Communication researchers often employ automated tools such as the Perspective toxicity classifier \citep{cjadams_jigsaw_2019} to detect toxicity in online content \citep[e.g.,][]{hopp_social_2019, kim_distorting_2021, votta_going_2023}.
 As shown next, however, relying on toxicity scores created by ACs such as the Perspective API as (in-)dependent variables produces different results than using measurements created via manual annotation.
 To illustrate this, we use the Civil Comments dataset released in 2019 by Jigsaw, the Alphabet corporation subsidiary behind the Perspective API. Methodological details on the data and our example are available in Appendix \ref{appendix:perspective}. The dataset has \Sexpr{f(dv.example[['n.annotated.comments']])} English-language comments made on independent news sites. It also includes manual annotations of each comment concerning its toxicity (\emph{toxicity}), whether it discloses aspects of personal identity like race or ethnicity \emph{(identity disclosure)}, and the number of likes it received \emph{(number of likes)}.
@@ -217,10 +217,10 @@ As shown in Figure \ref{fig:real.data.example.iv}, relying on AC-based toxicity
 %This is because the coefficient for likes is statistically indistinguishable from 0 and the coefficient for the interaction between likes and toxicity is positive and well-estimated. 
 In contrast, using human annotations would lead researchers to conclude a subtle positive direct relationship between likes and identity disclosure. %Using a smaller smaller sample of manually annotated data, as will often be more feasible due to limited resources, lacks sufficient statistical power to detect any such relationship.
 %However, our method can use this sample of annotations to correct the bias introduced by Perspective's misclassifications while preserving enough statistical power to detect the direct relationship between likes and identity disclosure at the 95\% confidence level with estimates similar to those in the model using all \Sexpr{f(dv.example[['n.annotated.comments']])}  annotations. 
-This  demonstrates that even a very accurate AC can introduce type-2 error, i.e. researchers failing to rejecting a null hypothesis due to misclassification.
+This  demonstrates that even a very accurate AC can introduce type-II errors, i.e. researchers failing to rejecting a null hypothesis due to misclassification.
 Second, let us consider \emph{misclassification in a dependent variable}. We now predict the \emph{toxicity} of a comment with \emph{number of likes}, \emph{identity disclosure} in a comment, and their interaction as independent variables.
-As shown in Figure \ref{fig:real.data.example.dv}, using Perspective's classification of toxicity results in a small negative direct effect of likes. However, there is no detectable relationship when using manual annotations. As such, misclassification can also lead to type-1 error, i.e., false discovery of a nonzero relationship.
+As shown in Figure \ref{fig:real.data.example.dv}, using Perspective's classification of toxicity results in a small negative direct effect of likes. However, there is no detectable relationship when using manual annotations. As such, misclassification can also lead to type-I error, i.e., false discovery of a nonzero relationship.
 %The model using a more feasible sample of \Sexpr{format.percent(dv.sample.prop) } of manual annotations cannot rule out such a weak relationship.
 %(the estimated effect using the AC is in the 95\% confidence interval), but our error correction method using this sample and Perspective's automatic classifications together can do so.
@@ -262,7 +262,7 @@ If ACs become standard measurement devices, for instance
 %the LIWC dictionary to measure sentiment \citep{boukes_whats_2020},
 %\citep{dobbrick_enhancing_2021}
 Google's Perspective API for measuring toxicity \citep[see critically][]{hosseini_deceiving_2017} or Botometer for classifying social media bots \citep[see critically][]{rauchfleisch_false_2020}, entire research areas may be subject to systematic biases.
-Even if misclassification bias is usually conservative, it can slow progress in a research area.   Consider how \citet{scharkow_how_2017} argue that media's ``minimal effects'' on political opinions and behavior in linkage studies may be an artifact of measurement errors both in manual content analyses and self-reported media use in surveys.  Conversely, if researchers selectively report statistically significant hypothesis tests, misclassification can introduce an upward bias in the magnitude of reported effect sizes and contribute to a replication crisis \citep{loken_measurement_2017}.
+Even if misclassification bias is usually conservative, it can slow progress in a research area.   Consider how \citet{scharkow_how_2017} argue that media's ``minimal effects'' on political opinions and behavior in linkage studies may be an artifact of measurement errors in both manual content analyses and self-reported media use in surveys.  Conversely, if researchers selectively report statistically significant hypothesis tests, misclassification can introduce an upward bias in the magnitude of reported effect sizes and contribute to a replication crisis \citep{loken_measurement_2017}.
 % First, we note that when the anticipated effect size is large enough, traditional content analysis of a random sample has the advantage over the considerable complexity of automated content analysis.
@@ -337,7 +337,7 @@ Like regression calibration, multiple imputation uses a model to infer likely va
 % This section basically translates Carroll et al. for a technically advanced 1st year graduate student. 
 We now elaborate on \emph{Maximum Likelihood Adjustement} (MLA), a new method we propose for correcting misclassification bias. Our method tailors \citet{carroll_measurement_2006}'s presentation of the general statistical theory of likelihood modeling for measurement error correction to the context of automated content analysis.\footnote{In particular see Chapter 8 (especially example 8.4) and Chapter 15. (especially 15.4.2).}  The MLA approach deals with misclassification bias by maximizing a likelihood that correctly specifies an \emph{error model} of the probability of the automated classifications conditional on the true value and the outcome \citep{carroll_measurement_2006}.
 In contrast to the GMM and the MI approach, which predict values of the misclassified variable, the MLA method accounts for all possible values of the variable by ``integrating them out'' of the likelihood.
-``Integrating out'' means adding possible values of a variable to the likelihood, weighted by the likelihood of the error model. 
+``Integrating out'' means adding possible values of a variable to the joint likelihood, weighted by the likelihood  of the error model. 
 MLA methods have four advantages in the context of ACs that reflect the benefits of integrating out partially observed discrete variables. First, they are  general in that they can be applied to any model with a convex likelihood including generalized linear models (GLMs) and generalized additive models (GAMs).
 Second, assuming the model is correctly specified, MLA estimators are fully consistent whereas regression calibration estimators are only approximately consistent \citep{carroll_measurement_2006}.  Practically, this means that MLA methods can have greater statistical efficiency and require less manually annotated data to make precise estimates. 
@@ -355,16 +355,16 @@ Fourth, and most important, MLA can be effective when misclassification is syste
 \subsubsection{When an Automated Classifier Predicts an Independent Variable}
 In general, if we want to estimate a model $P(Y|\Theta_Y, X, Z)$ for $Y$ given $X$ and $Z$ with parameters $\Theta_Y$, we can use AC classifications $W$ predicting $X$ to gain statistical power without introducing misclassification bias by maximizing ($\mathcal{L}(\Theta|Y,W)$), the likelihood of the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\}$ in a joint model of $Y$ and $W$  \citep{carroll_measurement_2006}.
-The joint probability of $Y$ and $W$ can be factored into the product of three terms: $P(Y|X,Z,\Theta_Y)$, the model with parameters $\Theta_Y$ we want to estimate, $P(W|X,Y, \Theta_W)$, a model for $W$ having parameters $\Theta_W$, and $P(X|Z, \Theta_X)$, a model for $X$ having parameters $\Theta_X$.
+The joint probability of $Y$ and $W$ can be factored into the product of three terms: $P(Y|X,Z,\Theta_Y)$, the model with parameters $\Theta_Y$ we want to estimate, $P(W|X,Y,Z \Theta_W)$, a model for $W$ having parameters $\Theta_W$, and $P(X|Z, \Theta_X)$, a model for $X$ having parameters $\Theta_X$.
 Calculating these three conditional probabilities is sufficient to calculate the joint probability of the dependent variable and automated classifications and thereby obtain a consistent estimate despite misclassification. $P(W|X,Y, \Theta_W)$ is called the \emph{error model} and $P(X|Z, \Theta_X)$ is called the \emph{exposure model} \citep{carroll_measurement_2006}.
 To illustrate, consider the regression model  $Y=B_0 + B_1 X + B_2 Z + \varepsilon$  and automated classifications $W$ of the independent variable $X$.
-We can assume that the probability of $W$ follows a logistic regression model of $Y$, $X$, and $Z$ and that the probability of $X$ follows a logistic regression model of $Z$. In this case, the likelihood model below is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\} = \{\{B_0, B_1, B_2\}, \{\alpha_0, \alpha_1, \alpha_2\}, \{\gamma_0, \gamma_1\}\}$.
+We can assume that the probability of $W$ follows a logistic regression model of $Y$, $X$, and $Z$ and that the probability of $X$ follows a logistic regression model of $Z$. In this case, the likelihood model below is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\} = \{\{B_0, B_1, B_2\}, \{\alpha_0, \alpha_1, \alpha_2, \alpha_3\}, \{\gamma_0, \gamma_1\}\}$.
 \begin{align}
    \mathcal{L}(\Theta | Y, W) &= \prod_{i=0}^{N}\sum_{x} {P(Y_i| X_i, Z_i, \Theta_Y)P(W_i|X_i, Y_i, Z_i, \Theta_W)P(X_i|Z_i, \Theta_X)} \label{eq:covariate.reg.general}\\
    P(Y_i| X_i, Z_i, \Theta_Y) &= \phi(B_0 + B_1 X_i + B_2 Z_i) \\
-    P(W_i| X_i, Y_i, Z_i, \Theta_W) &= \frac{1}{1 + e^{-(\alpha_0 + \alpha_1 Y_i + \alpha_2 X_i)}} \label{eq:covariate.logisticreg.w} \\
+    P(W_i| X_i, Y_i, Z_i, \Theta_W) &= \frac{1}{1 + e^{-(\alpha_0 + \alpha_1 Y_i + \alpha_2 X_i + \alpha_3 Z_i)}} \label{eq:covariate.logisticreg.w} \\
    P(X_i| Z_i, \Theta_X) &= \frac{1}{1 + e^{-(\gamma_0 + \gamma_1 Z_i)}}
 \end{align}
@@ -409,7 +409,7 @@ Such simulations allow exploration of how results depend on assumptions about th
 In our simulations, we tested four error correction methods: \emph{GMM calibration} (GMM) \citep{fong_machine_2021}, \emph{multiple imputation} (MI) \citep{blackwell_unified_2017-1}, \emph{Zhang's pseudo-likelihood model} (PL) \citep{zhang_how_2021}, and our \emph{maximum likelihood adjustment} approach (MLA). We use the \texttt{predictionError} R package \citep{fong_machine_2021} for the GMM method, the \texttt{Amelia} R package for the MI approach, and our own implementation of \citet{zhang_how_2021}'s PL approach in R.
 We develop our MLA approach in the R package \texttt{misclassificationmodels}. 
-For PL and MLA, we quantify uncertainty using the fisher information quadratic approximation.\footnote{The code for reproducing our simulations and our experimental R package is available here: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}.} 
+For PL and MLA, we quantify uncertainty using the Fisher information quadratic approximation.\footnote{The code for reproducing our simulations and our experimental R package is available here: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}.} 
 In addition, we compare these error correction methods to two common approaches in communication science: the \emph{feasible} estimator (i.e., conventional content analysis that uses only manually annotated data and not ACs)
 %and illustrates the motivation for using an AC in these scenarios—validation alone provide insufficient statistical power for a sufficiently precise hypothesis test. 
@@ -433,7 +433,7 @@ observations). Since our review indicated that ACs are most often used to create
 We simulate regression models with two independent variables ($X$ and $Z$). This sufficiently constrains our study's scope but the scenario is general enough to be applied in a wide range of research studies. 
 %Simulating studies with  two covariates lets us study how  measurement error in one covariate can cause bias in coefficient estimates of other covariates. 
 Whether the methods we evaluate below are effective or not depends on the conditional dependence structure among independent variables, the dependent variable $Y$, and automated classifications $W$.
-This structure determines if adjustment for systematic misclassifications is required \citep{carroll_measurement_2006}.
+This structure determines if adjustment for systematic misclassification is required \citep{carroll_measurement_2006}.
 In Figure \ref{bayesnets}, we illustrate our scenarios via Bayesian networks representing the conditional dependence structure of variables  \citep{pearl_fusion_1986}:
 %In these figures, an edge between two variables indicates that they have a direct relationship.  Two nodes that are not neighbors are statistically independent given the variables between them on the graph. For example, in Figure \ref{fig:simulation.1a}, the automatic classifications $W$ are conditionally independent of $Y$ given $X$ because all paths between $W$ and $Y$ contain $X$. This indicates that the model $Y=B_0 +B_1 W+ B_2 Z$ (the \emph{naïve estimator}) has non-differential error because the automatic classifications $W$ are conditionally independent of $Y$ given $X$. However, in Figure \ref{fig:simulation.1b}, there is an edge between $W$ and $Y$ to indicate that $W$ is not conditionally independent of $Y$ given  other variables. Therefore, the naïve estimator has differential error.
 We first simulate two cases where an AC measures an independent variable without (\emph{Simulation 1a}) and with differential error (\emph{Simulation 1b}). Then, we simulate using an AC to measure the dependent variable, either one with misclassifications that are uncorrelated (\emph{Simulation 2a}) or correlated with an independent variable (\emph{Simulation 2b}). GMM is not designed to correct misclassifications in dependent variables, so we omit this method in \emph{Simulations 2a} and \emph{2b}. 
@@ -454,7 +454,7 @@ In this simulated example, $Y$ is continuous variable,  $X$ is a binary variable
 %Say that human content coders can observe $X$ perfectly, but each observation is so expensive that observing $X$ for a large sample is infeasible.
 %Instead, the human coders can measure $X$ without error for a subsample of size $m << N$.  
 %To scale up content analysis, a SML-based AC makes predictions $W$ of $X$—for instance predicting if any comments from a social media user include toxicity. 
-Both simulations have a normally distributed dependent variable $Y$ and two binary independent variables $X$ and $Z$, which are balanced ($P(X)=P(Z)=0.5$) and correlated (Pearson's $\rho=\Sexpr{round(sim1a.cor.xz,2)}$). %Simulating balanced covariates serves simplicity so that accuracy is adequate to quantify the predictive performance of our simulated classifier. Simulating correlated covariates is helpful to study how misclassification in one variable affects parameter inference in other covariates. 
+Both simulations have a normally distributed dependent variable $Y$ and two binary independent variables $X$ and $Z$, which are balanced ($P(X)=P(Z)=0.5$) and correlated (Pearson's $\rho=\Sexpr{round(sim2a.cor.xz,2)}$). %Simulating balanced covariates serves simplicity so that accuracy is adequate to quantify the predictive performance of our simulated classifier. Simulating correlated covariates is helpful to study how misclassification in one variable affects parameter inference in other covariates. 
 To represent a study design where an AC is needed to obtain sufficient statistical power, $Z$ and $X$ can explain only \Sexpr{format.percent(sim1.R2)} of the variance in $Y$. 
 % TODO, bring back when these simulations are in the appendix.
 %Additional simulations in appendix \ref{appendix:sim1.imbalanced} show results for variations of \emph{Simulation 1} with imbalanced covariates explaining a range of variances, different classifier accuracies, heteroskedastic misclassifications and deviance from normality in the an outcome $Y$.