revisions and appendix update

2023-02-24 15:10:00 -08:00
parent c5e0a01713
commit 3dc090ec6a
18 changed files with 254 additions and 438 deletions
--- a/article.Rtex
+++ b/article.Rtex
@@ -35,6 +35,8 @@ source('resources/real_data_example.R')
 \definecolor{codepurple}{rgb}{0.58,0,0.82}
 \definecolor{backcolour}{rgb}{0.95,0.95,0.92}

+\setcounter{secnumdepth}{3}
+
 %Code listing style named "mystyle"
 \lstdefinestyle{mystyle}{
  backgroundcolor=\color{backcolour}, commentstyle=\color{codegreen},
@@ -115,13 +117,18 @@ Despite this popularity, even highly accurate classifiers make errors that cause
 As we show in a systematic literature review of SML applications, 
 communication scholars largely ignore misclassification bias.
 In principle, existing statistical methods can use ``gold standard'' validation data, such as that created by human annotators, to correct misclassification bias and produce consistent estimates.
-We introduce and test such methods, including a new method we design and implement in the R package \texttt{misclassificationmodels}, via Monte-Carlo simulations designed to reveal each method's limitations. Based on our results, we provide recommendations for correcting misclassification bias. In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.
+We introduce and test such methods, including a new method we design and implement in the R package \texttt{misclassificationmodels}, via Monte-Carlo simulations designed to reveal each method's limitations. Based on our results, we recommend our method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.
 }
+
+% fix bug in apa7 package: https://tex.stackexchange.com/questions/645947/adding-appendices-in-toc-using-apa7-package
+
+
 \begin{document}
 \maketitle
 %\section{Introduction}

-
+\tableofcontents
+\clearpage
 \emph{Automated classifiers} (ACs) based on supervised machine learning (SML) have rapidly gained popularity
 as part of the \emph{automated content analysis} toolkit in communication science \citep{baden_three_2022}. With ACs, researchers can categorize large samples of text, images, video or other types of data into predefined categories \citep{scharkow_thematic_2013}. Studies for instance use SML-based classifiers to study frames \citep{burscher_teaching_2014}, tonality \citep{van_atteveldt_validity_2021}, %even ones as seemingly straightforward as sentiment \citep{van_atteveldt_validity_2021}, toxicity \citep{fortuna_toxic_2020}
 or civility \citep{hede_toxicity_2021} in news media texts or social media posts.
@@ -145,7 +152,7 @@ Our primary contribution, an effort rescue ACs from this dismal state, is to \em

 % Such biases can easily result when classifier errors affect human behavior, such as that of social media moderators \maskparencite{teblunthuis_effects_2021}. Studies using classifiers from APIs that are also used in sociotechnical systems therefore be particularly prone to to differential error, which can cause misleading statistics even when classification accuracy is high. 

-% Our Supplementary Materials present numerous extensions of these scenarios.  We show that none of the existing error correction methods are effective in all scenarios. 
+% Our Supplementary Materials present numerous extensions of these scenarios.  We show that none of the existing error correction methodsare effective in all scenarios. 
 %— multiple imputation fails in scenario 2; GMM calibration fails in scenario 1b and is not designed for scenario 2; and the pseudo-likelihood method fails in scenario 1 and in scenario 2b.  When correctly applied, our likelihood modeling is the only correction method recovering the true parameters in all scenarios. %We provide our implementation as an R package.

 %  , and our approach based on maximum likelihood methods \citep{carroll_measurement_2006} .
@@ -228,8 +235,8 @@ ACs boasting high performance often have biases related to social categories \ci
 Much of this critique targets unjust consequences of these biases to individuals. Our example shows that these biases can also contaminate scientific studies using ACs as measurement devices. Even very accurate ACs can cause both type-I and type-II errors, which become more likely when  classifiers are less accurate or more biased, or when effect sizes are small.

 We argue that current common practices to address such limitations are insufficient. These practices assert validity by reporting classifier performance on manually annotated data quantified as metrics including accuracy, precision, recall, or the F1 score \citep{hase_computational_2022, baden_three_2022, song_validations_2020}.
-These steps promote confidence in results by making misclassification transparent, but our example indicates that high predictiveness may not protect researchers from biases flowing downstream into statistical inferences.
-Instead of practicing transparancy and hoping not to be mislead by misclassification bias, researchers can and should use validation data to correct misclassification bias. 
+These steps promote confidence in results by making misclassification transparent, but our example indicates bias can flow downstream into statistical inferences, despite high predictiveness. 
+Instead of relying on transparency rituals to ward off misclassification bias, researchers can and should use validation data to understand and correct it. 

 % \citep{obermeyer_dissecting_2019, kleinberg_algorithmic_2018, bender_dangers_2021, wallach_big_2019, noble_algorithms_2018}. 
 %For example, \citet{hede_toxicity_2021} show that, when applied to news datasets, the Perspecitve API overestimates incivility related to topics such as racial identity, violence, and sex. 
@@ -238,9 +245,10 @@ Instead of practicing transparancy and hoping not to be mislead by misclassifica
 %Importantly, these errors are correctable using human annotations.  Although this example required \Sexpr{iv.sample.count} annotations, a large number representing considerable effort, to consistently do so, this is a small fraction of the entire dataset.

 These claims may surprise because of the wide-spread misconception that misclassification causes only conservative bias (i.e., bias towards null effects). This is believed because it is true for bivariate least squares regression when misclassifications are  nonsystematic
-\citep{carroll_measurement_2006, loken_measurement_2017, van_smeden_reflection_2020}.\footnote{Measurement error is \emph{classical} when $W = X + \xi$ because the variance of an AC's predictions is greater than the variance of the true value \citep{carroll_measurement_2006}.  
-Non-classical measurement error in an independent variable can be ``differential'' if it is not conditionally independent of the dependent variable given the other independent variables. 
-Measurement error in an independent variable can be nondifferential and not classical. This is called Berkson and has the form $X = W + \xi$. In general, Berkson measurement error is easier to deal with than classical error. It is hard to imagine how a AC would have Berkson errors as predictions would then have lower variance than the training data. Following prior work, we thus do not consider Berkson errors \citep{fong_machine_2021, zhang_how_2021}. We call measurement error in the dependent variable \emph{systematic} when it is correlated with an independent variable.}  As a result, researchers interested in a hypothesis of a statistically significant relationship may not consider misclassification an important threat to validity \citep{loken_measurement_2017}. 
+\citep{carroll_measurement_2006, loken_measurement_2017, van_smeden_reflection_2020}.\footnote{Measurement error is \emph{classical} when it is nonsystematic and the the variance of an AC's predictions is greater than the variance of the true value \citep{carroll_measurement_2006}. 
+Measurement error in an independent variable is called ``differential'' if it is not conditionally independent of the dependent variable given the other independent variables. 
+Measurement error in an independent variable can be nondifferential and not classical when the variance of the misclassified variable is less than the variance of the true value.  This is called Berkson error and in general is easier to deal with than classical error. It is hard to imagine how a AC would have Berkson errors as predictions would then have lower variance than the training data. Following prior work, we thus do not consider Berkson errors \citep{fong_machine_2021, zhang_how_2021}. We call measurement error in the dependent variable \emph{systematic} when it is correlated
+with an independent variable. We use this more general term to simplify our discussions that pertain equally to misclassified independent and dependent variables.}  As a result, researchers interested in a hypothesis of a statistically significant relationship may not consider misclassification an important threat to validity \citep{loken_measurement_2017}. 

 However, as shown in our example, misclassification bias can be anti-conservative \citep{carroll_measurement_2006, loken_measurement_2017, van_smeden_reflection_2020}. In regression models with more than one independent variable, or in nonlinear models, such as the logistic regression we used in our example, even nonsystematic misclassification can cause bias away from 0. 
 Second, systematic misclassification can bias inference in any direction.
@@ -321,35 +329,36 @@ In contrast, an AC can make classifications $W$ for the entire dataset but intro

 \emph{Multiple imputation} (MI) treats misclassification as a missing data problem. It understands the true value of $X$ to be observed in manually annotated data $X^*$ and missing otherwise \citep{blackwell_unified_2017-1}. 
 %For example, the regression calibration step in \citet{fong_machine_2021}'s GMM method uses least squares regression to impute unobserved values of the covariate $X$. Indeed, \citet{carroll_measurement_2006} describe regression calibration when validation data are available as ``simply a poor person's imputation methodology'' (pp. 70). 
-Like regression calibration, multiple imputation uses a model to infer likely values of possibly misclassified variables. The difference is that multiple imputation samples several (hence \emph{multiple} imputation) entire datasets filling in the missing data from the predictive probability distribution of $X$ conditional on other variables $\{W,Y,Z\}$, then runs a statistical analysis on each of these sampled datasets and pools the results of each of these analyses \citep{blackwell_unified_2017-1}. Note that  $Y$ is included among the imputing variables, giving the MI approach the potential to address differential error.  \citet{blackwell_unified_2017-1} claim that the MI method is relatively robust when it comes to small violations of the assumption of nondifferential error. Moreover, in theory, the MI approach can be used for correcting misclassifications both in independent and dependent variables. 
+Like regression calibration, multiple imputation uses a model to infer likely values of possibly misclassified variables. The difference is that multiple imputation samples several (hence \emph{multiple} imputation) entire datasets filling in the missing data from the predictive probability distribution of $X$ conditional on other variables $\{W,Y,Z\}$, then runs a statistical analysis on each of these sampled datasets and pools the results of each of these analyses \citep{blackwell_unified_2017-1}. Note that  $Y$ is included among the imputing variables, giving the MI approach the potential to address \emph{differential error,} when systematic misclassification makes automatic classifications conditionally dependent on the outcome given the other independent variables.
+ \citet{blackwell_unified_2017-1} claim that the MI method is relatively robust when it comes to small violations of the assumption of nondifferential error. Moreover, in theory, the MI approach can be used for correcting misclassifications both in independent and dependent variables. 

-\emph{``Pseudo-likelihood''} methods (PL)—even if not always explicitly labeled this way—are another approach for correcting misclassification bias. \citet{zhang_how_2021} proposes a method that approximates the error model using quantities from the AC's confusion matrix—the positive and negative predictive values in the case of a mismeasured independent variable and the AC's false positive and false negative rates in the case of a mismeasured dependent variable.  Because quantities from the confusion matrix are neither data nor model parameters, \citet{zhang_how_2021}'s method is technically a ``pseudo-likelihood'' method. A clear benefit is that this method only requires summary quantities derived from manually annotated data, for instance via a confusion matrix. %We will discuss likelihood methods in greater depth in the presentation of our MLE framework below.
+\emph{``Pseudo-likelihood''} methods (PL)—even if not always explicitly labeled this way—are another approach for correcting misclassification bias. \citet{zhang_how_2021} proposes a method that approximates the error model using quantities from the AC's confusion matrix—the positive and negative predictive values in the case of a mismeasured independent variable and the AC's false positive and false negative rates in the case of a mismeasured dependent variable.  Because quantities from the confusion matrix are neither data nor model parameters, \citet{zhang_how_2021}'s method is technically a ``pseudo-likelihood'' method. A clear benefit is that this method only requires summary quantities derived from manually annotated data, for instance via a confusion matrix. %We will discuss likelihood methods in greater depth in the presentation of our MLA framework below.

-\subsection{Proposing a Likelihood Modeling Approach to Correct Misclassification}
+\subsection{Proposing Maximum Likelihood Adjustment for Misclassification}

 % This section basically translates Carroll et al. for a technically advanced 1st year graduate student. 
-We now elaborate on a new \emph{Maximum Likelihood Method} (MLE) we propose for correcting misclassification bias. Our method tailors \citet{carroll_measurement_2006}'s presentation of the general statistical theory of likelihood modeling for measurement error correction to context of automated content analysis.\footnote{In particular see Chapter 8 (especially example 8.4) and Chapter 15. (especially 15.4.2).}  The MLE approach deals with misclassification bias by maximizing a likelihood that correctly specifies an \emph{error model} of the probability of the automated classifications conditional on the true value and the outcome \citep{carroll_measurement_2006}.
-In contrast to the GMM and the MI approach, which predict values of the mismeasured variable, the MLE method accounts for all possible values of the variable by ``integrating them out'' of the likelihood.
+We now elaborate on \emph{Maximum Likelihood Adjustement} (MLA), a new method we propose for correcting misclassification bias. Our method tailors \citet{carroll_measurement_2006}'s presentation of the general statistical theory of likelihood modeling for measurement error correction to context of automated content analysis.\footnote{In particular see Chapter 8 (especially example 8.4) and Chapter 15. (especially 15.4.2).}  The MLA approach deals with misclassification bias by maximizing a likelihood that correctly specifies an \emph{error model} of the probability of the automated classifications conditional on the true value and the outcome \citep{carroll_measurement_2006}.
+In contrast to the GMM and the MI approach, which predict values of the misclassified variable, the MLA method accounts for all possible values of the variable by ``integrating them out'' of the likelihood.
 ``Integrating out'' means adding possible values of a variable to the likelihood, weighted by the likelihood of the error model. 

-MLE methods have four advantages in the context of ACs. First, they are  general in that they can be applied to any model with a convex likelihood including generalized linear models (GLMs) and generalized additive models (GAMs).
-Second, assuming the model is correctly specified, MLE estimators are fully consistent whereas regression calibration estimators are only approximately consistent \citep{carroll_measurement_2006}.  Practically, this means that MLE methods can have greater statistical efficiency and require less manually annotated data to make precise estimates. 
-%The MLE approach is conceptually different from the GMM one. The GMM approach first imputes likely values and then runs the main analysis on imputed values. By contrast, MLE approaches estimate—all in one step—the main analysis using the full dataset and the error model estimated using only the validation data \citep{carroll_measurement_2006}. 
-Third, the MLE approach is applicable both for correcting for misclassification in a dependent and an independent variable.
-Fourth, and most important, this approach is effective when misclassification is systematic. 
+MLA methods have four advantages in the context of ACs that reflect the benefits of integrating out partially observed discrete variables. First, they are  general in that they can be applied to any model with a convex likelihood including generalized linear models (GLMs) and generalized additive models (GAMs).
+Second, assuming the model is correctly specified, MLA estimators are fully consistent whereas regression calibration estimators are only approximately consistent \citep{carroll_measurement_2006}.  Practically, this means that MLA methods can have greater statistical efficiency and require less manually annotated data to make precise estimates. 
+%The MLA approach is conceptually different from the GMM one. The GMM approach first imputes likely values and then runs the main analysis on imputed values. By contrast, MLA approaches estimate—all in one step—the main analysis using the full dataset and the error model estimated using only the validation data \citep{carroll_measurement_2006}. 
+Third, the MLA approach is applicable both for correcting for misclassification in a dependent and an independent variable.
+Fourth, and most important, MLA can be effective when misclassification is systematic. 

 %The idea is to use an \emph{error model} of the conditional probability of the automatic classifications given the true classifications and other variables on which automatic classifications depend.
 %In other words, the error model estimates the conditional probability mass function of the automatic classifications. 
 
 % When a variable is measured with error, this error introduces uncertainty. The overall idea of correcting an analysis with a mismeasured variable through likelihood modeling is to use 

-%Including the error model in the likelihood effectively accounts for uncertainty of the true classifications and, assuming the error model gives consistent estimates of the conditional probability of the automatic classifications given the true values, is sufficient to obtain consistent estimates using MLE \citep{carroll_measurement_2006}. 
+%Including the error model in the likelihood effectively accounts for uncertainty of the true classifications and, assuming the error model gives consistent estimates of the conditional probability of the automatic classifications given the true values, is sufficient to obtain consistent estimates using MLA \citep{carroll_measurement_2006}. 

 \subsubsection{When an Automated Classifier Predicts an Independent Variable}

-In general, if we want to estimate a model $P(Y|\Theta_Y, X, Z)$ for $Y$ given $X$ and $Z$ with parameters $\Theta_Y$, we can use AC classifications $W$ predicting $X$ to gain statistical power without introducing misclassification bias by maximizing ($\mathcal{L}(\Theta|Y,W)$), the likelihood of the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\}$ in a joint model of $Y$ and the error model of $W$  \citep{carroll_measurement_2006}.
-The joint probability of $Y$ and $W$, can be factored into the product of three terms: $P(Y|X,Z,\Theta_Y)$, the model we want to estimate, $P(W|X,Y, \Theta_W)$, a model for $W$ having parameters $\Theta_W$, and $P(X|Z, \Theta_X)$, a model for $X$ having parameters $\Theta_X$.
-Calculating these three conditional probabilities is sufficient to calculate the joint probability of the dependent variable and automated classifications and thereby obtain a consistent estimate despite misclassification. $P(W|X,Y, \Theta_W)$ is called the \emph{error model} and $P(X|Z, \Theta_X)$ is called the \emph{exposure model} \cite{carroll_measurement_2006}.
+In general, if we want to estimate a model $P(Y|\Theta_Y, X, Z)$ for $Y$ given $X$ and $Z$ with parameters $\Theta_Y$, we can use AC classifications $W$ predicting $X$ to gain statistical power without introducing misclassification bias by maximizing ($\mathcal{L}(\Theta|Y,W)$), the likelihood of the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\}$ in a joint model of $Y$ and $W$  \citep{carroll_measurement_2006}.
+The joint probability of $Y$ and $W$, can be factored into the product of three terms: $P(Y|X,Z,\Theta_Y)$, the model with parameters $\Theta_Y$ we want to estimate, $P(W|X,Y, \Theta_W)$, a model for $W$ having parameters $\Theta_W$, and $P(X|Z, \Theta_X)$, a model for $X$ having parameters $\Theta_X$.
+Calculating these three conditional probabilities is sufficient to calculate the joint probability of the dependent variable and automated classifications and thereby obtain a consistent estimate despite misclassification. $P(W|X,Y, \Theta_W)$ is called the \emph{error model} and $P(X|Z, \Theta_X)$ is called the \emph{exposure model} \citep{carroll_measurement_2006}.

 To illustrate, the regression model  $Y=B_0 + B_1 X + B_2 Z + \varepsilon$,  predicts the discrete independent variable $X$.
 We can assume that the probability of $W$ follows a logistic regression model of $Y$, $X$ and $Z$ and that the probability of $X$ follows a logistic regression model of $Z$. In this case, the likelihood model below is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\} = \{\{B_0, B_1, B_2\}, \{\alpha_0, \alpha_1, \alpha_2\}, \{\gamma_0, \gamma_1\}\}$.
@@ -362,9 +371,9 @@ We can assume that the probability of $W$ follows a logistic regression model of
 \end{align}


-\noindent where $\phi$ is the normal probability distribution function.  Note that Equation \ref{eq:covariate.reg.general} models differential error (i.e., $Y$ is not independent of $W$ conditional on $X$ and $Z$) via a linear relationship between $W$ and $Y$.  When error is nondifferential, the dependence between $W$ and $Y$ can be removed from Equations \ref{eq:covariate.reg.general} and \ref{eq:covariate.logisticreg.w}. 
+\noindent where $\phi$ is the normal probability density function.  Note that Equation \ref{eq:covariate.reg.general} models differential error (i.e., $Y$ is not independent of $W$ conditional on $X$ and $Z$) via a linear relationship between $W$ and $Y$.  When error is nondifferential, the dependence between $W$ and $Y$ can be removed from Equations \ref{eq:covariate.reg.general} and \ref{eq:covariate.logisticreg.w}. 

-Calculating the three conditional probabilities in practice requires specifying models on which validity of the method depends.
+Estimating the three conditional probabilities in practice requires specifying models on which validity of the method depends.
 This framework is very general and a wide range of probability models, such as generalized additive models (GAMs) or Gaussian process classification, may be used to estimate $P(W| X, Y, Z, \Theta_W)$ and $P(X|Z,\Theta_X)$ \citep{williams_bayesian_1998}.

 \subsubsection{When an Automated Classifier Predicts a Dependent Variable}
@@ -382,6 +391,7 @@ If we assume that the probability of $Y$ follows a logistic regression model of
 \end{align}

 If the AC's errors are conditionally independent of $X$ and $Z$ given  $W$, the dependence of $W$ on $X$ and $Z$ can be omitted from equations \ref{eq:depvar.general} and \ref{eq:depvar.w}.
+
 Additional details on the likelihood modeling approach available in Appendix \ref{appendix:derivation} of the Supplement.


@@ -389,20 +399,21 @@ Additional details on the likelihood modeling approach available in Appendix \re

 % \TODO{Create a table summarizing the simulations and the parameters.}

-We now present four Monte Carlo simulations (\emph{Simulations 1a}, \emph{1b}, \emph{2a}, and \emph{2b}) with which we evaluate existing methods (GMM, MI, PL) and our approach (MLE) for correcting misclassification bias.
+We now present four Monte Carlo simulations (\emph{Simulations 1a}, \emph{1b}, \emph{2a}, and \emph{2b}) with which we evaluate existing methods (GMM, MI, PL) and our approach (MLA) for correcting misclassification bias.

 Monte Carlo simulations are a tool for evaluating statistical methods, including (automated) content analysis \citep[e.g.,][]{song_validations_2020,bachl_correcting_2017,geis_statistical_2021, fong_machine_2021,zhang_how_2021}.
 They are defined by a data generating process from which datasets are repeatedly sampled. Repeating an analyses for each of these datasets provides an empirical distribution of results the analysis would obtain over study replications. Monte-carlo simulation affords exploration of finite-sample performance, robustness to assumption violations, comparison across several methods, and ease of interpretability \citep{mooney_monte_1997}. 

 \subsection{Parameters of the Monte Carlo Simulations}

-In our simulations, we tested four error correction methods: \emph{GMM calibration} (GMM) \citep{fong_machine_2021}, \emph{multiple imputation} (MI) \citep{blackwell_unified_2017-1}, \emph{Zhang's pseudo-likelihood model} (PL) \citep{zhang_how_2021}, and our \emph{likelihood modeling} approach (MLE). We use the \texttt{predictionError} R package \citep{fong_machine_2021} for the GMM method, the \texttt{Amelia} R package for the MI approach, and our own implementation of \citet{zhang_how_2021}'s PL approach in R.
-We develop our MLE approach in the R package \texttt{misclassificationmodels}. 
-For PL and MLE, we quantify uncertainty using the fisher information quadratic approximation. 
+In our simulations, we tested four error correction methods: \emph{GMM calibration} (GMM) \citep{fong_machine_2021}, \emph{multiple imputation} (MI) \citep{blackwell_unified_2017-1}, \emph{Zhang's pseudo-likelihood model} (PL) \citep{zhang_how_2021}, and our \emph{maximum likelihood adjustment} approach (MLA). We use the \texttt{predictionError} R package \citep{fong_machine_2021} for the GMM method, the \texttt{Amelia} R package for the MI approach, and our own implementation of \citet{zhang_how_2021}'s PL approach in R.
+We develop our MLA approach in the R package \texttt{misclassificationmodels}. 
+For PL and MLA, we quantify uncertainty using the fisher information quadratic approximation. 

 In addition, we compare these error correction methods to two common approaches in communication science: the \emph{feasible} estimator (i.e., conventional content analysis that uses only manually annotated data and not ACs)
 %and illustrates the motivation for using an AC in these scenarios—validation alone provide insufficient statistical power for a sufficiently precise hypothesis test. 
-and the \emph{naïve} estimator (i.e., using AC-based classifications $W$ as stand-ins for $X$, thereby ignoring misclassifications). According to our systematic review, the \emph{naïve} approach reflects standard practice in studies employing SML for text classification.
+and the \emph{naïve} estimator (i.e., using AC-based classifications $W$ as stand-ins for $X$, thereby ignoring misclassifications).
+According to our systematic review, the \emph{naïve} approach reflects standard practice in studies employing SML for text classification.

 We evaluate each of the six analytical approaches in terms of \emph{consistency} (whether the estimates of parameters $\hat{B_X}$ and $\hat{B_Z}$ have expected values nearly equal to the true values $B_X$ and $B_Z$), \emph{efficiency} (how precisely the parameters are estimated and how precision improves with additional data), and \emph{uncertainty quantification} (how well the 95\% confidence intervals approximate the range including 95\% of parameter estimates across simulations).
 To evaluate efficiency, we repeat each simulation with different amounts of total observations, i.e., unlabeled data to be classified by an AC (ranging from \Sexpr{min(N.sizes)} to \Sexpr{max(N.sizes)} observations), and manually annotated observations (ranging from \Sexpr{min(m.sizes)} to \Sexpr{max(m.sizes)}
@@ -414,15 +425,14 @@ observations). Since our review indicated that ACs are most often used to create
 %\end{equation}


-%These simulations are designed to verify that error correction methods from prior work are effective in ideal scenarios and to create the simplest possible cases where these methods are inconsistent. Showing how prior methods fail is instructive for understanding how our MLE approach does better both in these artificial simulations and in practical projects. 
+%These simulations are designed to verify that error correction methods from prior work are effective in ideal scenarios and to create the simplest possible cases where these methods are inconsistent. Showing how prior methods fail is instructive for understanding how our MLA approach does better both in these artificial simulations and in practical projects. 

 \subsection{Four Prototypical Scenarios for our Monte Carlo Simulations}

 We simulate regression models with two independent variables ($X$ and $Z$). This sufficiently constrains our study's scope but the scenario is general enough to be applied in a wide range of research studies. 
 %Simulating studies with  two covariates lets us study how  measurement error in one covariate can cause bias in coefficient estimates of other covariates. 
 Whether the methods we evaluate below are effective or not depends on the conditional dependence structure among independent variables, the dependent variable $Y$, and automated classifications $W$.
-This structure determines if systematic misclassifications in an independent variable cause differential error and if systematic misclassifications in a dependent variable should be modeled
-  \citep{carroll_measurement_2006}.
+This structure determines if adjustment for systematic misclassifications is required \citep{carroll_measurement_2006}.
 In Figure \ref{bayesnets}, we illustrate our scenarios via Bayesian networks representing the conditional dependence structure of variables  \citep{pearl_fusion_1986}:
 %In these figures, an edge between two variables indicates that they have a direct relationship.  Two nodes that are not neighbors are statistically independent given the variables between them on the graph. For example, in Figure \ref{fig:simulation.1a}, the automatic classifications $W$ are conditionally independent of $Y$ given $X$ because all paths between $W$ and $Y$ contain $X$. This indicates that the model $Y=B_0 +B_1 W+ B_2 Z$ (the \emph{naïve estimator}) has non-differential error because the automatic classifications $W$ are conditionally independent of $Y$ given $X$. However, in Figure \ref{fig:simulation.1b}, there is an edge between $W$ and $Y$ to indicate that $W$ is not conditionally independent of $Y$ given  other variables. Therefore, the naïve estimator has differential error.
 We first simulate two cases where an AC measures an independent variable without (\emph{Simulation 1a}) and with differential error (\emph{Simulation 1b}). Then, we simulate using an AC to measure the dependent variable, either one with misclassifications that are uncorrelated (\emph{Simulation 2a}) or correlated with an independent variable (\emph{Simulation 2b}). GMM is not designed to correct misclassifications in dependent variables, so we omit this method in \emph{Simulations 2a} and \emph{2b}. 
@@ -436,7 +446,7 @@ We first consider studies with the goal of testing hypotheses about the coeffici
 Y=B_0 + B_1 X + B_2 Z + \varepsilon
    \label{mod:true.ols}
 \end{equation}
-In our first real-data example, $Y$ was a discrete variable-whether a comment self-disclosed a racial or ethnic identity, $X$ was if a comment was toxic, and $Z$ was the number of likes.
+In our first real-data example, $Y$ was a discrete variable---whether a comment self-disclosed a racial or ethnic identity, $X$ was if a comment was toxic, and $Z$ was the number of likes.
 In this simulated example, $Y$ is continuous variable,  $X$ is a binary variable measured with an AC, and $Z$ is a normally distributed variable with mean 0 and standard deviation \Sexpr{sim1.z.sd} measured without error.
 %The simulated example could represent a study of $Y$, the time until an social media account is banned, $X$ if the account posted a comment including toxicity, and $Z$ the account's reputation score. $X$ and $Z$ are negatively correlated because high-reputation accounts may be less likely to post comments including toxicity.

@@ -449,7 +459,7 @@ To represent a study design where an AC is needed to obtain sufficient statistic
 % TODO, bring back when these simulations are in the appendix.
 %Additional simulations in appendix \ref{appendix:sim1.imbalanced} show results for variations of \emph{Simulation 1} with imbalanced covariates explaining a range of variances, different classifier accuracies, heteroskedastic misclassifications and deviance from normality in the an outcome $Y$. 

-In \emph{Simulation 1a} (Figure \ref{fig:simulation.1a}), we simulate an AC with \Sexpr{format.percent(sim1a.acc)} accuracy.\footnote{Classifier accuracy varies between our simulations because it is difficult to jointly specify classifier accuracy and the required correlations among variables and due to random variation between simulation runs. We report the median accuracy over simulation runs.}  This reflects a situation where $X$ may be difficult to predict, but the AC, represented as a logistic regression model having linear predictor $W^*$, provides a useful signal. 
+In \emph{Simulation 1a} (Figure \ref{fig:simulation.1a}), we simulate an AC with \Sexpr{format.percent(sim1a.acc)} accuracy.\footnote{Classifier accuracy varies between our simulations because it is difficult to jointly specify classifier accuracy and the required correlations among variables and due to random variation between simulation runs. We report the median accuracy over simulation runs.}  This reflects a situation where $X$ may be difficult to predict, but the AC, represented as a logistic regression model having linear predictor $W^*$ provides a useful signal. 
 We simulate nondifferential misclassification because $W=X+\xi$, $\xi$ is normally distributed with mean $0$, and $\xi$ and $W$ are conditionally independent of $Y$ given $X$ and $Z$.

 %($P(\xi| Y,X,Z) = P(\xi|X,Z)$).
@@ -474,7 +484,7 @@ This simulated AC has $\Sexpr{format.percent(sim1b.acc)}$ accuracy and makes pre
 % False negatives may cause delays in moderation increasing $Y$ (time-until-ban), while false-positives could draw moderator scrutiny and cause them to issue speedy bans.
 % This mechanism is not mediated by observable variables such as reputation ($Z$) or the true use of toxicity ($X$). Therefore, we expect differential error.

-\subsubsection{Measurement Error in a Dependent Variable (Simulation 2a and 2b)}
+\subsubsection{Measurement Error in a Dependent Variable (\textit{Simulation 2a} and \textit{2b})}

 We then simulate using an AC to measure the dependent variable $Y$,  a binary independent variable $X$, and a continuous independent variable $Z$. The goal is to estimate $B_1$ and $B_2$ in the following logistic regression model: 

@@ -501,18 +511,18 @@ For each method, we visualize the consistency, efficiency, and the accuracy of u
 %Our main results are presented as plots visualizing the consistency (i.e., does the method, on average, recover the true parameter?), efficiency (i.e., how precise are estimates and does precision improve as sample size increases?), and the accuracy of uncertainty quantification of each method in each scenario. 
 For example, Figure \ref{fig:sim1a.x} visualizes results for \emph{Simulation 1a}. Each subplot shows a simulation with a given total sample size (No. observations) and a given sample of manually annotated observations (No. manually annotated observations). 
 To assess a method's consistency, we locate the expected value of the point estimate across simulations with the center of the black circle. As an example, see the leftmost column in the bottom-left subplot of Figure \ref{fig:sim1a.x}. For the naïve estimator, the circle is far below the dashed line indicating the true value of $B_X$. Here, ignoring misclassification causes bias toward 0 and the estimator is inconsistent. To assess a method's efficiency, we mark the region in which point estimate falls in 95\% of the simulations with black lines.
-The black lines in the bottom-left subplot of Figure \ref{fig:sim1a.x} for example show that the feasible estimator, which uses only manually annotated data, is consistent but less precise than estimates from error correction methods. To assess each  method's uncertainty quantification, compare the gray lines,  which show the expected value of a method's approximate 95\% confidence intervals across simulations, to the neighboring black lines.
+The black lines in the bottom-left subplot of Figure \ref{fig:sim1a.x} for example show that the feasible estimator, which uses only manually annotated data, is consistent but less precise than estimates from error correction methods. To assess each  method's uncertainty quantification, compare the gray lines,  which show the expected value of a method's approximate 95\% confidence intervals across simulations, to the corresponding black lines.
 The \emph{PL} column in the bottom-left subplot of Figure \ref{fig:sim1a.x} for instance shows that the method's  95\% confidence interval is biased towards 0 when the number of manually annotated observations is smaller.  This is to be expected because the PL estimator does not account for uncertainty in misclassification probabilities estimated using the sample of manually annotated observations. 
 %Now that we have explained how to interpret our plots, we unpack them for each simulated scenario. 

-\subsection{Simulation 1a: Nonsystematic Misclassification of an Independent Variable}
+\subsection{\emph{Simulation 1a:} Nonsystematic Misclassification of an Independent Variable}

 Figure \ref{fig:sim1a.x} illustrates \emph{Simulation 1a}. Here, the naïve estimator is severely biased in its estimation of $B_X$.
-Fortunately, error correction methods (GMM, MI, MLE) produce consistent estimates and acceptably accurate confidence intervals.
+Fortunately, error correction methods (GMM, MI, MLA) produce consistent estimates and acceptably accurate confidence intervals.
 Notably, the PL method is inconsistent and considerable bias remains when the sample of annotations is much smaller than the entire dataset.  This is likely due to $P(X=x)$ missing from the PL estimation.\footnote{Compare Equation \ref{eq:mle.covariate.chainrule.4} in Appendix \ref{appendix:derivation} to Equations 24-28 from \citet{zhang_how_2021}.} Figure
-\ref{fig:sim1a.x} also shows that MLE and GMM estimates become more precise in larger datasets.
-This is is less pronounced for MI estimates, indicating that
-GMM and MLE use automated classifications more efficiently than MI.
+\ref{fig:sim1a.x} also shows that MLA and GMM estimates become more precise in larger datasets.
+As \citet{fong_machine_2021} also observed, this precision improvement is less pronounced for MI estimates, indicating that
+GMM and MLA use automated classifications more efficiently than MI.

 \begin{figure}
 <<example1.x,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
@@ -522,39 +532,39 @@ grid.draw(p)
 \caption{Simulation 1a: Nonsystematic misclassification of an independent variable. Error correction methods, except for PL, obtain precise and accurate estimates given sufficient manually annotated data. \label{fig:sim1a.x}}
 \end{figure}
 %It is important to correct misclassification error even when an AC is only used as a statistical control \citep[for example]{weld_adjusting_2022}, because when a covariate $Z$ is correlated with $X$, misclassifications of $X$ cause bias in the \emph{naïve} estimates of $B_Z$, the regression coefficient of $Z$ on $Y$. As Figure \ref{fig:sim1a.z} in Appendix \ref{appendix:main.sim.plots} shows, methods that effectively correct estimates of $X$ in \emph{Simulation 1a} also correct estimates of $B_Z$.  
-In brief, when misclassifications cause nondifferential error, MLE and GMM are effective, efficient, and provide accurate uncertainty quantification.  They complement each other due to different assumptions: MLE depends on correctly specifying the likelihood but its robustness to incorrect specifications is difficult to analyze \citep{carroll_measurement_2006}. The GMM approach depends on the exclusion restriction instead of distributional assumptions \citep{fong_machine_2021}. 
-MLE's advantage over GMM come from the relative ease with which it can be extended to for instance generalized linear models (GLMs) or generalized additive models (GAMs).
-In cases similar to \emph{Simulation 1a}, we therefore recommend both the GMM and an appropriately specified MLE approach to correct for misclassification.
+In brief, when misclassifications cause nondifferential error, MLA and GMM are effective, efficient, and provide accurate uncertainty quantification.  They complement each other due to different assumptions: MLA depends on correctly specifying the likelihood but its robustness to incorrect specifications is difficult to analyze \citep{carroll_measurement_2006}. The GMM approach depends on the exclusion restriction instead of distributional assumptions \citep{fong_machine_2021}. 
+MLA's advantage over GMM come from the relative ease with which it can be extended to for instance generalized linear models (GLMs) or generalized additive models (GAMs).
+In cases similar to \emph{Simulation 1a}, we therefore recommend both GMM and MLA to correct for misclassification.

-\subsection{Simulation 1b: Systematic Misclassification of an Independent Variable}
+\subsection{\emph{Simulation 1b:} Systematic Misclassification of an Independent Variable}

 Figure \ref{fig:sim1b.x} illustrates \emph{Simulation 1b}. Here, systematic misclassification gives rise to differential error and creates more extreme misclassification bias that is more difficult to correct.
 As Figure \ref{fig:sim1b.x} shows, the naïve estimator is opposite in sign to the true parameter. 
-Of the four methods we test, only the MLE and the MI approach provide consistent estimates. This is expected because they use $Y$ to adjust for misclassifications. The bottom row of Figure \ref{fig:sim1b.x} shows how the precision of the MI and MLE estimates increase with additional observations.  As in \emph{Simulation 1a}, MLE uses this data more efficiently than MI does. However, due to the low accuracy and bias of the AC, additional unlabeled data improves precision less than one might expect. Both methods provide acceptably accurate confidence intervals. Figure \ref{fig:sim1b.z} in Appendix \ref{appendix:main.sim.plots} shows that, as in \emph{Simulation 1a}, effective correction for misclassifications of $X$ is required to consistently estimate $B_Z$, the coefficient of $Z$ on $Y$.  Inspecting results from methods that do not correct for differential error is useful for understanding their limitations. When few annotations of $X$ are observed, GMM is nearly as bad as the naïve estimator. PL is also visibly biased. Both improve when a greater proportion of the data is labeled since they combine AC-based estimates with the feasible estimator.
+Of the four methods we test, only the MLA and the MI approach provide consistent estimates. This is expected because they use $Y$ to adjust for misclassifications. The bottom row of Figure \ref{fig:sim1b.x} shows how the precision of the MI and MLA estimates increase with additional observations.  As in \emph{Simulation 1a}, MLA uses this data more efficiently than MI does. However, due to the low accuracy and bias of the AC, additional unlabeled data improves precision less than one might expect. Both methods provide acceptably accurate confidence intervals. Figure \ref{fig:sim1b.z} in Appendix \ref{appendix:main.sim.plots} shows that, as in \emph{Simulation 1a}, effective correction for misclassifications of $X$ is required to consistently estimate $B_Z$, the coefficient of $Z$ on $Y$.  Inspecting results from methods that do not correct for differential error is useful for understanding their limitations. When few annotations of $X$ are observed, GMM is nearly as bad as the naïve estimator. PL is also visibly biased. Both improve when a greater proportion of the data is labeled since they combine AC-based estimates with the feasible estimator.

-In sum, our simulations suggest that the MLE approach is superior in conditions of differential error.  Although estimations by the MI approach are consistent, the method's practicality is limited by its inefficiency.
+In sum, our simulations suggest that the MLA approach is superior in conditions of differential error.  Although estimations by the MI approach are consistent, the method's practicality is limited by its inefficiency.

 \begin{figure}
 <<example2.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
 p <- plot.simulation.iv(plot.df.example.2, iv='x')
 grid.draw(p)
@
-\caption{Simulation 1b: Systematic misclassification of an independent variable. Only the the MLE approach obtains consistent estimates of $B_X$. \label{fig:sim1b.x}}
+\caption{\emph{Simulation 1b:} Systematic misclassification of an independent variable. Only the the MLA approach obtains consistent estimates of $B_X$. \label{fig:sim1b.x}}
 \end{figure}

-\subsection{Simulation 2a: Nonsystematic Misclassification of a Dependent Variable}
+\subsection{\emph{Simulation 2a:} Nonsystematic Misclassification of a Dependent Variable}

-Figure \ref{fig:sim2a.x} illustrates \emph{Simulation 2a}: nonsystematic misclassification of a dependent variable. This also introduces bias as evidenced by the naïve estimator's inaccuracy. Our MLE method 
+Figure \ref{fig:sim2a.x} illustrates \emph{Simulation 2a}: nonsystematic misclassification of a dependent variable. This also introduces bias as evidenced by the naïve estimator's inaccuracy. Our MLA method 
 is able to correct this error and provide consistent estimates.
 Surprisingly, the MI estimator is inconsistent and does not improve with more human-labeled data.
 %Note that the GMM estimator is not designed to correct misclassifications in the outcome.
 The PL approach is also inconsistent, especially when only few of all observations are annotated manually. It is closer to recovering the true parameter than the MI or the naïve estimator, but provides only modest improvements in precision compared to the feasible estimator.   
-It is clear that the precision of the MLE estimator improves with more observations data to a greater extent than the PL estimator.  
-When the amount of human-labled data is low, inaccuracies in the 95\% confidence intervals of both the MLE and PL become visible due to the poor finite-sample properties of the quadradic approximation for standard errors.  
+It is clear that the precision of the MLA estimator improves with more observations data to a greater extent than the PL estimator.  
+When the amount of human-labled data is low, inaccuracies in the 95\% confidence intervals of both the MLA and PL become visible due to the poor finite-sample properties of the quadradic approximation for standard errors.  
 %As before, PL's inaccurate confidence intervals are due to its use of finite-sample estimates of automated classification probabilities. 
-%In both cases, the poor finite-sample properties of the fischer-information quadratic approximation contribute to this inaccuracy. In Appendix \ref{appendix:sim1.profile}, we show that the MLE method's inaccuracy vanishes when using the profile-likelihood method instead.
+%In both cases, the poor finite-sample properties of the fischer-information quadratic approximation contribute to this inaccuracy. In Appendix \ref{appendix:sim1.profile}, we show that the MLA method's inaccuracy vanishes when using the profile-likelihood method instead.
 
- In brief, our simulations suggest that MLE is the best error correction method when random misclassifications affect the dependent variable. It is the only consistent option and more efficient than the PL method, which is almost consistent.
+ In brief, our simulations suggest that MLA is the best error correction method when random misclassifications affect the dependent variable. It is the only consistent option and more efficient than the PL method, which is almost consistent.

 \begin{figure}
 <<example3.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
@@ -562,7 +572,7 @@ When the amount of human-labled data is low, inaccuracies in the 95\% confidence
 p <- plot.simulation.dv(plot.df.example.3,'z')
 grid.draw(p)
@
-\caption{Simulation 2a: Nonsystematic misclassification of a dependent variable. Only the MLE approach obtains consistent estimates. \label{fig:sim2a.x}}
+\caption{Simulation 2a: Nonsystematic misclassification of a dependent variable. Only the MLA approach obtains consistent estimates. \label{fig:sim2a.x}}
 \end{figure}
 
 \subsection{Simulation 2b: Systematic Misclassification of a Dependent Variable}
@@ -570,18 +580,18 @@ grid.draw(p)
 \begin{figure}
 <<example.4.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=

-p <- plot.simulation.dv(plot.df.example.4,'z')
+p <- plot.simulation.dv(plot.df.example.4,'x')
 grid.draw(p)
@
-\caption{Simulation 2b: Systematic misclassification of a dependent variable. Only the MLE approach obtains consistent estimates. \label{fig:sim2b.x}}
+\caption{Simulation 2b: Systematic misclassification of a dependent variable. Only the MLA approach obtains consistent estimates. \label{fig:sim2b.x}}
 \end{figure}

 In \emph{Simulation 2b}, misclassifiations of the dependent variable $Y$ are correlated with an independent variable $X$. As shown in Figure \ref{fig:sim2b.x}, this causes dramatic bias in the naïve estimator. 
 Similar to \emph{Simulation 2a}, MI is inconsistent. PL is also inconsistent because it does not account for $Y$ when correcting for misclassifications.
-As in \emph{Simulation 1b}, our MLE method obtains consistent estimates, but only does much better than the feasible estimator when the dataset is large.
+As in \emph{Simulation 1b}, our MLA method obtains consistent estimates, but only does much better than the feasible estimator when the dataset is large.
 Figure \ref{fig:sim2b.z} in Appendix \ref{appendix:main.sim.plots} shows that the precision of estimates for the coefficient for $X$ improves with additional data to a greater extent. As such, this imprecision is mainly in estimating the coefficient for $Z$, the variable correlated with misclassification.

-Therefore, our simulations suggest that MLE is the best method when misclassifications in the dependent variable are correlated with an independent variable. 
+Therefore, our simulations suggest that MLA is the best method when misclassifications in the dependent variable are correlated with an independent variable. 

 \section{Transparency about Misclassification Is Not Enough—We Have To Fix It! Recommendations for Automated Content Analysis}

@@ -591,7 +601,7 @@ Therefore, our simulations suggest that MLE is the best method when misclassific
 %This suggests that quantifying an AC's predictive performance by comparing human-labeled validation data to  automated classifications sufficiently establishes an AC's validity and thereby the validity of downstream analyses.
 Like \citet{grimmer_text_2013}, we are deeply concerned that computational methods may produce invalid evidence. In this sense, their validation mantra animates this paper. But transparency about misclassification rates via metrics such as precision or recall leaves unanswered an important question: Is comparing automated classifications to some external ground truth sufficient to claim that results are valid? Or is there something else we can do and should do? 

-We think there is: Using statistical methods to not only quantify but also correct for misclassification. Our study provides several recommendations in this regard, with an overview of recommendations provided in Figure \ref{fig:FigureRecommendations}.
+We think there is: Using statistical methods to not only quantify but also correct for misclassification. Our study provides several recommendations in this regard, as summarized in Figure \ref{fig:FigureRecommendations}.

 \begin{figure}[hbt!]
 \centering
@@ -614,7 +624,7 @@ One may for example need a large dataset to study an effect one assumes to be sm
 Often, ACs are seen as a cost-saving procedure without consideration of the threats to validity posed by misclassification. 
 Moreover, validating an existing AC or building a new AC is also expensive, for instance due to costs of computational resources or manual annotation of (perhaps smaller) test and training datasets.

-We therefore caution researchers against preferring automated over manual content analysis unless doing so is necessary to obtain useful evidence. We agree with \citet{baden_three_2022} who argue that ``social science researchers may be well-advised to eschew the promises of computational tools and invest instead into carefully researcher-controlled, limited-scale manual studies'' (p. 11). In particular, we recommend to use manually annotated data \textit{ante facto}: Researchers should begin by statistical modeling human-annotated data so to discern if an AC is necessary. In our simulations, the feasible estimator is less precise but consistent in all cases. So if fortune shines and this estimate sufficiently answers one's research question, manual coding is sufficient. Here, scholars should rely on existing recommendations for descriptive and inferential statistics in the context of manual content analysis \citep{geis_statistical_2021, bachl_correcting_2017}. If the feasible estimator however fails to provide convincing evidence, for example by not rejecting the null, manually annotated data is not wasted. It can be reused to build an AC or correct misclassification bias.
+We therefore caution researchers against preferring automated over manual content analysis unless doing so is necessary to obtain useful evidence. We agree with \citet{baden_three_2022} who argue that ``social science researchers may be well-advised to eschew the promises of computational tools and invest instead into carefully researcher-controlled, limited-scale manual studies'' (p. 11). In particular, we recommend to use manually annotated data \textit{ante facto}: Researchers should begin by examining human-annotated data so to discern if an AC is necessary. In our simulations, the feasible estimator is less precise but consistent in all cases. So if fortune shines and this estimate sufficiently answers one's research question, manual coding is sufficient. Here, scholars should rely on existing recommendations for descriptive and inferential statistics in the context of manual content analysis \citep{geis_statistical_2021, bachl_correcting_2017}. If the feasible estimator however fails to provide convincing evidence, for example by not rejecting the null, manually annotated data is not wasted. It can be reused to build an AC or correct misclassification bias.
 %One potential problem of this \textit{ante facto} approach is that conducting two statistical tests of the same hypothesis increases the chances of false discover. A simple solution to this is to adjust the significance threshold $\alpha$ for drawing conclusions from the feasible estimate. %We recommend p < .01. %That said, it might useful  use an AC in a preliminary analysis, prior to collecting validation data when an AC such as one available from an API, is available for reuse and confusion matrix quantities necessary for the pseudo-likelihood (PL) method are published. Although (PL) is inconsistent when used for a covariate, this can be corrected if the true rate of $X$ can be estimated. 
 %Caution is still warranted because ACs can perform quite differently from one dataset to another so we recommend collecting validation representative of your study's dataset and using another appropriate method for published studies. 

@@ -638,10 +648,10 @@ More generally, one can test if the data's conditional independence structures c

 \subsubsection{Step 3: Correct for Misclassification Bias Instead of Being Naïve}

-Across our simulations, we showed that the naïve estimator is biased. Testing different error correction methods, we found that these generate different levels of consistency, efficiency, and accuracy in uncertainty quantification. That said, our proposed MLE method should be considered as a versatile method because it is the only method capable of producing consistent estimates in prototypical situations studied here. We recommend the MLE method as the first ``go-to'' method.  As shown in Appendix \ref{appendix:noz}, this method requires specifying a valid error model to obtain consistent estimates. This may not be too difficul in practice because if one can assume the primary model for $Y$, this implies that an error model for $W$ that includes all observed variables is sufficient.  Still, on should take care correctly model nonlinearities and interactions.
-Our \textbf{misclassificationmodels} R package provides reasonable default error models and a user-friendly interface to facilitate adoption of our MLE method (see Appendix \ref{appendix:misclassificationmodels}).
+Across our simulations, we showed that the naïve estimator is biased. Testing different error correction methods, we found that these generate different levels of consistency, efficiency, and accuracy in uncertainty quantification. That said, our proposed MLA method should be considered as a versatile method because it is the only method capable of producing consistent estimates in prototypical situations studied here. We recommend the MLA method as the first ``go-to'' method.  As shown in Appendix \ref{appendix:nozp}, this method requires specifying a valid error model to obtain consistent estimates. One should take care that the model not have omitted variables including nonlinearities and interactions.
+Our \textbf{misclassificationmodels} R package provides reasonable default error models and a user-friendly interface to facilitate adoption of our MLA method (see Appendix \ref{appendix:misclassificationmodels}).

-When feasible, we recommend comparing the MLE approach to another error correction method. Consistency between two correction methods shows that results are robust independent of the correction method. If the AC is used to predict an independent variable, GMM is a good choice if error is nondifferential. Otherwise, MI can be considered.
+When feasible, we recommend comparing the MLA approach to another error correction method. Consistency between two correction methods shows that results are robust independent of the correction method. If the AC is used to predict an independent variable, GMM is a good choice if error is nondifferential. Otherwise, MI can be considered.
 Unfortunantly, if the AC is used to predict a dependent variable, our simulations do not support a strong suggestion for a second method. 
 PL might be useful reasonable choice with enough manually annotated data and non-differential error.
 This range of viable choices in error correction methods also  motivates our next recommendation.
@@ -661,22 +671,35 @@ Therefore, both corrected and uncorrected estimates should be presented as part

 \section{Conclusion and Limitations}

-In this study, we discuss the problem of misclassification in automated content analysis which may introduce misclassification bias in statistical models. We believe this is a topic that has not attracted enough attention within communication science \citep[but see][]{bachl_correcting_2017} and even in the broader computational social science community. After illustrating biases emerging from automated classifiers such as the Perspective API, we quantify how aware researchers are of the issue of misclassification. In a systematic review of studies using SML-based text classification, we show that scholars rarely acknowledge this problem and almost never address it. We therefore discuss a range of statistical methods that use manually annotated data as a ``gold standard'' to account for misclassification and produce correct statistical results, including a new MLE method we design. Using Monte-Carlo simulations, we show that our method provides consistent estimates, especially in situations involving differential error. Based on these results, we provide four recommendations for the future of automated content analysis: Researchers should (1) attempt manual content analyssis before building or validating ACs to see whether human-labeled data is sufficient, (2) use manually annotated data to test for systematic misclassification and choose appropiate error correction methods, (3) correct for misclassifications via error correction methods, and (4) be transparent about the methodological decisions involved in AC-based classifications and error correction. 
+Misclassification bias is an important threat to validity in studies that use automatic classifiers to measure statistical variables.
+As we showed in an example with data from the Perspective API, widely used and very accurate automatic classifiers can cause type-1 and type-2 errors.
+As evidence by our literature review, this problem not attracted enough attention within communication science \citep[but see][]{bachl_correcting_2017} and even in the broader computational social science community. 
+Therefore, although current best-practices of reporting metrics of classifier performance on manually annotated validation data are important, they provide little protection from misclassification bias.
+These practices use annotations to enact a transparency ritual to ward against misclassification bias, but annotations can do much more. With the right statistical model, they can correct misclassification bias.
+We introduce maximum likelihood adjustment, a new method we designed to correct misclassification bias and use monte-carlo simulations to 
+evaluate it and compare it to other recently proposed methods.  
+Our method is the only one that is effective across a wide range of scenarios. It is also straightforward to use. Our implementation in the R package \texttt{misclassificationmodels} provides a familiar formula interface for regression models.
+
+Remarkably, our simulations show that our method can use even an automatic classifier below common accuracy standards to obtain consistent estimates.  Therefore, low accuracy is not necessarily a barrier to using an AC.
+
+Based on these results, we provide four recommendations for the future of automated content analysis: Researchers should (1) attempt manual content analyssis before building or validating ACs to see whether human-labeled data is sufficient, (2) use manually annotated data to test for systematic misclassification and choose appropiate error correction methods, (3) correct for misclassifications via error correction methods, and (4) be transparent about the methodological decisions involved in AC-based classifications and error correction. 

 Our study has several limitations. First, the simulations and methods we introduce focus on misclassification by automated tools. They provisionally assume that human annotators do not make errors, especially systematic ones. 
 This assumption can be reasonable if intercoder reliability is very high but, as with ACs, this may not always be the case.
 %Alternatively, validation data can be treated as a gold standard if the goal is measuring \emph{how a person categorizes content}, as opposed to the more common approach of measuring presumably objective content categories. That said, the prevailing approaches in content analysis use human coders to measure a latent category who are prone to misclassification.
-Thus, it may be important to account for measurement error by human coders  \citep{bachl_correcting_2017} and by automated classifiers simultaneously. In theory, it is possible to extend our MLE approach in order to do so \citep{carroll_measurement_2006}. 
-However, because the true values of content categories are never observed, accounting for automated and human misclassification at once requires latent variable methods that bear considerable additional complexity and assumptions \citep{pepe_insights_2007}. We leave the integration of such methods into our MLE framework for future work. Second, the simulations we present do not consider all possible factors that may influence the performance and robustness of error correction methods including classifier accuracy, heteroskedasticity, and violations of distributional assumptions.  We are working to investigate such factors, as shown in  Appendix \ref{appendix:main.sim.plots}, by extending our simulations. Third, we simulated datasets with balanced variables, but classifiers are often used to measure rare occurrences. Imbalanced covariates will require greater sample sizes of validation data to correct for misclassification. 
-In such cases, validation data may be collected more efficiently using approaches that provide balanced, but unrepresentative samples. However, non-representative sampling requires correction methods to account for the probability that a data point will be sampled.  
+ Thus, it may be important to account for measurement error by human coders  \citep{bachl_correcting_2017} and by automated classifiers simultaneously. In theory, it is possible to extend our MLA approach in order to do so \citep{carroll_measurement_2006}. 
+However, because the true values of content categories are never observed, accounting for automated and human misclassification at once requires latent variable methods that bear considerable additional complexity and assumptions \citep{pepe_insights_2007}. We leave the integration of such methods into our MLA framework for future work. In addition, our method requires an additional assumption that the error model is correct. As we argue in Appendix \ref{appendix:assumption}, this assumption is often acceptable.
+Second, the simulations we present do not consider all possible factors that may influence the performance and robustness of error correction methods including classifier accuracy, heteroskedasticity, and violations of distributional assumptions.  We are working to investigate such factors, as shown in  Appendix \ref{appendix:main.sim.plots}, by extending our simulations.

 \setcounter{biburlnumpenalty}{9001}
 \printbibliography[title = {References}]

 \clearpage
 \appendix
-
-\section{Perspective API Example}\label{appendix:perspective}
+\addcontentsline{toc}{section}{Appendices}
+\stepcounter{section}
+\section{Perspective API Example}
+\label{appendix:perspective}

 Our example relies on the publicly available Civil Comments dataset \citep{cjadams_jigsaw_2019}. The dataset contains around 2 million comments collected from independent English-language news sites between 2015 and 2017. We rely on a subset of \Sexpr{f(dv.example[['n.annotated.comments']])} comments which were manually annotated both for toxicity (\emph{toxicity}) and disclosure of identity (\emph{disclosure}) in a comment. The dataset also includes counts of likes each comment received (\emph{number of likes}).

@@ -701,7 +724,7 @@ print(p)
 \subcaption{\emph{Example 2}: Misclassification in a dependent variable. \label{fig:real.data.example.dv.app}}

 \end{subfigure}
-\caption{Real-data example including correction using MLE.}
+\caption{Real-data example including correction using MLA.}
 \end{figure}

 % Our maximum-likelihood based error correction technique in this example requires specifying models for the Perspective's scores and, in the case where these scores are used as a covariate, a model for the human annotations.  In our first example, where toxicity was used as a covariate, we used the \emph{human annotations}, \emph{identity disclosure}, and the interaction of these two variables in the model for scores.  We omitted \emph{likes} from this model because they are virtually uncorrelated with misclassifications (Pearson's $\rho=\Sexpr{iv.example[['civil_comments_cortab']]['toxicity_error','likes']}$).  Our model for the human annotations is an intercept-only model.
@@ -778,9 +801,9 @@ We also do not consider \emph{Bayesian methods} (aside from the Amelia implement

 \section{Deriving the Maximum Likelihood Approach}
 \label{appendix:derivation}
-In the following, we derive our MLE approach for addressing misclassifications.
+In the following, we derive our MLA approach for addressing misclassifications.
 \subsection{When an AC Measures an Independent Variable}
-To show why $L(\theta|Y,W)$ can be factored, we follow \citet{carroll_measurement_2006} and begin by observing the following fact from basic probability theory.
+To explain why the MLA approach is effective, we follow \citet{carroll_measurement_2006} and begin by observing the following fact from basic probability theory:

 \begin{align}
    P(Y,W) &= \sum_{x}{P(Y,W,X=x)} 
@@ -820,20 +843,15 @@ We implement these methods in \texttt{R} using the \texttt{optim} library for ma


 \subsection{Comment on model assumptions}
+\label{appendix:assumption}

-How burdensome is the assumption that the error model be able to consistently estimate the conditional probability of $W$ given $Y$?  If this assumption were much more difficult than those already accepted by the model for $Y$ given $X$ and $Z$, one would fear that using the MLE correction method introduces greater validity threats than it removes. However, if we believe our model for $Y$ given $X$ is consistent this makes it unlikely that we have omitted variables from the error model.  Any such variables must be correlated with both $W$ and $X$ or $Z$, but not with $Y$.
+How burdensome is the assumption that the error model be able to consistently estimate the conditional probability of $W$ given $Y$?  If this assumption were much more difficult than those already accepted by the model for $Y$ given $X$ and $Z$, one would fear that using the MLA correction method introduces greater validity threats than it removes. In particular, one may worry that unobserved variables $U$ are omitted from our model for $P(Y,W)$.  As demonstrated in Appendix \ref{appendix:misspec}, the MLA method is less effective when variables are omitted from the error model. 

-To see why, first suppose $U$ is an omitted variable from $P(W|X,Y,Z)$. Then $U$ is correlated with $W$ and at least one of $X$, $Y$, $Z$. If $U$ is correlated with $Y$, then it is either an omitted variable from $P(Y|X,Y,Z)$ or otherwise $P(Y|X,Z)=P(Y|U,X,Z)$.  
+However, if we believe our outcome model for $P(Y|X,Z)$ is consistent this threat is substantially reduced.  If one can assume a model for $P(Y|X,Z)$, it is often reasonable assume the variables needed to model $P(W|X,Y,Z)$ are observed.
+Furthermore, since $W$ is an output from an automatic classifier it depends only on the classifier's features, which are observable in principle. As a result, and as suggested by \citet{fong_machine_2021}, one should consider including all such features in the error model.  

-Assuming the later, observe by conditional probability,
-\begin{align}
-P(W|U,X,Y,Z)&=\frac{P(U,W,X,Y,Z)}{P(U,X,Y,Z)} = \frac{P(Y|U,W,X,Z)P(U,W,X,Z)} 
-{P(Y|U,X,Z)P(U,X,Z)}\\ &= \frac{P(U,W,X,Z)}{P(U,X,Z)} = P(W|X,Y,Z)
-\end{align}
-\noindent Note that $W$ is not an omitted variable from $P(Y|X,Z)$).  As a result, $P(W|U,X,Y,Z) = P(W|X,Y,Z)$ and $U$ is not omitted from our model for $P(W|X,Y,Z)$.
+However, due to the highly nonlinear nature of machine learning classifiers, specifying the functional form of the error model may require care in practice.  One option is to calibrate an AC's to one's dataset and thereby obtain accurate estimates of its predicted probabilities.

-In sum, if one can assume a model for $P(Y|X,Z)$, it is often reasonable assume the variables needed to model $P(W|X,Y,Z)$ are observed. Any such variables that are unobserved must be independent from $Y$.
-As demonstrated in Appendix \ref{appendix:misspec}, the method is less effective when variables are omitted from the error model. 

 \section{misclassificationmodels: The R package} \label{appendix:misclassificationmodels}

@@ -882,7 +900,7 @@ grid.draw(p)
 p <- plot.simulation.iv(plot.df.example.2, iv='z')
 grid.draw(p)
@
-\caption{Estimates of $B_Z$ in multivariate regression with $X$ measured using machine learning and model accuracy correlated with $X$ and $Y$ and error is differential.  Only multiple imputation and our MLE model with a full specification of the error model obtain consistent estimates of $B_X$.\label{fig:sim1b.z}}
+\caption{Estimates of $B_Z$ in multivariate regression with $X$ measured using machine learning and model accuracy correlated with $X$ and $Y$ and error is differential.  Only multiple imputation and our MLA model with a full specification of the error model obtain consistent estimates of $B_X$.\label{fig:sim1b.z}}
 \end{figure}

 \begin{figure}
@@ -891,7 +909,7 @@ grid.draw(p)
 p <- plot.simulation.dv(plot.df.example.3,'z')
 grid.draw(p)
@
-\caption{Estimates of $B_Z$ in \emph{simulation 2a}, multivariate regression with $Y$ measured using an AC that makes errors. Only our MLE model with a full specification of the error model obtains consistent estimates.}
+\caption{Estimates of $B_Z$ in \emph{simulation 2a}, multivariate regression with $Y$ measured using an AC that makes errors. Only our MLA model with a full specification of the error model obtains consistent estimates.}
 \end{figure}

 \begin{figure}
@@ -900,19 +918,19 @@ grid.draw(p)
 p <- plot.simulation.dv(plot.df.example.4,'x')
 grid.draw(p)
@
-\caption{Estimates of $B_X$ in \emph{simulation 2b} multivariate regression with $Y$ measured using machine learning, model accuracy correlated with $Z$ and $Y$ and differential error. Only our MLE model with a full specification of the error model obtains consistent estimates. \label{fig:sim2b.z}}
+\caption{Estimates of $B_X$ in \emph{simulation 2b} multivariate regression with $Y$ measured using machine learning, model accuracy correlated with $Z$ and $Y$ and differential error. Only our MLA model with a full specification of the error model obtains consistent estimates. \label{fig:sim2b.z}}
 \end{figure}

 \subsection{Simulating what happens when an error model is misspecified.}
 \label{appendix:misspec}
-In simulations 1b and 2b, the MLE method was able to correct systematic misclassification using the error models in equations \ref{eq:covariate.reg.general} and \ref{eq:depvar.general}.
+In simulations 1b and 2b, the MLA method was able to correct systematic misclassification using the error models in equations \ref{eq:covariate.reg.general} and \ref{eq:depvar.general}.
 However, this depends on the error model consistently estimating the conditional probability of automatic classifications given the true value and the outcome.
 If the misclassifications and the outcome are conditionally dependent given a variable $Z$ that is omitted from the model, this will not be possible. 
 Here, we demonstrate how such misspecification of the error model can affect results. 

 \subsubsection{Systematic Misclassification of an Independent Variable with $Z$ omitted from the error model}
-
-What happens in simulation 1b, representing systematic misclassification of an independent variable, when the error model is  missing variable $Z$? As shown in Figure \ref{fig:iv.noz} this incorrect MLE model is unable to fully correct misclassification bias. Although the estimate of $B_X$ is close to correct, estimation of $B_Z$ is clearly biased, if improved compared to the näive estimator.
+\label{appendix:noz}
+What happens in simulation 1b, representing systematic misclassification of an independent variable, when the error model is  missing variable $Z$? As shown in Figure \ref{fig:iv.noz} this incorrect MLA model is unable to fully correct misclassification bias. Although the estimate of $B_X$ is close to correct, estimation of $B_Z$ is clearly biased, if improved compared to the näive estimator.
 %Here we refer to $P(Y|X,Z,\Theta_Y)$ as the ``outcome model'', $P(W|Y,X,Z,\Theta_W)$ as the ``proxy model'', and $P(X|Z,\Theta_X)$ as the ``truth model''. 


@@ -1003,32 +1021,118 @@ grid.draw(p)
 \label{fig:iv.predacc}
 \end{figure}

+
+\subsection{Simulating misclassification in imbalanced variables}
+
+For simplicity, our main simulations have balanced classified variables.  But classifiers are often used to measure imbalanced variables, which can be more difficult to predict.  Here, we show that MLA correction performs similarly well with imbalanced classified variables.  Notably, the quality of uncertainty quantification of methods tends to degrade as imbalance increases by replicating versions of our simulations 1a and 2a having 5,000 classifications and 200 annotations.  This suggests that imbalanced data requires additional validation data for effective misclassification correction. 
+
+\subsubsection{Imbalance in classified independent variables}
+
+
 \begin{figure}[htpb!]
 \begin{subfigure}{0.95\textwidth}
-<<dv.predacc.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>=
-p <- plot.robustness.2.dv('x')
+<<iv.imbalance.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>=
+p <- plot.robustness.3.iv('x',n.classifications=5000, n.annotations=200)
 grid.draw(p)
@
 \label{fig:dv.predacc.x}
-\caption{Estimates of $B_X$ from simulation 2a with 5,000 classifications and 100 annotations with varying levels of classifier accuracy indicated by the facet labels.}
+\caption{Estimates of $B_X$ from simulation 1a with 1,000 classifications and 100 annotations as the probability of $X$ varies from $0.5$ to $0.95$ as the facet labels indicate.}
 \end{subfigure}

 \begin{subfigure}{0.95\textwidth}
-<<predacc.y.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
-p <- plot.robustness.2.dv('z')
+<<iv.imbalance.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
+p <- plot.robustness.3.iv('z',n.classifications=5000, n.annotations=200)
 grid.draw(p)
@
 \label{fig:dv.predacc}
-\caption{Estimates of Z from simulation 2a with 5,000 classifications and 100 annotations with varying levels of classifier accuracy indicated by the facet labels.}
+\caption{Estimates of Z from simulation 2a with 1,000 classifications and 100 annotations as the probability of $X$ varies from $0.5$ to $0.95$ as the facet labels indicate.}
 \end{subfigure}
-\caption{Dependent variable with more and less accurate automatic classifiers. More accurate classifiers cause less misclassification bias and more efficient estimates when used with error correction methods.}
+\caption{Imbalance in a misclassified independent variable. Imbalance requires additional statistical power and causes biased uncertainty quantification in error correction methods. PL has very confidence intervals and so is excluded for clarity.}
 \label{fig:dv.predacc}
 \end{figure}

-\subsection{Simulating misclassification in skewed variables}
+\subsubsection{Imbalance in classified independent variables}

-For simplicity, our main simulations have balanced classified variables.  But classifiers are often used to measure imbalanced or skewed variables, which can be more difficult to predict.  Here, we show that MLE correction performs similarly well as with skewned classified variables. Although the Fischer approximation for confidence intervals performs poorly, the profile likelihood method works well.

+\begin{figure}[htpb!]
+\begin{subfigure}{0.95\textwidth}
+<<dv.imbalance.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>=
+p <- plot.robustness.3.dv('x',n.classifications=5000, n.annotations=200)
+grid.draw(p)
+@
+\label{fig:dv.predacc.x}
+\caption{Estimates of $B_X$ from simulation 2a with 1,000 classifications and 100 annotations with $Y$'s base rate ranging from  $0.5$ to $0.95$.}
+\end{subfigure}
+
+\begin{subfigure}{0.95\textwidth}
+<<dv.imbalance.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
+p <- plot.robustness.3.dv('z',n.classifications=5000, n.annotations=200)
+grid.draw(p)
+@
+\label{fig:dv.predacc}
+\caption{Estimates of Z from simulation 2a with 1,000 classifications and 100 annotations with $Y$'s base rate ranging from $0.5$ to $0.95$.}
+\end{subfigure}
+\caption{Imbalance in a misclassified dependent variable. Imbalance requires additional statistical power and causes biased uncertainty quantification in error correction methods. PL has very confidence intervals and so is excluded for clarity.}
+\label{fig:dv.predacc}
+\end{figure}
+
+\subsection{Simulating a range of classifier biases}
+
+
+Now, we explore what happens as misclassification is more or less systematic in replications of simulations 1b and 2b having 1000 classifications and 100 annotations.  We vary the amount of systematic misclassification in simulation 1b via the logistic regression coefficient of $Y$ on $W$ in our simulated data generating process while keeping the overall classifier accuracy close to 0.73. Similarly, in simulation 2b we use a range of values for the coefficient of $Z$ on $W$.
+
+\subsubsection{Systematic independent variable misclassification}
+
+\begin{figure}[htpb!]
+\begin{subfigure}{0.95\textwidth}
+<<iv.bias.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>=
+p <- plot.robustness.4.iv('x')
+grid.draw(p)
+@
+\label{fig:dv.predacc.x}
+\caption{Estimates of $B_X$ simulation 1b variants with the logistic regression coefficient of $Y$ on $W$ ranging from $-3$ to $0.25$ as indicated by facet labels.}
+\end{subfigure}
+
+
+
+\begin{subfigure}{0.95\textwidth}
+<<iv.bias.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
+p <- plot.robustness.4.iv('z')
+grid.draw(p)
+@
+
+\caption{Estimates of $B_Z$ in simulation 1b variants with the logistic regression coefficient of $Y$ on $W$ ranging from $-3$ to $0.25$ as indicated by facet labels.}
+\end{subfigure}
+\caption{Systematically misclassified independent variable. As misclassification becomes more systematic it causes greater bias in naïve estimator and becomes more difficult to correct. Nevertheless, the MLA method remains consistent. }
+
+\label{fig:dv.predacc}
+\end{figure}
+
+\subsubsection{Systematic dependent variable misclassification}
+
+In the case of systematic misclassification in the dependent variable, we can observe that the bias in the naïve estimator switches from negative to positive as systematic misclassification increases. 
+
+\begin{figure}[htpb!]
+\begin{subfigure}{0.95\textwidth}
+<<dv.bias.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>=
+p <- plot.robustness.4.dv('x')
+grid.draw(p)
+@
+\label{fig:dv.predacc.x}
+\caption{Estimates of $B_X$ in variants of simulation 2b with 5,000 classifications and 200 annotations as the logistic regression coefficient of $Z$ on $W$'s ranges as indicated on the facet labels.}
+\end{subfigure}
+
+\begin{subfigure}{0.95\textwidth}
+<<dv.bias.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
+p <- plot.robustness.4.dv('z')
+grid.draw(p)
+@
+\label{fig:dv.predacc}
+\caption{Estimates of $B_Z$ in variants of simulation 2b with 5,000 classifications and 200 annotations as the logistic regression coefficient of $Z$ on $W$'s ranges as indicated on the facet labels.}
+\end{subfigure}
+\caption{Imbalance in a misclassified dependent variable. As misclassification becomes more systematic it causes greater bias in naïve estimator and becomes more difficult to correct. Nevertheless, the MLA method remains consistent. Power and causes biased uncertainty quantification in error correction methods. }
+\label{fig:dv.predacc}
+\end{figure}

 %However, if one can assume the model for $Y$, then one believes that $Y$ and $X$ are conditionally independent given other observed variables. 

@@ -1104,7 +1208,7 @@ We'll run simulations that vary along these dimensions:

 In this section we present the design of our simulation studies. So far I have designed the following three scenarios (though I have some work to do to polish them and fix bugs):

-\subsection{Definition of MLE Models}
+\subsection{Definition of MLA Models}

 We model example 1 and 2, 
 \section{Discussion}
--- a/bayesnets.tex
+++ b/bayesnets.tex
@@ -73,7 +73,7 @@
  \node[unobserved] (y) {$Y$};

  \node[observed, above=of y] (x) {$X$};
-  \node[observed, left=of y] (w) {$W$};
+  \node[observed, right=of y] (w) {$W$};

 %  \node[unobserved, above=of w] (k) {$K$};
  \node[observed,right=of x] (z) {$Z$};
@@ -95,14 +95,14 @@
  \node[unobserved] (y) {$Y$};

  \node[observed={white}{gray!40}, above=of y] (x) {$X$};
-  \node[observed, left=of y] (w) {$W$};
+  \node[observed, right=of y] (w) {$W$};

 %  \node[unobserved, above=of w] (k) {$K$};
  \node[observed,right=of x] (z) {$Z$};
 %   \node[residual,below=of y] (e) {$\varepsilon$};
 %   \node[residual,below=of w] (xi) {$\xi$};
  \draw[-] (x) -- (y);
-  \draw[-] (x) -- (w);
+  \draw[-] (z) -- (w);
  \draw[-] (y) -- (w);
  \draw[-] (x) -- (z);
 %  \draw[-] (k) -- (w);
@@ -111,7 +111,7 @@
 %  \draw[-] (y) -- (xi);
 %  \draw[-] (w) -- (xi);
 \end{tikzpicture}
-\caption{In \emph{Simulation 2b}, the edge connecting $W$ and $X$ signifies that the predictions $W$ are not conditionally independent of $X$ given $Y$, indicating systematic misclassification. \label{fig:simulation.2b}}
+\caption{In \emph{Simulation 2b}, the edge connecting $W$ and $Z$ signifies that the predictions $W$ are not conditionally independent of $Z$ given $Y$, indicating systematic misclassification. \label{fig:simulation.2b}}
 \end{subfigure}
 \vspace{1em}
 \begin{subfigure}[t]{0.2\textwidth}
--- a/flowchart_recommendations.tex
+++ b/flowchart_recommendations.tex
@@ -57,12 +57,12 @@
   \node[mylabel, anchor=south west] (independent) [below=8ex of correct,xshift=1.23in] {Independent\\ variable}; 
   \node[mylabel, anchor=south west] (dependent) [below=10ex of independent] {Dependent\\ variable};

-   \node[outcome box] (outcome_systematic_iv) [below =3.8in of report_manual] {Use MLE or MI.}; 
-   \node[outcome box] (outcome_nonsystematic_iv) [above =3ex of outcome_systematic_iv] {Use GMM or MLE.};
+   \node[outcome box] (outcome_systematic_iv) [below =3.8in of report_manual] {Use MLA or MI.}; 
+   \node[outcome box] (outcome_nonsystematic_iv) [above =3ex of outcome_systematic_iv] {Use GMM or MLA.};

-    %   \node[outcome box] (outcome_systematic_dv) [below =2in of outcome_nonsystematic_iv] {Use MLE.}; 
+    %   \node[outcome box] (outcome_systematic_dv) [below =2in of outcome_nonsystematic_iv] {Use MLA.}; 

-   \node[outcome box] (outcome_dv) [below =3ex of outcome_systematic_iv] {Use MLE.};
+   \node[outcome box] (outcome_dv) [below =3ex of outcome_systematic_iv] {Use MLA.};

  % & \node[] (iv_1) {Independent variable}; & \node[decision box] (dv_1) {Dependent variable}; \\

--- a/remember_grid_sweep.RDS
+++ b/remember_grid_sweep.RDS
--- a/remembr.RDS
+++ b/remembr.RDS
--- a/resources/#robustness_check_plots.R#
+++ b/resources/#robustness_check_plots.R#
@@ -1,266 +0,0 @@
-library(data.table)
-library(ggplot2)
-source('resources/functions.R')
-
-plot.robustness.1 <- function(iv='x'){
-
-## robustness check 1 test g
-    r <- readRDS('robustness_1.RDS')
-    baseline_df <- readRDS('remembr.RDS')[['plot.df.example.2']]
-    robust_df <- data.table(r$robustness_1)
-
-    ## just compare the mle methods in the two examples
-    robust_df <- robust_df[Bzy!=0]
-    robust_df <- robust_df[Bzx!=0]
-    baseline_df[method=='true', method:='True']
-    robust_df[method=='true', method:='True']
-
-    baseline_df <- baseline_df[(method=='mle') | (method=='True') | (method=='naive')]
-    robust_df <- robust_df[(method=='mle') | (method=='True')]
-
-    baseline_df[method=='mle',method:='MLE Reported']
-    robust_df[method=='mle',method:='No Z in Error Model']
-
-    df <- rbind(baseline_df, robust_df, fill=TRUE)
-    df[method=='naive', method:='Naive']
-    df <- df[(N %in% c(1000,5000)) & (m %in% c(200,100))]
-    p <- plot.simulation(df,iv=iv,levels=c('MLE Reported','No Z in Error Model', 'Naive', 'True'))
-    grid.draw(p)
-}
-
-plot.robustness.1.checkassumption <- function(iv='x'){
-
-## robustness check 1 test g
-    r <- readRDS('robustness_1.RDS')
-    baseline_df <- readRDS('remembr.RDS')[['plot.df.example.2']]
-    robust_df <- data.table(r$robustness_1)
-
-    ## just compare the mle methods in the two examples
-    robust_df <- robust_df[Bzy==0]
-    robust_df <- robust_df[Bzx!=0]
-    baseline_df[method=='true', method:='True']
-    robust_df[method=='true', method:='True']
-
-    baseline_df <- baseline_df[(method=='mle') | (method=='naive')]
-    robust_df <- robust_df[(method=='mle') | (method=='True')]
-
-    baseline_df[method=='mle',method:='MLE Reported']
-    robust_df[method=='mle',method:='No Z in Error Model']
-
-    df <- rbind(baseline_df, robust_df, fill=TRUE)
-    df[method=='naive', method:='Naive']
-    df <- df[(N %in% c(1000,5000)) & (m %in% c(200,100))]
-    p <- plot.simulation(df,iv=iv,levels=c('MLE Reported','No Z in Error Model', 'Naive', 'True'))
-    grid.draw(p)
-}
-
-plot.robustness.1.dv <- function(iv='z'){
-
-    ## robustness check 1 test g
-    r <- readRDS('robustness_1_dv.RDS')
-    baseline_df <- readRDS('remembr.RDS')[['plot.df.example.4']]
-    robust_df <- data.table(r$robustness_1_dv)
-
-    ## just compare the mle methods in the two examples
-    
-    baseline_df[method=='true', method:='True']
-    robust_df[method=='true', method:='True']
-
-    robust_df <- robust_df[Bxy!=0]
-    robust_df <- robust_df[Bzy!=0]
-    # robust_df <- robust_df[Bzx==-0.1]
-
-    baseline_df <- baseline_df[(method=='mle') | (method=='True') | (method=='naive')]
-    robust_df <- robust_df[(method=='mle') | (method=='True')]
-
-    baseline_df[method=='mle',method:='MLE Reported']
-    robust_df[method=='mle',method:='No Z in Error Model']
-
-    df <- rbind(baseline_df, robust_df, fill=TRUE)
-    df <- df[(N %in% c(1000,5000)) & (m %in% c(200,100))]
-    df[method=='naive', method:='Naive']
-
-    p <- plot.simulation(df,iv=iv,levels=c('MLE Reported','No Z in Error Model','Naive', 'True'))
-    grid.draw(p)
-}
-
-plot.robustness.2.iv <- function(iv, n.annotations=100, n.classifications=5000){
-
-    r <- readRDS("robustness_2.RDS")
-    robust_df <- data.table(r[['robustness_2']])
-
-    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]
-
-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
-
-    robust_df <- robust_df[,method := new.levels[method]]
-    robust_df <- robust_df[method != "Feasible"]
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
-    
-    p <- p + facet_wrap(prediction_accuracy~., ncol=4,as.table=F)
-    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
-    
-
-    p <- arrangeGrob(p,
-                     top=grid.text("AC Accuracy",x=0.32,just='right'))
-                     
-    grid.draw(p)
-}
-
-robust2 <- readRDS("robustness_2_dv.RDS")
-robust_2_df <- data.table(robust2[['robustness_2_dv']])
-robust_2_min_acc <- min(robust_2_df[,prediction_accuracy])
-robust_2_max_acc <- max(robust_2_df[,prediction_accuracy])
-
-plot.robustness.2.dv <- function(iv, n.annotations=100, n.classifications=5000){
-
-    r <- readRDS("robustness_2_dv.RDS")
-    robust_df <- data.table(r[['robustness_2_dv']])
-
-
-    #temporary work around a bug in the makefile
-    ## if('Px' %in% names(robust_df))
-    ##     robust_df <- robust_df[is.na(Px)]
-    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]
-
-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
-
-    robust_df <- robust_df[,method := new.levels[method]]
-    robust_df <- robust_df[method != "Feasible"]
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
-    
-    p <- p + facet_wrap(prediction_accuracy~., ncol=4,as.table=F)
-    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
-
-    p <- arrangeGrob(p,
-                     top=grid.text("AC Accuracy",x=0.32,just='right'))
-                     
-    grid.draw(p)
-}
-
-
-plot.robustness.3.iv <- function(iv, n.annotations=100, n.classifications=5000){
-    r <- readRDS('robustness_3.RDS')
-    robust_df <- data.table(r[['robustness_3']])
-    r2 <- readRDS('robustness_3_proflik.RDS')
-    robust_df_proflik <- data.table(r2[['robustness_3_proflik']])
-
-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
-
-    robust_df <- robust_df[,method := new.levels[method]]
-    robust_df <- robust_df[method != "Feasible"]
-    robust_df <- robust_df[method=='MLE',method:='Fischer approximation']
-
-    robust_df_proflik <- robust_df_proflik[(m==n.annotations) & (N==n.classifications)]
-    robust_df_proflik <- robust_df_proflik[,method := new.levels[method]]
-    robust_df_proflik <- robust_df_proflik[method=='MLE',method:='Profile likelihood']
-    robust_df_proflik <- robust_df_proflik[method != "Feasible"]
-
-    df <- df[(m==n.annotations) & (N==n.classifications)]
-
-    df <- rbind(robust_df, robust_df_proflik)
-
-    p <- .plot.simulation(df, iv=iv, levels=c("True","Naïve","MI", "GMM", "Profile likelihood","Fischer approximation", "PL", "Feasible"))
-    
-    p <- p + facet_wrap(Px~., ncol=3,as.table=F)
-    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
-
-    p <- arrangeGrob(p,
-                     top=grid.text("P(X)",x=0.32,just='right'))
-                     
-    grid.draw(p)
-}
-
-plot.robustness.3.dv <- function(iv, n.annotations=100, n.classifications=1000){
-    r <- readRDS('robustness_3_dv.RDS')
-    robust_df <- data.table(r[['robustness_3_dv']])
-
-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","mle"="MLE", "zhang"="PL","feasible"="Feasible")
-
-    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]
-
-    robust_df <- robust_df[,method := new.levels[method]]
-    robust_df <- robust_df[method != "Feasible"]
-
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
-    
-    p <- p + facet_wrap(B0~., ncol=3,as.table=F)
-    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
-
-    p <- arrangeGrob(p,
-                     top=grid.text("P(Y)",x=0.32,just='right'))
-                     
-    grid.draw(p)
-}
-
-plot.robustness.4.iv <- function(iv, n.annotations=100, n.classifications=1000){
-    r <- readRDS('robustness_4.RDS')
-    robust_df <- data.table(r[['robustness_4']])
-
-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
-
-    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]
-
-    robust_df <- robust_df[,method := new.levels[method]]
-    robust_df <- robust_df[method != "Feasible"]
-
-    robust_df <- robust_df[,y_bias=factor(robust_df$y_bias,levels=sort(unique(robust_df$y_bias),decreasing=TRUE))]
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
-    
-    p <- p + facet_wrap(y_bias~., ncol=3,as.table=T)
-    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
-
-    p <- arrangeGrob(p,
-                     top=grid.text("Coefficient of Y for W",x=0.32,just='right'))
-                     
-    grid.draw(p)
-}
-
-
-    
-plot.robustness.4.iv <- function(iv, n.annotations=100, n.classifications=1000){
-    r <- readRDS('robustness_4.RDS')
-    robust_df <- data.table(r[['robustness_4']])
-
-
-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
-
-    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]
-
-    robust_df <- robust_df[,method := new.levels[method]]
-    robust_df <- robust_df[method != "Feasible"]
-
-    robust_df <- robust_df[,y_bias=factor(robust_df$y_bias,levels=sort(unique(robust_df$y_bias),decreasing=TRUE))]
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
-    
-    p <- p + facet_wrap(y_bias~., ncol=3,as.table=T)
-    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
-
-    p <- arrangeGrob(p,
-                     top=grid.text("Coefficient of Y for W",x=0.32,just='right'))
-                     
-    grid.draw(p)
-}
-
-plot.robustness.4.dv <- function(iv, n.annotations=100, n.classifications=1000){
-    r <- readRDS('robustness_4_dv.RDS')
-    robust_df <- data.table(r[['robustness_4']])
-
-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","mle"="MLE", "zhang"="PL","feasible"="Feasible")
-
-    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]
-
-    robust_df <- robust_df[,method := new.levels[method]]
-    robust_df <- robust_df[method != "Feasible"]
-
-    robust_df <- robust_df[,z_bias=factor(z_bias, levels=sort(unique(z_bias),descending=TRUE))]
-
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
-    p <- p + facet_wrap(z_bias~., ncol=3,as.table=F)
-    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
-
-    p <- arrangeGrob(p,
-                     top=grid.text("Coefficient of Z on W",x=0.32,just='right'))
-                     
-    grid.draw(p)
-}
--- a/resources/functions.R
+++ b/resources/functions.R
@@ -39,7 +39,7 @@ plot.simulation <- function(plot.df, iv='x', levels=c("true","naive", "amelia.fu
    p <- p + geom_hline(aes(yintercept=true.est),linetype=2)

    p <- p + geom_pointrange(shape=1,size=0.5)
-    p <- p + geom_linerange(aes(ymax=mean.ci.upper, ymin=mean.ci.lower),position=position_nudge(x=0.4), color='grey40')
+    p <- p + geom_linerange(aes(ymax=median.ci.upper, ymin=median.ci.lower),position=position_nudge(x=0.4), color='grey40')

    return(p)
 }
@@ -48,10 +48,10 @@ plot.simulation <- function(plot.df, iv='x', levels=c("true","naive", "amelia.fu
 plot.simulation.iv <- function(plot.df, iv='x'){

    plot.df <- plot.df[(N!=8000) & (m!=800) & (m!=200)]
-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
+    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLA", "zhang"="PL","feasible"="Feasible")
    plot.df[,method := new.levels[method]] 
    
-    return(plot.simulation(plot.df, iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible")))
+    return(plot.simulation(plot.df, iv, levels=c("True","Naïve","MI", "GMM", "MLA", "PL", "Feasible")))
 }


@@ -59,10 +59,10 @@ plot.simulation.dv <- function(plot.df, iv='x'){
    plot.df <- copy(plot.df)
    plot.df <- plot.df[(N!=8000) & (m!=800) & (m!=200)]

-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
+    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLA", "zhang"="PL","feasible"="Feasible")

    plot.df[,method:=new.levels[method]]
-    return(plot.simulation(plot.df, iv, levels=c("True","Naïve", "MI","MLE","PL","Feasible")))
+    return(plot.simulation(plot.df, iv, levels=c("True","Naïve", "MI","MLA","PL","Feasible")))
 }

 plot.simulation.irr <- function(plot.df,iv='x'){
--- a/resources/robustness_check_plots.R
+++ b/resources/robustness_check_plots.R
@@ -18,13 +18,13 @@ plot.robustness.1 <- function(iv='x'){
    baseline_df <- baseline_df[(method=='mle') | (method=='True') | (method=='naive')]
    robust_df <- robust_df[(method=='mle') | (method=='True')]

-    baseline_df[method=='mle',method:='MLE Reported']
+    baseline_df[method=='mle',method:='MLA Reported']
    robust_df[method=='mle',method:='No Z in Error Model']

    df <- rbind(baseline_df, robust_df, fill=TRUE)
    df[method=='naive', method:='Naive']
    df <- df[(N %in% c(1000,5000)) & (m %in% c(200,100))]
-    p <- plot.simulation(df,iv=iv,levels=c('MLE Reported','No Z in Error Model', 'Naive', 'True'))
+    p <- plot.simulation(df,iv=iv,levels=c('MLA Reported','No Z in Error Model', 'Naive', 'True'))
    grid.draw(p)
 }

@@ -44,13 +44,13 @@ plot.robustness.1.checkassumption <- function(iv='x'){
    baseline_df <- baseline_df[(method=='mle') | (method=='naive')]
    robust_df <- robust_df[(method=='mle') | (method=='True')]

-    baseline_df[method=='mle',method:='MLE Reported']
+    baseline_df[method=='mle',method:='MLA Reported']
    robust_df[method=='mle',method:='No Z in Error Model']

    df <- rbind(baseline_df, robust_df, fill=TRUE)
    df[method=='naive', method:='Naive']
    df <- df[(N %in% c(1000,5000)) & (m %in% c(200,100))]
-    p <- plot.simulation(df,iv=iv,levels=c('MLE Reported','No Z in Error Model', 'Naive', 'True'))
+    p <- plot.simulation(df,iv=iv,levels=c('MLA Reported','No Z in Error Model', 'Naive', 'True'))
    grid.draw(p)
 }

@@ -73,14 +73,14 @@ plot.robustness.1.dv <- function(iv='z'){
    baseline_df <- baseline_df[(method=='mle') | (method=='True') | (method=='naive')]
    robust_df <- robust_df[(method=='mle') | (method=='True')]

-    baseline_df[method=='mle',method:='MLE Reported']
+    baseline_df[method=='mle',method:='MLA Reported']
    robust_df[method=='mle',method:='No Z in Error Model']

    df <- rbind(baseline_df, robust_df, fill=TRUE)
    df <- df[(N %in% c(1000,5000)) & (m %in% c(200,100))]
    df[method=='naive', method:='Naive']

-    p <- plot.simulation(df,iv=iv,levels=c('MLE Reported','No Z in Error Model','Naive', 'True'))
+    p <- plot.simulation(df,iv=iv,levels=c('MLA Reported','No Z in Error Model','Naive', 'True'))
    grid.draw(p)
 }

@@ -91,11 +91,11 @@ plot.robustness.2.iv <- function(iv, n.annotations=100, n.classifications=5000){

    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]

-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
+    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLA", "zhang"="PL","feasible"="Feasible")

    robust_df <- robust_df[,method := new.levels[method]]
    robust_df <- robust_df[method != "Feasible"]
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
+    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLA", "PL", "Feasible"))
    
    p <- p + facet_wrap(prediction_accuracy~., ncol=4,as.table=F)
    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
@@ -123,11 +123,11 @@ plot.robustness.2.dv <- function(iv, n.annotations=100, n.classifications=5000){
    ##     robust_df <- robust_df[is.na(Px)]
    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]

-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
+    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLA", "zhang"="PL","feasible"="Feasible")

    robust_df <- robust_df[,method := new.levels[method]]
    robust_df <- robust_df[method != "Feasible"]
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
+    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLA", "PL", "Feasible"))
    
    p <- p + facet_wrap(prediction_accuracy~., ncol=4,as.table=F)
    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
@@ -139,30 +139,30 @@ plot.robustness.2.dv <- function(iv, n.annotations=100, n.classifications=5000){
 }


-plot.robustness.3.iv <- function(iv, n.annotations=100, n.classifications=5000){
+plot.robustness.3.iv <- function(iv, n.annotations=200, n.classifications=5000){
    r <- readRDS('robustness_3.RDS')
    robust_df <- data.table(r[['robustness_3']])
    r2 <- readRDS('robustness_3_proflik.RDS')
    robust_df_proflik <- data.table(r2[['robustness_3_proflik']])

-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
+    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLA", "zhang"="PL","feasible"="Feasible")

    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]
    robust_df <- robust_df[,method := new.levels[method]]
-    robust_df <- robust_df[method != "Feasible"]
-    robust_df <- robust_df[method=='MLE',method:='Fischer likelihood']
+    robust_df <- robust_df[(method != "Feasible") & (Bzx==0.3)]
+    robust_df <- robust_df[(method != "PL")]
+##    robust_df <- robust_df[method=='MLA',method:='Fischer likelihood']

-    robust_df_proflik <- robust_df_proflik[(m==n.annotations) & (N==n.classifications)]
-    robust_df_proflik <- robust_df_proflik[method=='MLE',method:='Profile likelihood']
+    ## robust_df_proflik <- robust_df_proflik[(m==n.annotations) & (N==n.classifications)]
+    ## robust_df_proflik <- robust_df_proflik[,method := new.levels[method]]

+    ## robust_df_proflik <- robust_df_proflik[method=='MLA']
+    ## robust_df_proflik <- robust_df_proflik[method=='MLA',method:='Profile likelihood']
+    ## robust_df_proflik <- robust_df_proflik[method != "Feasible"]

+    ## df <- rbind(robust_df, robust_df_proflik)

-    robust_df_proflik <- robust_df_proflik[,method := new.levels[method]]
-    robust_df_proflik <- robust_df_proflik[method != "Feasible"]
-
-    df <- rbind(robust_df, robust_df_proflik)
-
-    p <- .plot.simulation(df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
+    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM","MLA", "PL", "Feasible"))
    
    p <- p + facet_wrap(Px~., ncol=3,as.table=F)
    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
@@ -177,16 +177,17 @@ plot.robustness.3.dv <- function(iv, n.annotations=100, n.classifications=1000){
    r <- readRDS('robustness_3_dv.RDS')
    robust_df <- data.table(r[['robustness_3_dv']])

-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","mle"="MLE", "zhang"="PL","feasible"="Feasible")
+    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","mle"="MLA", "zhang"="PL","feasible"="Feasible")

    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]

    robust_df <- robust_df[,method := new.levels[method]]
    robust_df <- robust_df[method != "Feasible"]
+    robust_df <- robust_df[,Py := round(plogis(B0),2)]
+    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLA", "PL", "Feasible"))
+    robust_df <- robust_df[(method != "PL")]
    
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
-    
-    p <- p + facet_wrap(B0~., ncol=3,as.table=F)
+    p <- p + facet_wrap(Py~., ncol=3,as.table=F,scales='free')
    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()

    p <- arrangeGrob(p,
@@ -194,20 +195,21 @@ plot.robustness.3.dv <- function(iv, n.annotations=100, n.classifications=1000){
                     
    grid.draw(p)
 }
-
 plot.robustness.4.iv <- function(iv, n.annotations=100, n.classifications=1000){
    r <- readRDS('robustness_4.RDS')
    robust_df <- data.table(r[['robustness_4']])

-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
+    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLA", "zhang"="PL","feasible"="Feasible")

    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]

    robust_df <- robust_df[,method := new.levels[method]]
    robust_df <- robust_df[method != "Feasible"]

-    robust_df <- robust_df[,y_bias=factor(robust_df$y_bias,levels=sort(unique(robust_df$y_bias),decreasing=TRUE))]
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
+    robust_df <- robust_df[,y_bias:=factor(robust_df$y_bias,levels=sort(unique(robust_df$y_bias),decreasing=TRUE))]
+    robust_df <- robust_df[Bzx==1]
+
+    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLA", "PL", "Feasible"))
    
    p <- p + facet_wrap(y_bias~., ncol=3,as.table=T)
    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
@@ -219,45 +221,21 @@ plot.robustness.4.iv <- function(iv, n.annotations=100, n.classifications=1000){
 }


-    
-plot.robustness.4.iv <- function(iv, n.annotations=100, n.classifications=1000){
-    r <- readRDS('robustness_4.RDS')
-    robust_df <- data.table(r[['robustness_4']])
-
-
-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","gmm"="GMM", "mle"="MLE", "zhang"="PL","feasible"="Feasible")
-
-    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]
-
-    robust_df <- robust_df[,method := new.levels[method]]
-    robust_df <- robust_df[method != "Feasible"]
-
-    robust_df <- robust_df[,y_bias=factor(robust_df$y_bias,levels=sort(unique(robust_df$y_bias),decreasing=TRUE))]
-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
-    
-    p <- p + facet_wrap(y_bias~., ncol=3,as.table=T)
-    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
-
-    p <- arrangeGrob(p,
-                     top=grid.text("Coefficient of Y for W",x=0.32,just='right'))
-                     
-    grid.draw(p)
-}
-
 plot.robustness.4.dv <- function(iv, n.annotations=100, n.classifications=1000){
    r <- readRDS('robustness_4_dv.RDS')
    robust_df <- data.table(r[['robustness_4']])

-    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","mle"="MLE", "zhang"="PL","feasible"="Feasible")
+    new.levels <- c("true"="True","naive"="Naïve","amelia.full"="MI", "mecor"="mecor","mle"="MLA", "zhang"="PL","feasible"="Feasible")

    robust_df <- robust_df[(m==n.annotations) & (N==n.classifications)]

    robust_df <- robust_df[,method := new.levels[method]]
    robust_df <- robust_df[method != "Feasible"]

-    robust_df <- robust_df[,z_bias=factor(z_bias, levels=sort(unique(z_bias),descending=TRUE))]
+    robust_df <- robust_df[,z_bias:=factor(z_bias, levels=sort(unique(z_bias),decreasing=TRUE))]

-    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLE", "PL", "Feasible"))
+    robust_df <- robust_df[Bzx==1]
+    p <- .plot.simulation(robust_df, iv=iv, levels=c("True","Naïve","MI", "GMM", "MLA", "PL", "Feasible"))
    p <- p + facet_wrap(z_bias~., ncol=3,as.table=F)
    p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()

--- a/robustness_1.RDS
+++ b/robustness_1.RDS
--- a/robustness_1_dv.RDS
+++ b/robustness_1_dv.RDS
--- a/robustness_2.RDS
+++ b/robustness_2.RDS
--- a/robustness_2_dv.RDS
+++ b/robustness_2_dv.RDS
--- a/robustness_3.RDS
+++ b/robustness_3.RDS
--- a/robustness_3_dv.RDS
+++ b/robustness_3_dv.RDS
--- a/robustness_3_dv_proflik.RDS
+++ b/robustness_3_dv_proflik.RDS
--- a/robustness_3_proflik.RDS
+++ b/robustness_3_proflik.RDS
--- a/robustness_4.RDS
+++ b/robustness_4.RDS
--- a/robustness_4_dv.RDS
+++ b/robustness_4_dv.RDS