1
0

Update on Overleaf.

This commit is contained in:
Valerie Hase
2023-03-03 15:22:06 +00:00
committed by node
parent e38eb94d40
commit 6562f84292
2 changed files with 371 additions and 370 deletions

View File

@@ -234,7 +234,7 @@ Much of this critique targets unjust consequences of these biases to individuals
We argue that current common practices to address such limitations are insufficient. These practices assert validity by reporting classifier performance on manually annotated data quantified via metrics like accuracy, precision, recall, or the F1 score \citep{hase_computational_2022, baden_three_2022, song_validations_2020}. We argue that current common practices to address such limitations are insufficient. These practices assert validity by reporting classifier performance on manually annotated data quantified via metrics like accuracy, precision, recall, or the F1 score \citep{hase_computational_2022, baden_three_2022, song_validations_2020}.
These steps promote confidence in results by making misclassification transparent, but our example indicates bias can flow downstream into statistical inferences, despite high predictiveness. These steps promote confidence in results by making misclassification transparent, but our example indicates bias can flow downstream into statistical inferences, despite high predictiveness.
Instead of relying on transparency rituals to ward off misclassification bias, researchers can and should use validation data to no only report but correct it. Instead of relying only on transparency rituals to ward off misclassification bias, researchers can and should use validation data to correct it.
% \citep{obermeyer_dissecting_2019, kleinberg_algorithmic_2018, bender_dangers_2021, wallach_big_2019, noble_algorithms_2018}. % \citep{obermeyer_dissecting_2019, kleinberg_algorithmic_2018, bender_dangers_2021, wallach_big_2019, noble_algorithms_2018}.
%For example, \citet{hede_toxicity_2021} show that, when applied to news datasets, the Perspecitve API overestimates incivility related to topics such as racial identity, violence, and sex. %For example, \citet{hede_toxicity_2021} show that, when applied to news datasets, the Perspecitve API overestimates incivility related to topics such as racial identity, violence, and sex.
@@ -358,7 +358,7 @@ In general, if we want to estimate a model $P(Y|\Theta_Y, X, Z)$ for $Y$ given $
The joint probability of $Y$ and $W$ can be factored into the product of three terms: $P(Y|X,Z,\Theta_Y)$, the model with parameters $\Theta_Y$ we want to estimate, $P(W|X,Y, \Theta_W)$, a model for $W$ having parameters $\Theta_W$, and $P(X|Z, \Theta_X)$, a model for $X$ having parameters $\Theta_X$. The joint probability of $Y$ and $W$ can be factored into the product of three terms: $P(Y|X,Z,\Theta_Y)$, the model with parameters $\Theta_Y$ we want to estimate, $P(W|X,Y, \Theta_W)$, a model for $W$ having parameters $\Theta_W$, and $P(X|Z, \Theta_X)$, a model for $X$ having parameters $\Theta_X$.
Calculating these three conditional probabilities is sufficient to calculate the joint probability of the dependent variable and automated classifications and thereby obtain a consistent estimate despite misclassification. $P(W|X,Y, \Theta_W)$ is called the \emph{error model} and $P(X|Z, \Theta_X)$ is called the \emph{exposure model} \citep{carroll_measurement_2006}. Calculating these three conditional probabilities is sufficient to calculate the joint probability of the dependent variable and automated classifications and thereby obtain a consistent estimate despite misclassification. $P(W|X,Y, \Theta_W)$ is called the \emph{error model} and $P(X|Z, \Theta_X)$ is called the \emph{exposure model} \citep{carroll_measurement_2006}.
To illustrate, the regression model $Y=B_0 + B_1 X + B_2 Z + \varepsilon$ includes predictions for the independent variable $X$. To illustrate, consider the regression model $Y=B_0 + B_1 X + B_2 Z + \varepsilon$ and automated classifications $W$ of the independent variable $X$.
We can assume that the probability of $W$ follows a logistic regression model of $Y$, $X$, and $Z$ and that the probability of $X$ follows a logistic regression model of $Z$. In this case, the likelihood model below is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\} = \{\{B_0, B_1, B_2\}, \{\alpha_0, \alpha_1, \alpha_2\}, \{\gamma_0, \gamma_1\}\}$. We can assume that the probability of $W$ follows a logistic regression model of $Y$, $X$, and $Z$ and that the probability of $X$ follows a logistic regression model of $Z$. In this case, the likelihood model below is sufficient to consistently estimate the parameters $\Theta = \{\Theta_Y, \Theta_W, \Theta_X\} = \{\{B_0, B_1, B_2\}, \{\alpha_0, \alpha_1, \alpha_2\}, \{\gamma_0, \gamma_1\}\}$.
\begin{align} \begin{align}
@@ -401,6 +401,7 @@ We now present four Monte Carlo simulations (\emph{Simulations 1a}, \emph{1b}, \
Monte Carlo simulations are a tool for evaluating statistical methods, including (automated) content analysis \citep[e.g.,][]{song_validations_2020,bachl_correcting_2017,geis_statistical_2021, fong_machine_2021,zhang_how_2021}. Monte Carlo simulations are a tool for evaluating statistical methods, including (automated) content analysis \citep[e.g.,][]{song_validations_2020,bachl_correcting_2017,geis_statistical_2021, fong_machine_2021,zhang_how_2021}.
They are defined by a data generating process from which datasets are repeatedly sampled. Repeating an analysis for each of these datasets provides an empirical distribution of results the analysis would obtain over study replications. Monte Carlo simulation affords exploration of finite-sample performance, robustness to assumption violations, comparison across several methods, and ease of interpretability \citep{mooney_monte_1997}. They are defined by a data generating process from which datasets are repeatedly sampled. Repeating an analysis for each of these datasets provides an empirical distribution of results the analysis would obtain over study replications. Monte Carlo simulation affords exploration of finite-sample performance, robustness to assumption violations, comparison across several methods, and ease of interpretability \citep{mooney_monte_1997}.
Such simulations allow exploration of how results depend on assumptions about the data-generating process and analytical choices and are thus an important tool for designing studies that account for misclassification. Such simulations allow exploration of how results depend on assumptions about the data-generating process and analytical choices and are thus an important tool for designing studies that account for misclassification.
% Code for reproducing our simulations is available here: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}.
@@ -522,7 +523,7 @@ Notably, the PL method is inconsistent and considerable bias remains when the sa
As \citet{fong_machine_2021} also observed, this precision improvement is less pronounced for MI estimates, indicating that As \citet{fong_machine_2021} also observed, this precision improvement is less pronounced for MI estimates, indicating that
GMM and MLA use automated classifications more efficiently than MI. GMM and MLA use automated classifications more efficiently than MI.
\begin{figure} \begin{figure}[htbp!]
<<example1.x,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>= <<example1.x,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
p <- plot.simulation.iv(plot.df.example.1, iv='x') p <- plot.simulation.iv(plot.df.example.1, iv='x')
grid.draw(p) grid.draw(p)
@@ -617,12 +618,12 @@ We think there is: Using statistical methods to not only quantify but also corre
\subsubsection{Step 1: Attempt Manual Content Analysis} \subsubsection{Step 1: Attempt Manual Content Analysis}
Manual content annotation is often done \textit{post facto}, for instance to calculate predictiveness of an already existing AC such as Google's Perspective classifier. We propose to instead use manually annotated data \textit{ante facto}, i.e. before building or validating an AC. Manual content annotation is often done \textit{post facto}, for instance to calculate predictiveness of an already existing AC such as Google's Perspective classifier. We propose to instead use manually annotated data \textit{ante facto}, i.e. before building or validating an AC.
Practically speaking, the main reason to use an AC is feasibility: to avoid the costs of manual coding a large dataset. Practically speaking, the main reason to use an AC is feasibility: to avoid the costs of manually coding a large dataset.
One may for example need a large dataset to study an effect one assumes to be small. Manually labeling such a dataset is expensive. One may for example need a large dataset to study an effect one assumes to be small. Manually labeling such a dataset is expensive.
Often, ACs are seen as a cost-saving procedure without consideration of the threats to validity posed by misclassification. Often, ACs are seen as a cost-saving procedure but scholars often fail to consider the threats to validity posed by misclassification.
Moreover, validating an existing AC or building a new AC is also expensive, for instance due to costs of computational resources or manual annotation of (perhaps smaller) test and training datasets. Moreover, validating an existing AC or building a new AC is also expensive, for instance due to costs of computational resources or manual annotation of (perhaps smaller) test and training datasets.
We therefore caution researchers against preferring automated over manual content analysis unless doing so is necessary to obtain useful evidence. We agree with \citet{baden_three_2022} who argue that ``social science researchers may be well-advised to eschew the promises of computational tools and invest instead into carefully researcher-controlled, limited-scale manual studies'' (p. 11). In particular, we recommend to use manually annotated data \textit{ante facto}: Researchers should begin by examining human-annotated data so to discern if an AC is necessary. In our simulations, the feasible estimator is less precise but consistent in all cases. So if fortune shines and this estimate sufficiently answers one's research question, manual coding is sufficient. Here, scholars should rely on existing recommendations for descriptive and inferential statistics in the context of manual content analysis \citep{geis_statistical_2021, bachl_correcting_2017}. If the feasible estimator however fails to provide convincing evidence, for example by not rejecting the null, manually annotated data is not wasted. It can be reused to build an AC or correct misclassification bias. We therefore caution researchers against preferring automated over manual content analysis unless doing so is necessary to obtain useful evidence. We agree with \citet{baden_three_2022} who argue that ``social science researchers may be well-advised to eschew the promises of computational tools and invest instead into carefully researcher-controlled, limited-scale manual studies'' (p. 11). In particular, we recommend using manually annotated data \textit{ante facto}: Researchers should begin by examining human-annotated data so to discern if an AC is necessary. In our simulations, the feasible estimator is less precise but consistent in all cases. So if fortune shines and this estimate sufficiently answers one's research question, manual coding is sufficient. Here, scholars should rely on existing recommendations for descriptive and inferential statistics when using manual content analysis \citep{geis_statistical_2021, bachl_correcting_2017}. If the feasible estimator however fails to provide convincing evidence, for example by not rejecting the null, manually annotated data is not wasted. It can be reused to build an AC or correct misclassification bias.
%One potential problem of this \textit{ante facto} approach is that conducting two statistical tests of the same hypothesis increases the chances of false discover. A simple solution to this is to adjust the significance threshold $\alpha$ for drawing conclusions from the feasible estimate. %We recommend p < .01. %That said, it might useful use an AC in a preliminary analysis, prior to collecting validation data when an AC such as one available from an API, is available for reuse and confusion matrix quantities necessary for the pseudo-likelihood (PL) method are published. Although (PL) is inconsistent when used for a covariate, this can be corrected if the true rate of $X$ can be estimated. %One potential problem of this \textit{ante facto} approach is that conducting two statistical tests of the same hypothesis increases the chances of false discover. A simple solution to this is to adjust the significance threshold $\alpha$ for drawing conclusions from the feasible estimate. %We recommend p < .01. %That said, it might useful use an AC in a preliminary analysis, prior to collecting validation data when an AC such as one available from an API, is available for reuse and confusion matrix quantities necessary for the pseudo-likelihood (PL) method are published. Although (PL) is inconsistent when used for a covariate, this can be corrected if the true rate of $X$ can be estimated.
%Caution is still warranted because ACs can perform quite differently from one dataset to another so we recommend collecting validation representative of your study's dataset and using another appropriate method for published studies. %Caution is still warranted because ACs can perform quite differently from one dataset to another so we recommend collecting validation representative of your study's dataset and using another appropriate method for published studies.
@@ -634,7 +635,7 @@ We therefore caution researchers against preferring automated over manual conten
As demonstrated in our simulations, knowing whether an AC makes systematic misclassifications is important: It determines which correction methods can work. As demonstrated in our simulations, knowing whether an AC makes systematic misclassifications is important: It determines which correction methods can work.
Fortunately, manually annotated data can be used to detect systematic misclassification. Fortunately, manually annotated data can be used to detect systematic misclassification.
For example, \citet{fong_machine_2021} suggest using Sargan's J-test of the null hypothesis that the product of the AC's predictions and regression residuals have an expected value of 0. For example, \citet{fong_machine_2021} suggest using Sargan's J-test of the null hypothesis that the product of the AC's predictions and regression residuals have an expected value of 0.
More generally, one can test if the data's conditional independence structures can represented by Figures \ref{fig:simulation.1a} or \ref{fig:simulation.2a}. This can be done, for example, via likelihood ratio tests of $P(W|X,Z) = P(W|X,Y,Z)$ (if an AC measures an independent variable $X$) or of $P(W|Y) = P(W|Y,Z,X)$ (if an AC measures a dependent variable $Y$) or by visual inspection of plots of relating misclassifications to other variables \citep{carroll_measurement_2006}. More generally, one can test if the data's conditional independence structures can be represented by Figures \ref{fig:simulation.1a} or \ref{fig:simulation.2a}. This can be done, for example, via likelihood ratio tests of $P(W|X,Z) = P(W|X,Y,Z)$ (if an AC measures an independent variable $X$) or of $P(W|Y) = P(W|Y,Z,X)$ (if an AC measures a dependent variable $Y$) or by visual inspection of plots of relating misclassifications to other variables \citep{carroll_measurement_2006}.
We strongly recommend using such methods to test for differential error and to design an appropriate correction. We strongly recommend using such methods to test for differential error and to design an appropriate correction.
% For example, ``algorithmic audits'' \citep[e.g.,][]{rauchfleisch_false_2020, kleinberg_algorithmic_2018} evaluate the performance of AC across different subgroups in the data. % For example, ``algorithmic audits'' \citep[e.g.,][]{rauchfleisch_false_2020, kleinberg_algorithmic_2018} evaluate the performance of AC across different subgroups in the data.
@@ -646,7 +647,7 @@ More generally, one can test if the data's conditional independence structures c
\subsubsection{Step 3: Correct for Misclassification Bias Instead of Being Naïve} \subsubsection{Step 3: Correct for Misclassification Bias Instead of Being Naïve}
Across our simulations, we showed that the naïve estimator is biased. Testing different error correction methods, we found that these generate different levels of consistency, efficiency, and accuracy in uncertainty quantification. That said, our proposed MLA method should be considered as a versatile method because it is the only method capable of producing consistent estimates in prototypical situations studied here. We recommend the MLA method as the first ``go-to'' method. As shown in Appendix \ref{appendix:nozp}, this method requires specifying a valid error model to obtain consistent estimates. One should take care that the model not have omitted variables including nonlinearities and interactions. Across our simulations, we showed that the naïve estimator is biased. Testing different error correction methods, we found that these generate different levels of consistency, efficiency, and accuracy in uncertainty quantification. That said, our proposed MLA method should be considered as a versatile method because it is the only method capable of producing consistent estimates in prototypical situations studied here. We recommend the MLA method as the first ``go-to'' method. As shown in Appendix \ref{appendix:robustness}, this method requires specifying a valid error model to obtain consistent estimates. One should take care that the model not have omitted variables including nonlinearities and interactions.
Our \textbf{misclassificationmodels} R package provides reasonable default error models and a user-friendly interface to facilitate adoption of our MLA method (see Appendix \ref{appendix:misclassificationmodels}). Our \textbf{misclassificationmodels} R package provides reasonable default error models and a user-friendly interface to facilitate adoption of our MLA method (see Appendix \ref{appendix:misclassificationmodels}).
When feasible, we recommend comparing the MLA approach to another error correction method. Consistency between two correction methods shows that results are robust independent of the correction method. If the AC is used to predict an independent variable, GMM is a good choice if error is nondifferential. Otherwise, MI can be considered. When feasible, we recommend comparing the MLA approach to another error correction method. Consistency between two correction methods shows that results are robust independent of the correction method. If the AC is used to predict an independent variable, GMM is a good choice if error is nondifferential. Otherwise, MI can be considered.
@@ -657,7 +658,7 @@ This range of viable choices in error correction methods also motivates our nex
\subsubsection{Step 4: Provide a Full Account of Methodological Decisions} \subsubsection{Step 4: Provide a Full Account of Methodological Decisions}
Finally, we add our voices to those Finally, we add our voices to those
recommending that researchers report methodological decisions so other can understand and replicate their design \citep{pipal_if_2022, reiss_reporting_2022}. These decisions include but are not limited to choices concerning test and training data (e.g., size, sampling, split in cross-validation procedures, balance), manual annotations (size, number of annotators, intercoder values, size of data annotated for intercoder testing), and the classifier itself (choice of algorithm or ensemble, different accuracy metrics). They extend to reporting different error correction methods as proposed by our third recommendation. recommending that researchers report methodological decisions so other can understand and replicate their design \citep{pipal_if_2022, reiss_reporting_2022}, especially in the context of machine learning \citep{mitchell_model_2019}. These decisions include but are not limited to choices concerning test and training data (e.g., size, sampling, split in cross-validation procedures, balance), manual annotations (size, number of annotators, intercoder values, size of data annotated for intercoder testing), and the classifier itself (choice of algorithm or ensemble, different accuracy metrics). They extend to reporting different error correction methods as proposed by our third recommendation.
In our review, we found that reporting such decisions is not yet common, at least in the context of SML-based text classification. In our review, we found that reporting such decisions is not yet common, at least in the context of SML-based text classification.
When correcting for misclassification, uncorrected results will often provide a lower-bound on effect sizes; corrected analyses will provide more accurate but less conservative results. When correcting for misclassification, uncorrected results will often provide a lower-bound on effect sizes; corrected analyses will provide more accurate but less conservative results.
Therefore, both corrected and uncorrected estimates should be presented as part of making potential multiverses of findings transparent. Therefore, both corrected and uncorrected estimates should be presented as part of making potential multiverses of findings transparent.
@@ -670,24 +671,23 @@ Therefore, both corrected and uncorrected estimates should be presented as part
\section{Conclusion and Limitations} \section{Conclusion and Limitations}
Misclassification bias is an important threat to validity in studies that use automatic classifiers to measure statistical variables. Misclassification bias is an important threat to validity in studies that use automatic classifiers to measure statistical variables.
As we showed in an example with data from the Perspective API, widely used and very accurate automatic classifiers can cause type-1 and type-2 errors. As we showed in an example with data from the Perspective API, widely used and very accurate automated classifiers can cause type-1 and type-2 errors.
As evidence by our literature review, this problem not attracted enough attention within communication science \citep[but see][]{bachl_correcting_2017} and even in the broader computational social science community. As evidence by our literature review, this problem not attracted enough attention within communication science \citep[but see][]{bachl_correcting_2017} and even in the broader computational social science community.
Therefore, although current best-practices of reporting metrics of classifier performance on manually annotated validation data are important, they provide little protection from misclassification bias. Although current best-practices of reporting metrics of classifier performance on manually annotated validation data, for instance metrics like precision or recall, are important, they provide little protection from misclassification bias.
These practices use annotations to enact a transparency ritual to ward against misclassification bias, but annotations can do much more. With the right statistical model, they can correct misclassification bias. These practices use annotations to enact a transparency ritual to ward against misclassification bias, but annotations can do much more. With the right statistical model, they can correct misclassification bias.
We introduce maximum likelihood adjustment, a new method we designed to correct misclassification bias and use Monte Carlo simulations to We introduce maximum likelihood adjustment (MLA), a new method we designed to correct misclassification bias and use Monte Carlo simulations to
evaluate it and compare it to other recently proposed methods. evaluate it in comparison to other recently proposed error correction methods.
Our method is the only one that is effective across a wide range of scenarios. It is also straightforward to use. Our implementation in the R package \texttt{misclassificationmodels} provides a familiar formula interface for regression models. Our MLA method is the only one that is effective across a wide range of scenarios. It is also straightforward to use. Our implementation in the R package \texttt{misclassificationmodels} provides a familiar formula interface for regression models.
Remarkably, our simulations show that our method can use even an automated classifier below common accuracy standards to obtain consistent estimates. Therefore, low accuracy is not necessarily a barrier to using an AC.
Remarkably, our simulations show that our method can use even an automatic classifier below common accuracy standards to obtain consistent estimates. Therefore, low accuracy is not necessarily a barrier to using an AC. Based on these results, we provide four recommendations for the future of automated content analysis: Researchers should (1) attempt manual content analysis before building or validating ACs to see whether human-labeled data is sufficient, (2) use manually annotated data to test for systematic misclassification and choose appropiate error correction methods, (3) correct for misclassifications via error correction methods, and (4) be transparent about the methodological decisions involved in AC-based classifications and error correction.
Based on these results, we provide four recommendations for the future of automated content analysis: Researchers should (1) attempt manual content analyssis before building or validating ACs to see whether human-labeled data is sufficient, (2) use manually annotated data to test for systematic misclassification and choose appropiate error correction methods, (3) correct for misclassifications via error correction methods, and (4) be transparent about the methodological decisions involved in AC-based classifications and error correction.
Our study has several limitations. First, the simulations and methods we introduce focus on misclassification by automated tools. They provisionally assume that human annotators do not make errors, especially systematic ones. Our study has several limitations. First, the simulations and methods we introduce focus on misclassification by automated tools. They provisionally assume that human annotators do not make errors, especially systematic ones.
This assumption can be reasonable if intercoder reliability is very high but, as with ACs, this may not always be the case. This assumption can be reasonable if intercoder reliability is very high but, as with ACs, this may not always be the case.
%Alternatively, validation data can be treated as a gold standard if the goal is measuring \emph{how a person categorizes content}, as opposed to the more common approach of measuring presumably objective content categories. That said, the prevailing approaches in content analysis use human coders to measure a latent category who are prone to misclassification. %Alternatively, validation data can be treated as a gold standard if the goal is measuring \emph{how a person categorizes content}, as opposed to the more common approach of measuring presumably objective content categories. That said, the prevailing approaches in content analysis use human coders to measure a latent category who are prone to misclassification.
Thus, it may be important to account for measurement error by human coders \citep{bachl_correcting_2017} and by automated classifiers simultaneously. In theory, it is possible to extend our MLA approach in order to do so \citep{carroll_measurement_2006}. Thus, it may be important to account for measurement error by human coders \citep{bachl_correcting_2017} and by automated classifiers simultaneously. In theory, it is possible to extend our MLA approach in order to do so \citep{carroll_measurement_2006}.
However, because the true values of content categories are never observed, accounting for automated and human misclassification at once requires latent variable methods that bear considerable additional complexity and assumptions \citep{pepe_insights_2007}. We leave the integration of such methods into our MLA framework for future work. In addition, our method requires an additional assumption that the error model is correct. As we argue in Appendix \ref{appendix:assumption}, this assumption is often acceptable. However, because the true values of content categories are never observed, accounting for automated and human misclassification at once requires latent variable methods that bear considerable additional complexity and assumptions \citep{pepe_insights_2007}. We leave the integration of such methods into our MLA framework for future work. In addition, our method requires an additional assumption that the error model is correct. As we argue in Appendix \ref{appendix:robustness} (section \ref{appendix:assumption}), this assumption is often acceptable.
Second, the simulations we present do not consider all possible factors that may influence the performance and robustness of error correction methods including classifier accuracy, heteroskedasticity, and violations of distributional assumptions. We are working to investigate such factors, as shown in Appendix \ref{appendix:main.sim.plots}, by extending our simulations. Second, the simulations we present do not consider all possible factors that may influence the performance and robustness of error correction methods including classifier accuracy, heteroskedasticity, and violations of distributional assumptions. We are working to investigate such factors, as shown in Appendix \ref{appendix:robustness}, by extending our simulations.
\setcounter{biburlnumpenalty}{9001} \setcounter{biburlnumpenalty}{9001}
\printbibliography[title = {References}] \printbibliography[title = {References}]
@@ -840,20 +840,20 @@ As above, the conditional probability of $W$ given $Y$ must be calculated using
We implement these methods in \texttt{R} using the \texttt{optim} library for maximum likelihood estimation. Our implementation supports models specified using \texttt{R}'s formula syntax. It can fit linear and logistic regression models when an AC measures an independent variable and logistic regression models when an AC measures the dependent variable. Our implementation provides two methods for approximating confidence intervals: The Fischer information quadratic approximation and the profile likelihood method provided in the \texttt{R} package \texttt{bbmle}. The Fischer approximation usually works well in simple models fit to large samples and is fast enough for practical use for the large number of simulations we present. However, the profile likelihood method provides more accurate confidence intervals \citep{carroll_measurement_2006}. We implement these methods in \texttt{R} using the \texttt{optim} library for maximum likelihood estimation. Our implementation supports models specified using \texttt{R}'s formula syntax. It can fit linear and logistic regression models when an AC measures an independent variable and logistic regression models when an AC measures the dependent variable. Our implementation provides two methods for approximating confidence intervals: The Fischer information quadratic approximation and the profile likelihood method provided in the \texttt{R} package \texttt{bbmle}. The Fischer approximation usually works well in simple models fit to large samples and is fast enough for practical use for the large number of simulations we present. However, the profile likelihood method provides more accurate confidence intervals \citep{carroll_measurement_2006}.
\subsection{Comment on model assumptions} \subsection{Comment on Model Assumptions}
\label{appendix:assumption} \label{appendix:assumption}
How burdensome is the assumption that the error model be able to consistently estimate the conditional probability of $W$ given $Y$? If this assumption were much more difficult than those already accepted by the model for $Y$ given $X$ and $Z$, one would fear that using the MLA correction method introduces greater validity threats than it removes. In particular, one may worry that unobserved variables $U$ are omitted from our model for $P(Y,W)$. As demonstrated in Appendix \ref{appendix:misspec}, the MLA method is less effective when variables are omitted from the error model. How burdensome is the assumption that the error model be able to consistently estimate the conditional probability of $W$ given $Y$? If this assumption were much more difficult than those already accepted by the model for $Y$ given $X$ and $Z$, one would fear that using the MLA correction method introduces greater validity threats than it removes. In particular, one may worry that unobserved variables $U$ are omitted from our model for $P(Y,W)$. As demonstrated in Appendix \ref{appendix:robustness} (section \ref{appendix:misspec}), the MLA method is less effective when variables are omitted from the error model.
However, if we believe our outcome model for $P(Y|X,Z)$ is consistent this threat is substantially reduced. If one can assume a model for $P(Y|X,Z)$, it is often reasonable assume the variables needed to model $P(W|X,Y,Z)$ are observed. However, if we believe our outcome model for $P(Y|X,Z)$ is consistent this threat is substantially reduced. If one can assume a model for $P(Y|X,Z)$, it is often reasonable assume the variables needed to model $P(W|X,Y,Z)$ are observed.
Furthermore, since $W$ is an output from an automatic classifier it depends only on the classifier's features, which are observable in principle. As a result, and as suggested by \citet{fong_machine_2021}, one should consider including all such features in the error model. Furthermore, since $W$ is an output from an automated classifier it depends only on the classifier's features, which are observable in principle. As a result, and as suggested by \citet{fong_machine_2021}, one should consider including all such features in the error model.
However, due to the highly nonlinear nature of machine learning classifiers, specifying the functional form of the error model may require care in practice. One option is to calibrate an AC's to one's dataset and thereby obtain accurate estimates of its predicted probabilities. However, due to the highly nonlinear nature of machine learning classifiers, specifying the functional form of the error model may require care in practice. One option is to calibrate an AC's to one's dataset and thereby obtain accurate estimates of its predicted probabilities.
\section{misclassificationmodels: The R package} \label{appendix:misclassificationmodels} \section{misclassificationmodels: The R package} \label{appendix:misclassificationmodels}
The package provides a function to conduct regression analysis but also corrects for misclassification using information from manually annotated data. The function is very simular to \textbf{glm()} but with two changes: The package provides a function to conduct regression analysis but also corrects for misclassification using information from manually annotated data. The function is very similar to \textbf{glm()} but with two changes:
\begin{itemize} \begin{itemize}
\item The formula interface has been extended with the double-pipe operator to denote proxy variable. For example, \textbf{x || w} indicates that \textit{w} is the proxy of the ground truth \textit{x}. \item The formula interface has been extended with the double-pipe operator to denote proxy variable. For example, \textbf{x || w} indicates that \textit{w} is the proxy of the ground truth \textit{x}.
@@ -876,61 +876,60 @@ summary(res)
For more information about the package, please see here: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}. For more information about the package, please see here: \url{https://osf.io/pyqf8/?view_only=c80e7b76d94645bd9543f04c2a95a87e}.
\section{Additional Plots and Simulations} \section{Robustness Tests}\label{appendix:robustness}
In addition to the results reported in the main paper, we include in the next section auxiliary plots from the main simulations. Below, we present results from further simulations that show what happens when the error model is misspecified, how results vary with classifier predictivness or when the classified variable is not balanced, but skewed, and as the degree to which misclassification is systematic varies. Appendix \ref{appendix:robustness} discusses robustness tests for our simulations. In the following sections, we show what happens when the error model is misspecified (see section \ref{appendix:misspec}), when the accuracy of the classifier varies (see section \ref{appendix:accuracy}), when the classified variable is not balanced but skewed (see section \ref{appendix:imbalanced}), and when the degree of systematic misclassification changes (see section \ref{appendix:degreebias}).
\subsection{Additional plots for Simulations 1 and 2} %\subsection{Additional plots for Simulations 1 and 2}
\label{appendix:main.sim.plots} %\label{appendix:main.sim.plots}
\begin{figure}[htbp!] %\begin{figure}[htbp!]
<<example1.g,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>= %<<example1.g,echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
p <- plot.simulation.iv(plot.df.example.1,iv='z') %p <- plot.simulation.iv(plot.df.example.1,iv='z')
grid.draw(p) %grid.draw(p)
@ %@
\caption{Estimates of $B_Z$ in \emph{simulation 1a}, multivariate regression with $X$ measured using machine learning and model accuracy independent of $X$, $Y$, and $Z$. All methods obtain precise and accurate estimates given sufficient validation data.} %\caption{Estimates of $B_Z$ in \emph{simulation 1a}, multivariate regression with $X$ measured using machine learning and model %accuracy independent of $X$, $Y$, and $Z$. All methods obtain precise and accurate estimates given sufficient validation data.}
\end{figure} %\end{figure}
\begin{figure}[htbp!] %\begin{figure}[htbp!]
<<example2.g, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>= %<<example2.g, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
p <- plot.simulation.iv(plot.df.example.2, iv='z') %p <- plot.simulation.iv(plot.df.example.2, iv='z')
grid.draw(p) %grid.draw(p)
@ %@
\caption{Estimates of $B_Z$ in multivariate regression with $X$ measured using machine learning and model accuracy correlated with $X$ and $Y$ and error is differential. Only multiple imputation and our MLA model with a full specification of the error model obtain consistent estimates of $B_X$.\label{fig:sim1b.z}} %\caption{Estimates of $B_Z$ in multivariate regression with $X$ measured using machine learning and model accuracy correlated with %$X$ and $Y$ and error is differential. Only multiple imputation and our MLA model with a full specification of the error model %obtain consistent estimates of $B_X$.\label{fig:sim1b.z}}
\end{figure} %\end{figure}
\begin{figure}[htbp!] %\begin{figure}[htbp!]
<<example3.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>= %<<example3.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
#plot.df <- %#plot.df <-
p <- plot.simulation.dv(plot.df.example.3,'z') %p <- plot.simulation.dv(plot.df.example.3,'z')
grid.draw(p) %grid.draw(p)
@ %@
\caption{Estimates of $B_Z$ in \emph{simulation 2a}, multivariate regression with $Y$ measured using an AC that makes errors. Only our MLA model with a full specification of the error model obtains consistent estimates.} %\caption{Estimates of $B_Z$ in \emph{simulation 2a}, multivariate regression with $Y$ measured using an AC that makes errors. Only %our MLA model with a full specification of the error model obtains consistent estimates.}
\end{figure} %\end{figure}
\begin{figure}[htbp!] %\begin{figure}[htbp!]
<<example.4.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>= %<<example.4.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.65,cache=F>>=
#plot.df <- %#plot.df <-
p <- plot.simulation.dv(plot.df.example.4,'x') %p <- plot.simulation.dv(plot.df.example.4,'x')
grid.draw(p) %grid.draw(p)
@ %@
\caption{Estimates of $B_X$ in \emph{simulation 2b} multivariate regression with $Y$ measured using machine learning, model accuracy correlated with $Z$ and $Y$ and differential error. Only our MLA model with a full specification of the error model obtains consistent estimates. \label{fig:sim2b.z}} %\caption{Estimates of $B_X$ in \emph{simulation 2b} multivariate regression with $Y$ measured using machine learning, model %accuracy correlated with $Z$ and $Y$ and differential error. Only our MLA model with a full specification of the error model %obtains consistent estimates. \label{fig:sim2b.z}}
\end{figure} %\end{figure}
\clearpage %\clearpage
\subsection{Robustness Test I: Misspecification of the Error Correction Model}
\subsection{Simulating what happens when an error model is misspecified.}
\label{appendix:misspec} \label{appendix:misspec}
In simulations 1b and 2b, the MLA method was able to correct systematic misclassification using the error models in equations \ref{eq:covariate.reg.general} and \ref{eq:depvar.general}. In \emph{Simulation 1b} and \emph{2b}, the MLA method was able to correct systematic misclassification using the error models in equations \ref{eq:covariate.reg.general} and \ref{eq:depvar.general}.
However, this depends on the error model consistently estimating the conditional probability of automatic classifications given the true value and the outcome. However, this depends on the error model consistently estimating the conditional probability of automated classifications given the true value and the outcome.
If the misclassifications and the outcome are conditionally dependent given a variable $Z$ that is omitted from the model, this will not be possible. If the misclassifications and the outcome are conditionally dependent on an omitted variable $Z$, this will not be possible.
Here, we demonstrate how such misspecification of the error model can affect results. Here, we demonstrate how misspecification of the error correction model affects results in the context of misclassification in an independent variable (see section \ref{appendix:misspec.iv}) and a dependent variable (see section \ref{appendix:misspec.dv}).
\subsubsection{Systematic Misclassification of an Independent Variable with $Z$ omitted from the error model} \subsubsection{Systematic Misclassification of an Independent Variable}
\label{appendix:noz} \label{appendix:misspec.iv}
What happens in simulation 1b, representing systematic misclassification of an independent variable, when the error model is missing variable $Z$? As shown in Figure \ref{fig:iv.noz} this incorrect MLA model is unable to fully correct misclassification bias. Although the estimate of $B_X$ is close to correct, estimation of $B_Z$ is clearly biased, if improved compared to the näive estimator. Repeating \emph{Simulation 1b}, what happens when the error model is misspecified? Figure \ref{fig:iv.noz} visualizes effects on $B_X$ (upper panel) and $B_Z$ (lower panel). It shows that a misspecified MLA model is unable to fully correct misclassification bias: Although estimates of $B_X$ are close to the true estimate and estimates of $B_Z$ are better than the näive estimator, $B_Z$ is still clearly biased.
%Here we refer to $P(Y|X,Z,\Theta_Y)$ as the ``outcome model'', $P(W|Y,X,Z,\Theta_W)$ as the ``proxy model'', and $P(X|Z,\Theta_X)$ as the ``truth model''. %Here we refer to $P(Y|X,Z,\Theta_Y)$ as the ``outcome model'', $P(W|Y,X,Z,\Theta_W)$ as the ``proxy model'', and $P(X|Z,\Theta_X)$ as the ``truth model''.
@@ -942,7 +941,7 @@ p <- plot.robustness.1('x')
grid.draw(p) grid.draw(p)
@ @
\label{fig:iv.noz.x} \label{fig:iv.noz.x}
\caption{Estimates of $B_X$ with a misspecified error correction model that omits $Z$ are still close to the true value.} \caption{Estimates of $B_X$ are close to the true value despite the misspecified error correction model.}
\end{subfigure} \end{subfigure}
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -952,15 +951,15 @@ p <- plot.robustness.1('z')
grid.draw(p) grid.draw(p)
@ @
\label{fig:iv.noz.z} \label{fig:iv.noz.z}
\caption{Estimates of $B_Z$ with a misspecified error correction model that omits $Z$ are noticeably biased but better than the näive estimator.} \caption{Estimates of $B_Z$ are biased given a misspecified error correction model.}
\end{subfigure} \end{subfigure}
\caption{Failure to correct for misclassification in an independent variable when the error model is misspecified. } \caption{Robustness Test I: Misspecification of the Error Correction Model, Simulation 1b}
\label{fig:iv.noz} \label{fig:iv.noz}
\end{figure} \end{figure}
\subsubsection{Systematic misclassification of a dependent variable with $Z$ omitted from the error model.} \subsubsection{Systematic Misclassification of a Dependent Variable}
\label{appendix:misspec.dv}
Similarly, as shown in Figure \ref{fig:dv.noz}, in the case a dependent variable is systematically misclassified, an error model omitting a variable $Z$ required to make $W$ and $Y$ conditionally independent is unable to obtain consistent estimates. Again, the estimate of $B_X$ is close to the true value, but the estimate of $B_Z$ is biased, if less so than the näive estimate. Next, we repeat \emph{Simulation 2b} with a misspecified error correction model. Figure \ref{fig:dv.noz} shows that a misspecified error model is, again, unable to obtain consistent estimates of $B_Z$.
\begin{figure}[htpb!] \begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -970,7 +969,7 @@ p <- plot.robustness.1.dv('x')
grid.draw(p) grid.draw(p)
@ @
\label{fig:dv.noz.x} \label{fig:dv.noz.x}
\caption{Estimates of $B_X$ with a misspecified error correction model that omits $Z$ are still close to the true value.} \caption{Estimates of $B_X$ are close to the true value despite the misspecified error correction model.}
\end{subfigure} \end{subfigure}
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -980,24 +979,25 @@ p <- plot.robustness.1.dv('z')
grid.draw(p) grid.draw(p)
@ @
\label{fig:dv.noz.z} \label{fig:dv.noz.z}
\caption{Estimates of $B_Z$ with a misspecified error correction model that omits $Z$ are noticeably biased but better than the näive estimator.} \caption{Estimates of $B_Z$ are biased given a misspecified error correction model.}
\end{subfigure} \end{subfigure}
\caption{Failure to correct for misclassification in an independent variable when the error model is misspecified. } \caption{Robustness Test I: Misspecification of the Error Correction Model, Simulation 2b}
\label{fig:dv.noz} \label{fig:dv.noz}
\end{figure} \end{figure}
\clearpage \clearpage
\subsection{Simulating varying automatic classifier accuracy} \subsection{Robustness Test II: Varying Accuracy of the Automated Classifier}
\label{appendix:accuracy}
<<load.robustness.2, echo=FALSE, message=FALSE, warning=FALSE, result='hide'>>= <<load.robustness.2, echo=FALSE, message=FALSE, warning=FALSE, result='hide'>>=
source('resources/robustness_check_plots.R') source('resources/robustness_check_plots.R')
@ @
To explore how misclassification bias and correction methods depend on classifier performance, we repeat Simulations 1a and 1b with levels classifier accuracy ranging Next, we repeat \emph{Simulation 1a} to show how varying accuracy of the AC affects estimates of independent variables $B_X$ and $B_Z$. Here, we let classifier accuracy range
from \Sexpr{format.percent(min(robust_2_min_acc))} to \Sexpr{format.percent(max(robust_2_max_acc))}. from \Sexpr{format.percent(min(robust_2_min_acc))} to \Sexpr{format.percent(max(robust_2_max_acc))}.
We present results with the sample size with 5,000 classifications and 100 annotations. In Figure \ref{fig:iv.predacc}, we present results for 5,000 classifications and 100 annotations.
As expected, in both scenarios a more accurate classifier causes less misclassification bias. All the error correction methods provide more precise estimates when used with a more accurate classifiers. As expected, a more accurate classifier causes less misclassification bias. All the error correction methods also provide more precise estimates when used with a more accurate classifiers.
\begin{figure}[htpb!] \begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -1005,8 +1005,7 @@ As expected, in both scenarios a more accurate classifier causes less misclassif
p <- plot.robustness.2.iv('x') p <- plot.robustness.2.iv('x')
grid.draw(p) grid.draw(p)
@ @
\label{fig:iv.predacc.x} \caption{Estimates of $B_X$ improve with higher accuracy of the AC.}
\caption{Estimates of $B_X$ from simulation 1a with 5,000 classifications and 100 annotations with varying levels of classifier accuracy indicated by the facet labels. }
\end{subfigure} \end{subfigure}
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -1014,18 +1013,25 @@ grid.draw(p)
p <- plot.robustness.2.iv('z') p <- plot.robustness.2.iv('z')
grid.draw(p) grid.draw(p)
@ @
\label{fig:iv.predacc.z} \caption{Estimates of $B_Z$ improve with higher accuracy of the AC.}
\caption{Estimates of $B_X$ from simulation 1a with 5,000 classifications and 100 annotations with varying levels of classifier accuracy indicated by the facet labels.}
\end{subfigure} \end{subfigure}
\caption{Misclassification in an independent variable with more and less accurate automatic classifiers. More accurate classifiers cause less misclassification bias and more efficient estimates when used with error correction methods.} \caption{Robustness Test II: Varying Accuracy of the Automated Classifier, Simulation 1a}
\label{fig:iv.predacc} \label{fig:iv.predacc}
\end{figure} \end{figure}
\clearpage \clearpage
\subsection{Simulating misclassification in imbalanced variables} \subsection{Robustness Test III: Misclassification in Imbalanced Variables}
\label{appendix:imbalanced}
For simplicity, our main simulations have balanced classified variables. But classifiers are often used to measure imbalanced variables, which can be more difficult to predict. Here, we show that MLA correction performs similarly well with imbalanced classified variables. Notably, the quality of uncertainty quantification of methods tends to degrade as imbalance increases by replicating versions of our simulations 1a and 2a having 5,000 classifications and 200 annotations. This suggests that imbalanced data requires additional validation data for effective misclassification correction. For simplicity, our main simulations include balanced classified variables. However, classifiers are often used to measure imbalanced variables, which can be more difficult to predict. As a next robustness test, we therefore replicate \emph{Simulation 1a} (see section \ref{appendix:imbalanced.iv}) and \emph{Simulation 2a} (see section \ref{appendix:imbalanced.dv}) to analyze whether the MLA error correction method performs similarly well with imbalanced classified variables. We do so for the scenario with 5,000 classifications and 200 manual annotations.
\subsubsection{Imbalance in Classified Independent Variables}
\label{appendix:imbalanced.iv}
Replicating \emph{Simulation 1a}, Figure \ref{fig:iv.imbalanced} illustrates that our MLA method performs similarly well with imbalance in classified independent variables.
%Although the Fischer approximation for confidence intervals performs poorly, the profile likelihood method works well.
However, the quality of uncertainty quantification of methods tends to degrade as imbalance increases. This suggests that imbalanced data requires additional validation data for effective misclassification correction. Please note that the PL approach has very large confidence intervals and is thus excluded in Figure \ref{fig:iv.imbalanced} for readability.
\begin{figure}[htpb!] \begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -1033,8 +1039,7 @@ For simplicity, our main simulations have balanced classified variables. But cl
p <- plot.robustness.3.iv('x',n.classifications=5000, n.annotations=200) p <- plot.robustness.3.iv('x',n.classifications=5000, n.annotations=200)
grid.draw(p) grid.draw(p)
@ @
\label{fig:dv.predacc.x} \caption{Estimates of $B_X$ are close to true values given imbalance in $X$.}
\caption{Estimates of $B_X$ from simulation 1a with 1,000 classifications and 100 annotations as the probability of $X$ varies from $0.5$ to $0.95$ as the facet labels indicate.}
\end{subfigure} \end{subfigure}
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -1042,25 +1047,22 @@ grid.draw(p)
p <- plot.robustness.3.iv('z',n.classifications=5000, n.annotations=200) p <- plot.robustness.3.iv('z',n.classifications=5000, n.annotations=200)
grid.draw(p) grid.draw(p)
@ @
\label{fig:dv.predacc} \caption{Estimates of $B_Z$ are close to true values given imbalance in $X$.}
\caption{Estimates of Z from simulation 2a with 1,000 classifications and 100 annotations as the probability of $X$ varies from $0.5$ to $0.95$ as the facet labels indicate.}
\end{subfigure} \end{subfigure}
\caption{Imbalance in a misclassified independent variable. Imbalance requires additional statistical power and causes biased uncertainty quantification in error correction methods. PL has very confidence intervals and so is excluded for clarity.} \caption{Robustness Test III: Misclassification in Imbalanced Variables, Simulation 1a}
\label{fig:dv.predacc} \label{fig:iv.imbalanced}
\end{figure} \end{figure}
\subsubsection{Imbalance in classified independent variables} \subsubsection{Imbalance in Classified Dependent Variables}
\label{appendix:imbalanced.dv}
For simplicity, our main simulations have balanced classified variables. But classifiers are often used to measure imbalanced or skewed variables, which can be more difficult to predict. Here, we show that MLE correction performs similarly well as with skewed classified variables. Although the Fischer approximation for confidence intervals performs poorly, the profile likelihood method works well. Replicating \emph{Simulation 2a}, Figure \ref{fig:dv.imbalanced} further illustrates that our MLA method performs similarly well with imbalance in classified dependent variables. The PL approach is, again, removed due to the large confidence intervals of its estimations.
\begin{figure}[htpb!] \begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
<<dv.imbalance.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>= <<dv.imbalance.x, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=0.5,cache=F>>=
p <- plot.robustness.3.dv('x',n.classifications=5000, n.annotations=200) p <- plot.robustness.3.dv('x',n.classifications=5000, n.annotations=200)
grid.draw(p) grid.draw(p)
@ @
\label{fig:dv.predacc.x} \caption{Estimates of $B_X$ are close to true values given imbalance in $Y$.}
\caption{Estimates of $B_X$ from simulation 2a with 1,000 classifications and 100 annotations with $Y$'s base rate ranging from $0.5$ to $0.95$.}
\end{subfigure} \end{subfigure}
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -1068,21 +1070,22 @@ grid.draw(p)
p <- plot.robustness.3.dv('z',n.classifications=5000, n.annotations=200) p <- plot.robustness.3.dv('z',n.classifications=5000, n.annotations=200)
grid.draw(p) grid.draw(p)
@ @
\label{fig:dv.predacc} \caption{Estimates of $B_Z$ are close to true values given imbalance in $Y$.}
\caption{Estimates of Z from simulation 2a with 1,000 classifications and 100 annotations with $Y$'s base rate ranging from $0.5$ to $0.95$.}
\end{subfigure} \end{subfigure}
\caption{Imbalance in a misclassified dependent variable. Imbalance requires additional statistical power and causes biased uncertainty quantification in error correction methods. PL has very confidence intervals and so is excluded for clarity.} \caption{Robustness Test III: Misclassification in Imbalanced Variables, Simulation 2a}
\label{fig:dv.predacc} \label{fig:dv.imbalanced}
\end{figure} \end{figure}
\clearpage \clearpage
\subsection{Simulating a range of classifier biases} \subsection{Robustness Test IV: Different Degrees of Systematic Misclassification}
\label{appendix:degreebias}
Lastly, we explore what happens if misclassification is more or less systematic. To do so, we replicate \emph{Simulation 1b} (see section \ref{appendix:degreebias.iv}) and \emph{Simulation 2b} (see section \ref{appendix:degreebias.dv}) with 1000 classifications and 100 manual annotations. We vary the amount of systematic misclassification in \emph{Simulation 1b} via the logistic regression coefficient of $Y$ on $W$ while keeping the overall classifier accuracy close to 0.73. In \emph{Simulation 2b}, we similarly use a range of values for the coefficient of $Z$ on $W$.
Now, we explore what happens as misclassification is more or less systematic in replications of simulations 1b and 2b having 1000 classifications and 100 annotations. We vary the amount of systematic misclassification in simulation 1b via the logistic regression coefficient of $Y$ on $W$ in our simulated data generating process while keeping the overall classifier accuracy close to 0.73. Similarly, in simulation 2b we use a range of values for the coefficient of $Z$ on $W$. \subsubsection{Systematic Misclassification in an Independent Variable}
\label{appendix:degreebias.iv}
\subsubsection{Systematic independent variable misclassification} Replicating \emph{Simulation 1b}, Figure \ref{fig:iv.degreebias} underlines that our MLA method performs well even for higher degrees of systematic misclassification in the independent variable. With fairly high degrees of systematic misclassification, however, estimations of $B_Z$ in particular become inconsistent.
\begin{figure}[htpb!] \begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -1090,28 +1093,24 @@ Now, we explore what happens as misclassification is more or less systematic in
p <- plot.robustness.4.iv('x') p <- plot.robustness.4.iv('x')
grid.draw(p) grid.draw(p)
@ @
\label{fig:dv.predacc.x} \caption{Estimates of $B_X$ are close to true values given different degrees of misclassication in $X$.}
\caption{Estimates of $B_X$ simulation 1b variants with the logistic regression coefficient of $Y$ on $W$ ranging from $-3$ to $0.25$ as indicated by facet labels.}
\end{subfigure} \end{subfigure}
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
<<iv.bias.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>= <<iv.bias.z, echo=FALSE, message=FALSE, warning=FALSE, result='asis', dev='pdf', fig.width=6, fig.asp=.5,cache=F>>=
p <- plot.robustness.4.iv('z') p <- plot.robustness.4.iv('z')
grid.draw(p) grid.draw(p)
@ @
\caption{Estimates of $B_Z$ in simulation 1b variants with the logistic regression coefficient of $Y$ on $W$ ranging from $-3$ to $0.25$ as indicated by facet labels.} \caption{Estimates of $B_Z$ are close to true values given different degrees of misclassication in $X$.}
\end{subfigure} \end{subfigure}
\caption{Systematically misclassified independent variable. As misclassification becomes more systematic it causes greater bias in naïve estimator and becomes more difficult to correct. Nevertheless, the MLA method remains consistent. } \caption{Robustness Test IV: Different Degrees of Systematic Misclassification, Simulation 1b}
\label{fig:iv.degreebias}
\label{fig:dv.predacc}
\end{figure} \end{figure}
\subsubsection{Systematic dependent variable misclassification} \subsubsection{Systematic Misclassification in a Dependent Variable}
\label{appendix:degreebias.dv}
In the case of systematic misclassification in the dependent variable, we can observe that the bias in the naïve estimator switches from negative to positive as systematic misclassification increases. Replicating \emph{Simulation 2b}, Figure \ref{fig:dv.degreebias} comes to similar conclusions. In the case of systematic misclassification in the dependent variable, we can observe that the bias in the naïve estimator switches from negative to positive as systematic misclassification increases.
\begin{figure}[htpb!] \begin{figure}[htpb!]
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -1119,8 +1118,7 @@ In the case of systematic misclassification in the dependent variable, we can ob
p <- plot.robustness.4.dv('x') p <- plot.robustness.4.dv('x')
grid.draw(p) grid.draw(p)
@ @
\label{fig:dv.predacc.x} \caption{Estimates of $B_X$ are close to true values given different degrees of misclassication in $Y$.}
\caption{Estimates of $B_X$ in variants of simulation 2b with 5,000 classifications and 200 annotations as the logistic regression coefficient of $Z$ on $W$'s ranges as indicated on the facet labels.}
\end{subfigure} \end{subfigure}
\begin{subfigure}{0.95\textwidth} \begin{subfigure}{0.95\textwidth}
@@ -1128,11 +1126,10 @@ grid.draw(p)
p <- plot.robustness.4.dv('z') p <- plot.robustness.4.dv('z')
grid.draw(p) grid.draw(p)
@ @
\label{fig:dv.predacc} \caption{Estimates of $B_Z$ become inconsistent with increasing misclassication in $Y$.}
\caption{Estimates of $B_Z$ in variants of simulation 2b with 5,000 classifications and 200 annotations as the logistic regression coefficient of $Z$ on $W$'s ranges as indicated on the facet labels.}
\end{subfigure} \end{subfigure}
\caption{Imbalance in a misclassified dependent variable. As misclassification becomes more systematic it causes greater bias in naïve estimator and becomes more difficult to correct. Nevertheless, the MLA method remains consistent. Power and causes biased uncertainty quantification in error correction methods. } \caption{Robustness Test IV: Different Degrees of Systematic Misclassification, Simulation 2b}
\label{fig:dv.predacc} \label{fig:dv.degreebias}
\end{figure} \end{figure}
%However, if one can assume the model for $Y$, then one believes that $Y$ and $X$ are conditionally independent given other observed variables. %However, if one can assume the model for $Y$, then one believes that $Y$ and $X$ are conditionally independent given other observed variables.

View File

@@ -23,8 +23,10 @@ plot.robustness.1 <- function(iv='x'){
df <- rbind(baseline_df, robust_df, fill=TRUE) df <- rbind(baseline_df, robust_df, fill=TRUE)
df[method=='naive', method:='Naive'] df[method=='naive', method:='Naive']
df[method=='MLA Reported', method:='Correct MLA']
df[method=='No Z in Error Model', method:='Misspec. MLA']
df <- df[(N %in% c(1000,5000)) & (m %in% c(200,100))] df <- df[(N %in% c(1000,5000)) & (m %in% c(200,100))]
p <- plot.simulation(df,iv=iv,levels=c('MLA Reported','No Z in Error Model', 'Naive', 'True')) p <- plot.simulation(df,iv=iv,levels=c('Correct MLA','Misspec. MLA', 'Naive', 'True'))
grid.draw(p) grid.draw(p)
} }
@@ -79,8 +81,10 @@ plot.robustness.1.dv <- function(iv='z'){
df <- rbind(baseline_df, robust_df, fill=TRUE) df <- rbind(baseline_df, robust_df, fill=TRUE)
df <- df[(N %in% c(1000,5000)) & (m %in% c(200,100))] df <- df[(N %in% c(1000,5000)) & (m %in% c(200,100))]
df[method=='naive', method:='Naive'] df[method=='naive', method:='Naive']
df[method=='MLA Reported', method:='Correct MLA']
df[method=='No Z in Error Model', method:='Misspec. MLA']
p <- plot.simulation(df,iv=iv,levels=c('MLA Reported','No Z in Error Model','Naive', 'True')) p <- plot.simulation(df,iv=iv,levels=c('Correct MLA','Misspec. MLA','Naive', 'True'))
grid.draw(p) grid.draw(p)
} }
@@ -102,7 +106,7 @@ plot.robustness.2.iv <- function(iv, n.annotations=100, n.classifications=5000){
p <- arrangeGrob(p, p <- arrangeGrob(p,
top=grid.text("AC Accuracy",x=0.32,just='right')) top=grid.text("Varying Accuracy of the AC",x=0.42,just='right'))
grid.draw(p) grid.draw(p)
} }
@@ -133,7 +137,7 @@ plot.robustness.2.dv <- function(iv, n.annotations=100, n.classifications=5000){
p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip() p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
p <- arrangeGrob(p, p <- arrangeGrob(p,
top=grid.text("AC Accuracy",x=0.32,just='right')) top=grid.text("Varying Accuracy of the AC",x=0.42,just='right'))
grid.draw(p) grid.draw(p)
} }
@@ -168,7 +172,7 @@ plot.robustness.3.iv <- function(iv, n.annotations=200, n.classifications=5000){
p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip() p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
p <- arrangeGrob(p, p <- arrangeGrob(p,
top=grid.text("P(X)",x=0.32,just='right')) top=grid.text("Imbalance in X",x=0.32,just='right'))
grid.draw(p) grid.draw(p)
} }
@@ -191,7 +195,7 @@ plot.robustness.3.dv <- function(iv, n.annotations=100, n.classifications=1000){
p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip() p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
p <- arrangeGrob(p, p <- arrangeGrob(p,
top=grid.text("P(Y)",x=0.32,just='right')) top=grid.text("Imbalance in Y",x=0.32,just='right'))
grid.draw(p) grid.draw(p)
} }
@@ -215,7 +219,7 @@ plot.robustness.4.iv <- function(iv, n.annotations=100, n.classifications=1000){
p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip() p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
p <- arrangeGrob(p, p <- arrangeGrob(p,
top=grid.text("Coefficient of Y for W",x=0.32,just='right')) top=grid.text("Varying Degree of Misclassification in X",x=0.52,just='right'))
grid.draw(p) grid.draw(p)
} }
@@ -240,7 +244,7 @@ plot.robustness.4.dv <- function(iv, n.annotations=100, n.classifications=1000){
p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip() p <- p + scale_x_discrete(labels=label_wrap_gen(14)) + ylab("Estimate") + xlab("Method") + coord_flip()
p <- arrangeGrob(p, p <- arrangeGrob(p,
top=grid.text("Coefficient of Z on W",x=0.32,just='right')) top=grid.text("Varying Degree of Misclassification in Y",x=0.52,just='right'))
grid.draw(p) grid.draw(p)
} }