From 38deeddefbb81d807272eafc80f1aad28703c04c Mon Sep 17 00:00:00 2001 From: nathante Date: Mon, 6 Mar 2023 17:55:28 +0000 Subject: [PATCH] Update on Overleaf. --- article.Rtex | 84 ++++++++++++++++++++++++++++------------------------ 1 file changed, 46 insertions(+), 38 deletions(-) diff --git a/article.Rtex b/article.Rtex index 32d8c94..e7f2c28 100644 --- a/article.Rtex +++ b/article.Rtex @@ -574,7 +574,7 @@ grid.draw(p) \caption{Simulation 2a: Nonsystematic misclassification of a dependent variable. Only the MLA approach obtains consistent estimates. \label{fig:sim2a.x}} \end{figure} -\subsection{Simulation 2b: Systematic Misclassification of a Dependent Variable} +\subsection{\emph{Simulation 2b}: Systematic Misclassification of a Dependent Variable} \begin{figure} <>= @@ -682,7 +682,7 @@ Remarkably, our simulations show that our method can use even an automated class Based on these results, we provide four recommendations for the future of automated content analysis: Researchers should (1) attempt manual content analysis before building or validating ACs to see whether human-labeled data is sufficient, (2) use manually annotated data to test for systematic misclassification and choose appropriate error correction methods, (3) correct for misclassification via error correction methods, and (4) be transparent about the methodological decisions involved in AC-based classifications and error correction. -Our study has several limitations. First, the simulations and methods we introduce focus on misclassification by automated tools. They provisionally assume that human annotators do not make errors, especially notn systematic ones. +Our study has several limitations. First, the simulations and methods we introduce focus on misclassification by automated tools. They provisionally assume that human annotators do not make errors, especially not systematic ones. This assumption can be reasonable if intercoder reliability is very high but, as with ACs, this may not always be the case. %Alternatively, validation data can be treated as a gold standard if the goal is measuring \emph{how a person categorizes content}, as opposed to the more common approach of measuring presumably objective content categories. That said, the prevailing approaches in content analysis use human coders to measure a latent category who are prone to misclassification. Thus, it may be important to account for measurement error by human coders \citep{bachl_correcting_2017} and by automated classifiers simultaneously. In theory, it is possible to extend our MLA approach in order to do so \citep{carroll_measurement_2006}. @@ -704,30 +704,32 @@ Our example relies on the publicly available Civil Comments dataset \citep{cjada Each comment was labeled by up to ten manual annotators (although selected comments were labeled by even more annotators). Originally, the dataset represents \emph{toxicity} and \emph{disclosure} as proportions of annotators who labeled a comment as toxic or as disclosing aspects of personal identity including race and ethnicity. For our analysis, we converted these proportions into indicators of the majority view to transform both variables to a binary scale. -%\begin{figure}[htbp!] -%\centering -%\begin{subfigure}{\linewidth} -%<>= -%p <- plot.civilcomments.iv.example(include.models=c("Automatic %Classification", "All Annotations", "Annotation Sample", "Error %Correction")) -%print(p) -%@ -%\subcaption{\emph{Example 1}: Misclassification in an independent %variable.\label{fig:real.data.example.iv.app}} -%\end{subfigure} +Our MLA method works in this scenario, as shown in \ref{fig:real.data.example.app} -%\begin{subfigure}{\linewidth} -%<>= -%p <- plot.civilcomments.dv.example(include.models=c("Automatic %Classification", "All Annotations", "Annotation Sample", "Error %Correction")) -%print(p) -%@ -%\subcaption{\emph{Example 2}: Misclassification in a dependent variable. %\label{fig:real.data.example.dv.app}} +\begin{figure}[htbp!] +\centering +\begin{subfigure}{\linewidth} +<>= +p <- plot.civilcomments.iv.example(include.models=c("Automatic Classifications", "Manual Annotations", "Annotation Sample", "Error Correction")) +print(p) +@ +\subcaption{\emph{Example 1}: Misclassification in an independent variable. \label{fig:real.data.example.iv.app}} +\end{subfigure} -%\end{subfigure} -%\caption{Real-data example including correction using MLA.} -%\end{figure} +\begin{subfigure}{\linewidth} +<>= +p <- plot.civilcomments.dv.example(include.models=c("Automatic Classifications", "All Annotations", "Annotation Sample", "Error Correction")) +print(p) +@ +\subcaption{\emph{Example 2}: Misclassification in a dependent variable. \label{fig:real.data.example.dv.app}} -% Our maximum-likelihood based error correction technique in this example requires specifying models for the Perspective's scores and, in the case where these scores are used as a covariate, a model for the human annotations. In our first example, where toxicity was used as a covariate, we used the \emph{human annotations}, \emph{identity disclosure}, and the interaction of these two variables in the model for scores. We omitted \emph{likes} from this model because they are virtually uncorrelated with misclassifications (Pearson's $\rho=\Sexpr{iv.example[['civil_comments_cortab']]['toxicity_error','likes']}$). Our model for the human annotations is an intercept-only model. +\end{subfigure} +\caption{Real-data example including correction using MLA. \label{fig:real.data.example.app}} +\end{figure} -% In our second example, where toxicity is the outcome, we use the fully interacted model of the \emph{human annotations}, \emph{identity disclosure}, and \emph{likes} in our model for the human annotations because all three variables are correlated with the Perspective scores. +Our maximum-likelihood based error correction technique in this example requires specifying models for the Perspective's scores and, in the case where these scores are used as a covariate, a model for the human annotations. In our first example, where toxicity was used as a covariate, we used the \emph{human annotations}, \emph{identity disclosure}, and the interaction of these two variables in the model for scores. We omitted \emph{likes} from this model because they are virtually uncorrelated with misclassifications (Pearson's $\rho=\Sexpr{iv.example[['civil_comments_cortab']]['toxicity_error','likes']}$). Our model for the human annotations is an intercept-only model. + +In our second example, where toxicity is the outcome, we use the fully interacted model of the \emph{human annotations}, \emph{identity disclosure}, and \emph{likes} in our model for the human annotations because all three variables are correlated with the Perspective scores. \section{Systematic Literature Review} \label{appendix:lit.review} @@ -879,6 +881,8 @@ For more information about the package, please see here: \url{https://osf.io/py \section{Additional plots for Simulations 1 and 2} \label{appendix:main.sim.plots} +Appendix \ref{appendix:main.sim.plots} includes addition plots for our main simulations across \emph{Simulation 1a-2b}. It visualizes estimates of $B_Z$, the second independent variable in our inferential model. Here, \ref{fig:sim1a.z} visualizes estimates of $B_Z$ in \emph{Simulation 1a}, \ref{fig:sim1b.z} visualizes estimates of $B_Z$ in \emph{Simulation 1b}, \ref{fig:sim2a.z} visualizes estimates of $B_Z$ in \emph{Simulation 2a}, and \ref{fig:sim2b.z} visualizes estimates of $B_Z$ in \emph{Simulation 2b}. + \begin{figure}[htbp!] <>= @@ -886,7 +890,8 @@ p <- plot.simulation.iv(plot.df.example.1,iv='z') grid.draw(p) @ -\caption{Estimates of $B_Z$ in \emph{simulation 1a}, multivariate regression with $X$ measured using machine learning and model accuracy independent of $X$, $Y$, and $Z$. All methods obtain precise and accurate estimates given sufficient validation data.} +\caption{Estimates of $B_Z$ in \emph{simulation 1a}, multivariate regression with $X$ measured using an AC and model accuracy independent of $X$, $Y$, and $Z$. All error correction methods obtain precise and accurate estimates of of $B_Z$ given sufficient validation data.} +\label{fig:sim1a.z} \end{figure} \begin{figure}[htbp!] @@ -894,7 +899,7 @@ grid.draw(p) p <- plot.simulation.iv(plot.df.example.2, iv='z') grid.draw(p) @ -\caption{Estimates of $B_Z$ in multivariate regression with $X$ measured using machine learning and model accuracy correlated with $X$ and $Y$ and error is differential. Only multiple imputation and our MLA model with a full specification of the error model obtain consistent estimates of $B_X$.\label{fig:sim1b.z}} +\caption{Estimates of $B_Z$ in multivariate regression with $X$ measured using an AC and differential error. Only multiple imputation and our MLA approach obtain consistent estimates of $B_Z$.\label{fig:sim1b.z}} \end{figure} \begin{figure}[htbp!] @@ -903,7 +908,8 @@ grid.draw(p) p <- plot.simulation.dv(plot.df.example.3,'z') grid.draw(p) @ -\caption{Estimates of $B_Z$ in \emph{simulation 2a}, multivariate regression with $Y$ measured using an AC that makes errors. Only our MLA model with a full specification of the error model obtains consistent estimates.} +\caption{Estimates of $B_Z$ in \emph{simulation 2a}, multivariate regression with $Y$ measured using an AC and misclassifications being uncorrelated with independent variables. Only our MLA approach obtains consistent estimates of $B_Z$.} +\label{fig:sim2a.z} \end{figure} \begin{figure}[htbp!] @@ -912,7 +918,7 @@ grid.draw(p) p <- plot.simulation.dv(plot.df.example.4,'x') grid.draw(p) @ -\caption{Estimates of $B_X$ in \emph{simulation 2b} multivariate regression with $Y$ measured using machine learning, model Accuracy correlated with $Z$ and $Y$ and differential error. Only our MLA model with a full specification of the error model obtains consistent estimates. \label{fig:sim2b.z}} +\caption{Estimates of $B_X$ in \emph{simulation 2b}, multivariate regression with $Y$ measured using an AC and misclassifications being correlated with independent variables. Only our MLA approach obtains consistent estimates of $B_Z$. \label{fig:sim2b.z}} \end{figure} @@ -995,12 +1001,17 @@ grid.draw(p) source('resources/robustness_check_plots.R') @ -Next, we repeat \emph{Simulation 1a} and \emph{Simulation 2a} to show how varying accuracy of the AC affects estimates of independent variables $B_X$ and $B_Z$. Here, we let classifier accuracy range -from \Sexpr{format.percent(min(robust_2_min_acc))} to \Sexpr{format.percent(max(robust_2_max_acc))}. -In Figure \ref{fig:iv.predacc}, we present results for \emph{Simulation 1a}, where an independent variable is automatically classified with 5,000 classifications and 100 annotations. -As expected, a more accurate classifier causes less misclassification bias. All the error correction methods also provide more precise estimates when used with a more accurate classifiers. +According to our literature review, the accuracy of reported classifiers strongly varies. But how does the performance of the classifier affect error correction methods and remaining bias in inferential modeling? To test this, we repeat \emph{Simulation 1a} (see Section \ref{appendix:iv.predacc}) and \emph{Simulation 2a} (see Section \ref{appendix:dv.predacc}) to show how varying accuracy of the AC affects estimates of independent variables $B_X$ and $B_Z$. Here, we let classifier accuracy range +from \Sexpr{format.percent(min(robust_2_min_acc))} to \Sexpr{format.percent(max(robust_2_max_acc))}. We present results for a scenario withn 5,000 classifications and 100 manual annotations. -As Figure \ref{fig:dv.predacc} shows, these patterns are similar when the dependent variable is automatically classified as in \emph{Simulation 2a}. +\subsubsection{Varying Accuracy of an AC Predicting an Independent Variable} +\label{appendix:iv.predacc} +In Figure \ref{fig:iv.predacc}, we present results for \emph{Simulation 1a} where the independent variable is created via an AC. +As expected, a more accurate classifier causes less misclassification bias. All the error correction methods also provide more precise estimates when used with a more accurate classifier. + +\subsubsection{Varying Accuracy of an AC Predicting a Dependent Variable} +\label{appendix:dv.predacc} +We then repeat these simulations for \emph{Simulation 2a}, where the dependent variable is created via an AC. As Figure \ref{fig:dv.predacc} shows, patterns are similar: error correction methods provide more precise estimates when used with a more accurate classifier. \begin{figure}[htpb!] \begin{subfigure}{0.95\textwidth} @@ -1018,8 +1029,8 @@ grid.draw(p) @ \caption{Estimates of $B_Z$ improve with higher accuracy of the AC.} \end{subfigure} -\caption{Robustness Test II: Varying Accuracy of the Automated Classifier, Simulation 2a} -\label{fig:dv.predacc} +\caption{Robustness Test II: Varying Accuracy of the Automated Classifier, Simulation 1a} +\label{fig:iv.predacc} \end{figure} \begin{figure}[htpb!] @@ -1054,7 +1065,7 @@ For simplicity, our main simulations include balanced classified variables. How \label{appendix:imbalanced.iv} Replicating \emph{Simulation 1a}, Figure \ref{fig:iv.imbalanced} illustrates that our MLA method performs similarly well with imbalance in classified independent variables. -However, the quality of uncertainty quantification of methods tends to degrade as imbalance increases. This suggests that imbalanced data requires additional validation data for effective misclassification correction. Please note that the PL approach has very large confidence intervals and is thus excluded in Figure \ref{fig:iv.imbalanced} for readability. +However, the quality of uncertainty quantification of methods tends to degrade as imbalance increases, as seen by comparing the neighboring black and gray lines when the probability of X is 0.95 in Figure \ref{fig:iv.imbalanced.bx}. This suggests that imbalanced data requires additional validation data for effective misclassification correction. Please note that the PL approach has very large confidence intervals and is thus excluded in Figure \ref{fig:iv.imbalanced} for readability. \begin{figure}[htpb!] \begin{subfigure}{0.95\textwidth} @@ -1062,7 +1073,7 @@ However, the quality of uncertainty quantification of methods tends to degrade a p <- plot.robustness.3.iv('x',n.classifications=5000, n.annotations=200) grid.draw(p) @ -\caption{Estimates of $B_X$ are close to true values given imbalance in $X$.} +\caption{Estimates of $B_X$ are close to true values given imbalance in $X$. \label{fig:iv.imbalanced.bx}} \end{subfigure} \begin{subfigure}{0.95\textwidth} @@ -1109,7 +1120,6 @@ Lastly, we explore what happens if misclassification is more or less systematic. \label{appendix:degreebias.iv} Replicating \emph{Simulation 1b}, Figure \ref{fig:iv.degreebias} underlines that our MLA method performs well even for higher degrees of systematic misclassification in the independent variable. -% With fairly high degrees of systematic misclassification, however, estimations of $B_Z$ in particular become inconsistent. \begin{figure}[htpb!] \begin{subfigure}{0.95\textwidth} @@ -1180,8 +1190,6 @@ nondifferential measurement error and random error in the outcome are relatively Research using ACs based on supervised machine learning may be particularly prone to differential and systematic measurement error. Problems of bias and generalizability have machine learning field of machine learning more generally has - - %Statistical theory and simulations have shown that all these methods are effective (though some are more efficient) when ``ground-truth'' observations are unproblematic and when classifiers only make random, but not systematic, errors. We contribute by testing these methods in more difficult cases likely to arise in text-as-data studies. %