diff --git a/article.Rtex b/article.Rtex index a41d61a..ad5b878 100644 --- a/article.Rtex +++ b/article.Rtex @@ -1254,20 +1254,20 @@ Instead of tayloring an AC for a research study, using predictive features direc \item These issues become even more important, and also more complex in important research designs such as those involving multiple languages. -\subsection{Imperfect human-coded validation data} +% \subsection{Imperfect human-coded validation data} -All approaches stated above depend on the human-coded validation data $X^*$. Most often, ACs are also trained on human-coded material. The content analysis literature has long been documented how unreliable human coding and manual content analysis papers routinely report intercoder reliability as a result \citep{krippendorff_content_2018}. Intercoder reliability metrics typically assume that human coders are interchangeable and the only source of disagreement is ``coder idiosyncrasies'' \citep{krippendorff_reliability_2004}. A previous monte-carlo simulation operationalizes these ``coder idiosyncrasies'' as a fixed probability that a coder makes a random guess independent of the coder and of the material \citep{geis_statistical_2021}. In this work, we accept this ``interchangeable coders making random errors'' (ICMRE) assumption. Under this optimistic assumption, only ``coder idiosyncrasies'' cause misclassification error in the validation data. +% All approaches stated above depend on the human-coded validation data $X^*$. Most often, ACs are also trained on human-coded material. The content analysis literature has long been documented how unreliable human coding and manual content analysis papers routinely report intercoder reliability as a result \citep{krippendorff_content_2018}. Intercoder reliability metrics typically assume that human coders are interchangeable and the only source of disagreement is ``coder idiosyncrasies'' \citep{krippendorff_reliability_2004}. A previous monte-carlo simulation operationalizes these ``coder idiosyncrasies'' as a fixed probability that a coder makes a random guess independent of the coder and of the material \citep{geis_statistical_2021}. In this work, we accept this ``interchangeable coders making random errors'' (ICMRE) assumption. Under this optimistic assumption, only ``coder idiosyncrasies'' cause misclassification error in the validation data. -\citet{song_validations_2020}'s monte-carlo simulation demonstrates that human-coded $X^*$ with a lower intercoder reliability generates more biased classification accuracy of the AC. So even if manual annotation errors are only due to the ICMRE assumption, they may bias results. None of the above correction approaches account for the imperfect human coding of $X^*$, although \citet{zhang_how_2021} identifies the omission of this as a weakness of his proposed approach. Even in the context of manual content analysis, these ``coder idiosyncrasies'' are not routinely adjusted (although methods are available, e.g. \citet{bachl_correcting_2017}). -An advantage of our proposed method over prior approaches is that it automatically accounts for imperfection of human coding under the ICMRE assumption because the random errors in validation data are independent from the AC errors. +% \citet{song_validations_2020}'s monte-carlo simulation demonstrates that human-coded $X^*$ with a lower intercoder reliability generates more biased classification accuracy of the AC. So even if manual annotation errors are only due to the ICMRE assumption, they may bias results. None of the above correction approaches account for the imperfect human coding of $X^*$, although \citet{zhang_how_2021} identifies the omission of this as a weakness of his proposed approach. Even in the context of manual content analysis, these ``coder idiosyncrasies'' are not routinely adjusted (although methods are available, e.g. \citet{bachl_correcting_2017}). +% An advantage of our proposed method over prior approaches is that it automatically accounts for imperfection of human coding under the ICMRE assumption because the random errors in validation data are independent from the AC errors. -Precision of estimates can be improved using more than one independent coder. With two coders, for example, two sets of validation data are generated, $X^*_{1}$, $X^*_{2}$. We then list-wise delete all data that $X^*_{1} \neq X^*_{2}$. If the ICMRE assumption holds, the deleted data, where two coders disagree, can only be due to ``coder idiosyncrasies''. As coders are assumed to be interchangeable, the probability of two interchangeable coders both making the same misclassification error is much less than the probability that one makes a misclassification error . Using such ``labeled-only, coherent-only'' (LOCO) data improves the precision of consistent estimates in our simulation. +% Precision of estimates can be improved using more than one independent coder. With two coders, for example, two sets of validation data are generated, $X^*_{1}$, $X^*_{2}$. We then list-wise delete all data that $X^*_{1} \neq X^*_{2}$. If the ICMRE assumption holds, the deleted data, where two coders disagree, can only be due to ``coder idiosyncrasies''. As coders are assumed to be interchangeable, the probability of two interchangeable coders both making the same misclassification error is much less than the probability that one makes a misclassification error . Using such ``labeled-only, coherent-only'' (LOCO) data improves the precision of consistent estimates in our simulation. -\subsection{Measurement error in validation data} +% \subsection{Measurement error in validation data} -The simulations above assume that validation data is perfectly accurate. This is obviously unrealistic because, validation data, such as that obtained from human classifiers, normally has inaccuracies. -To evaluate the robustness of correction methods to imperfect validation data, we extend our scenarios with with nondifferential error with simulated validation data that is misclassified \Sexpr{format.percent(med.loco.accuracy)} of the time at random. +% The simulations above assume that validation data is perfectly accurate. This is obviously unrealistic because, validation data, such as that obtained from human classifiers, normally has inaccuracies. +% To evaluate the robustness of correction methods to imperfect validation data, we extend our scenarios with with nondifferential error with simulated validation data that is misclassified \Sexpr{format.percent(med.loco.accuracy)} of the time at random. \subsubsection{Recommendation II: Employ at Least Two Manual Coders, not One} diff --git a/resources/real_data_example.R b/resources/real_data_example.R index f59aaa4..ac639c4 100644 --- a/resources/real_data_example.R +++ b/resources/real_data_example.R @@ -12,9 +12,9 @@ iv.sample.count <- iv.example$cc_ex_tox.likes.race_disclosed.medsampsample.count dv.sample.count <- dv.example$cc_ex_tox.likes.race_disclosed.largesampsample.count -plot.cc.example <- function(datalist, name, varnames=NULL, varorder=NULL, include.models=c("Automated Classifications", "Manual Annotations")){ +plot.cc.example <- function(datalist, name, varnames=NULL, varorder=NULL, include.models=c("Automatic Classification", "All Annotations")){ - model.names <- c("Automated Classifications", "Manual Annotations", "Annotation Sample", "Error Correction") + model.names <- c("Automatic Classification", "All Annotations", "Annotation Sample", "Error Correction") glm.par.names <- paste0(name,"coef_",c("pred", "coder", "sample"), "_model") @@ -62,6 +62,8 @@ plot.cc.example <- function(datalist, name, varnames=NULL, varorder=NULL, includ df[,Model:= factor(gsub('\\.',' ', Model), levels=rev(model.names))] df <- df[Model %in% include.models] + rename_models <- list("Automatic Classification"="Automated Classifications", "All Annotations"="Manual Annotations") + df <- df[, Model := factor(rename_models[as.character(df$Model)],levels=rev(unique(rename_models[model.names])))] p <- ggplot(df[variable != "Intercept"], aes(y = Estimate, x=Model, ymax=LowerCI, ymin=UpperCI, group=variable)) p <- p + geom_pointrange(shape=1) + facet_wrap('variable',scales='free_x',nrow=1,as.table=F) + geom_hline(aes(yintercept=0),linetype='dashed',color='gray40') + coord_flip() + xlab("") @@ -70,8 +72,9 @@ plot.cc.example <- function(datalist, name, varnames=NULL, varorder=NULL, includ } -plot.civilcomments.dv.example <- function(include.models=c("Automatic Classification", "All Annotations")){ - return(plot.cc.example(dv.example, "cc_ex_tox.likes.race_disclosed.medsamp", varnames=c("Intercept", "Likes", "Identity Disclosure", "Likes:Identity Disclosure"),varorder=c("Intercept", "Likes", "Identity Disclosure", "Likes:Identity Disclosure"), include.models=include.models) + ylab("Coefficients and 95% Confidence Intervals") + ggtitle("Logistic Regression Predicting Toxicity")) +plot.civilcomments.dv.example <- function(include.models=c("Automated Classifications", "Manual Annotations")){ + p <- plot.cc.example(dv.example, "cc_ex_tox.likes.race_disclosed.medsamp", varnames=c("Intercept", "Likes", "Identity Disclosure", "Likes:Identity Disclosure"),varorder=c("Intercept", "Likes", "Identity Disclosure", "Likes:Identity Disclosure"), include.models=include.models) + ylab("Coefficients and 95% Confidence Intervals") + ggtitle("Logistic Regression Predicting Toxicity") + return(p) }