32 lines
2.0 KiB
Org Mode
32 lines
2.0 KiB
Org Mode
** Next simulations to run
|
|
- Focus on the BKH *multiple imputation* approach --- We don't actually need overimputation when we have some gold standard data.
|
|
- The idea of this paper is basically to say "Supervised learning fits well into this framework. So use it!"
|
|
- Important findings will involve:
|
|
- How well does this work when the learner is biased?
|
|
Probably works great as long as you can account for the bias in imputation + regression
|
|
- How well does this work when features are unavailable?
|
|
- if the
|
|
- How well does this work when featues are available?
|
|
- How well does this work when measurement error is correlated with the independent and dependent variable?
|
|
- How well does this work when there's a very large number of missing values?
|
|
|
|
|
|
|
|
** How well does supervised ML measurement fit into BHK?
|
|
*** Unbiased proxy assumption: If data is missing but there's a proxy variable that's unbiased we can use m=1 techniques instead of m=2 techniques.
|
|
|
|
M=1 techniques are simpler / more powerful, but maybe m=2 techniques are more robust given we can't assume supervised learners are unbiased.
|
|
|
|
That said, if we have access to the features, then we can use the predictions and the features as our proxy variable. Since any bias in w will be correlated with the features, including the features in the likelihood will reduce the bias.
|
|
|
|
*** IMMA assumption: distribution of the mismeasurement indicator, m,is the same no matter the value of the missing data.
|
|
|
|
This isn't a problem if ground truth is randomly sampled.
|
|
It is a problem if ground truth is based on a stratified sample.
|
|
|
|
*** Measurement error distribution assumption: The distribution of the proxy variable (conditional on missing and observed data and its distributional parameters) known up to its parameters. The parameters are either known or a consistent estimator is available.
|
|
|
|
This isn't a problem if we have ground truth because we can use the ground truth to estimate the parameters.
|
|
|
|
If we don't have ground truth, we'll have to guess.
|