Reflexive Consistency About The Right Thing
Today at work I proposed a sampling scheme for our ML training. Our data is highly imbalanced and more so because we don't know ground truth on most of the examples which are (roughly) at least 2% likely to be the minority class. Our architecture does not allow us to use all the data in training, so we downsample the target=0 examples and to a lesser extent the no-ground-truth examples, giving them weights inversely proportional to their sample rate.
The proposal: What if we sample our data in proportion to its probability of target=1? And then calibrate to get probabilities from the new model, then resample and train, calibrate, repeat? That would definitely converge.
The rebuttal: "I agree that [it] should get convergence, but can you explain why you think it would converge to the correct result?"
No, no I cannot, and I'm pretty sure it would not. I didn't suggest re-labeling, just adjusting the sampling scheme. Why would I expect that sending the training algorithm a huge variety of the most extreme examples would result in good discrimination? Shouldn't I rather send over a good variety of every type of example?
My modified proposal: Okay, right, this is about the sampling scheme. That mostly affects the target=0 examples, and to some extent the no-ground-truth examples. Maybe the actual win here is in sampling approximately equal amounts from each "type" of example, and maybe we can approximate that by binning by equal-in-some-space score bands and sampling approximately equal amounts from each. And then, sure, if that works well, we can make it reflexively consistent and try that experiment too.
What mistake did I make? I started by thinking about reflexive consistency. I thought to make the scores output by the system consistent with the scores input to the system. But that sounded scary (still maybe it's a good idea!), so I backed off to "hey, we downsample low scores quite a lot, [cue key mistake] so since we downsample high scores less than low maybe we should rescore so that we know which examples are now thought of as high score and can be downsampled less!" This conflated sample rate with score; we weren't actually downsampling low scores, we were downsampling overrepresented examples, and as it happens, the overrepresented examples mostly have low scores. So the correct observation should have been:
"We downsample overrepresented examples for performance reasons. Let's try (a) being more deliberate about checking for overrepresentation even within each class of example, for example we know we end up with tons of really low scoring examples, and (b) if that works, then our overrepresentation checks use the output of past models as an input, so iterate on new models until we hit an approximate fixed point such that we're using the most up-to-date guess about scores, i.e. from the new model, to get good information to the new model's training."
Comments
Post a Comment