External Views on Internal Truth

The way to maintain a truth-seeking culture on your data science team while respecting the constraints of sales and marketing and presenting to the board and simplicity and etc is to separate messy, inexplicable truth-oriented analysis from useful views on top of the truth.

This is an example of a more general heuristic: any time you're optimizing for "two objectives at once", take a step back and try to figure out how you can optimize for just one objective. A common solution is to split The Thing into multiple parts, each of which can optimize for just one objective, and then combine the results. That's what we'll do here. (Another common solution is to combine your objectives via some sort of parameter, expose the parameter as a "business" lever, refactor until the parameter you're exposing is a good business lever, then optimize for the explicit combination of objectives.)

Suppose for example you're a ~Kiva making micro-loans to under-served individuals. You've got a pretty great repayment rate, like 95%. But also you reject a number of possible loans. What can you say about the rejected population?

The most blatant problem is if the external-facing repayment rate of 95% is taken as The Important Metric internally, and then you completely ignore the rejected population, and talk internally as if 95% is a fundamental number rather than one point on a tradeoff curve. Maybe you reject 10% of possible loans, and actually 90% of them would have repaid, so you could just stop rejecting and still have a 94.5% external-facing repayment rate. But maybe you're really good at discrimination and actually only 20% of them would have repaid and you'd have a 87.5% external-facing repayment rate. Internally, it's super-important to figure out which is true. Externally, maybe you straight-up can't talk about this, because it's too fraught with PR peril. Don't let the PR peril dictate what you study internally.

But maybe you do need to talk about it. Maybe you need to say "we estimate that only 10% of those we reject would actually have been fruitful partners" or something. I'm not good at PR. Basically, external-facing false-positives. Here it becomes crucial to do good internal analysis and then sanitize it for external consumption. You probably don't actually simply have an accept/reject binary internally. Maybe you have bins, like "absolutely loan", "probably loan", "on the fence", "pretty bad", and "definitely do not". And what you really do is cherry pick the "pretty bad", accept all the better, and reject the "definitely do not", except also you accept 5% of what you think you should reject to get an idea of ground truth. Internally you want to know the actual counts of repayment in each bucket, your estimates and confidence intervals of extrapolating ground truth data to unseen in various ways, etc. Make sure this happens, even if the external formula has to be literally just "# of that 5% which repayed / total # in that 5%". That's just one piece of interesting data internally. Keep digging. Stratify and recombine, smooth small data, etc etc. But don't just say "this is what we present externally, so this is the number we will inscribe into our intuitions internally".

Comments

Popular posts from this blog

"Liberty" Tasks

Make A Bad Plan

Creating an extra character