[Review] Counterfactual Estimation and Optimization of Click Metrics for Search Engines
Paper Review
Counterfactual Estimation and Optimization of Click Metrics for Search Engines - Lihong Li, Shunbao Chen, Jim Kleban, Ankur Gupta
Also see Stripe's talk on how they use these concepts in production.
Condensed abstract: In this paper, we propose to address [the expense of A/B testing] using causal inference techniques, under the contextual-bandit framework. This approach effectively allows one to run (potentially infinitely) many A/B tests offline from search logs, making it possible to estimate and optimize online metrics quickly and inexpensively.
In Brief
If you've got a system that makes choices about what to do with items, and you get feedback about how good your choice turned out, you can choose to randomize your choices somewhat in a way that lets you retrospectively analyze how any decision algorithm would have performed.
The important things are to make sure you have a nonzero chance of making every choice on every item, and to make sure you know the actual chance of making every single choice for every single item.
If you can do that, then you pull out your shiny new algorithm you're thinking of using, label each item with the choice the new algorithm would make on it, weight each item according to the reciprocal of the probability that the real historical system made that choice, sum the weighted value of those historical choices which happen to be the same as your shiny new algorithm would make, and divide by your total historical items. Congratulations, you have estimated your value metric for your shiny new algorithm as if it had been run on the historical data!
(You probably will want to incur a little bit of bias to get rid of a lot of variance by capping weights at 20, or 100, or something.)
Worked Example
CW: the real world is racist and classist
Suppose your two choices are "approve this loan" or "don't approve this loan". You have many potential decision algorithms, but they all boil down in the end to "approve or reject". Let's consider two real-world decisioning algorithms and see how counterfactual estimation works:
- Approve if the applicant looks white to you, reject otherwise.
- Approve if the applicant looks rich to you, reject otherwise.
You're currently using algorithm A but have been informed that you're acting like a super-racist asshole, so you're considering switching to algorithm B next year. You run an A/B test, randomly using algorithm B 10% of the time this year, and then compare the results.
Unfortunately, next year, you are informed that since your intuitions are those of a super-racist asshole, your evaluation of whether an applicant looks rich is far too correlated with whether they look white to you. This time you're handed a nice new approach that will at least be up to industry standard even if it's still not ethical:
- Approve if the applicant has a high enough credit score, reject otherwise.
You go to check how well this approach will work. Here's your data:
APPROVE no default | APPROVE but default | REJECT | |
WHITE, RICH, CREDIT | 922 | 28 | 0 |
WHITE, RICH, ~CREDIT | 35 | 15 | 0 |
WHITE, ~RICH, CREDIT | 706 | 14 | 80 |
WHITE, ~RICH, ~CREDIT | 135 | 45 | 20 |
~WHITE, RICH, CREDIT | 88 | 2 | 810 |
~WHITE, RICH, ~CREDIT | 68 | 22 | 10 |
~WHITE, ~RICH, CREDIT | 0 | 0 | 700 |
~WHITE, ~RICH, ~CREDIT | 0 | 0 | 300 |
There's a massive problem here! You want to know what the default rate will be when approving based on credit score. We can take all the approved defaults and divide by all the approves for about (28+14+2+0)/(950+720+90+0) = 2.5%. ...right? Well, no, definitely not. There's a lot of selection bias going on in that calculation. I mean we rejected 810 apps from non-white, rich looking people. Worse, we rejected literally every app from non-white, non-rich people. 700 of those had good credit, and you have no idea what fraction of them would have defaulted.
What you should have done is instead (e.g.) choose to randomly approve 5% of those loans you'd otherwise reject. This gives you a little bit of data in every category. The analysis after-the-fact to see how the new credit-score algorithm performs will be more difficult, but will not have divide-by-zero problems.
(By-the-book, you'd do something more like making a random 50/50 decision on 10% of apps, but in this case we're only going to calculate a metric over the approved apps, so we don't need to experiment with randomly rejecting cases.)
If you had done your tests that way, you'd get this data:
APPROVE no default | APPROVE but default | REJECT | |
WHITE, RICH, CREDIT | 922 | 28 | 0 |
WHITE, RICH, ~CREDIT | 35 | 15 | 0 |
WHITE, ~RICH, CREDIT | 784 | 16 | 0 |
WHITE, ~RICH, ~CREDIT | 150 | 50 | 0 |
~WHITE, RICH, CREDIT | 44 | 1 | 855 |
~WHITE, RICH, ~CREDIT | 4 | 1 | 95 |
~WHITE, ~RICH, CREDIT | 35 | 0 | 665 |
~WHITE, ~RICH, ~CREDIT | 12 | 3 | 285 |
You'll weight everyone you would always have approved as 1.0 and everyone who lucked into the 5% as 20.0. Now you can evaluate any scheme you want! Let's say you're just interested in default rate, no other metric.
- Racist: (28+15+16+50)/(950+50+800+200) = 5.45%
- Classist: (28+15+1*20+1*20)/(950+50+900+100) = 4.15%
- Disclaimist: (28+16+1*20+0*20)/(950+800+900+700) = 1.91%
For completeness, here are the approval rates under the three schemes. No special weighting needed for this, of course.
- Racist: (950+50+800+200)/(total=4000) = 50%
- Classist: (950+50+900+100)/4000 = 50%
- Disclaimist: (950+800+900+700)/4000 = 83.75%
The moral of the story is that you should always throw some random decisions into the mix. This lets you do exploratory analysis later. But you must record how likely each decision was to occur, faithfully, or else you won't know whether that 20.0 weight should have been 10.0 or 35.0 maybe? Who remembers what the probability was last October??
Comments
Post a Comment