Estimating Now What We'll Later Know The Rate Will Be

Here's the setup: a stream of events comes in over time, most from working sources, and some from broken sources. We're trying to discriminate between them in real time, accepting only the legitimate events. When we accept an event from a broken source, we only learn that it was broken sometime in the future. Nevertheless, we'd like to estimate the fraction of our accepted events which are from broken sources as near real-time as possible.

We have lots of historical data, so overall, we have a very good idea of our past broken rates, and also of the distribution of delays before we learn about which events were broken. Unfortunately sometimes we'd like to know the broken rate within a relatively small segment, so we might not have lots of historical data for that specific segment.

Let's discuss some ways of performing the estimation and their pros and cons.

Past Performance Perfectly Predicts Pfuture

Bucket delay times. Find the overall distribution of delay buckets per broken event. Compute the expectation of the number of broken events accepted within the last day as equal to the number you've seen so far divided by the historical fraction which were revealed within one day. So if you've seen 10 and historically you see about 5% of them within one day, that's 10/5%=200 that you'll eventually know you accepted today. Then just sum over whatever time period you're interested in and divide by total accepted during that time period.

Pros: Super simple. Fairly intuitive.

Cons: If you accept 100 usually and 110 today and already know 10 were from broken sources you're going to estimate that 200 of your 110 were broken and that's just silly.

Truncate, Truncate

Okay first of all, maybe within one day you only get to know that 0.1% of the broken events were actually broken and would be multiplying some numbers x1000, that's too ridiculous, so maybe constrain that particular fraction to be >= 2% or so.

And secondly, please don't estimate that 200 of your 110 were fraudulent. Just cap it at 110.

Pros: Still simple and intuitive.

Cons: Not as silly but still, what's more likely, that usually you accept 90-105ish legitimate events per day and today you accepted 0, or that today you still accepted 90-105ish and this time all of those 10 broken events were from the same broken source, correlating the delays, and the delay happened to be very short?

Smooth It Out

There's a lot of noise in the very recent events. What if we just took a weighted average and put pretty small weight on recency? And actually we have a very convenient curve that has small numbers for recent time windows and large numbers for older time windows - the same one that says "5% of them within one day". Let's just use it and see what happens. Now the first day will contribute 10/5%*5%=10 to the numerator of our rate, and 110*5%=5.5 to the denominator. Whoops, we lost the anti-silly cap; let's say instead the first day contributes 100*5% + 10 = 15.5 to the denominator.

Pros: Still actually pretty simple, and since we used the same curve a bunch of terms cancel and it looks even simpler than the others. Probably pretty good results too since we're heavily discounting the noisiest time periods.

Cons: Heavily discounts the most recent time periods, so maybe it lags more than we'd like. Maybe a lot more, for small segments which are noisy for more than just the most recent times.

Will It Bayes?

All of this noise reduction, smoothing, and heuristic estimation seems unnecessary. Can't we just put a prior on our estimates whose parameters are set by data? Maybe we say that the revealed breakage rate at a certain delay window is described by a normal distribution over the natural scale (log-odds), with mean equal to our previous untruncated point estimate (so it could be 0.1%) translated to log-odds, and variance as if we divided our historical data into all our known segments, found the segment rates by-time-delay, and found the variance per time-delay over all segment rates. I suppose taking the mean that way too would make sense; it may be somewhat different than the raw mean. Also maybe rather than weighting segments equally or by their volume, we weight by sqrt(volume). Also also since we're going to take the inverse of this rate, maybe the natural scale is for a magnitude in [1,inf) rather than in (0,1)? I don't know the best practices here.

Anyway now that we have a distribution over discovery delay rates, let's also do something like the above to get a distribution over broken rates, then sample from broken rates distribution x discover delay rates distribution such that we never exceed total accepted cases and take the prior-probability-weighted mean (or maybe solve it analytically) of that for our contribution to the numerator for this particular delay window. Raw accepted for the denominator. Weight against recency as desired, though it should be much less needed now.

Pros: This should work well! It incorporates a lot more information and structure than the previous methods.

Cons: Literally incomprehensible. I can barely understand what I wrote above, and I know it has holes and TBDs.

Search This Blog

Walk Away Or Stand And Fight