Efficient overfitting of training data (Kaggle Bowl 2017)

During the Kaggle Data Science Bowl 2017, the leaderboard was based on only $198$ samples. The opportunity for overfitting was quickly understood, but initially only the naive option was mentioned, testing 1 submission per sample taking 66 days (still doable within the competition duration, but less than ideal).

But then Oleg Trott got a perfect score in just 14 submissions! topic) I was really curious how he managed to do this. Together with Cas, I found out one way it can be done. Perhaps Oleg will be posting his solution soon, but as I write this, it's not yet public.

EDIT: Kaggle is not happy with these overfitted perfect scores, which is reasonable. It gives a wrong impression to press or newcomers. Therefore I ask you to not post your perfect solution to the leaderboard. You can still post test ones to get extra training data though, giving you a slight advantage over others who will get this data later. Test submissions with have terrible logloss so won't influence the leaderboard.

The score is calculated as logloss, with probabilities capped at $1 \cdot 10^{-15}$. Therefore, the maximum error is $-\ln(10^{-15})$. With a resolution of $0.00001$ on the public leaderboard, that leaves $-\ln(10^{-15})/198/0.00001 = 17443$ discernible values or $14.090$ bits of information. The solution has $198$ bits of information, so theoretically, it seems possible to get it in 15 attempts.

Here's how to do it. You want to be able to tell from the sum of the sample log losses which position had an error. Therefore, you want the errors in "binary representation" to be like $0000$, $0001$, $0010$, $0100$, $1000$, but with a "bit" being the minimum resolution. That way you can tell from the sum which position was wrong.

EDIT: I initially used only half the probability range for positives, to prevent collisions between scores from positives and negatives. By "collision" I mean the score being decomposable in multiple ways. As it turns out, collisions are pretty rare, so we use the whole range and sample 15 bits each time (including a 0 prediction).

We can check 15 positions per submission. To have the other positions not interfere, we predict $p=0.5$ for all of them, which will give a logloss of $(198-15)/198 \cdot \ln(0.5)$ whether they're positives or negatives. This will simply add a constant to the score.

EDIT: I initially used exponential function, but after seeing that Oleg uses sigmoid, I found that there are fewer collisions that way. You should use that way, by simply replacing "exp" by "logistic.cdf" (scipy). Comments about why this works are welcome!

Let's look at a simplified example, where we probe 4 positions and predict $p=0.5$ for the rest. For the first position we predict $p=1$, for an logloss of $0$ if correct. For the second position, we want the minimum discernible error if it's a positive. The minimum discernible error is $0.00001$, so we predict $p = \exp(-1 \cdot 2^0 \cdot 0.00001 \cdot 198) = 0.99802$. For the third one, we want double ($2^1$) that error ($0.00002$), so we predict $\exp(-1 \cdot 2^1 \cdot 0.00001 \cdot 198) = 0.99605$. For the fourth one, we want 4 ($2^2$) times that error ($0.00004$) so $p=0.99211$.

We submit this solution and get an error of (for example) $0.70714$. We first subtract the constant term due to 194 times $p=0.5$ which is $\ln(0.5) / 198 \cdot 194 = 0.67914$, which leaves us $0.02800$. Position 1 can contribute either $0.00000$ or $-\log(1o^{-15})/198 = 0.17444$, position 2 can contribute either $0.00001$ or $0.03144$, position 3 is either $0.00002$ or $0.02795$ and position 4 is either $0.00004$ or $0.02446$.

In this case it's fairly obvious which sums to $0.02800$, but you can generally just try all possibilities, there are just 32768 for 15 bits. The solution is $0.00000 + 0.00001 + 0.02795 + 0.00004$. So we know position 1, 2 and 4 are positives, while position 3 is a negative.

Just repeat this for 15 positions at a time until you have everything. If there were ever multiple solutions for how to obtain the score, then you can do another submission to compare them.

See my implementation on Kaggle. (Jan 18)


Topics: #imadethis #machine-learning #data-science #cross-validation


No comments yet

You need to be logged in to comment.