Sunday, June 18, 2017

paired cross-validation for better model selection

Source: Ensemble Methods: Foundations and Algorithms. Section 1.3

We model to generalise. Hence, test error is the main criteria of expected performance. If one is fitting a lot of models their performance should be compared on the test set. For selecting the parameters of a specific model one needs validation set (apart from training and test set). When the data is less we use k-fold cross validation to do validation and testing. A simple comparison of average generalisation error estimates is not reliable since the winning algorithm may occasionally perform well due to the randomness in data split. That is where the paired cross-validation t-test comes into play.

Let's take the example of $5 \times 2$ cv paired t-test. We run 5-times 2-fold cross-validation.

1) student-t hypothesis test (to compare a few models): Run 5 times 2-fold cross validation. Two algorithms a and b are trained on each half and tested on the other. The error difference $d^{(i)}  = err_a^{(i)} - err_b^{(i)}$, for $i \in [1,2]$. The mean and variance of this error difference can be calculated as $\mu = (d^{(1)} + d^{(2)})/2$ and $s^2 = (d^{(1)} - \mu)^2 +(d^{(2)} - \mu) ^2$. Let $s_i^2$ denote the variance  in the $i^{th}$ time 2-fold cross-validation, and $\mu^{(i)}$ the mean error difference. One can take the mean error difference over 5 times and get the t-stats as
\[t = \frac{\frac{1}{5}\sum \mu^{(i)}}{\sqrt{\frac{1}{5}\sum s_i^2}}\]
This should be distributed according to t-distribution with 5-degrees of freedom.

2) McNemar's test (when the model is expensive): We calculate $err_{01}$ the number of instances when first algo is wrong while the second is correct and similarly $err_{10}$. The quantity
\[\frac{(|err_{01}-err_{10}|-1)^2}{err_{01}+err_{10}}\]
will have a $\chi^2_1$ distribution.

3) Friedman test (compare many models): First, we sort the algorithms on each dataset according to their average errors. We then average the ranks of each algorithm over all the dataset and use the critical difference value (Nemenyi post-hoc test) to get the confidence interval. This is used to infer if there is any statistical difference between the algorithms.


No comments:

Post a Comment