Beating the Holdout: Bounds for KFold and Progressive Cross-Validation - Robotics Institute Carnegie Mellon University

Beating the Holdout: Bounds for KFold and Progressive Cross-Validation

A. Blum, Adam Kalai, and John Langford
Conference Paper, Proceedings of 12th Annual Conference on Computational Learning Theory (COLT '99), pp. 203 - 208, July, 1999

Abstract

The empirical error on a test set, the hold-out estimate, often is a more reliable estimate of generalization error than the observed error on the training set, the training estimate. K-fold cross validation is used in practice with the hope of being more accurate than the hold-out estimate without reducing the number of training examples. We argue that the k-fold estimate does in fact achieve this goal. Specifically, we show that for any nontrivial learning problem and learning algorithm that is insensitive to example ordering, the k-fold estimate is strictly more accurate than a single hold-out estimate on 1/k of the data, for 2 < k < n (k = n is leave-one-out), based on its variance and all higher moments. Previous bounds were termed sanitycheck because they compared the k-fold estimate to the training estimate and, further, restricted the VC dimension and required a notion of hypothesis stability [2]. In order to avoid these dependencies, we consider a k-fold hypothesis that is a randomized combination or average of the k individual hypotheses. We introduce progressive validationas another possible improvement on the hold-out estimate. This estimate of the generalization error is, in many ways, as good as that of a single hold-out, but it uses an average of half as many examples for testing. The procedure also involves a hold-out set, but after an example has been tested, it is added to the training set and the learning algorithm is rerun.

BibTeX

@conference{Blum-1999-16679,
author = {A. Blum and Adam Kalai and John Langford},
title = {Beating the Holdout: Bounds for KFold and Progressive Cross-Validation},
booktitle = {Proceedings of 12th Annual Conference on Computational Learning Theory (COLT '99)},
year = {1999},
month = {July},
pages = {203 - 208},
}