Big Data/Analytics Zone is brought to you in partnership with:

John Cook is an applied mathematician working in Houston, Texas. His career has been a blend of research, software development, consulting, and management. John is a DZone MVB and is not an employee of DZone and has posted 171 posts at DZone. You can read more from them at their website. View Full User Profile

Why Every Statistician Should Know About Cross-​​Validation

10.14.2012
| 3245 views |
  • submit to reddit

Sur­pris­ingly, many sta­tis­ti­cians see cross-​​validation as some­thing data min­ers do, but not a core sta­tis­ti­cal tech­nique. I thought it might be help­ful to sum­ma­rize the role of cross-​​validation in sta­tis­tics.

Cross-​​validation is pri­mar­ily a way of mea­sur­ing the pre­dic­tive per­for­mance of a sta­tis­ti­cal model. Every sta­tis­ti­cian knows that the model fit sta­tis­tics are not a good guide to how well a model will pre­dict: high R^2 does not nec­es­sar­ily mean a good model. It is easy to over-​​fit the data by includ­ing too many degrees of free­dom and so inflate R^2 and other fit sta­tis­tics. For exam­ple, in a sim­ple poly­no­mial regres­sion I can just keep adding higher order terms and so get bet­ter and bet­ter fits to the data. But the pre­dic­tions from the model on new data will usu­ally get worse as higher order terms are added.

One way to mea­sure the pre­dic­tive abil­ity of a model is to test it on a set of data not used in esti­ma­tion. Data min­ers call this a “test set” and the data used for esti­ma­tion is the “train­ing set”. For exam­ple, the pre­dic­tive accu­racy of a model can be mea­sured by the mean squared error on the test set. This will gen­er­ally be larger than the MSE on the train­ing set because the test data were not used for estimation.

How­ever, there is often not enough data to allow some of it to be kept back for testing. A more sophis­ti­cated ver­sion of training/​​test sets is leave-​​one-​​out cross-​​​​validation (LOOCV) in which the accu­racy mea­sures are obtained as fol­lows. Sup­pose there are n inde­pen­dent obser­va­tions, y_1,\dots,y_n.

  1. Let obser­va­tion i form the test set, and fit the model using the remain­ing data. Then com­pute the error (e_{i}^*=y_{i}-\hat{y}_{i}) for the omit­ted obser­va­tion. This is some­times called a “pre­dicted resid­ual” to dis­tin­guish it from an ordi­nary residual.
  2. Repeat step 1 for i=1,\dots,n.
  3. Com­pute the MSE from e_{1}^*,\dots,e_{n}^*. We shall call this the CV.

This is a much more effi­cient use of the avail­able data, as you only omit one obser­va­tion at each step. How­ever, it can be very time con­sum­ing to imple­ment (except for lin­ear mod­els — see below).

Other sta­tis­tics (e.g., the MAE) can be com­puted sim­i­larly. A related mea­sure is the PRESS sta­tis­tic (pre­dicted resid­ual sum of squares) equal to n\timesMSE.

Vari­a­tions on cross-​​validation include leave-​​k-​​out cross-​​validation (in which k obser­va­tions are left out at each step) and k-​​fold cross-​​validation (where the orig­i­nal sam­ple is ran­domly par­ti­tioned into k sub­sam­ples and one is left out in each iter­a­tion). Another pop­u­lar vari­ant is the .632+bootstrap of Efron & Tib­shi­rani (1997) which has bet­ter prop­er­ties but is more com­pli­cated to implement.

Min­i­miz­ing a CV sta­tis­tic is a use­ful way to do model selec­tion such as choos­ing vari­ables in a regres­sion or choos­ing the degrees of free­dom of a non­para­met­ric smoother. It is cer­tainly far bet­ter than pro­ce­dures based on sta­tis­ti­cal tests and pro­vides a nearly unbi­ased mea­sure of the true MSE on new observations.

However, as with any vari­able selec­tion pro­ce­dure, it can be mis­used. Beware of look­ing at sta­tis­ti­cal tests after select­ing vari­ables using cross-​​validation — the tests do not take account of the vari­able selec­tion that has taken place and so the p-​​values can mislead.

It is also impor­tant to realise that it doesn’t always work. For exam­ple, if there are exact dupli­cate obser­va­tions (i.e., two or more obser­va­tions with equal val­ues for all covari­ates and for the y vari­able) then leav­ing one obser­va­tion out will not be effective.

Another prob­lem is that a small change in the data can cause a large change in the model selected. Many authors have found that k-​​fold cross-​​validation works bet­ter in this respect.

In a famous paper, Shao (1993) showed that leave-​​one-​​out cross val­i­da­tion does not lead to a con­sis­tent esti­mate of the model. That is, if there is a true model, then LOOCV will not always find it, even with very large sam­ple sizes. In con­trast, cer­tain kinds of leave-​​k-​​out cross-​​validation, where k increases with n, will be con­sis­tent. Frankly, I don’t con­sider this is a very impor­tant result as there is never a true model. In real­ity, every model is wrong, so con­sis­tency is not really an inter­est­ing property.

Cross-​​validation for lin­ear models

While cross-​​validation can be com­pu­ta­tion­ally expen­sive in gen­eral, it is very easy and fast to com­pute LOOCV for lin­ear mod­els. A lin­ear model can be writ­ten as

    \[ \mathbf{Y} = \mathbf{X}\mbox{\boldmath$\beta$} + \mathbf{e}. \]

Then

    \[ \hat{\mbox{\boldmath$\beta$}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} \]

and the fit­ted val­ues can be cal­cu­lated using

    \[ \mathbf{\hat{Y}} = \mathbf{X}\hat{\mbox{\boldmath$\beta$}} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{H}\mathbf{Y}, \]

where \mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' is known as the “hat-​​matrix” because it is used to com­pute \mathbf{\hat{Y}} (“Y-​​hat”).

If the diag­o­nal val­ues of \mathbf{H} are denoted by h_{1},\dots,h_{n}, then the cross-​​validation sta­tis­tic can be com­puted using

    \[ \text{CV} = \frac{1}{n}\sum_{i=1}^n [e_{i}/(1-h_{i})]^2, \]

where e_{i} is the resid­ual obtained from fit­ting the model to all n obser­va­tions. See Christensen’s book Plane Answers to Com­plex Ques­tions for a proof. Thus, it is not nec­es­sary to actu­ally fit n sep­a­rate mod­els when com­put­ing the CV sta­tis­tic for lin­ear mod­els. This remark­able result allows cross-​​validation to be used while only fit­ting the model once to all avail­able observations.

Rela­tion­ships with other quantities

Cross-​​validation sta­tis­tics and related quan­ti­ties are widely used in sta­tis­tics, although it has not always been clear that these are all con­nected with cross-​​validation.

Jack­knife

A jack­knife esti­ma­tor is obtained by recom­put­ing an esti­mate leav­ing out one obser­va­tion at a time from the esti­ma­tion sam­ple. The n esti­mates allow the bias and vari­ance of the sta­tis­tic to be calculated.

Akaike’s Infor­ma­tion Criterion

Akaike’s Infor­ma­tion Cri­te­rion is defined as

    \[ \text{AIC} = -2\log {\cal L}+ 2p, \]

where {\cal L} is the max­i­mized like­li­hood using all avail­able data for esti­ma­tion and p is the num­ber of free para­me­ters in the model. Asymp­tot­i­cally, min­i­miz­ing the AIC is equiv­a­lent to min­i­miz­ing the CV value. This is true for any model (Stone 1977), not just lin­ear mod­els. It is this prop­erty that makes the AIC so use­ful in model selec­tion when the pur­pose is prediction.

Schwarz Bayesian Infor­ma­tion Criterion

A related mea­sure is Schwarz’s Bayesian Infor­ma­tion Criterion:

    \[ \text{BIC} = -2\log {\cal L}+ p\log(n), \]

where n is the num­ber of obser­va­tions used for esti­ma­tion. Because of the heav­ier penalty, the model cho­sen by BIC is either the same as that cho­sen by AIC, or one with fewer terms. Asymp­tot­i­cally, for lin­ear mod­els min­i­miz­ing BIC is equiv­a­lent to leave–v–out cross-​​validation when v = n[1-1/(\log(n)-1)] (Shao 1997).

Many sta­tis­ti­cians like to use BIC because it is con­sis­tent — if there is a true under­ly­ing model, then with enough data the BIC will select that model. How­ever, in real­ity there is rarely if ever a true under­ly­ing model, and even if there was a true under­ly­ing model, select­ing that model will not nec­es­sar­ily give the best fore­casts (because the para­me­ter esti­mates may not be accurate).

Cross-​​validation for time series

When the data are not inde­pen­dent cross-​​validation becomes more dif­fi­cult as leav­ing out an obser­va­tion does not remove all the asso­ci­ated infor­ma­tion due to the cor­re­la­tions with other obser­va­tions. For time series fore­cast­ing, a cross-​​validation sta­tis­tic is obtained as follows

  1. Fit the model to the data y_1,\dots,y_t and let \hat{y}_{t+1} denote the fore­cast of the next obser­va­tion. Then com­pute the error (e_{t+1}^*=y_{t+1}-\hat{y}_{t+1}) for the fore­cast observation.
  2. Repeat step 1 for t=m,\dots,n-1 where m is the min­i­mum num­ber of obser­va­tions needed for fit­ting the model.
  3. Com­pute the MSE from e_{m+1}^*,\dots,e_{n}^*.

Ref­er­ences

An excel­lent and com­pre­hen­sive recent sur­vey of cross-​​validation results is Arlot and Celisse (2010)
Published at DZone with permission of John Cook, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: