Big Data/Analytics Zone is brought to you in partnership with:

John Cook is an applied mathematician working in Houston, Texas. His career has been a blend of research, software development, consulting, and management. John is a DZone MVB and is not an employee of DZone and has posted 172 posts at DZone. You can read more from them at their website. View Full User Profile

Data Calls the Model's Bluff

03.06.2013
| 2192 views |
  • submit to reddit

I hear a lot of people saying that simple models work better than complex models when you have enough data. For example, here’s a tweet from Giuseppe Paleologo this morning:

Isn’t it ironic that almost all known results in asymptotic statistics don’t scale well with data?

There are several things people could mean when they say that complex models don’t scale well.

First, they may mean that the implementation of complex models doesn’t scale. The computational effort required to fit the model increases disproportionately with the amount of data.

Second, they could mean that complex models aren’t necessary. A complex model might do even better than a simple model, but simple models work well enough given lots of data.

A third possibility, less charitable than the first two, is that the complex models are a bad fit, and this becomes apparent given enough data. The data calls the model’s bluff. If a statistical model performs poorly with lots of data, it must have performed poorly with a small amount of data too, but you couldn’t tell. It’s simple over-fitting.

I believe that’s what Giuseppe had in mind in his remark above. When I replied that the problem is modeling error, he said “Yes, big time.” The results of asymptotic statistics scale beautifully when the model is correct. But giving a poorly fitting model more data isn’t going to make it perform better.

The wrong conclusion would be to say that complex models work well for small data. I think the conclusion is that you can’t tell that complex models are not working well with small data. It’s a researcher’s paradise. You can fit a sequence of ever more complex models, getting a publication out of each. Evaluate your model using simulations based on your assumptions and you can avoid the accountability of the real world.

If the robustness of simple models is important with huge data sets, it’s even more important with small data sets.

Model complexity should increase with data, not decrease. I don’t mean that it should necessarily increase, but that it could. With more data, you have the ability to test the fit of more complex models. When people say that simple models scale better, they may mean that they haven’t been able to do better, that the data has exposed the problems with other things they’ve tried.

Published at DZone with permission of John Cook, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)