Reflections on fields where sound research is hard.
I once heard Lucy Sanders, CEO of the National Center for Women in Information Technology, comment on how hard it was to get politicians in Washington to understand that, as she put it, “the plural of anecdote is not data.” That statement has stuck with me for years, and was the inspiration for today’s post.
Much of science and engineering is based on the following process
Observe the world repeatedly, taking measurements of each observation.
Find a mathematical formula that the measurements fit.
Assume that that formula captures a fundamental truth and use it to predict what you will measure in future observations.
Observe the world many more times, taking measurements of each observation and see if they verify your proposed formula.
Science focuses mostly on steps 1, 2, and 4, while engineering is more interested in step 3, and in particular in designing scenarios where the measurements will meet particular objectives.
Step 2 is always possible. For any finite set of data, there is an infinite set of a finitely-expressible formulae that fit every single datum perfectly. However, most of these formulae will provide no predictive power; step 4 will confirm they are false. While there we can never reduce the set of possible formulae to a single formula, we do have a set of rough rules of thumb for detecting “overfitting”. Roughly speaking, these suggest that
there should be an unambiguous process for deriving the formula from the data, and
applying this process to a subset of the data should produce a similar formula as is produced by applying it to the whole data set.
or, almost equivalently,
the family of formula from which you draw this formula should have many fewer degrees of freedom than you have datapoints.
But both of these are just approximations. The real meaning of overfitting is that the formula you found matches the random chance in your data rather than the underlying truths of the universe.
Consider the following scenario:
I flip a coin 10 times, and it comes up heads every time.
I notice that there is a very simple formula for this: flip(this coin) = heads.
I assume that that formula captures a fundamental truth and predict that the next ten flips will be heads every time as well.
I flip the coin 10 more times, and it comes up heads every time.
From this I conclude… actually, what I conclude varies depending on how I think. If I am a first-principles kind of person, I say “I have no reason to believe the tails side won’t come up eventually; the experiment shows nothing.” If I am a gut-feeling data-driven kind of person, I say “This coin is amazing! It always comes up heads!” Or I could try to hedge my bets and observe that “This coin always comes up heads (p < 0.00001)”.
But if all coins are perfectly fair, one out of a million will give me those 20 heads in a row. For the low-low price of $10,000 for materials (plus my wages during the search) I can find you that “always”-heads one-cent coin.
If you test enough things, some of them will look significant. At the level of significance accepted by most publications today (p = 0.05) you only have to test 20 things. If you can test a false postulate each week, you’ll find 2 or 3 “significant” findings every year; if you can test one each working day you’ll find one you can confirm in two independent studies every 18 months.
There are a lot of people “doing studies”. Given the huge pool, I fully expect that “studies show” anything and everything.
Bad as statistical accidents are, things are far worse when steps 1 and 2 are problematic. Suppose you want to tell managers who to make their businesses profitable. Given (a) the number of very profitable businesses is not that large, (b) each manager does a lot of things, (c) most of those things cannot be easily measured, and (d) it is difficult to measure those that are measurable, the chance that you can get data in any meaningful quantity is slim to none. So instead you resort to case studies.
Case studies are fine things; they basically mean that you pick one example and study it in detail. They let you see the whole picture, and are widely used in law, business, and politics. But they are a bad way of making formula-style predictions. It takes a lot of time to do (or even read) a case study, so the total number is always low. Additionally, the number of distinct things that you can fuzzily quantify in each case study is very large. A large number of variables and a low number of samples means you are almost guaranteed to find some truly remarkable trend. Maybe dirtier stores sold more product, or all the best stores had low-pressure sodium lamps, or the managers of the most profitable stores were single-mantra people or the managers of the most profitable stores were hand-in-every-pot people or… who knows what, but something, something every time.
You see, with a hundred variables and ten datapoints, you almost always overfit. Some parts of the noise are always going to be beautifully orderly.
There is some intuitive understanding of this, which is why rationale start to appear. “Low-pressure sodium lights emit 589.3 nm yellow illumination, which is near the center of the warm, friendly glow of a wood fire, creating in the minds of the customers a subconscious feeling of comfort and pleasure.” An effort to make bad data look more like first-principles science results in untested, mostly nonsensical claims supported by digested case-study annecdotes.