Thursday, August 17, 2017

"Theory vs. Data" in statistics too

Via Brad DeLong -- still my favorite blogger after all these years -- I stumbled on this very interesting essay from 2001, by statistician Leo Breiman. Breiman basically says that statisticians should do less modeling and more machine learning. The essay has several responses from statisticians of a more orthodox persuasion, including the great David Cox (whom every economist should know). Obviously, the world has changed a lot since 2001 -- where random forests were the hot machine learning technique back then, it's now deep learning -- but it seems unlikely that this overall debate has been resolved. And the parallels to the methodology debates in economics are interesting.

In empirical economics, the big debate is between two different types of model-makers. Structural modelers want to use models that come from economic theory (constrained optimization of economic agents, production functions, and all that), while reduced-form modelers just want to use simple stuff like linear regression (and rely on careful research design to make those simple models appropriate).

I'm pretty sure I know who's right in this debate: both. If you have a really solid, reliable theory that has proven itself in lots of cases so you can be confident it's really structural instead of some made-up B.S., then you're golden. Use that. But if economists are still trying to figure out which theory applies in a certain situation (and let's face it, this is usually the case), reduced-form stuff can both A) help identify the right theory and B) help make decently good policy in the meantime.

Statisticians, on the other hand, debate whether you should actually have a model at all! The simplistic reduced-form models that structural econometricians turn up their noses at -- linear regression, logit models, etc. -- are the exact things Breiman criticizes for being too theoretical! 

Here's Breiman:
[I]n the Journal of the American Statistical Association JASA, virtually every article contains a statement of the form: "Assume that the data are generated by the following model: ..." 
I am deeply troubled bythe current and past use of data models in applications, where quantitative conclusions are drawn and perhaps policy decisions made... 
[Data generating process modeling] has at its heart the belief that a statistician, by imagination and by looking at the data, can invent a reasonably good parametric class of models for a complex mechanism devised bynature. Then parameters are estimated and conclusions are drawn. But when a model is fit to data to draw quantitative conclusions... 
[t]he conclusions are about the model’s mechanism, and not about nature’s mechanism. It follows that...[i]f the model is a poor emulation of nature, the conclusions maybe wrong... 
These truisms have often been ignored in the enthusiasm for fitting data models. A few decades ago, the commitment to data models was such that even simple precautions such as residual analysis or goodness-of-fit tests were not used. The belief in the infallibility of data models was almost religious. It is a strange phenomenon—once a model is made, then it becomes truth and the conclusions from it are [considered] infallible.
This sounds very similar to the things reduced-form econometric modelers say when they criticize their structural counterparts. For example, here's Francis Diebold (a fan of structural modeling, but paraphrasing others' criticisms):
A cynical but not-entirely-false view is that structural causal inference effectively assumes a causal mechanism, known up to a vector of parameters that can be estimated. Big assumption. And of course different structural modelers can make different assumptions and get different results.
In both cases, the criticism is that if you have a misspecified theory, results that look careful and solid will actually be wildly wrong. But the kind of simple stuff that (some) structural econometricians think doesn't make enough a priori assumptions is exactly the stuff Breiman says (often) makes way too many

So if even OLS and logit are too theoretical and restrictive for Breiman's tastes, what does he want to do instead? Breiman wants to toss out the idea of a model entirely. Instead of making any assumption about the DGP, he wants to use an algorithm - a set of procedural steps to make predictions from data. As discussant Brad Efron puts it in his comment, Breiman wants "a black box with lots of knobs to twiddle." 

Breiman has one simple, powerful justification for preferring black boxes to formal DGP modeling: it works. He shows lots of examples where machine learning beat the pants off traditional model-based statistical techniques, in terms of predictive accuracy. Efron is skeptical, accusing Breiman of cherry-picking his examples to make machine learning methods look good. But LOL, that was back in 2001. As of 2017, machine learning - in particular, deep learning - has accomplished such magical feats that no one now questions the notion that these algorithmic techniques really do have some secret sauce. 

Of course, even Breiman admits that algorithms don't beat theory in all situations. In his comment, Cox points out that when the question being asked lies far out of past experience, theory becomes more crucial:
Often the prediction is under quite different conditions from the data; what is the likely progress of the incidence of the epidemic of v-CJD in the United Kingdom, what would be the effect on annual incidence of cancer in the United States of reducing by 10% the medical use of X-rays, etc.? That is, it may be desired to predict the consequences of something only indirectly addressed by the data available for analysis. As we move toward such more ambitious tasks, prediction, always hazardous, without some understanding of underlying process and linking with other sources of information, becomes more and more tentative.
And Breiman agrees:
I readily acknowledge that there are situations where a simple data model maybe useful and appropriate; for instance, if the science of the mechanism producing the data is well enough known to determine the model apart from estimating parameters. There are also situations of great complexity posing important issues and questions in which there is not enough data to resolve the questions to the accuracy desired. Simple models can then be useful in giving qualitative understanding, suggesting future research areas and the kind of additional data that needs to be gathered. At times, there is not enough data on which to base predictions; but policydecisions need to be made. In this case, constructing a model using whatever data exists, combined with scientific common sense and subject-matter knowledge, is a reasonable path...I agree [with the examples Cox cites].
In a way, this compromise is similar to my post about structural vs. reduced-form models - when you have solid, reliable structural theory or you need to make predictions about situations far away from the available data, use more theory. When you don't have reliable theory and you're considering only a small change from known situations, use less theory. This seems like a general principle that can be applied in any scientific field, at any level of analysis (though it requires plenty of judgment to put into practice, obviously).

So it's cool to see other fields having the same debate, and (hopefully) coming to similar conclusions.

In fact, it's possible that another form of the "theory vs. data" debate could be happening within machine learning itself. Some types of machine learning are more interpretable, which means it's possible - though very hard - to open them up and figure out why they gave the correct answers, and maybe generalize from that. That allows you to figure out other situations where a technique can be expected to work well, or even to use insights gained from machine learning to allow the creation of good statistical models.

But deep learning, the technique that's blowing everything else away in a huge array of applications, tends to be the least interpretable of all - the blackest of all black boxes. Deep learning is just so damned deep - to use Efron's term, it just has so many knobs on it. Even compared to other machine learning techniques, it looks like a magic spell. I enjoyed this cartoon by Valentin Dalibard and Peter Petar Veličković (tweeted by Dendi Suhubdy):

Deep learning seems like the outer frontier of atheoretical, purely data-based analysis. It might even classify as a new type of scientific revolution - a whole new way for humans to understand and control their world. Deep learning might finally be the realization of the old dream of holistic science or complexity science - a way to step beyond reductionism by abandoning the need to understand what you're predicting and controlling.

But this, as they say, would lead us too far afield...

(P.S. - Obviously I'm doing a ton of hand-waving here, I barely know any machine learning yet, and the paper I'm writing about is 16 years out of date! I'll try to start keeping track of cool stuff that's happening at the intersection of econ and machine learning, and on the general philosophy of the thing. For example, here's a cool workshop on deep learning, recommended by the good folks at r/badeconomics. It's quite possible deep learning is no longer anywhere near as impenetrable and magical as outside observers often claim...)


  1. The Brieman paper is excellent. Think of machine Learning as *really* good interpolation: if you have enough data it will do an excellent job at generalising to 'nearby' situations. I'm not an economist, but..... it strokes me in economics you often want to extrapolate to not-so-well-covered situations, and you often care about more than good prediction: you want interpretability.

  2. It sometimes seems the point of Economics was to create models! But the point of science is to create explanations. In particular, scientific explanations. Explanations have two parts: a mechanism (i.e. a story as a result of which the phenomenon happens) and a criteria for considering the explanation is valid (that's where the "scientific" in scientific explanations comes in). I think the point of statistical models (i.e. models with a data generating process) is that they try to explicitly validate a mechanism and that's why they make sense in an explanatory enterprise such as Economics.

    1. The point of economics (as with any science) is to build better tools to make better decisions.
      So, if you can robustly extract the deep structure, you might not care about what it truly is, as long as it improves your decision making.
      Explicit models are robustness checks, in case you cannot trust the implicit deep structure. Fortunately ( or rather unfortunately) deep learning will do nothing ( ok maybe little) for aggregate behavior.
      So, the economics of models will live another day.

  3. The cartoon mentioned here is actually a screenshot from one of my slide decks (, and is itself by Valentin Dalibard (

  4. Breiman's essay is very famous within ML and stats. But I think we (I speak mostly of ML here) have moved beyond the parametric vs. black box dichotomy by recognizing the weaknesses of both approaches and working to address them, or combine the best aspects of both. Active research areas in ML and computer science theory pushing in this direction are:

    - Robust (frequentist) statistics. Can we provably fit simple models even when we know they are wrong? And do so in a computationally efficient manner?
    - Robust (Bayesian) statistics. Can we determine when our Bayesian model is failing, and adjust accordingly? E.g., by determining which part of the data don't adhere to the modeling assumptions
    - Interrogating black box models to understand how and why they are making predictions. E.g., which training data points most influenced the prediction about a new piece of data?
    - Combining parametric models for the parts of a data generating process we think we understand well (which provides sample efficiency) with black box/nonparametric likelihoods for the parts we don't understand well (which provides flexibility).

    Happy to provide some references if you are interested.

    1. Yeah, send some references! I'd love to see them!

      One question: what does it mean to "provably fit" a model?

  5. Deep learning is Psychohistory!

  6. Or multi-level modeling bootstrapping.

  7. As a ML practitioner, my take is that, philosophically, ML says 'forget occams razor, and just massively over parameterize your model. Then make "parsimony" an extra model parameter and jointly optimize that with your actual model parameters'. Essentially you start with an expressive model and slowly strangle it until you get to the optimal level of strangulation. Optimal being maximum out of sample accuracy.

  8. Machines don't (yet) propagandize their political viewpoints by aiming to make their models agree with their politics, but economists do. And it's been proven desperately necessary to find a way to keep economists' politics out of policy. So I think machines pulling models out of their asses would be far preferable to today's situation of economists pulling models out of *their* asses.

    If a computer ends up saying something stupid, we'll be fine as long as Krugman is still around to mock it.

    As an aside, since economists can't possibly include everything in their models that would make them resemble reality (like interbank market sentiment, changes in sectors' market power characteristics, political economy decisions and political economy uncertainty, or even a decent explanation of what counts as capital), why not just let computers invent a model with inputs and parameters X(1) thru X(n)? Wasn't it someone like Friedman or Hayek who said "it doesn't have to be realistic if it works"? That was the explanation my prof gave me for RBC. :-)

  9. Keep in mind this was before the third neural network craze and at the time they weren't very hot. What Brieman probably was advocating was using random forest, which he invented so vested interest.

  10. To economists, appropriate scientific methodology is an alien landscape; something far beyond their comprehension. Mostly, they seem intent on promoting some ideological position which they support with a grab bag of empirical observations divorced from anything resembling actual theory.

    An obvious example is my own sphere of interest - trade theory. When applied to globalization in current scenarios, basic theory predicts slowed GDP growth, lower productivity, reduced investment in productive capacity and severely retarded wage growth in the developed economies. These poor outcomes are a result of the global restructuring and capital reallocation associated with allowing free trade and capital transfer with a huge population currently industrializing off a low capital base. These conditions can be expected to continue while wages are much lower and the return on capital higher than prevails in Western economies.

    Instead of appropriately warning us of the challenging conditions which lay ahead, economists denied their own basic theory. With an ideological commitment to free trade and globalization, they continually made false or misleading statements about its impacts on ordinary working families. Brad DeLong, your current guru, is a noted purveyor of many of these myths.

    1. Nathanael12:50 PM

      Good point. A surprising amount of the problems of economics has come when economists had a perfectly good theory which made good predictions, and actually denied the later predictions of the theory because they didn't like them.

      This is... not science.

  11. It's interesting to think of machine learning being used to inform policy, but I think ML use in the private sector could make enforcing those policies very difficult, especially for highly regulated industries.

    For example, I work at a financial services company, and whenever we do marketing, we need to explain our models and segmentations to our compliance department so they can ensure we're not systematically excluding certain populations (I believe it's part of the Fair Lending Act).

    If we used machine learning/black box magic to make those decisions, we couldn't point to the rationale for each decision. And I don't know if saying "well, this is what the computer told us to do and we don't know how it works" will satisfy regulators. So using ML to create policies but having private sectors potentially not being able to use ML to satisfy those regulations is an interesting dynamic.

  12. David J. Littleboy6:14 PM

    Rather than rant at ML (which I'm not fond of), I'll point you at someone who does it better than I could, Gary Marcus.

    In the video, Marcus talks really fast, and makes lots of really important points. To paraphrase "AI is like a drunk looking for his keys where the light is. We need more lampposts. ML is explicating important aspects of perception, but it doesn't do cognition." But when we say "AI" in our hype, everyone thinks that we're doing cognition, but we're not.

    (Another point he makes is that correlation is not causation. You dump a zillion tons of economic data into an ML system, and it finds the correlations, even a few you might not have noticed already. But they're just correlations, and you'll lose your shirt if you use those correlations to bet on next week's stock market.

    1. Nathanael12:53 PM

      Good point. So-called "deep learning" is basically very powerful correlators. They're good for certain types of fuzzy pattern recognition. Period.

      It's not AI; it replicates at best one of the subunits of intelligence. If you have an actual understanding of *causation*, you're better off using a model.

      If you don't really understand the causation (like, cough, most of economics) then stick with the correlating "deep learning" machines.

  13. Ok. But any approach is only as good as the fit of the data to the real world. So the key step is to continually check the definition of the data against the reality, and then check the collection process. Much economics data has poor fit - think of the process of hedonic adjustment of prices. This is kind of the usual state in the social sciences, as part of the ongoing social contest revolves around changing the definitions or rules of the game to one's advantage. Compuetrs may be part of the solution, but the ohter part is more sociological groundwork.

  14. Have you seen Cosma's project, related to this set of topics:


  15. I always thought of Brieman's paper as important in making a distinction between methods used for "explaining" vs "predicting" a. I had not thought about Brieman's paper in the theory vs data context. But it makes sense now that you put it this way.