The Causal Advantage
Why high-quality causal data, not bigger datasets, is what lets a system reach its full potential.
Ibrahim AbuAlhaol, PhD, P.Eng., SMIEEE
AI Technical Lead
For a decade, the default answer to almost every hard problem in AI has been the same: collect more data. The answer is wrong more often than the field likes to admit. A larger pile of records makes a model more confident about patterns it has already seen. It does not teach the model why those patterns hold, and it does not tell anyone what happens when the world shifts.
That gap is the difference between a system that describes the past and a system that can act on the future. The book The Why Axis, by economists Uri Gneezy and John List, makes the point in plain terms: you learn why people behave the way they do by running deliberate experiments, not by mining ever-larger logs of what they did. The same lesson now governs whether an AI system earns its keep.
A system that knows only correlations can predict the world as it is. A system that knows causes can act on the world to change it. Only one of those is worth betting a business on.
The two questions data can answer
Judea Pearl, who spent his career formalizing cause and effect, describes a ladder with three rungs. The bottom rung is association: what tends to occur together. Most machine learning lives here. A model trained on millions of historical records learns that customers who do A often go on to do B. That is useful for ranking and forecasting, and it is all that correlation can give you.
The higher rungs ask harder questions. The second rung is intervention: what happens if we change A on purpose. The third is the counterfactual: what would have happened to this specific customer had we acted differently. Pure observation cannot climb past the first rung, because two variables can move together for reasons that have nothing to do with one causing the other. Ice cream sales and drowning rates rise in the same months. No amount of additional sales data will tell you that summer heat, not dessert, is behind both.
Why a bigger dataset does not close the gap
This is the part that catches careful teams off guard. When you double an observational dataset, you sharpen your estimate of the correlations already present. You do not add a single causal fact. The confounders that distorted the small dataset distort the large one just as much, only now with tighter error bars that make the wrong conclusion look authoritative.
Gneezy and List built their careers on the alternative. They ran field experiments, changing one variable at a time in the real world, to separate what causes a behavior from what merely accompanies it. A randomized change, even on a modest sample, answers a question that petabytes of passive logs cannot. The value sits in how the data was generated, not in how much of it piled up.
The same trap appears in production AI. A recommendation model trained on click logs learns that a banner and a purchase coincide. Whether the banner caused the purchase is a separate question, and the log alone cannot settle it. Teams that confuse the two ship features that test well on history and then disappoint in the live experiment.
Quality and structure beat raw volume
Recent work from large labs lands on the same conclusion from a different direction. In the paper Textbooks Are All You Need, a Microsoft Research team trained a small code model, phi-1, on a curated set of roughly seven billion tokens of textbook-grade material. The 1.3 billion parameter model matched or beat far larger models trained on the usual scraped web. The training set was tiny by modern standards. It was also clean, well structured, and dense with the reasoning the task actually required.
Read that result next to the causal point and a single idea emerges. Data becomes knowledge when it is organized around the relationships that matter, not when it is merely abundant. A curated corpus that encodes how and why, and a dataset built from deliberate interventions, both do the same job: they hand the system the structure of the problem instead of leaving it to guess from noise.
Knowledge, in this sense, is not a synonym for a large database. It is causal structure that has been captured, cleaned, and arranged so a system can reason with it. That is what lets a system reach its potential with less compute and fewer surprises.
Building systems that climb the ladder
The research frontier has a name for this goal. Bernhard Schölkopf and colleagues, including Yoshua Bengio, frame causal representation learning as the discovery of the few high-level causal variables that drive a process, recovered from messy low-level observations. A model that learns those variables transfers to new conditions, because the causes it relies on still hold when the surface statistics change. A model that learned only correlations breaks the moment the distribution moves.
For a team shipping systems today, the practical path is concrete:
- Treat experiments as a data source, not an afterthought. Every A/B test, holdout, and staged rollout produces causal evidence that no amount of observational logging replaces.
- Write down the assumed causal structure of the problem before training. A simple diagram of what drives what exposes the confounders that a model will otherwise absorb as signal.
- Spend the data budget on quality and coverage of the decision space, not on raw row count. A smaller set that spans the conditions you will actually face beats a larger set drawn from a single regime.
The shift this asks for
The instinct to hoard data made sense when storage was the bottleneck and any signal was scarce. That era is over. The constraint now is the quality and the causal content of what you feed a system, and feeding it more of the wrong thing actively degrades trust in the output.
Organizations that internalize this will run fewer, sharper data efforts and get more from smaller models. They will ask why before they ask how much. The result is systems that hold up when conditions change, which is the only test that matters once a model leaves the lab.
What leaders should do
- Fund a standing experimentation capability. Make randomized tests a first-class input to every model, and treat their results as the data of record when they disagree with observational logs.
- Require a causal diagram for any high-stakes model before approving the training budget. If the team cannot name the likely confounders, the model is not ready.
- Redirect data spend from volume to curation. Reward teams for shrinking a dataset while raising its quality and coverage, the way phi-1 did, rather than for the size of the corpus.
- Judge models on out-of-distribution performance, not on fit to historical data. Hold back a slice of genuinely different conditions and make passing it the bar for shipping.
Related Articles
References & Extended Literature
- Gneezy, U., & List, J. A. (2013). "The Why Axis: Hidden Motives and the Undiscovered Economics of Everyday Life." PublicAffairs. https://en.wikipedia.org/wiki/The_Why_Axis
- Pearl, J., & Mackenzie, D. (2018). "The Book of Why: The New Science of Cause and Effect." Basic Books. https://en.wikipedia.org/wiki/The_Book_of_Why
- Pearl, J. (2019). "The Seven Tools of Causal Inference, with Reflections on Machine Learning." Communications of the ACM, 62(3). https://ftp.cs.ucla.edu/pub/stat_ser/r481.pdf
- Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., & Bengio, Y. (2021). "Towards Causal Representation Learning." Proceedings of the IEEE. arXiv:2102.11107. https://arxiv.org/abs/2102.11107
- Gunasekar, S., et al. (2023). "Textbooks Are All You Need." Microsoft Research. arXiv:2306.11644. https://arxiv.org/abs/2306.11644