Scaling Laws Revisited: When Data, Not Compute, is the Bottleneck
Abstract:
The formula for AI progress has long appeared straightforward: more compute, more data, better models. Yet while compute is growing with better hardware and bigger clusters, data is stagnating—calling into question the very scaling laws that powered the last decade. The internet—often described as the “fossil fuel” of AI—offers only a finite reservoir of training data, raising a critical question: how can we sustain the scaling trends that underpin modern AI progress?
This thesis explores one promising direction: trading off compute for data. The central idea is to leverage additional compute to compensate for limited data. We present three simple strategies towards this goal:
- Make the task harder. We show that making the training objective more challenging can improve the generalization ability of current models.
- Make the supervision richer. We demonstrate that providing dense gradient feedback can enhance the sample-efficiency for post-training foundation models.
- Make the tasks unsupervised. We find that large language models can improve at reasoning without access to any ground truth question–answer pairs, reducing reliance on costly supervision.
Collectively, the results suggest new ways for extending .scaling laws in an era where data growth can no longer be taken for granted
Thesis Committee Members:
Deepak Pathak, Co-Chair
Katerina Fragkiadak, Co-Chair
Deva Ramanan
Hao Liu
Yejin Choi, Stanford University
