Login
Sign Up
Woofun AI reports that Lilian Weng, a former OpenAI Vice President and Peking University graduate, has published a definitive analysis titled 'Scaling Laws, Carefully' after a three-year delay, revealing that the foundational formula guiding billions of dollars in AI investment is fundamentally fragile. This article dismantles the industry's five-year reliance on a single mathematical relationship that claimed performance increases in fixed proportion to model size, data volume, and hash rate. The revelation that this law, which transformed AI from alchemy into a predictable business practice, contains significant methodological flaws and ignores the impending exhaustion of high-quality data, marks a critical inflection point for the sector. The analysis details how opposing conclusions from OpenAI and DeepMind regarding budget allocation were driven by measurement discrepancies and calculation errors rather than genuine scientific divergence.
The core of the Scaling Laws hypothesis posits that as the number of parameters N, the amount of data D, and the computational hash rate C increase, the training loss L(x) declines along a straight line on a logarithmic scale, mathematically expressed as L(x) = E + A/x^α. Here, E represents the theoretical optimal loss or the entropy of the data itself, while A and α are constants derived from fitting processes. This relationship implies that training a model with N parameters using D tokens requires a total hash rate of approximately 6ND, comprising 2ND for forward propagation and 4ND for backward propagation. Before this power-law relationship was introduced by Kaplan from OpenAI in 2020, deep learning was often dismissed as mystical because practitioners knew what worked but lacked a theoretical understanding of why. The ability to predict performance improvements by training smaller models and extrapolating the results to larger scales without spending billions on actual training became the primary justification for the massive capital allocation toward large-scale models.
However, the critical question of how to optimally allocate a fixed hash rate budget between model size and data volume led to contradictory strategies that dominated the industry for years.
In 2020, Kaplan's team at OpenAI concluded that the optimal model size N_opt is approximately proportional to C^0.73, suggesting that if the hash rate doubles, 5.5 times should be allocated to model size while only 1.8 times should go to data volume. This conclusion directly shaped the training of GPT-3, a 175-billion-parameter model trained on only 300 billion tokens, which later standards deemed severely undertrained. Conversely, in 2022, DeepMind's Chinchilla team reached the opposite conclusion that N_opt is approximately proportional to C^0.50, arguing that model size and data volume should increase in equal proportion. Engineers simplified this into a 20:1 ratio of tokens to parameters. To validate this, DeepMind compared its Gopher model, featuring 280 billion parameters and 300 billion tokens, against the Chinchilla model with 70 billion parameters and 1.4 trillion tokens. Despite using the same hash rate, Chinchilla significantly outperformed Gopher, shifting the industry consensus from maximizing model size to recognizing that most models were undertrained. The contrast between the exponents 0.73 and 0.50 meant that the same budget could be deployed in radically different ways depending on which research team's methodology was adopted.
A 2024 paper published in the top machine learning journal TMLR by two researchers delved into the root causes of this disagreement, uncovering that the discrepancy stemmed from how parameters were counted and the scale of the experiments. In neural networks, a layer of parameters known as embeddings converts text into digital vectors; in smaller models, this layer can account for up to one-third of the total parameters. Kaplan excluded these embeddings when calculating total parameters, whereas Chinchilla included them. This simple difference in measurement distorted the resulting power-law relationship. The researchers provided a correction formula: N = N_\E + ω·N_\E^(1/3), where N_\E represents parameters excluding embeddings and ω is a constant. For smaller models, the second term has a significant impact, but for larger models, it tends toward zero, meaning both methods eventually converge.
Furthermore, the scale of the experiments played a crucial role; Kaplan tested models up to 1.5 billion parameters, while Chinchilla covered models over 16 billion parameters. In a logarithmic context, small fitting differences are amplified when extended to larger scales. Using a unified counting method, researchers found that the exponent was indeed close to 0.73 within Kaplan's small-scale range but converged toward 0.50 as the scale increased. Kaplan was not wrong within his experimental scope but incorrectly generalized a local pattern to a universal conclusion.
Despite the resolution of the parameter counting issue, Lilian Weng's analysis points out that even the Chinchilla methodology, which the industry followed for two years, contained severe flaws. Chinchilla's paper utilized three independent verification methods: fixing model size while varying data, plotting iso-FLOP curves, and directly fitting the loss function L(N,D) = E + A/N^α + B/D^β. All three methods initially seemed to support the 0.50 ratio, with Method 3 offering an elegant mathematical derivation where optimizing L(N,D) under the constraint C ≈ 6ND yielded a closed-form solution N_opt ∝ (C/6)^(β/(α+β)). When α ≈ β, the exponent becomes approximately 0.5.
However, in 2024, the Epoch AI team manually extracted original data points from Chinchilla's charts and re-derived the results using Method 3, discovering two critical bugs. The first bug involved the loss function calculation: Chinchilla minimized the average value of the Huber Loss for each sample rather than the sum. The optimization objective was min Σ Huber_δ(log L̂(Nᵢ,Dᵢ) − log Lᵢ), with a Huber Loss function less sensitive to outliers (δ = 10⁻³) and the L-BFGS-B optimizer. Calculating the average instead of the sum drastically reduced the loss values, causing the optimization algorithm to stop prematurely before reaching the true optimal parameters. The second bug involved rounding errors; the two core exponents determining the power-law shape were rounded to only two decimal places. This minor precision loss caused significant errors in deriving other constants, resulting in unreasonably narrow confidence intervals that would require over 600,000 experiments to achieve, whereas fewer than 500 were actually conducted.
Woofun AI data shows that the industry's reliance on these flawed formulas persisted for two years, with the entire sector adjusting training practices based on conclusions that were mathematically compromised. Weng's blog includes an interactive simulator allowing users to adjust sliders for loss accuracy, loss noise, and fitting range, demonstrating how each change alters the Scaling Law curve. The analysis reveals that OpenAI's conclusion suffered from local biases due to small-scale experiments, while DeepMind's approach was marred by methodological errors in loss calculation and precision. Both sides of the most significant academic debate in the AI industry possessed fundamental shortcomings. Beyond these fitting issues regarding parameter counting, loss calculation, and decimal precision, the classic Scaling Laws theory faces a more existential threat: the assumption of unlimited data. The formula presumes every training dataset is unique with no repetitions, yet high-quality text data is expected to be exhausted by major research institutions between 2026 and 2028. Repeated training with the same data is now inevitable, collapsing the premise of the classic formula.
A large-scale experiment in 2023 involving approximately 400 models with parameter sizes ranging from tens of millions to 9 billion, and up to 1,500 repeated training rounds, introduced the concept of 'effective data volume' to replace actual data counts. If U unique datasets are repeated R times, the effective data volume is not U×R but calculated via the exponential decay formula D_eff = U·(1 - e^(-R)). While significant new information is learned in the first few rounds, marginal benefits approach zero by the fifth or tenth round. Counterintuitively, additional parameters depreciate faster than repeatedly used data, suggesting that under resource constraints, conducting more training rounds is more cost-effective than increasing model size. A new paper published in May 2026 proposed adding an explicit overfitting penalty term to the classic loss function, where the penalty is proportional to the model size and the number of repeated training rounds R. The complete formula includes constants P, δ, and κ derived from experimental data, with the ratio N/U representing model parameters to unique data volume. The core finding is that larger models are more sensitive to repeated data; a 500-million-parameter model might perform well after 10 rounds, whereas a 5-billion-parameter model would suffer significant declines. Strengthening weight decay was also found to significantly reduce overfitting from repeated training.
Consequently, between 2025 and 2026, the industry pivoted toward three approaches to circumvent data scarcity: reinforcement learning, DeepSeek R1, and OpenAI's o-series models. These methods allow models to generate training signals through verifiable tasks like mathematics and programming, utilize computed testing methods that require deeper thinking without increasing training costs, and employ synthetic data generated by existing large models to train the next generation. All three strategies share the premise that the traditional power-law relationship based solely on increasing model size is no longer sufficient. Lilian Weng's background adds weight to this critique; holding a bachelor's degree from Peking University and a doctorate from Indiana University Bloomington in network science and complex systems, she focused on how information spreads within social networks rather than deep learning initially. After working at Dropbox and Affirm, she joined OpenAI in 2018, contributing to the Dactyl robot which learned to solve Rubik's Cube over two years. She later established the applied research team and built the Safety Systems team, which grew to over 80 scientists, engineers, and policy experts before her departure. Promoted to VP of Research and Safety in August 2024, she left three months later.
Weng's personal blog, Lil'Log, started in 2017 to organize learning notes, has become one of the most cited resources in the AI field, with universities using it for teaching. She once stated, 'Explaining a concept clearly is the best way to test whether you truly understand it.' Over nine years, she authored dozens of articles on reinforcement learning, diffusion models, and large-scale model agents, each grounded in basic principles with original diagrams. In February 2025, she co-founded Thinking Machines Lab with former OpenAI CTO Mira Murati, OpenAI co-founder John Schulman, former Research VP Barret Zoph, and Luke Metz. The company secured $2 billion in seed financing led by a16z, achieving a valuation of $12 billion. While managing these ventures, she completed this in-depth analysis of Scaling Laws. The platforms we use daily, including ChatGPT, Claude, and Gemini, rely on these formulas to determine future training strategies. The utility of the next generation of AI will depend not on who possesses the most GPUs, but on who can navigate these mathematical nuances with greater precision. This marks a definitive shift from the era of brute-force scaling to one of algorithmic efficiency and data synthesis. The industry must now accept that the era of infinite data growth is over, and the path forward requires a fundamental rethinking of how models are trained and evaluated in a resource-constrained environment.