Login
Sign Up
The AI industry has undergone a fundamental structural shift where the bottleneck has migrated from model training to inference execution. In 2023, David Cahn of Sequoia Capital identified a critical disconnect in the ecosystem, termed the '200 billion dollars problem,' noting that for every 1 dollar spent on GPUs, approximately 1 dollar was required for power, necessitating 200 billion dollars in revenue to recoup annual capital expenditures. Even under optimistic revenue assumptions, a gap of over 125 billion dollars existed between investment and actual end-user payments. By 2024, this discrepancy widened significantly as large-scale manufacturers increased capital expenditure, redefining the issue as the '600 billion dollars problem.' The core concern remained consistent: overproduction of GPUs ahead of demand risked massive capital waste.
However, the resolution to this equation was not found in training but in inference, a realization that only recently began influencing market pricing and valuation models.
Cerebras Systems' initial public offering on Thursday served as a definitive market signal, attracting 20 times its initial offering amount with pricing nearly double the final figure announced on Wednesday. This surge in demand was not driven by speculation regarding a 'next Nvidia killer' but by a strategic recognition that inference, not training, represents the true bottleneck. Cerebras' architecture prioritizes extremely fast inference, a capability that resonates with Wall Street given the demand-driven nature of the inference market. J.P. Morgan estimates the inference market size to be 10 to 50 times larger than the training market. Unlike training, which is a one-time event, inference is an ongoing process consumed every time an AI agent executes a task. As the industry transitions toward agent-based systems where machines assign tasks to other machines, demand scales with computing power consumption rather than user count, fundamentally altering the economics of the sector.
Nvidia's latest financial reporting reinforces this paradigm shift from the top of the supply chain. During its earnings call, Jensen Huang stated that AI demand is growing parabolically due to the arrival of agent-based AI, which has evolved from simple inference to logical reasoning and autonomous task scheduling. Huang explicitly noted that 'Tokens are now profitable,' equating computing power directly to revenue and profit. Consequently, Nvidia restructured its financial reporting to highlight two distinct platforms: Data Center and Edge Computing. The Data Center segment generated approximately 75 billion dollars in the current quarter, a year-on-year increase of 92%, split between Hyperscale at 38 billion dollars and ACIE at 37 billion dollars.
Meanwhile, the Edge Computing segment, covering endpoints like PCs, robots, and cars where physical AI operates, reached 6.4 billion dollars, a 29% year-on-year increase. Although Edge Computing represents less than 8% of total revenue, its elevation to a 'second platform' signals a bifurcation of the inference market into cloud and endpoint domains.
The technological roadmap aligns with this strategic pivot. The Vera Rubin chip, shipping in the third quarter, promises up to 35 times the inference throughput of the Blackwell architecture. Huang also outlined a new total addressable market of 200 billion dollars for the Vera CPU, specifically designed for agent-based workloads, with leading AI companies expected to adopt it immediately. As major corporations restructure financial frameworks around 'service tokens,' the debate over bottlenecks has effectively concluded. The focus now shifts to identifying which entities will capture value when inference becomes the primary scarce resource. Woofun AI analysis suggests that the value accumulation will favor those controlling the aggregation and routing layers rather than mere hardware owners.
The scope of this transformation encompasses both cloud inference, involving the rental of data center GPUs for API token services, and endpoint inference running on local chips. Cloud inference currently exacerbates the bottleneck, as evidenced by Anthropic's operational challenges. Usage far exceeded pre-configured capacity, leading to complaints regarding limited performance and reduced context windows. To address this, Anthropic secured the entire Colossus 1 data center from SpaceX in May 2026, comprising over 220,000 Nvidia GPUs and 300 megawatts of power dedicated solely to inference. This acquisition triggered immediate policy shifts: on May 6, usage limits for Claude Code were doubled, and on May 13, weekly limits increased by 50%. By June 15, Anthropic separated agent-based usage from flat subscriptions, implementing a metered credit system ranging from $20 to $200 per month. This move underscores that agent consumption rates exceed flat-rate models, necessitating pricing based on actual operating costs.
In this fragmented supply chain, Hyperbolic has emerged as a unique player by operating without owning any GPUs. Launched in June 2025, its on-demand GPU market surpassed 200,000 developers within months, serving leading research labs and consumer platforms. Hyperbolic aggregates fragmented capacity from dozens of independent clouds, including CoreWeave, Lambda Labs, and Nebius, creating a standardized pool for developers to access the cheapest available compute. Woofun AI notes that this asset-light model provides a competitive advantage through real-time data access on pricing and availability, allowing the firm to detect supply surpluses and demand surges before they impact the broader market. While the company plans to eventually act as a market maker using its own capital, its current value lies in the aggregation layer itself, connecting disparate suppliers with consumers.
Venice offers a contrasting model focused on privacy within the inference economy. It functions as a routing layer for approximately 75 models, two-thirds of which are open-source, offering subscription tiers that trigger buybacks of its VVV token. Each DIEM token represents approximately 1 dollar of daily computing power. Despite reported annual recurring revenue figures of 70 million dollars, a more realistic assessment places the range between 6 million and 15 million dollars. The platform maintains approximately 136,000 active token holders and receives around 9.9 million monthly website visits.
However, its business model relies on renting GPUs from partners like NEAR AI Cloud and Phala, meaning its gross profit is the subscription fee minus inference costs. The value proposition is a privacy premium, ensuring data is not stored or used for training, though this protection varies between open-source and closed-source models.
Ultimately, the ecosystem is stratifying into layers where Hyperbolic acts as the oil refinery while Venice functions as a gas station. Hyperbolic aggregates and standardizes the fragmented supply that Venice and similar platforms depend upon. As agent-based and physical AI technologies multiply demand across cloud and edge sectors, the '600 billion dollars problem' identified by Cahn may indeed manifest as excess supply.
However, this surplus creates an ideal environment for asset-light companies. When GPU prices fall and supply fragments, firms that do not own depreciating hardware but can route workloads to the most cost-effective locations will capture the margins. Woofun AI assesses that the ultimate winners will be those capable of providing real-time visibility into GPU availability and pricing, positioning Hyperbolic as the definitive aggregation layer for the inference era.