Login
Sign Up
The structural limitations of discrete tokenization are forcing a fundamental re-evaluation of the path to Artificial General Intelligence. In December 2024, Ilya Sutskever declared at NeurIPS that pre-training is ending, while Yann LeCun departed Meta in March 2026 to found AMI Labs, asserting that the large language model approach is fundamentally flawed. These divergent moves by deep learning pioneers signal a consensus that while current models retain commercial value with growing user penetration, the token paradigm faces a hard structural ceiling preventing true world understanding. Woofun AI analysis suggests this shift is not merely theoretical but is being driven by empirical evidence that discrete sequences cannot capture the continuous, high-dimensional nature of human cognition.
In May 2026, research teams led by Kai-Ming He at MIT and the Seed Lab at ByteDance published papers demonstrating that language generation can occur in continuous embedding spaces rather than discrete token sequences. The ELF project utilized Flow Matching to evolve from noise to target embeddings in just 32 sampling steps, matching the quality of discrete models requiring 1024 steps. This approach reduced training data requirements to approximately 45 billion tokens, less than one-tenth of the volume needed by mainstream autoregressive methods. Woofun AI figures indicate that this efficiency gain stems from modeling the continuous patterns of sensory cortex activity rather than the lossy compression protocol of human language.
Four days after the ELF release, the ByteDance Seed team unveiled Cola DLM, which compresses language into a semantic latent space via Text VAE before applying Flow Matching to model global priors. With only 2 billion parameters, Cola DLM outperformed autoregressive models of similar size and rivaled the 100 billion parameter LLaDA2.0 across eight benchmarks. The paper explicitly framed the diffusion process as transporting latent priors rather than reconstructing token-level observations, proving that tokens are not a necessary condition for high-performance language modeling. This technical breakthrough highlights a significant epistemological gap between simulating human symbols and understanding the causal relationships of the physical world.
Major technology firms are rapidly pivoting toward native multimodal unity to capitalize on these continuous space advantages. Google's Gemini series, evolving from version 1.0 in December 2023 to 3.1 Pro in 2026, trains text, images, audio, and video within a single model sharing attention layers. The Gemini Embedding 2 model, released in March 2026, unified all inputs into a single 3072-dimensional vector space, effectively eliminating modality boundaries. In contrast, OpenAI's GPT-4V relied on attached visual encoders and projection layers, a circuitous path that reportedly led to the abandonment of video generation functions in favor of Agent architecture and code tooling due to excessive resource consumption.
ByteDance is uniquely positioned to validate continuous unified spaces at an industrial scale, leveraging its Seedance video generation models and access to massive TikTok video datasets. The Cola DLM paper explicitly pointed toward unified modeling of discrete text and continuous modalities as the next frontier. Conversely, Anthropic has adopted a strategy of focusing exclusively on text reasoning and code execution, generating $2.5 billion in annual revenue from Claude Code and achieving an implied valuation of $1.2 trillion in May 2026. Woofun AI notes that while commercially successful, this specialization risks accumulating technical liabilities if the industry standard shifts to unified continuous space understanding within the next two to three years.
The financial and architectural implications of this paradigm shift are profound for the broader ecosystem. Companies specializing in video tokenization, such as those developing VQ-VAE, MAGVIT, and OmniTokenizer, face existential threats as continuous data representation renders discrete encoding obsolete.
Furthermore, the industry's reliance on token-based pricing models is becoming unsustainable; if diffusion models generate text of any length with a fixed number of steps, computational costs will decouple from output length. This disruption suggests that intermediate products bridging modalities will lose their business case as native unified architectures become the default standard.
Ultimately, the transition away from tokens represents the first step toward Recursive Self-Improvement, where models learn through active exploration rather than passive data compression. While ELF and Cola DLM demonstrate the efficiency of continuous spaces, their training data remains derived from human-generated content, which is already a compressed representation of reality. LeCun's JEPA approach addresses this by prioritizing the prediction of physical consequences in abstract spaces over realistic output generation. The future of AGI likely depends not on accumulating more data, but on models that can act in the world and learn from the feedback of their own actions, a trajectory that current token-based systems are structurally incapable of supporting.