Login
Sign Up
Woofun AI reports that Clipto.AI launched a desktop product at the end of May that immediately topped global rankings on Product Hunt, introducing a multimodal search tool capable of navigating terabytes of videos, audio, images, and documents via natural language queries. While the immediate function appears to be search, the underlying architecture targets a fundamental infrastructure deficit in the current artificial intelligence landscape: the absence of a persistent user model. Over the past few years, large-scale AI models have achieved unprecedented efficiency in content generation, enabling the creation of code, images, and video, yet this capability has precipitated a paradoxical crisis where the volume of stored data outpaces the ability to reuse it. Knowledge workers including journalists, creators, lawyers, and researchers now face a bottleneck where the time-consuming aspect of their workflow is no longer content creation, but rather the arduous task of sifting through meeting recordings, live broadcast archives, podcast interviews, transcripts, project documents, and screenshots to locate specific information. The core issue identified by Takeo Kanade is not a failure of search algorithms, but a structural lack of a 'Memory Layer' that allows AI agents to truly understand users rather than just the world. Kanade argues that while AI has spent the last decade building comprehensive world models, it remains devoid of user models, rendering agents incapable of maintaining context or understanding individual needs without a dedicated memory system. This evolution reflects a hidden trend in AI technology over the past two decades, shifting from understanding content to generating content, and now necessitating a phase focused on organizing content.
Kanade posits that Clipto is not merely a search utility but a foundational 'Memory Layer' designed to bridge the gap between scattered personal data and the emerging agent ecosystem. He observes that over the last decade, AI has successfully constructed world models, yet it lacks the user models required to transform fragmented data across devices into personalized context that AI can continuously understand and utilize. Without a mechanism for long-term memory, no matter how intelligent an agent becomes, it cannot genuinely comprehend the user, as every interaction resets the context. Search is therefore only the initial step; the ultimate objective of Clipto is to construct the missing Memory Layer essential for the AI era. The solution relies on a fully local multimodal memory construction mechanism where users import local data including videos, audio files, images, and documents, which are then processed by the device's own AI hash rate and a self-developed desktop multimodal large model. This system perceives, analyzes, structures, and processes all files to create a personalized memory system featuring cognitive maps with temporal and spatial alignment. In practical application, users describe their needs in natural language, prompting the desktop large model to analyze intent and context before deploying a local search agent to locate relevant content within seconds, whether identifying a specific person, scene, dialogue, or complete event segment.
Woofun AI data shows that this local processing approach eliminates the high costs associated with uploading massive datasets and invoking cloud models while providing superior security for confidential work materials and ensuring usability during network interruptions or mobile work scenarios.
The architecture of Clipto establishes a critical memory connection between underlying large models and upper-layer agents, allowing users to ask conversational questions over terabytes of private data to receive answers, summaries, analyses, or organized content generated automatically from existing materials. All computations and processing operations occur entirely on the user's local device, a design choice that directly addresses the limitations of previous software solutions which focused primarily on storage rather than true content understanding. The core innovation lies in using local multimodal models to convert diverse media formats into data structures that AI can interpret, enabling a shift from searching for files to searching for memory. Kanade emphasizes that search is merely the first step, with the more critical function being the establishment of a Memory Layer capable of continuously accumulating personalized context. While AI has spent the last decade building knowledge bases about the world, the future requires systems that understand each user's individual knowledge and experiences. This perspective is deeply rooted in Kanade's professional trajectory, which has spanned nearly twenty years of witnessing key stages of AI development from research to industrialization. His career began in 2004 with an internship at Microsoft Research Asia, a time when deep learning was still years away and AI remained largely a laboratory research topic. One of his early projects involved helping Xbox automatically analyze vast amounts of home photos and videos taken by users to extract key segments from hours of footage and generate short family videos. While this may seem routine today, it addressed fundamental computer vision issues at the time, requiring machines to understand content before they could generate it by identifying who appeared in the footage, what happened, and which scenes were important versus ignorable.
Kanade subsequently pursued a Ph.D. at Carnegie Mellon University under the guidance of the legendary computer vision expert Takeo Kanade, continuing research on image and video understanding to enable robots to comprehend the real world by accumulating visual experience over time. To many, a video is simply a sequence of images, but Kanade views it as a complex information structure encompassing time, people, events, and relationships, where understanding video equates to understanding the real world. In 2017, he founded HuiChuan Intelligence and launched the text-to-video generation platform ZhiYing, coinciding with the rapid growth of the mobile internet and short-video industry. As new content creators flooded the market, the problem shifted from machines unable to understand content to the low efficiency of content creation, prompting Kanade to shift his technical focus from understanding to generation. Text-to-video generation, intelligent editing, and digital humans became central to ZhiYing's product development, anticipating the later popularity of AIGC. At the end of 2020, ZhiYing was acquired by Tencent, and Kanade joined the company to lead the Tencent ZhiYing team, developing full-stack AIGC products including text-to-image and text-to-video generation and digital humans. Had the industry followed a traditional path, he might have continued investing in generative AI, but the explosive growth of generative capabilities inspired a new line of thinking. As generation became increasingly easy, a new bottleneck emerged: the management of the vast amounts of videos, recordings, and documents created by users. AI had solved content creation but failed to address the understanding of individual content, leading Kanade to realize that before generation, understanding is necessary, and after understanding, memory is essential.
In Kanade's view, the maturation of agents depends on resolving the memory issue first, as today's large-scale AI models, despite their ability to write code, conduct analyses, and generate reports, suffer from an inherent flaw: they do not understand users. Every interaction with a new AI product feels like meeting someone with amnesia, requiring the user to repeatedly explain their identity, current tasks, and past actions, with all context disappearing once the conversation ends. The entire AI infrastructure lacks a crucial ability: a user model. While large-scale models possess almost all publicly available internet knowledge, they cannot truly understand a specific individual because that person's data is not on the internet but scattered across computers, phones, NAS drives, cloud storage, cameras, meeting records, and various local devices. For AI, this information is virtually invisible, a problem that will become increasingly apparent as agents become widely used. When discussing agents, the focus often centers on task completion, but the emergence of millions or even hundreds of millions of agents raises critical questions about how they will understand users, know their past actions, and share personal context. Kanade argues that it is neither practical nor necessary for each agent to rebuild its own user memory; instead, a reasonable approach involves an independent Memory Layer. The 'Living Memory Graph Agent' would handle task execution while the Memory Layer manages users' memories, allowing all agents to understand users based on this unified system. This structure mirrors the operating systems of the internet era, where countless applications rely on a single underlying file system, suggesting the current agent ecosystem needs a similar memory system as common infrastructure.