In a bold leap for multimodal AI, Meituan’s LongCat team has unveiled LongCat Flash Omni, a 560‑billion‑parameter model that is fully open‑source. The system is engineered to process text, images, video, and audio simultaneously, delivering real‑time responses that would normally require separate specialized models. By making the model publicly available, LongCat is encouraging researchers and developers to experiment, fine‑tune, and build on top of a truly universal AI foundation. The model was trained using a distributed cluster of GPUs, and the team released both the weights and the training recipe, enabling researchers to replicate or adapt the architecture to other domains.
What sets LongCat Flash Omni apart is its selective activation mechanism: while the model contains 560 billion weights, only about 27 billion are activated for each token. This dynamic routing drastically reduces the computational burden during inference, allowing the system to maintain low latency even on commodity hardware. The architecture blends a transformer backbone with a cross‑modal attention module that can fuse high‑level audio embeddings, visual frames, and textual tokens into a common latent space. Benchmark tests show that the model matches or exceeds state‑of‑the‑art performance on a range of multimodal benchmarks, from Visual Question Answering to Audio‑Video Retrieval, without sacrificing real‑time responsiveness. LongCat Flash Omni was trained on a curated multimodal corpus comprising 50 million image‑text pairs, 10 million video‑transcript triplets, and 5 million audio‑visual embeddings, ensuring balanced coverage across modalities.
Beyond performance, the open‑source release is a strategic move that could democratize advanced multimodal AI. Developers can fine‑tune LongCat Flash Omni on domain‑specific datasets—such as medical imaging paired with diagnostic reports or surveillance footage with narrative commentary—without the prohibitive costs of training a 560‑billion‑parameter model from scratch. The community can also contribute improvements to the selective activation algorithm, explore new modalities, and integrate the model into edge devices for applications like live captioning, interactive virtual assistants, and immersive AR/VR experiences. Looking ahead, the LongCat team plans to extend the model with additional modalities such as 3D point clouds and sensor data, aiming to support robotics and autonomous systems.
Want the full story?
Read on MarkTechPost →