LongCat Flash Omni: 560B Omni‑Modal Model for Audio‑Visual Interaction
Meituan’s LongCat team has broken new ground with the release of LongCat Flash Omni, a 560‑billion‑parameter omni‑modal model that brings together text, image, video, and audio processing in a single architecture. What sets this model apart is its activation strategy: only about 27 billion parameters are actively engaged per token, dramatically reducing the computational burden while preserving the expressive power of a massive network. Built on a transformer‑based backbone that extends the modern cross‑modal attention paradigm, LongCat Flash Omni can ingest and generate multimodal content in real time, making it suitable for applications ranging from live captioning and interactive storytelling to real‑time surveillance and entertainment.
The team leveraged a distributed training pipeline that spans thousands of GPUs, coupled with a novel sparse‑attention mechanism that selectively routes information between modalities. This approach not only keeps inference latency low but also enables the model to scale without the quadratic cost typically associated with token‑wise attention. Benchmarks on standard multimodal datasets show state‑of‑the‑art performance across vision‑language, audio‑visual, and speech‑language tasks, while its open‑source release invites researchers and developers to fine‑tune the model for niche domains. Importantly, LongCat Flash Omni’s architecture allows for modular plug‑ins, so future modalities—such as haptic or spatial data—can be integrated without a full retrain.
The release marks a significant milestone for the AI community, demonstrating that an ultra‑large omni‑modal model can be both efficient and versatile. By making the weights public, Meituan encourages a collaborative ecosystem where academia and industry can experiment with new multimodal paradigms, potentially accelerating breakthroughs in real‑time AI interaction. Whether it’s powering next‑generation virtual assistants or enabling immersive AR experiences, LongCat Flash Omni sets a new benchmark for what is achievable when multimodal intelligence meets practical deployment constraints.
Want the full story?
Read on MarkTechPost →