📰 LongCat-AudioDiT 2026: State-of-the-Art Diffusion TTS with Zero-Shot Voice Cloning LongCat-AudioD

huggingface speech voice

2026-03-31 | Source: Mastodon | Original article

LongCat‑AudioDiT, unveiled this week by the Finnish startup LongCat AI, pushes text‑to‑speech (TTS) into a new regime by generating audio directly in a latent waveform space with a diffusion transformer. The model, trained on a diverse multilingual corpus, can clone an unseen speaker’s timbre from as little as three seconds of reference audio and produce speech that scores above 0.90 on standard speaker‑similarity benchmarks—levels previously reserved for multi‑hour fine‑tuning pipelines. The breakthrough stems from a latent diffusion process that iteratively refines a compressed audio representation, eliminating the need for separate vocoder stages that have long been a bottleneck for quality and speed. Compared with earlier diffusion‑based TTS systems, LongCat‑AudioDiT reaches comparable fidelity in eight sampling steps, cutting inference time by roughly 60 % while preserving the natural prosody that has plagued earlier zero‑shot attempts. Why it matters is twofold. First, the ability to generate high‑fidelity, personalized speech on‑the‑fly opens the door for truly bespoke voice assistants, dynamic audiobook narration, and rapid localisation of video content without the costly collection of speaker‑specific data. Second, the latent‑space approach dovetails with recent advances in diffusion transformers, such as the Sparse‑Alignment DiT architecture we covered in our March 30 piece on A‑SelecT, suggesting a broader shift toward more efficient, end‑to‑end generative pipelines across modalities. Looking ahead, the community will be watching whether LongCat releases the model weights and training code, which could accelerate adoption in open‑source ecosystems like Hugging Face. Benchmarks on the Seed‑TTS‑Eval suite are expected in the coming weeks, and industry players are already hinting at integration trials in automotive infotainment and e‑learning platforms. The race to combine real‑time performance with zero‑shot cloning fidelity is now on, and LongCat‑AudioDiT has set a high bar for the next wave of conversational AI.

Sources

Back to AIPULSEN