MOSS-TTS-Nano: Real-Time Voice AI on CPU, Part of an Open-Source Stack Rivaling Gemini - Firethering

benchmarks gemini open-source speech voice

2026-04-15 | Source: Mastodon | Original article

MOSS‑TTS‑Nano, a 100‑million‑parameter text‑to‑speech model released by MOSI.AI and the OpenMOSS community, can generate natural‑sounding speech in real time on a standard CPU. The open‑source stack, announced on Firethering, claims speaker‑similarity scores that beat Google’s Gemini 2.5 Pro and ElevenLabs in independent benchmarks, and it can synthesize a voice from a plain text description without any reference recording. The breakthrough lies in the model’s “deployment‑first” design. At 0.1 billion parameters it fits comfortably in RAM, runs at 48 kHz stereo without GPU acceleration, and supports twenty languages. Installation requires only Conda, Python 3.12+ and a handful of pip packages, making it accessible to developers and hobbyists who lack specialised hardware. By keeping inference on‑device, MOSS‑TTS‑Nano also sidesteps the privacy concerns that accompany cloud‑based services. The release matters because high‑quality TTS has traditionally been split between two extremes: heavyweight commercial APIs that demand cloud resources, and lightweight open‑source tools that sound robotic. MOSS‑TTS‑Nano collapses that divide, offering a middle ground that could accelerate voice‑enabled applications on edge devices, from Nordic smart‑home assistants to on‑premise customer‑service bots. Its zero‑shot voice‑cloning capability opens the door to rapid prototyping of localized audio content without costly recording sessions, a prospect especially appealing to smaller media firms and educational platforms. What to watch next is how the community scales the model and integrates it into broader AI pipelines. Early adopters are already testing the stack in multilingual call‑center simulations and real‑time captioning for live events. Follow‑up research will likely compare MOSS‑TTS‑Nano against other open‑source contenders such as Coqui TTS, while MOSI.AI hints at a larger 500 M‑parameter sibling aimed at studio‑grade fidelity. The race to bring studio‑quality voice synthesis to the CPU is now on, and MOSS‑TTS‑Nano has put the Nordic AI scene squarely in the spotlight.

Sources

Back to AIPULSEN