Gemini 3.1 Flash TTS – with directed prompts

gemini speech

2026-04-16 | Source: HN | Original article

Google has added a new layer of control to its Gemini 3.1 Flash TTS model, letting developers steer the voice output with “directed prompts” embedded directly in the text. The feature, announced today, expands the model’s existing support for more than 70 languages and 30 distinct voice personas by allowing inline tags that specify tone, speed, emotion and even speaker identity. The prompts are parsed by the API at inference time, producing audio that matches the precise stylistic cues the user supplies without needing separate post‑processing steps. The upgrade matters because it turns a high‑quality, low‑latency text‑to‑speech engine into a programmable sound generator. Content creators can now generate multilingual podcasts, e‑learning modules or interactive voice assistants that adapt their delivery on the fly, while marketers can embed brand‑specific vocal traits without hiring voice talent. Google also continues to embed its SynthID watermark in every clip, a safeguard that helps platforms flag AI‑generated audio and mitigate deep‑fake misuse. As we reported on 16 April, Gemini 3.1 Flash TTS already impressed with Japanese‑language synthesis and emotion control via voice tags. Today’s directed‑prompt capability pushes the model from a static voice service toward a dynamic audio authoring tool, narrowing the gap with proprietary solutions from rivals such as Amazon Polly and Microsoft Azure Speech. What to watch next: Google has opened the preview endpoint (gemini‑3.1‑flash‑tts‑preview) to a limited set of developers, and a broader public rollout is expected later this quarter. Integration into the upcoming Gemini AI app for macOS could bring on‑device prompt editing, while updates to the SynthID detection framework will be crucial for maintaining trust as the technology spreads across media platforms.

Sources

Back to AIPULSEN