StepFun Unveils StepAudio 2.5 Realtime, a Breakthrough Real-Time Speech AI Model

speech voice

2026-05-25 | Source: Mastodon | Original article

StepFun releases StepAudio 2.5 Realtime, a real-time speech LLM. It processes audio input directly to output via WebSocket.

StepFun has unveiled StepAudio 2.5 Realtime, a groundbreaking end-to-end real-time speech large language model (LLM). This innovative model processes audio input directly to audio output via WebSocket, supporting both Chinese and English languages. By leveraging million-scale persona data and roleplay-specific reinforcement learning from human feedback (RLHF), StepAudio 2.5 Realtime achieves stable character consistency. This development matters because it marks a significant shift from traditional pipeline systems, which often rely on separate components for speech recognition and text-to-speech synthesis. StepAudio 2.5 Realtime's unified approach enables more seamless and natural interactions, paving the way for enhanced voice assistants, chatbots, and other conversational AI applications. As we reported on May 25, real-time multimodal AI integration is becoming increasingly important, and StepAudio 2.5 Realtime is a notable step forward in this area. As the AI community begins to explore the capabilities of StepAudio 2.5 Realtime, it will be interesting to watch how this technology is applied in various industries, such as customer service, education, and entertainment. Additionally, the potential for further advancements in real-time speech LLMs will likely drive innovation in areas like voice-controlled interfaces and emotional intelligence in AI systems.

Sources

Back to AIPULSEN