Shanghai-based AI firm StepFun has launched StepAudio 2.5 Realtime, a novel end-to-end real-time speech model that processes audio input directly to audio output without intermediate text conversion. This advanced model supports both Chinese and English and claims top performance across key voice AI benchmarks, outperforming established models like GPT Realtime 1.5 and Gemini Live. StepFun, previously recognized for its highly efficient Large Language Models (LLMs), is applying a similar strategy to voice AI, focusing on creating highly customizable and stable personas for extended, immersive interactions.
Key Takeaways
- StepAudio 2.5 Realtime is an end-to-end real-time speech model supporting Chinese and English with customizable personas.
- The model reportedly leads in five voice AI benchmarks, surpassing competitors like GPT Realtime 1.5 and Gemini Live.
- A core innovation is its advanced persona stability, achieved through roleplay-specific Reinforcement Learning from Human Feedback (RLHF) and a large-scale persona dataset.
- StepAudio demonstrates strong paralinguistic comprehension, analyzing vocal cues like emotion and speaking rate directly from audio.
- StepFun aims to enhance long-form roleplay interactions with AI, addressing common “out-of-character” failure modes.
StepFun’s prior success lies in developing text-based LLMs that achieve superior performance with fewer parameters compared to larger rivals. For instance, their Step 3.5 Flash model, with 196 billion parameters, topped several reasoning benchmarks against models with trillions of parameters. This new voice model follows the same philosophy, aiming for high efficiency and effectiveness in generating realistic and engaging voice interactions, particularly for roleplaying applications.
A significant challenge in current AI persona systems is maintaining character consistency, often referred to as “out-of-character” (OOC) behavior. This occurs when the AI deviates from its assigned personality, especially under conversational pressure or prolonged interaction. StepFun asserts that StepAudio 2.5 Realtime overcomes this limitation through specialized RLHF training. This method prioritizes persona stability by fine-tuning the model on extensive human feedback tailored specifically to maintaining character throughout diverse and complex dialogues. The training regimen begins with a substantial dataset of human-authored persona examples, which is then algorithmically expanded to create a million-scale feature matrix, ensuring robust performance even in uncommon conversational scenarios.
Technically, StepAudio 2.5 Realtime’s ability to perform paralinguistic comprehension is a noteworthy advancement. The model can interpret non-verbal auditory cues embedded within speech, such as variations in vocal speed, emotional tone, and inferred age, directly from the audio stream. This capability precedes the generation of a response, allowing for more nuanced and contextually appropriate outputs. In benchmarks measuring this acoustic feature perception, StepAudio achieved a score of 82.18, significantly outperforming GPT Realtime 1.5 (80.46), Gemini Live (58.05), and DouBao Realtime (16.09).

Furthermore, in human evaluation tests conducted via a mobile app and scored by human raters, StepAudio scored 80.41. This compares favorably to GPT Realtime 1.5’s 68.01 and Gemini Live’s 67.16. Objective tests for general dialogue quality also showed StepAudio achieving 86.36, exceeding GPT’s 81.60. While these are StepFun’s proprietary benchmarks, the substantial margins in paralinguistics and conversational quality suggest a significant leap in voice AI capabilities.
Founded in April 2023 by Jiang Daxin, a veteran from Microsoft with extensive experience in projects like Bing and Cortana, StepFun is positioned as a leading AI startup in China, having secured approximately $1.7 billion in funding. The launch of StepAudio comes at a time when voice AI is rapidly evolving, with OpenAI’s advanced voice mode setting a high standard. StepFun’s direct comparison and claim of superiority indicate a competitive landscape where innovation in real-time voice processing and persona fidelity is paramount.
The company has introduced Xiao Yue as a flagship AI persona, designed to offer a “soul-level companion” experience that mimics natural human interaction. This persona is fully configurable, allowing for distinct opinions, catchphrases, and emotional boundaries. Developers can leverage StepFun’s API to create their own custom personas, with comprehensive documentation available on their developer platform.
Long-Term Technological Impact on the Industry
The advancements demonstrated by StepAudio 2.5 Realtime could catalyze a significant shift in how decentralized applications and AI-powered services are developed and experienced within the broader Web3 ecosystem. The integration of highly responsive, context-aware voice AI directly addresses the need for more intuitive and accessible user interfaces, moving beyond traditional text-based or click-driven interactions. For blockchain platforms, this could mean enhanced user engagement through voice-controlled dApps, more sophisticated virtual assistants for decentralized finance (DeFi) platforms, or even AI-driven NPCs in Web3 gaming environments that maintain consistent personalities. The ability to process audio end-to-end without text conversion suggests potential optimizations for Layer 2 scaling solutions, reducing the computational overhead and latency typically associated with complex AI processing. This could lead to more efficient and cost-effective deployment of AI functionalities on-chain or via decentralized compute networks. Furthermore, the focus on robust persona management and paralinguistic understanding sets a new benchmark for AI-driven characters in virtual worlds and metaverses, potentially fostering deeper immersion and more meaningful human-AI interactions. As AI models become more adept at understanding and replicating human nuance in speech, their integration into Web3 could unlock new paradigms for community management, content creation, and personalized user experiences, driving mainstream adoption by making decentralized technologies feel more natural and human-centric.
Details can be found on the website : decrypt.co
