Tongyi Large Model Unveils New-Gen End-to-End Speech Interaction Model_English_

【#Tech24H】On December 23, Tongyi Large Model released its new-generation end-to-end speech interaction model, Fun-Audio-Chat. This model is not merely capable of conversation. It is an AI voice companion that understands your words, perceives your emotions, and can genuinely assist you with tasks. In terms of technical performance, the new model’s end-to-end S2S (speech-to-speech) architecture can directly generate speech output from speech input, eliminating the need for multi-module concatenation (ASR + LLM + TTS), resulting in higher efficiency and lower latency. The Shared LLM layer processes efficiently at a 5Hz frame rate, while SRH generates high-quality speech at a 25Hz frame rate, reducing GPU computational overhead by nearly 50%. The training content covers real-world scenarios such as audio understanding, voice Q&A, emotion recognition, and tool calling, making the model more “down-to-earth”.