
【#Tech24H】On May 7, Xiaomi AI Lab’s NextGeneration Kaldi team launched OmniVoice, which not only achieves toptier performance in Chinese and English scenarios but also surpasses commercial systems in multilingual tasks. It is the industry’s first speechcloning TTS model covering hundreds of languages. The model demonstrates strong generalization capabilities for lowresource minor languages, supporting speech synthesis for nearly any language worldwide. OmniVoice’s most striking breakthrough is its minimalist architecture. Using only a single bidirectional Transformer network, it directly converts text to speech, eliminating redundant structures and steps: no separate modeling of text, no complex hybrid architectures, and no multilevel token predictions. It is the simplest nonautoregressive (NAR) TTS model available today. [ By Zhang Liyan | Tang Ruohan ]
