ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| https://reddit.com/link/1u4lk5c/video/kyhdw0uog07h1/player Links:
Zyphra has released ZONOS2, its next-generation real-time text-to-speech model focused on expressive, high-fidelity voice cloning. It is open-source under Apache 2.0 and also available on Zyphra Cloud on AMD hardware. The model is designed to solve the usual TTS tradeoff between quality and speed. Zyphra says ZONOS2 is the first sparse MoE TTS model released open-source, with 8B total parameters and 900M active parameters at inference. The goal is straightforward: fast, efficient, and expressive speech synthesis without the usual compromise pileup. A major focus is voice cloning. Zyphra claims ZONOS2 is especially strong at capturing the distinctive characteristics of a speaker, producing more natural-sounding clones across a wide range of voices. The cloning is zero-shot, so no fine-tuning is needed. On the audio side, ZONOS2 predicts Descript Audio Codec (DAC) tokens for 44.1 kHz studio-quality audio. That gives better fidelity, but is harder to model than lower-quality codec setups. Zyphra says it closes that gap through larger-scale model and data training. For text handling, ZONOS2 does not use a phonemizer. Instead, it reads raw UTF-8 bytes, which Zyphra says improves coverage for lower-resource languages, boosts performance on Chinese, Korean, and Japanese, and supports native code-switching mid-sentence. Training also scaled heavily, from roughly 200K hours to 6M+ hours of audio. Zyphra says it used staged data filtering with increasing transcript-agreement strictness across pretraining, midtraining, and annealing. The intended result is fewer hallucinations, mispronunciations, and repetitions. Zyphra is also releasing ZTTS1-Eval, a new benchmark for TTS evaluation. It includes clean and in-the-wild datasets across up to 17 languages, with newer evaluation models such as Qwen3-ASR, ReDimNet, and MSR-UTMOS, plus prosody metrics. That is the gist. Big model, open weights, Apache 2.0, voice cloning, and enough infrastructure behind it to make the old TTS baseline look like scrap metal. [link] [comments] |
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.