Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Wanted a real head to head on the two TTS models that actually run well on CPU. Couldn't find one with proper numbers, so I ran one. Posting because the result was not what I expected going in. Quick context for anyone who hasn't seen Supertonic 3 yet: it's a flow-matching TTS where you can dial down inference steps to trade quality for speed. Default is 5 steps, "speed mode" is 2. Kokoro 82M everyone here knows by now. Hardware: AMD EPYC 7763, 4 vCPUs, 16GB RAM, no GPU. Roughly comparable to a Ryzen 5600 or a decent N100 box. Setup: 6 text lengths from 12 chars to 1712 chars, 5 runs each, 120 timed runs total. CUDA explicitly disabled. Warmup run discarded. Mean RTF (lower is faster):
Wall-clock latency on the medium text (196 chars, about 13 seconds of audio):
Long and Extended text details in the Github Repo below. Throughput in chars per second at steady state: Supertonic 2-step gets to ~111, Supertonic 5-step ~55, Kokoro hovers around 33 to 36 regardless of backend. The quality side, which actually flips the ranking: Supertonic at 2 steps is fast, but the audio is rough. Words slur, prosody is mechanical, not something I'd ship. At 5 steps it cleans up a lot and is genuinely usable. Kokoro at either backend still produces the most natural speech of anything I've tested in this size class. It's #1 on the TTS Arena leaderboard for a reason. So the practical ranking is more like:
Two things that surprised me:
Detailed write up and Github Repo with all 24 audio samples, and the benchmarks are mentioned in comments below 👇 This evaluation of both TTS models was performed using Neo AI Engineer that built the eval harness, handled model runtime issues, and consolidated results. I reviewed everything manually. If anyone has an N100 or a Pi 5 lying around and runs this, I'd love to see the numbers. That's the tier I actually want to deploy on. [link] [comments] |
More from r/LocalLLaMA
-
Local benchmarks with a RTX 3090 - Qwen3.6 27b vs Ornith
Jul 2
-
It's officially over. One of the fathers of AI at Nvidia doesn't believe in AGI and compares OpenAI and Anthropic's closed models to AOL and Prodigy's closed internets. Says the future is every business having a customized open source model.
Jul 2
-
6x P40 running Minimax M2.7_Q3_XL
Jul 2
-
Fine-tuned Gemma-4-31B specifically for Copywriting & Creative Writing Tasks (Scored +290 Elo over base using EqBench3)
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.