We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Hey, I work on inference tooling at Mininglamp AI. We needed faster prefill for a 4B VLM running on Apple Silicon. Problem was MLX only does weight-only quant — activations stay FP16 the whole way through. So we wrote Cider, a small SDK that adds W8A8 activation quant on top of MLX. Numbers on M5 Pro (64GB, 307 GB/s), 4516 token context:
Under the hood it's custom Metal kernels we registered as MLX primitives. At M=4096 the per-channel path runs 1.84x faster than W8A16 on the same shape. Not just for our model btw — works with anything that runs through MLX. One catch: INT8 TensorOps only compile on M5 and above. pip install on M4 still works, just falls back to the regular path. Repo: https://github.com/Mininglamp-AI/cider Edit: adding accuracy numbers since it came up. Wikitext2 PPL on Qwen3-8B: FP16 9.73, W8A16 9.71, W8A8 per-channel 9.76. Llama3-8B: FP16 6.14, W8A16 6.15, W8A8 per-channel 6.27. Per-group gs=64 keeps it tighter if precision matters more than speed for your use case. [link] [comments] |
More from r/LocalLLaMA
-
Local benchmarks with a RTX 3090 - Qwen3.6 27b vs Ornith
Jul 2
-
It's officially over. One of the fathers of AI at Nvidia doesn't believe in AGI and compares OpenAI and Anthropic's closed models to AOL and Prodigy's closed internets. Says the future is every business having a customized open source model.
Jul 2
-
6x P40 running Minimax M2.7_Q3_XL
Jul 2
-
Fine-tuned Gemma-4-31B specifically for Copywriting & Creative Writing Tasks (Scored +290 Elo over base using EqBench3)
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.