You can now convert EXL3 quants on Apple Silicon Mac
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hi, I'm here with an update. But this time it's quite a bigger news on local llm. Normally accessing the high fidelity quant like EXL3 is CUDA gated, and imagine you need 96GB-128GB with RTX cards, they are very specialized and expensive. But now on a more general basis, MacOS and Apple Silicon you can find those with 64GB+ quite easily, they don't come cheap but they are available for normal people. You can now run, inference and even convert EXL3 models. I've done it with MiniCPM5 and Qwen3.6-27B. The mean KLD of MiniCPM5 is on par with model converted with RTX card, and Qwen3.6-27B is just a tiny bit behind.
If you don't know about EXL3, it's a wonderful work from turboderp and co. Best quant quality-to-weight on a consumer machine. It's approximately around half a bit per weight better than MLX quant in general.
https://github.com/beamivalice/PonyExl3 Grab it - Apache 2.0
Cheers,
Beam
[link] [comments]
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.