Agents-A1 GGUF quants (35B Qwen3.5-MoE agent model) — NVFP4 for Blackwell + working MTP speculative decoding (up to 1.22× single-user, 91% draft acceptance)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Repo → huggingface.co/LordNeel/Agents-A1-GGUF I made GGUF quants of InternScience/Agents-A1 — a 35B Mixture of Experts agent model (Qwen3.5-MoE, ~3B active, 256 experts / 8+1 active, hybrid linear+full attention, 256K context). It's built for long-horizon search, tool-calling, and scientific/engineering agentic work. The base model's own benchmarks are strong for the ~35B class (their numbers, not mine — see their card). Two things made this more than a plain quant dump:
Quants + quality (vs BF16)Quality measured with KL-divergence over top-64 next-token distributions on 32 prompts (more meaningful than my deliberately-small PPL eval). Lower KLD = closer to BF16.
(BF16 reference: 162 gen tok/s. All numbers on a single RTX PRO 6000 Blackwell, full offload.) Sweet spots: MTP / speculative decodingThe upstream checkpoint advertises MTP in config but ships no MTP tensors. I grafted in the
So ~1.15–1.22× free throughput on a single stream depending on how aggressive you set the draft length. Running itYou need a recent llama.cpp build with MTP flags and the NVFP4 path are documented in the model card. Caveats
Full metrics, KLD reports, checksums, charts, and the MTP audit are all in the repo. Feedback welcome, especially from anyone running these on non-Blackwell cards. [link] [comments] |
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
-
They fit! Mostly.... 2x 3090, Thermaltake Core p3
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.