r/LocalLLaMA · · 2 min read

Agents-A1 GGUF quants (35B Qwen3.5-MoE agent model) — NVFP4 for Blackwell + working MTP speculative decoding (up to 1.22× single-user, 91% draft acceptance)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Agents-A1 GGUF quants (35B Qwen3.5-MoE agent model) — NVFP4 for Blackwell + working MTP speculative decoding (up to 1.22× single-user, 91% draft acceptance)

Repo → huggingface.co/LordNeel/Agents-A1-GGUF

I made GGUF quants of InternScience/Agents-A1 — a 35B Mixture of Experts agent model (Qwen3.5-MoE, ~3B active, 256 experts / 8+1 active, hybrid linear+full attention, 256K context). It's built for long-horizon search, tool-calling, and scientific/engineering agentic work. The base model's own benchmarks are strong for the ~35B class (their numbers, not mine — see their card).

Two things made this more than a plain quant dump:

  • NVFP4 build for Blackwell GPUs
  • MTP (multi-token prediction) grafted in for real speculative decoding, measured.

Text-only. The base is multimodal but I'm not shipping an mmproj, so no vision/video with these files.

Quants + quality (vs BF16)

Quality measured with KL-divergence over top-64 next-token distributions on 32 prompts (more meaningful than my deliberately-small PPL eval). Lower KLD = closer to BF16.

Quant Size Gen tok/s KLD mean Top-1 match
Q3_K_M 16.8 GB 269 0.0655 28/32
IQ4_XS 18.7 GB 258 0.0151 29/32
NVFP4 19.7 GB 265 0.0420 31/32
Q4_K_M 21.2 GB 263 0.1225 27/32
Q5_K_M 24.7 GB 258 0.0091 30/32
Q6_K 28.5 GB 245 0.0049 32/32
Q8_0 36.9 GB 223 0.0053 30/32

(BF16 reference: 162 gen tok/s. All numbers on a single RTX PRO 6000 Blackwell, full offload.)

Sweet spots: IQ4_XS for compact, Q5_K_M/Q6_K for near-BF16. Heads up — Q4_K_M has oddly high KLD despite a good PPL delta, so I'd reach for IQ4_XS or Q5_K_M over it unless you're using the MTP variant.

MTP / speculative decoding

The upstream checkpoint advertises MTP in config but ships no MTP tensors. I grafted in the wang-yang/Agents-A1-MTPLX-Q4 sidecar and converted it through llama.cpp's Qwen3.5-MoE MTP path (MTP block kept at Q6_K). Single-user serving, temperature=0:

Variant Mode tok/s Speedup Draft acceptance
IQ4_XS-MTP target-only 225 1.00×
IQ4_XS-MTP n_max=2 275 1.22× 76.5%
IQ4_XS-MTP n_max=1 260 1.16× 86.5%
Q4_K_M-MTP n_max=1 265 1.15× 91.5%
Q4_K_M-MTP n_max=2 274 1.19× 77.2%

So ~1.15–1.22× free throughput on a single stream depending on how aggressive you set the draft length.

Running it

You need a recent llama.cpp build with qwen35moe support (NVFP4/MTP need newer builds still).

hf download LordNeel/Agents-A1-GGUF agents-a1-IQ4_XS.gguf --local-dir ./agents-a1 llama-server -m ./agents-a1/agents-a1-IQ4_XS.gguf -ngl 99 -c 8192 -b 4096 -ub 512 --flash-attn on 

MTP flags and the NVFP4 path are documented in the model card.

Caveats

  • Text-only (no mmproj).
  • NVFP4 needs a Blackwell GPU + FP4-capable build (BLACKWELL_NATIVE_FP4 = 1).
  • PPL eval is small/directional — trust the KLD numbers more.
  • MTP weights are grafted from a separate sidecar, not native to the original release.

Full metrics, KLD reports, checksums, charts, and the MTP audit are all in the repo. Feedback welcome, especially from anyone running these on non-Blackwell cards.

https://preview.redd.it/xm9r1q48ahah1.png?width=1776&format=png&auto=webp&s=16fffe8d9f460584429298a42c1c68ac336ea206

https://preview.redd.it/td59qp48ahah1.png?width=1622&format=png&auto=webp&s=514828c8eb7cfe8d9ed7b7aa5a4dd7959fd7f33b

https://preview.redd.it/e6m3br48ahah1.png?width=1626&format=png&auto=webp&s=ac8ffd4b93f048f4e4df28cab6ba9ce591a9dab3

https://preview.redd.it/5o68bq48ahah1.png?width=1701&format=png&auto=webp&s=e7696771c8e4176767477ef0d4bf3997eb0304e3

https://preview.redd.it/29z6cq48ahah1.png?width=1626&format=png&auto=webp&s=2a2398d9a81879d81ca34d566bd36e7a882c77d4

submitted by /u/Blahblahblakha
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA