r/LocalLLaMA · June 30, 2026 · 2 min read

Agents-A1 GGUF quants (35B Qwen3.5-MoE agent model) — NVFP4 for Blackwell + working MTP speculative decoding (up to 1.22× single-user, 91% draft acceptance)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Agents-A1 GGUF quants (35B Qwen3.5-MoE agent model) — NVFP4 for Blackwell + working MTP speculative decoding (up to 1.22× single-user, 91% draft acceptance)

Repo → huggingface.co/LordNeel/Agents-A1-GGUF

I made GGUF quants of InternScience/Agents-A1 — a 35B Mixture of Experts agent model (Qwen3.5-MoE, ~3B active, 256 experts / 8+1 active, hybrid linear+full attention, 256K context). It's built for long-horizon search, tool-calling, and scientific/engineering agentic work. The base model's own benchmarks are strong for the ~35B class (their numbers, not mine — see their card).

Two things made this more than a plain quant dump:

NVFP4 build for Blackwell GPUs
MTP (multi-token prediction) grafted in for real speculative decoding, measured.

Text-only. The base is multimodal but I'm not shipping an mmproj, so no vision/video with these files.

Quants + quality (vs BF16)

Quality measured with KL-divergence over top-64 next-token distributions on 32 prompts (more meaningful than my deliberately-small PPL eval). Lower KLD = closer to BF16.

Quant	Size	Gen tok/s	KLD mean	Top-1 match
Q3_K_M	16.8 GB	269	0.0655	28/32
IQ4_XS	18.7 GB	258	0.0151	29/32
NVFP4	19.7 GB	265	0.0420	31/32
Q4_K_M	21.2 GB	263	0.1225	27/32
Q5_K_M	24.7 GB	258	0.0091	30/32
Q6_K	28.5 GB	245	0.0049	32/32
Q8_0	36.9 GB	223	0.0053	30/32

(BF16 reference: 162 gen tok/s. All numbers on a single RTX PRO 6000 Blackwell, full offload.)

Sweet spots: IQ4_XS for compact, Q5_K_M/Q6_K for near-BF16. Heads up — Q4_K_M has oddly high KLD despite a good PPL delta, so I'd reach for IQ4_XS or Q5_K_M over it unless you're using the MTP variant.

MTP / speculative decoding

The upstream checkpoint advertises MTP in config but ships no MTP tensors. I grafted in the wang-yang/Agents-A1-MTPLX-Q4 sidecar and converted it through llama.cpp's Qwen3.5-MoE MTP path (MTP block kept at Q6_K). Single-user serving, temperature=0:

Variant	Mode	tok/s	Speedup	Draft acceptance
IQ4_XS-MTP	target-only	225	1.00×	—
IQ4_XS-MTP	n_max=2	275	1.22×	76.5%
IQ4_XS-MTP	n_max=1	260	1.16×	86.5%
Q4_K_M-MTP	n_max=1	265	1.15×	91.5%
Q4_K_M-MTP	n_max=2	274	1.19×	77.2%

So ~1.15–1.22× free throughput on a single stream depending on how aggressive you set the draft length.

Running it

You need a recent llama.cpp build with qwen35moe support (NVFP4/MTP need newer builds still).

hf download LordNeel/Agents-A1-GGUF agents-a1-IQ4_XS.gguf --local-dir ./agents-a1 llama-server -m ./agents-a1/agents-a1-IQ4_XS.gguf -ngl 99 -c 8192 -b 4096 -ub 512 --flash-attn on

MTP flags and the NVFP4 path are documented in the model card.

Caveats

Text-only (no mmproj).
NVFP4 needs a Blackwell GPU + FP4-capable build (BLACKWELL_NATIVE_FP4 = 1).
PPL eval is small/directional — trust the KLD numbers more.
MTP weights are grafted from a separate sidecar, not native to the original release.

Full metrics, KLD reports, checksums, charts, and the MTP audit are all in the repo. Feedback welcome, especially from anyone running these on non-Blackwell cards.

https://preview.redd.it/xm9r1q48ahah1.png?width=1776&format=png&auto=webp&s=16fffe8d9f460584429298a42c1c68ac336ea206

https://preview.redd.it/td59qp48ahah1.png?width=1622&format=png&auto=webp&s=514828c8eb7cfe8d9ed7b7aa5a4dd7959fd7f33b

https://preview.redd.it/e6m3br48ahah1.png?width=1626&format=png&auto=webp&s=ac8ffd4b93f048f4e4df28cab6ba9ce591a9dab3

https://preview.redd.it/5o68bq48ahah1.png?width=1701&format=png&auto=webp&s=e7696771c8e4176767477ef0d4bf3997eb0304e3

https://preview.redd.it/29z6cq48ahah1.png?width=1626&format=png&auto=webp&s=2a2398d9a81879d81ca34d566bd36e7a882c77d4

submitted by /u/Blahblahblakha
[link] [comments]