Why do we benchmark quants on perplexity and prose but never on tool call validity?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
The mixed precision quant discussion here lately, MoE aware stuff that keeps shared experts and the edge layers at higher precision is great, but it's almost all measured against perplexity and general output quality. What I never see is structured output. Tool call JSON, function schemas, constrained formats.
My intuition, and I'd like to be wrong, is that those degrade earlier than prose does. A model at Q4_K_M can still write a perfectly readable paragraph while quietly producing JSON that's a brace short or hallucinating a field name. Prose has a lot of valid continuations at each token. A schema has very few. So the same quant error that's invisible in text is fatal in a tool call.
If that holds, then for agentic use the quant level you can actually get away with is lower than the perplexity charts suggest, and a lot of people are picking quants on the wrong metric. Has anyone benchmarked acceptance rate of valid tool calls across quant levels on one model? Not perplexity. Just did the JSON parse.
[link] [comments]
More from r/LocalLLaMA
-
Local benchmarks with a RTX 3090 - Qwen3.6 27b vs Ornith
Jul 2
-
July 4th is coming up, is there any vision model that's good for picking up fire?
Jul 2
-
It's officially over. One of the fathers of AI at Nvidia doesn't believe in AGI and compares OpenAI and Anthropic's closed models to AOL and Prodigy's closed internets. Says the future is every business having a customized open source model.
Jul 2
-
6x P40 running Minimax M2.7_Q3_XL
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.