The harness matters more than the model. A 27B behind good critics changed my mind.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I saw someone test Qwen3.6-27B with a 3-critic harness. The harness included code review, test review and Playwright e2e. Each critic had context. The result was that the model is usable for coding work. This matches what I have come to believe from running agents in production. The harness around the model is more important than the model itself.
A smaller model makes mistakes. That is real and expected.. A good critic pipeline catches the extra mistakes. Suddenly the gap between a 27B model and a frontier model gets smaller. The reliability comes from the process, not the model size.
The mistake I see teams make is that they focus on model selection and prompt-tuning.. They do not verify the results. They blame the model when it is flaky. The model is not your reliability layer. The harness is.
Fresh context per critic is important. A reviewer that has not seen the code catches things a self-review never will.
For people running models in production I ask: where does your reliability come from? Is it the model or the scaffolding, around it?
[link] [comments]
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
-
They fit! Mostly.... 2x 3090, Thermaltake Core p3
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.