r/LocalLLaMA · June 30, 2026 · 1 min read

The harness matters more than the model. A 27B behind good critics changed my mind.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I saw someone test Qwen3.6-27B with a 3-critic harness. The harness included code review, test review and Playwright e2e. Each critic had context. The result was that the model is usable for coding work. This matches what I have come to believe from running agents in production. The harness around the model is more important than the model itself.

A smaller model makes mistakes. That is real and expected.. A good critic pipeline catches the extra mistakes. Suddenly the gap between a 27B model and a frontier model gets smaller. The reliability comes from the process, not the model size.

The mistake I see teams make is that they focus on model selection and prompt-tuning.. They do not verify the results. They blame the model when it is flaky. The model is not your reliability layer. The harness is.

Fresh context per critic is important. A reviewer that has not seen the code catches things a self-review never will.

For people running models in production I ask: where does your reliability come from? Is it the model or the scaffolding, around it?

submitted by /u/recro69
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA