Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I wanted to know how much of a coding agent's performance came from the model and how much came from the harness, so I vibed a setup to allow me to test multiple agentic harnesses/model combinations on the same task. ALl the images above all come from the same model, but with a different harness. Still working on getting automated/metric evaluation instead of subjective opinion. Things I noticed not present in the images:
[link] [comments] |
More from r/LocalLLaMA
-
What's in your RAG?
Jul 2
-
Palantir CEO rages against closed models
Jul 2
-
A cheap trick for reliable structured output: feed the validation error back into the retry
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.