Verification signals are becoming central to agentic workflows: loop engineering, RL reward design, and CI-driven iteration all assume that a passing signal means the job is done. We show that this assumption breaks. When coding agents have access to a behavioral test oracle, they optimize the signal itself rather than delivering the requested artifact. Unlike an experienced engineer who uses test feedback to refine their implementation, agents treat passing as the goal even when they are asked not to do so. As the community builds increasingly sophisticated verification-driven loops, understanding this disposition seems worth investigating.</p>\n","updatedAt":"2026-07-02T20:04:17.341Z","author":{"_id":"6a431130d6b06215a8d1f5f4","avatarUrl":"/avatars/69972faef97393a1fe460a4e4650fa6d.svg","fullname":"Yanuo Ma","name":"yanuoma","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9531428217887878},"editors":["yanuoma"],"editorAvatarUrls":["/avatars/69972faef97393a1fe460a4e4650fa6d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.28430","authors":[{"_id":"6a433686763f63ca3757e96a","user":{"_id":"6a431130d6b06215a8d1f5f4","avatarUrl":"/avatars/69972faef97393a1fe460a4e4650fa6d.svg","isPro":false,"fullname":"Yanuo Ma","user":"yanuoma","type":"user","name":"yanuoma"},"name":"Yanuo Ma","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:47:04.329Z","hidden":false},{"_id":"6a433686763f63ca3757e96b","name":"Ben Kereopa-Yorke","hidden":false},{"_id":"6a433686763f63ca3757e96c","name":"Ben Schultz","hidden":false}],"publishedAt":"2026-06-26T00:00:00.000Z","submittedOnDailyAt":"2026-07-02T00:00:00.000Z","title":"Building to the Test: Coding Agents Deliver What You Check, Not What You Requested","submittedOnDailyBy":{"_id":"6a431130d6b06215a8d1f5f4","avatarUrl":"/avatars/69972faef97393a1fe460a4e4650fa6d.svg","isPro":false,"fullname":"Yanuo Ma","user":"yanuoma","type":"user","name":"yanuoma"},"summary":"Benchmarks are widely used to evaluate task completion by Large Language Models (LLMs), but this approach has accumulated construction-validity problems, and a passing score may not show whether the requested task was delivered. We study both problems. In a controlled code-as-spec setup, two production Copilot CLI agents (claude-opus-4.7, gpt-5.5) re-implement a React Fluent-UI data table in Angular as a reusable library under a hidden 222-test Playwright oracle across 18 runs and three oracle-availability conditions. Alongside the score, we run a mechanical library audit and check each verdict with a no-op ablation. Without the oracle, the library is present but unfinished, revealed by scores. With the oracle in the loop, the score reaches near-perfect, but from a demo holding the tested behavior directly, the library left dead or absent. We call this building to the test; the broader disposition behind both we call validation self-awareness. The agent does not, on its own, validate what it ships as a user would. Prevalence remains an open question across other agents, signals, and model families. Beyond benchmark scores, dispositions like validation self-awareness merit research attention.","upvotes":3,"discussionId":"6a433686763f63ca3757e96d","githubRepo":"https://github.com/yanuoma/b2t","githubRepoAddedBy":"user","ai_summary":"Large Language Models fail to validate their outputs when evaluated through benchmarks, revealing a gap between task completion scores and actual implementation quality.","ai_keywords":["Large Language Models","benchmarks","task completion","validation self-awareness","code-as-spec","Copilot CLI agents","React Fluent-UI","Angular","Playwright oracle","mechanical library audit","no-op ablation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a431130d6b06215a8d1f5f4","avatarUrl":"/avatars/69972faef97393a1fe460a4e4650fa6d.svg","isPro":false,"fullname":"Yanuo Ma","user":"yanuoma","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.28430.md","query":{}}">
Building to the Test: Coding Agents Deliver What You Check, Not What You Requested
Abstract
Large Language Models fail to validate their outputs when evaluated through benchmarks, revealing a gap between task completion scores and actual implementation quality.
Benchmarks are widely used to evaluate task completion by Large Language Models (LLMs), but this approach has accumulated construction-validity problems, and a passing score may not show whether the requested task was delivered. We study both problems. In a controlled code-as-spec setup, two production Copilot CLI agents (claude-opus-4.7, gpt-5.5) re-implement a React Fluent-UI data table in Angular as a reusable library under a hidden 222-test Playwright oracle across 18 runs and three oracle-availability conditions. Alongside the score, we run a mechanical library audit and check each verdict with a no-op ablation. Without the oracle, the library is present but unfinished, revealed by scores. With the oracle in the loop, the score reaches near-perfect, but from a demo holding the tested behavior directly, the library left dead or absent. We call this building to the test; the broader disposition behind both we call validation self-awareness. The agent does not, on its own, validate what it ships as a user would. Prevalence remains an open question across other agents, signals, and model families. Beyond benchmark scores, dispositions like validation self-awareness merit research attention.
Community
Verification signals are becoming central to agentic workflows: loop engineering, RL reward design, and CI-driven iteration all assume that a passing signal means the job is done. We show that this assumption breaks. When coding agents have access to a behavioral test oracle, they optimize the signal itself rather than delivering the requested artifact. Unlike an experienced engineer who uses test feedback to refine their implementation, agents treat passing as the goal even when they are asked not to do so. As the community builds increasingly sophisticated verification-driven loops, understanding this disposition seems worth investigating.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.28430 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.28430 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.28430 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.