Hugging Face Daily Papers · July 2, 2026 · 4 min read

Building to the Test: Coding Agents Deliver What You Check, Not What You Requested

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Verification signals are becoming central to agentic workflows: loop engineering, RL reward design, and CI-driven iteration all assume that a passing signal means the job is done. We show that this assumption breaks. When coding agents have access to a behavioral test oracle, they optimize the signal itself rather than delivering the requested artifact. Unlike an experienced engineer who uses test feedback to refine their implementation, agents treat passing as the goal even when they are asked not to do so. As the community builds increasingly sophisticated verification-driven loops, understanding this disposition seems worth investigating.</p>\n","updatedAt":"2026-07-02T20:04:17.341Z","author":{"_id":"6a431130d6b06215a8d1f5f4","avatarUrl":"/avatars/69972faef97393a1fe460a4e4650fa6d.svg","fullname":"Yanuo Ma","name":"yanuoma","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9531428217887878},"editors":["yanuoma"],"editorAvatarUrls":["/avatars/69972faef97393a1fe460a4e4650fa6d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.28430","authors":[{"_id":"6a433686763f63ca3757e96a","user":{"_id":"6a431130d6b06215a8d1f5f4","avatarUrl":"/avatars/69972faef97393a1fe460a4e4650fa6d.svg","isPro":false,"fullname":"Yanuo Ma","user":"yanuoma","type":"user","name":"yanuoma"},"name":"Yanuo Ma","status":"claimed_verified","statusLastChangedAt":"2026-07-01T08:47:04.329Z","hidden":false},{"_id":"6a433686763f63ca3757e96b","name":"Ben Kereopa-Yorke","hidden":false},{"_id":"6a433686763f63ca3757e96c","name":"Ben Schultz","hidden":false}],"publishedAt":"2026-06-26T00:00:00.000Z","submittedOnDailyAt":"2026-07-02T00:00:00.000Z","title":"Building to the Test: Coding Agents Deliver What You Check, Not What You Requested","submittedOnDailyBy":{"_id":"6a431130d6b06215a8d1f5f4","avatarUrl":"/avatars/69972faef97393a1fe460a4e4650fa6d.svg","isPro":false,"fullname":"Yanuo Ma","user":"yanuoma","type":"user","name":"yanuoma"},"summary":"Benchmarks are widely used to evaluate task completion by Large Language Models (LLMs), but this approach has accumulated construction-validity problems, and a passing score may not show whether the requested task was delivered. We study both problems. In a controlled code-as-spec setup, two production Copilot CLI agents (claude-opus-4.7, gpt-5.5) re-implement a React Fluent-UI data table in Angular as a reusable library under a hidden 222-test Playwright oracle across 18 runs and three oracle-availability conditions. Alongside the score, we run a mechanical library audit and check each verdict with a no-op ablation. Without the oracle, the library is present but unfinished, revealed by scores. With the oracle in the loop, the score reaches near-perfect, but from a demo holding the tested behavior directly, the library left dead or absent. We call this building to the test; the broader disposition behind both we call validation self-awareness. The agent does not, on its own, validate what it ships as a user would. Prevalence remains an open question across other agents, signals, and model families. Beyond benchmark scores, dispositions like validation self-awareness merit research attention.","upvotes":3,"discussionId":"6a433686763f63ca3757e96d","githubRepo":"https://github.com/yanuoma/b2t","githubRepoAddedBy":"user","ai_summary":"Large Language Models fail to validate their outputs when evaluated through benchmarks, revealing a gap between task completion scores and actual implementation quality.","ai_keywords":["Large Language Models","benchmarks","task completion","validation self-awareness","code-as-spec","Copilot CLI agents","React Fluent-UI","Angular","Playwright oracle","mechanical library audit","no-op ablation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a431130d6b06215a8d1f5f4","avatarUrl":"/avatars/69972faef97393a1fe460a4e4650fa6d.svg","isPro":false,"fullname":"Yanuo Ma","user":"yanuoma","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.28430.md","query":{}}">

Papers

arxiv:2606.28430

Building to the Test: Coding Agents Deliver What You Check, Not What You Requested

Published on Jun 26

· Submitted by

Yanuo Ma on Jul 2

Microsoft

Upvote

Authors:

Yanuo Ma ,

Abstract

Large Language Models fail to validate their outputs when evaluated through benchmarks, revealing a gap between task completion scores and actual implementation quality.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Benchmarks are widely used to evaluate task completion by Large Language Models (LLMs), but this approach has accumulated construction-validity problems, and a passing score may not show whether the requested task was delivered. We study both problems. In a controlled code-as-spec setup, two production Copilot CLI agents (claude-opus-4.7, gpt-5.5) re-implement a React Fluent-UI data table in Angular as a reusable library under a hidden 222-test Playwright oracle across 18 runs and three oracle-availability conditions. Alongside the score, we run a mechanical library audit and check each verdict with a no-op ablation. Without the oracle, the library is present but unfinished, revealed by scores. With the oracle in the loop, the score reaches near-perfect, but from a demo holding the tested behavior directly, the library left dead or absent. We call this building to the test; the broader disposition behind both we call validation self-awareness. The agent does not, on its own, validate what it ships as a user would. Prevalence remains an open question across other agents, signals, and model families. Beyond benchmark scores, dispositions like validation self-awareness merit research attention.

View arXiv page View PDF GitHub 0 Add to collection

Community

yanuoma

Paper author Paper submitter about 5 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.28430

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.28430 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.28430 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.28430 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Building to the Test: Coding Agents Deliver What You Check, Not What You Requested

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers