Hugging Face Daily Papers · · 4 min read

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Visit our website at <a href=\"https://tuabench.ai/\" rel=\"nofollow\">https://tuabench.ai/</a> for more details.</p>\n","updatedAt":"2026-06-30T05:52:34.924Z","author":{"_id":"6412a33900634c4fe9873652","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6412a33900634c4fe9873652/Nmn_yRA1gGD2VO1YbSOYF.jpeg","fullname":"Shoufa Chen","name":"ShoufaChen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":23,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8545798659324646},"editors":["ShoufaChen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6412a33900634c4fe9873652/Nmn_yRA1gGD2VO1YbSOYF.jpeg"],"reactions":[{"reaction":"👍","users":["yx199999","bebeberr","shilongz"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.28480","authors":[{"_id":"6a433b9b763f63ca3757e9f1","name":"Shoufa Chen","hidden":false},{"_id":"6a433b9b763f63ca3757e9f2","name":"Luyuan Wang","hidden":false},{"_id":"6a433b9b763f63ca3757e9f3","name":"Xuan Yang","hidden":false},{"_id":"6a433b9b763f63ca3757e9f4","name":"Zhiheng Liu","hidden":false},{"_id":"6a433b9b763f63ca3757e9f5","name":"Yuren Cong","hidden":false},{"_id":"6a433b9b763f63ca3757e9f6","name":"Yuanfeng Ji","hidden":false},{"_id":"6a433b9b763f63ca3757e9f7","name":"Feiyan Zhou","hidden":false},{"_id":"6a433b9b763f63ca3757e9f8","name":"Xiaohui Zhang","hidden":false},{"_id":"6a433b9b763f63ca3757e9f9","name":"Fanny Yang","hidden":false},{"_id":"6a433b9b763f63ca3757e9fa","name":"Belinda Zeng","hidden":false}],"publishedAt":"2026-06-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.","upvotes":44,"discussionId":"6a433b9c763f63ca3757e9fb","githubRepo":"https://github.com/facebookresearch/TUA-Bench","githubRepoAddedBy":"user","ai_summary":"TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents.","ai_keywords":["terminal-use agents","general-purpose agents","computer-use tasks","graphical user interfaces","shell-based workflows","execution-based scoring protocol","digital activities","specialized software","benchmark evaluation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":12,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592839207516-noauth.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6412a33900634c4fe9873652","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6412a33900634c4fe9873652/Nmn_yRA1gGD2VO1YbSOYF.jpeg","isPro":false,"fullname":"Shoufa Chen","user":"ShoufaChen","type":"user"},{"_id":"666b1ea035a7793181c2558d","avatarUrl":"/avatars/2bf4f543541bc53f258741586ce7bfaa.svg","isPro":false,"fullname":"Luyuan Wang","user":"bebeberr","type":"user"},{"_id":"6479925ab77e18dbf640bd67","avatarUrl":"/avatars/bb52ecd22ca4b49157f8668be35409e7.svg","isPro":false,"fullname":"Zhiheng Liu","user":"Johanan0528","type":"user"},{"_id":"653bff872dc0470f83226f7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/To6VdryU3PuzbJNpa7Xoa.jpeg","isPro":false,"fullname":"Ziming Zhou","user":"zimingzh","type":"user"},{"_id":"69dfff0594ad5cd3b7d40e16","avatarUrl":"/avatars/5fe85d47eba387c97b53cf316532e342.svg","isPro":false,"fullname":"Xuan Yang","user":"yx199999","type":"user"},{"_id":"6542107cc7bf7cd71258b66d","avatarUrl":"/avatars/19795ed2a9283471a3c239026a728f84.svg","isPro":false,"fullname":"Xunjian Yin","user":"Corning","type":"user"},{"_id":"66d7a621cf95643612998333","avatarUrl":"/avatars/88c46a82e8e837dcfc03b95ef52bd1dd.svg","isPro":true,"fullname":"Meilong Xu","user":"Meilong023","type":"user"},{"_id":"692f620c8051b9d3f10ba6b2","avatarUrl":"/avatars/221d85da8b5917af606ef39dcc8dfcc7.svg","isPro":false,"fullname":"Yuren Cong","user":"Yurencong","type":"user"},{"_id":"6322e420c6978383dec6f4f3","avatarUrl":"/avatars/b0cf2c46edaf90891b9a6f7f69daefc2.svg","isPro":false,"fullname":"Xuanwen Huang","user":"hxttkl","type":"user"},{"_id":"69015e8f16bd45305a458b64","avatarUrl":"/avatars/278914c5a78427edbb7cf2859389a363.svg","isPro":false,"fullname":"Feiyan Zhou","user":"FeiyanZhou","type":"user"},{"_id":"694dd14395a827432a797fcb","avatarUrl":"/avatars/b8d1a174088b960128885e2d1f394e50.svg","isPro":false,"fullname":"yutong wang","user":"yukiwang111","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592839207516-noauth.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.28480.md","query":{}}">
Papers
arxiv:2606.28480

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Published on Jun 26
· Submitted by
taesiri
on Jun 30
Authors:
,
,
,
,
,
,
,
,
,

Abstract

TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents.

As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.

Community

Visit our website at https://tuabench.ai/ for more details.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.28480
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.28480 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.28480 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.28480 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers