Hugging Face Daily Papers · June 30, 2026 · 4 min read

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Visit our website at <a href=\"https://tuabench.ai/\" rel=\"nofollow\">https://tuabench.ai/</a> for more details.</p>\n","updatedAt":"2026-06-30T05:52:34.924Z","author":{"_id":"6412a33900634c4fe9873652","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6412a33900634c4fe9873652/Nmn_yRA1gGD2VO1YbSOYF.jpeg","fullname":"Shoufa Chen","name":"ShoufaChen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":23,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8545798659324646},"editors":["ShoufaChen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6412a33900634c4fe9873652/Nmn_yRA1gGD2VO1YbSOYF.jpeg"],"reactions":[{"reaction":"👍","users":["yx199999","bebeberr","shilongz"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.28480","authors":[{"_id":"6a433b9b763f63ca3757e9f1","name":"Shoufa Chen","hidden":false},{"_id":"6a433b9b763f63ca3757e9f2","name":"Luyuan Wang","hidden":false},{"_id":"6a433b9b763f63ca3757e9f3","name":"Xuan Yang","hidden":false},{"_id":"6a433b9b763f63ca3757e9f4","name":"Zhiheng Liu","hidden":false},{"_id":"6a433b9b763f63ca3757e9f5","name":"Yuren Cong","hidden":false},{"_id":"6a433b9b763f63ca3757e9f6","name":"Yuanfeng Ji","hidden":false},{"_id":"6a433b9b763f63ca3757e9f7","name":"Feiyan Zhou","hidden":false},{"_id":"6a433b9b763f63ca3757e9f8","name":"Xiaohui Zhang","hidden":false},{"_id":"6a433b9b763f63ca3757e9f9","name":"Fanny Yang","hidden":false},{"_id":"6a433b9b763f63ca3757e9fa","name":"Belinda Zeng","hidden":false}],"publishedAt":"2026-06-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.","upvotes":44,"discussionId":"6a433b9c763f63ca3757e9fb","githubRepo":"https://github.com/facebookresearch/TUA-Bench","githubRepoAddedBy":"user","ai_summary":"TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents.","ai_keywords":["terminal-use agents","general-purpose agents","computer-use tasks","graphical user interfaces","shell-based workflows","execution-based scoring protocol","digital activities","specialized software","benchmark evaluation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":12,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592839207516-noauth.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6412a33900634c4fe9873652","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6412a33900634c4fe9873652/Nmn_yRA1gGD2VO1YbSOYF.jpeg","isPro":false,"fullname":"Shoufa Chen","user":"ShoufaChen","type":"user"},{"_id":"666b1ea035a7793181c2558d","avatarUrl":"/avatars/2bf4f543541bc53f258741586ce7bfaa.svg","isPro":false,"fullname":"Luyuan Wang","user":"bebeberr","type":"user"},{"_id":"6479925ab77e18dbf640bd67","avatarUrl":"/avatars/bb52ecd22ca4b49157f8668be35409e7.svg","isPro":false,"fullname":"Zhiheng Liu","user":"Johanan0528","type":"user"},{"_id":"653bff872dc0470f83226f7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/To6VdryU3PuzbJNpa7Xoa.jpeg","isPro":false,"fullname":"Ziming Zhou","user":"zimingzh","type":"user"},{"_id":"69dfff0594ad5cd3b7d40e16","avatarUrl":"/avatars/5fe85d47eba387c97b53cf316532e342.svg","isPro":false,"fullname":"Xuan Yang","user":"yx199999","type":"user"},{"_id":"6542107cc7bf7cd71258b66d","avatarUrl":"/avatars/19795ed2a9283471a3c239026a728f84.svg","isPro":false,"fullname":"Xunjian Yin","user":"Corning","type":"user"},{"_id":"66d7a621cf95643612998333","avatarUrl":"/avatars/88c46a82e8e837dcfc03b95ef52bd1dd.svg","isPro":true,"fullname":"Meilong Xu","user":"Meilong023","type":"user"},{"_id":"692f620c8051b9d3f10ba6b2","avatarUrl":"/avatars/221d85da8b5917af606ef39dcc8dfcc7.svg","isPro":false,"fullname":"Yuren Cong","user":"Yurencong","type":"user"},{"_id":"6322e420c6978383dec6f4f3","avatarUrl":"/avatars/b0cf2c46edaf90891b9a6f7f69daefc2.svg","isPro":false,"fullname":"Xuanwen Huang","user":"hxttkl","type":"user"},{"_id":"69015e8f16bd45305a458b64","avatarUrl":"/avatars/278914c5a78427edbb7cf2859389a363.svg","isPro":false,"fullname":"Feiyan Zhou","user":"FeiyanZhou","type":"user"},{"_id":"694dd14395a827432a797fcb","avatarUrl":"/avatars/b8d1a174088b960128885e2d1f394e50.svg","isPro":false,"fullname":"yutong wang","user":"yukiwang111","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592839207516-noauth.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.28480.md","query":{}}">

Papers

arxiv:2606.28480

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Published on Jun 26

· Submitted by

taesiri on Jun 30

AI at Meta

Upvote

Authors:

Abstract

TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.