Visit our website at <a href=\"https://tuabench.ai/\" rel=\"nofollow\">https://tuabench.ai/</a> for more details.</p>\n","updatedAt":"2026-06-30T05:52:34.924Z","author":{"_id":"6412a33900634c4fe9873652","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6412a33900634c4fe9873652/Nmn_yRA1gGD2VO1YbSOYF.jpeg","fullname":"Shoufa Chen","name":"ShoufaChen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":23,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8545798659324646},"editors":["ShoufaChen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6412a33900634c4fe9873652/Nmn_yRA1gGD2VO1YbSOYF.jpeg"],"reactions":[{"reaction":"👍","users":["yx199999","bebeberr","shilongz"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.28480","authors":[{"_id":"6a433b9b763f63ca3757e9f1","name":"Shoufa Chen","hidden":false},{"_id":"6a433b9b763f63ca3757e9f2","name":"Luyuan Wang","hidden":false},{"_id":"6a433b9b763f63ca3757e9f3","name":"Xuan Yang","hidden":false},{"_id":"6a433b9b763f63ca3757e9f4","name":"Zhiheng Liu","hidden":false},{"_id":"6a433b9b763f63ca3757e9f5","name":"Yuren Cong","hidden":false},{"_id":"6a433b9b763f63ca3757e9f6","name":"Yuanfeng Ji","hidden":false},{"_id":"6a433b9b763f63ca3757e9f7","name":"Feiyan Zhou","hidden":false},{"_id":"6a433b9b763f63ca3757e9f8","name":"Xiaohui Zhang","hidden":false},{"_id":"6a433b9b763f63ca3757e9f9","name":"Fanny Yang","hidden":false},{"_id":"6a433b9b763f63ca3757e9fa","name":"Belinda Zeng","hidden":false}],"publishedAt":"2026-06-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.","upvotes":44,"discussionId":"6a433b9c763f63ca3757e9fb","githubRepo":"https://github.com/facebookresearch/TUA-Bench","githubRepoAddedBy":"user","ai_summary":"TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents.","ai_keywords":["terminal-use agents","general-purpose agents","computer-use tasks","graphical user interfaces","shell-based workflows","execution-based scoring protocol","digital activities","specialized software","benchmark evaluation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":12,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592839207516-noauth.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6412a33900634c4fe9873652","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6412a33900634c4fe9873652/Nmn_yRA1gGD2VO1YbSOYF.jpeg","isPro":false,"fullname":"Shoufa Chen","user":"ShoufaChen","type":"user"},{"_id":"666b1ea035a7793181c2558d","avatarUrl":"/avatars/2bf4f543541bc53f258741586ce7bfaa.svg","isPro":false,"fullname":"Luyuan Wang","user":"bebeberr","type":"user"},{"_id":"6479925ab77e18dbf640bd67","avatarUrl":"/avatars/bb52ecd22ca4b49157f8668be35409e7.svg","isPro":false,"fullname":"Zhiheng Liu","user":"Johanan0528","type":"user"},{"_id":"653bff872dc0470f83226f7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/To6VdryU3PuzbJNpa7Xoa.jpeg","isPro":false,"fullname":"Ziming Zhou","user":"zimingzh","type":"user"},{"_id":"69dfff0594ad5cd3b7d40e16","avatarUrl":"/avatars/5fe85d47eba387c97b53cf316532e342.svg","isPro":false,"fullname":"Xuan Yang","user":"yx199999","type":"user"},{"_id":"6542107cc7bf7cd71258b66d","avatarUrl":"/avatars/19795ed2a9283471a3c239026a728f84.svg","isPro":false,"fullname":"Xunjian Yin","user":"Corning","type":"user"},{"_id":"66d7a621cf95643612998333","avatarUrl":"/avatars/88c46a82e8e837dcfc03b95ef52bd1dd.svg","isPro":true,"fullname":"Meilong Xu","user":"Meilong023","type":"user"},{"_id":"692f620c8051b9d3f10ba6b2","avatarUrl":"/avatars/221d85da8b5917af606ef39dcc8dfcc7.svg","isPro":false,"fullname":"Yuren Cong","user":"Yurencong","type":"user"},{"_id":"6322e420c6978383dec6f4f3","avatarUrl":"/avatars/b0cf2c46edaf90891b9a6f7f69daefc2.svg","isPro":false,"fullname":"Xuanwen Huang","user":"hxttkl","type":"user"},{"_id":"69015e8f16bd45305a458b64","avatarUrl":"/avatars/278914c5a78427edbb7cf2859389a363.svg","isPro":false,"fullname":"Feiyan Zhou","user":"FeiyanZhou","type":"user"},{"_id":"694dd14395a827432a797fcb","avatarUrl":"/avatars/b8d1a174088b960128885e2d1f394e50.svg","isPro":false,"fullname":"yutong wang","user":"yukiwang111","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592839207516-noauth.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.28480.md","query":{}}">
TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents
Authors: ,
,
,
,
,
,
,
,
,
Abstract
TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents.
As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use tasks beyond coding. However, existing benchmarks do not adequately evaluate general-purpose terminal computer-use agents (TUAs): general computer-use benchmarks primarily target graphical user interfaces (GUIs), whereas terminal-based benchmarks largely emphasize technical and programming-centric workflows historically native to the shell. We introduce TUA-Bench, a general-purpose benchmark for terminal-use agents. TUA-Bench includes 120 real-world tasks across five task families, covering routine digital activities-including document editing, email management, and live-web information seeking-as well as scientific and engineering workflows co-designed with PhD-level domain experts that require specialized software. This breadth distinguishes TUA-Bench from prior shell-focused or domain-specific benchmarks. Each task is manually designed, runs in a real terminal with a deterministic setup script, and is evaluated by an execution-based scoring protocol. We find that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieves 65.8% overall performance, with substantial gaps across both tracks. By providing a broad and realistic evaluation of terminal-use capabilities, TUA-Bench aims to accelerate the transition from narrow, task-specific assistants to general-purpose agents capable of operating reliably across diverse digital environments.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.28480 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.28480 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.28480 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.