Large language models (LLMs) are increasingly used to take actions in the real world and support human decision-making, yet most agents rely on parametric knowledge, fixed post-training data, retrieval, or search. This paradigm breaks down in novel domains and for sophisticated queries that cannot be answered from prior knowledge alone. Knowing the laws of physics, for instance, does not by itself enable LLMs to answer queries or complete long-horizon tasks in a complex physical system. To address this, we introduce Hierarchical Experimentalist Agents (HExA), an in-context self-improvement framework to learn from active experimentation. HExA iteratively designs and refines query-relevant experiments, learns a reusable library of composable skills from experience, and integrates experimental evidence to answer queries or take actions. HExA is training-free, compatible with any black-box model, and does not require external supervision, oracles, or offline data. To evaluate active experimentation, we introduce Interphyre, a tool-calling benchmark built on the PHYRE 2D procedural physics environment, where agents propose interventions and test hypotheses through simulation APIs. Experiments show that current LLM agents struggle in these settings, especially on the hardest levels of Interphyre. Claude Sonnet 4.6 achieves only 2% success, while HExA improves the same model to up to 77% success. HExA also improves open-weight models and outperforms agentic baselines such as ReAct and Reflexion. Moreover, using only skills learned from easier levels and transferred without active experimentation, HExA achieves 44% success, demonstrating the reusability and generalization of its learned skills. Overall, HExA shows that learning through active experimentation can help agents discover useful knowledge, acquire reusable skills, and make efficient progress on novel long-horizon tasks.</p>\n","updatedAt":"2026-07-01T16:56:40.206Z","author":{"_id":"6154ee656181394cc00cb990","avatarUrl":"/avatars/25d3b6911c5991b1869f5d76ca2c4069.svg","fullname":"Abhranil Chandra","name":"abhranil14","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8982005715370178},"editors":["abhranil14"],"editorAvatarUrls":["/avatars/25d3b6911c5991b1869f5d76ca2c4069.svg"],"reactions":[],"isReport":false}},{"id":"6a45c30b0986c08f4b911c58","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false},"createdAt":"2026-07-02T01:46:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Harnessing LLM Agents with Skill Programs](https://huggingface.co/papers/2605.17734) (2026)\n* [Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents](https://huggingface.co/papers/2606.31270) (2026)\n* [OpenSkill: Open-World Self-Evolution for LLM Agents](https://huggingface.co/papers/2606.06741) (2026)\n* [Training Language Agents to Learn from Experience](https://huggingface.co/papers/2605.20477) (2026)\n* [OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models](https://huggingface.co/papers/2606.16774) (2026)\n* [Test-Time Learning with an Evolving Library](https://huggingface.co/papers/2605.14477) (2026)\n* [Robot Self-Improvement via Human-Video Dynamics Models](https://huggingface.co/papers/2606.21406) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.17734\">Harnessing LLM Agents with Skill Programs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.31270\">Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.06741\">OpenSkill: Open-World Self-Evolution for LLM Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.20477\">Training Language Agents to Learn from Experience</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.16774\">OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.14477\">Test-Time Learning with an Evolving Library</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.21406\">Robot Self-Improvement via Human-Video Dynamics Models</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code>@librarian-bot recommend</code></p>\n","updatedAt":"2026-07-02T01:46:51.060Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":372,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7257076501846313},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.29315","authors":[{"_id":"6a45465f4f1dd35e48fb8d57","name":"Abhranil Chandra","hidden":false},{"_id":"6a45465f4f1dd35e48fb8d58","name":"Sankaran Vaidyanathan","hidden":false},{"_id":"6a45465f4f1dd35e48fb8d59","name":"Utsav Dhanuka","hidden":false},{"_id":"6a45465f4f1dd35e48fb8d5a","name":"Varun Gandhi","hidden":false},{"_id":"6a45465f4f1dd35e48fb8d5b","name":"Scott Niekum","hidden":false}],"publishedAt":"2026-06-28T00:00:00.000Z","submittedOnDailyAt":"2026-07-01T00:00:00.000Z","title":"Hierarchical Experimentalist Agents","submittedOnDailyBy":{"_id":"6154ee656181394cc00cb990","avatarUrl":"/avatars/25d3b6911c5991b1869f5d76ca2c4069.svg","isPro":true,"fullname":"Abhranil Chandra","user":"abhranil14","type":"user","name":"abhranil14"},"summary":"Large language models (LLMs) are increasingly used to take actions in the real world and support human decision-making, yet most agents rely on parametric knowledge, fixed post-training data, retrieval, or search. This paradigm breaks down in novel domains and for sophisticated queries that cannot be answered from prior knowledge alone. Knowing the laws of physics, for instance, does not by itself enable LLMs to answer queries or complete long-horizon tasks in a complex physical system. To address this, we introduce Hierarchical Experimentalist Agents (HExA), an in-context self-improvement framework to learn from active experimentation. HExA iteratively designs and refines query-relevant experiments, learns a reusable library of composable skills from experience, and integrates experimental evidence to answer queries or take actions. HExA is training-free, compatible with any black-box model, and does not require external supervision, oracles, or offline data. To evaluate active experimentation, we introduce Interphyre, a tool-calling benchmark built on the PHYRE 2D procedural physics environment, where agents propose interventions and test hypotheses through simulation APIs. Experiments show that current LLM agents struggle in these settings, especially on the hardest levels of Interphyre. Claude Sonnet 4.6 achieves only 2% success, while HExA improves the same model to up to 77% success. HExA also improves open-weight models and outperforms agentic baselines such as ReAct and Reflexion. Moreover, using only skills learned from easier levels and transferred without active experimentation, HExA achieves 44% success, demonstrating the reusability and generalization of its learned skills. Overall, HExA shows that learning through active experimentation can help agents discover useful knowledge, acquire reusable skills, and make efficient progress on novel long-horizon tasks.","upvotes":1,"discussionId":"6a45465f4f1dd35e48fb8d5c","projectPage":"https://general-exp-3-continual-learning-agent.github.io/HeXA/","githubRepo":"https://github.com/General-Exp-3-Continual-Learning-Agent/HeXA-Hierarchical-Experimentalist-Agents","githubRepoAddedBy":"user","ai_summary":"HExA enables large language models to improve through active experimentation and skill learning in novel domains without requiring training or external supervision.","ai_keywords":["Hierarchical Experimentalist Agents","active experimentation","self-improvement framework","query-relevant experiments","reusable library of composable skills","experimental evidence","Interphyre","PHYRE 2D procedural physics environment","tool-calling benchmark","simulation APIs","ReAct","Reflexion"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"6216d16db64eee25bf8f22dd","name":"umass","fullname":"University of Massachusetts Amherst","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1645662504549-6216cfcd6a99db28e0b3155a.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63ac5701c21e60a3e9b58aa7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ac5701c21e60a3e9b58aa7/g6EX7diOpuA94R2ab-rZC.png","isPro":true,"fullname":"Dipankar Sarkar","user":"dipankarsarkar","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6216d16db64eee25bf8f22dd","name":"umass","fullname":"University of Massachusetts Amherst","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1645662504549-6216cfcd6a99db28e0b3155a.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.29315.md","query":{}}">
Hierarchical Experimentalist Agents
Abstract
HExA enables large language models to improve through active experimentation and skill learning in novel domains without requiring training or external supervision.
Large language models (LLMs) are increasingly used to take actions in the real world and support human decision-making, yet most agents rely on parametric knowledge, fixed post-training data, retrieval, or search. This paradigm breaks down in novel domains and for sophisticated queries that cannot be answered from prior knowledge alone. Knowing the laws of physics, for instance, does not by itself enable LLMs to answer queries or complete long-horizon tasks in a complex physical system. To address this, we introduce Hierarchical Experimentalist Agents (HExA), an in-context self-improvement framework to learn from active experimentation. HExA iteratively designs and refines query-relevant experiments, learns a reusable library of composable skills from experience, and integrates experimental evidence to answer queries or take actions. HExA is training-free, compatible with any black-box model, and does not require external supervision, oracles, or offline data. To evaluate active experimentation, we introduce Interphyre, a tool-calling benchmark built on the PHYRE 2D procedural physics environment, where agents propose interventions and test hypotheses through simulation APIs. Experiments show that current LLM agents struggle in these settings, especially on the hardest levels of Interphyre. Claude Sonnet 4.6 achieves only 2% success, while HExA improves the same model to up to 77% success. HExA also improves open-weight models and outperforms agentic baselines such as ReAct and Reflexion. Moreover, using only skills learned from easier levels and transferred without active experimentation, HExA achieves 44% success, demonstrating the reusability and generalization of its learned skills. Overall, HExA shows that learning through active experimentation can help agents discover useful knowledge, acquire reusable skills, and make efficient progress on novel long-horizon tasks.
Community
Large language models (LLMs) are increasingly used to take actions in the real world and support human decision-making, yet most agents rely on parametric knowledge, fixed post-training data, retrieval, or search. This paradigm breaks down in novel domains and for sophisticated queries that cannot be answered from prior knowledge alone. Knowing the laws of physics, for instance, does not by itself enable LLMs to answer queries or complete long-horizon tasks in a complex physical system. To address this, we introduce Hierarchical Experimentalist Agents (HExA), an in-context self-improvement framework to learn from active experimentation. HExA iteratively designs and refines query-relevant experiments, learns a reusable library of composable skills from experience, and integrates experimental evidence to answer queries or take actions. HExA is training-free, compatible with any black-box model, and does not require external supervision, oracles, or offline data. To evaluate active experimentation, we introduce Interphyre, a tool-calling benchmark built on the PHYRE 2D procedural physics environment, where agents propose interventions and test hypotheses through simulation APIs. Experiments show that current LLM agents struggle in these settings, especially on the hardest levels of Interphyre. Claude Sonnet 4.6 achieves only 2% success, while HExA improves the same model to up to 77% success. HExA also improves open-weight models and outperforms agentic baselines such as ReAct and Reflexion. Moreover, using only skills learned from easier levels and transferred without active experimentation, HExA achieves 44% success, demonstrating the reusability and generalization of its learned skills. Overall, HExA shows that learning through active experimentation can help agents discover useful knowledge, acquire reusable skills, and make efficient progress on novel long-horizon tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.29315 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.29315 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.29315 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.