Hugging Face Daily Papers · June 30, 2026 · 3 min read

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Accepted by ECCV 2026.</p>\n","updatedAt":"2026-06-30T02:19:25.401Z","author":{"_id":"6434c9dc4b34368fdb07d421","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434c9dc4b34368fdb07d421/V_afg81iuNyMFfhM7qdgB.jpeg","fullname":"fansunqi","name":"fansunqi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9347259998321533},"editors":["fansunqi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6434c9dc4b34368fdb07d421/V_afg81iuNyMFfhM7qdgB.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.29445","authors":[{"_id":"6a432713763f63ca3757e815","name":"Sunqi Fan","hidden":false},{"_id":"6a432713763f63ca3757e816","name":"Qingle Liu","hidden":false},{"_id":"6a432713763f63ca3757e817","name":"Runqi Yin","hidden":false},{"_id":"6a432713763f63ca3757e818","name":"Meng-Hao Guo","hidden":false},{"_id":"6a432713763f63ca3757e819","name":"Shuojin Yang","hidden":false}],"publishedAt":"2026-06-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction","submittedOnDailyBy":{"_id":"6434c9dc4b34368fdb07d421","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434c9dc4b34368fdb07d421/V_afg81iuNyMFfhM7qdgB.jpeg","isPro":false,"fullname":"fansunqi","user":"fansunqi","type":"user","name":"fansunqi"},"summary":"Video understanding is a fundamental capability for multimodal intelligence, and recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance on Video Question Answering (VideoQA) benchmarks. However, existing benchmarks primarily evaluate whether models can perceive shallow visual cues, while rarely examining whether MLLMs can learn deeper knowledge or procedural skills from video tutorials and generalize them to downstream long-horizon agentic tasks. To address this gap, we introduce VG-GUIBench (Video-Guided GUI Benchmark), a new benchmark designed to evaluate whether MLLM-based GUI agents can follow video tutorials to complete corresponding GUI interactive tasks. Furthermore, we observe that the performance of models on both VideoQA and video-guided agentic tasks critically depends on effective keyframe extraction. Based on this observation, we propose TASKER (Task-driven And Scene-aware Keyframe searchER), a keyframe extraction algorithm that jointly considers task relevance and scene dynamics to identify informative frames. Experimental results demonstrate that TASKER achieves significant performance improvements on both VideoQA and video-guided agentic task benchmarks, outperforming the best baseline by 2.0% on the EgoSchema fullset and 1.8% on the NExT-QA dataset, respectively. These results further highlight the potential of generalized keyframe extraction methods for video understanding tasks. Our code and data are available at https://github.com/VG-GUI-TASKER/VG-GUI-TASKER.","upvotes":21,"discussionId":"6a432713763f63ca3757e81a","projectPage":"https://vg-gui-tasker.github.io/","githubRepo":"https://github.com/VG-GUI-TASKER/VG-GUI-TASKER","githubRepoAddedBy":"user","ai_summary":"A new benchmark evaluates multimodal large language models' ability to understand video content and perform GUI tasks, while a novel keyframe extraction method improves performance on both video question answering and video-guided agentic tasks.","ai_keywords":["Multimodal Large Language Models","Video Question Answering","GUI agents","video tutorials","keyframe extraction","task relevance","scene dynamics"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":12},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6434c9dc4b34368fdb07d421","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434c9dc4b34368fdb07d421/V_afg81iuNyMFfhM7qdgB.jpeg","isPro":false,"fullname":"fansunqi","user":"fansunqi","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"6784d6356b250284f09f3a78","avatarUrl":"/avatars/a4119b7d9ecba24480fc872e9f369923.svg","isPro":false,"fullname":"Yuxuan Han","user":"Hanyx21","type":"user"},{"_id":"69d759ecfa5f735977291448","avatarUrl":"/avatars/7bc512e8cc06505c091533e68daed956.svg","isPro":false,"fullname":"Stephen Fan","user":"stephenfan1101","type":"user"},{"_id":"6816bd8e0499f6c7c7b89601","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6816bd8e0499f6c7c7b89601/-dkIPxjOGbdwDZFxhkBMC.jpeg","isPro":false,"fullname":"Zhe-Han Mo","user":"Mo-ZheHan","type":"user"},{"_id":"6571b51fd5c6a6d3b0ba68ad","avatarUrl":"/avatars/0ccd8fe8de857753b534356a90eb10f0.svg","isPro":false,"fullname":"gmh","user":"menghao22","type":"user"},{"_id":"6687afb1d299b5da8104caba","avatarUrl":"/avatars/2da913ab24cd3ca059ed09600dc3a769.svg","isPro":false,"fullname":"Xinsheng Chen","user":"XinshengCHEN","type":"user"},{"_id":"68121159e36920a0175fef52","avatarUrl":"/avatars/7713664566cabab7dae29aaa5752d67c.svg","isPro":false,"fullname":"Zikai Xiao","user":"514flowey","type":"user"},{"_id":"63b2efb5922f26a27e76381c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b2efb5922f26a27e76381c/zOQAt_xywiY8eTvvQOrmQ.png","isPro":false,"fullname":"Yi Zhang","user":"uyzhang","type":"user"},{"_id":"67c9c53e74cf23e16fb9d887","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/LhufFuUBQzorYRZcDTP8N.png","isPro":false,"fullname":"Yingze Wang","user":"VoyagerCSTHU","type":"user"},{"_id":"62145614b670cb63a38075ba","avatarUrl":"/avatars/5e33debde75ae6c87640f63c48c560c6.svg","isPro":false,"fullname":"MenghaoGuo","user":"MenghaoGuo","type":"user"},{"_id":"6869ecb0118df64ef629b43d","avatarUrl":"/avatars/a0c973ed7b1b83ade1a8569ca8747c55.svg","isPro":false,"fullname":"Lingshan Chen","user":"chen03","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.29445.md","query":{}}">

Papers

arxiv:2606.29445

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Published on Jun 28

· Submitted by

fansunqi on Jun 30

Upvote

Authors:

Abstract

A new benchmark evaluates multimodal large language models' ability to understand video content and perform GUI tasks, while a novel keyframe extraction method improves performance on both video question answering and video-guided agentic tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Video understanding is a fundamental capability for multimodal intelligence, and recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance on Video Question Answering (VideoQA) benchmarks. However, existing benchmarks primarily evaluate whether models can perceive shallow visual cues, while rarely examining whether MLLMs can learn deeper knowledge or procedural skills from video tutorials and generalize them to downstream long-horizon agentic tasks. To address this gap, we introduce VG-GUIBench (Video-Guided GUI Benchmark), a new benchmark designed to evaluate whether MLLM-based GUI agents can follow video tutorials to complete corresponding GUI interactive tasks. Furthermore, we observe that the performance of models on both VideoQA and video-guided agentic tasks critically depends on effective keyframe extraction. Based on this observation, we propose TASKER (Task-driven And Scene-aware Keyframe searchER), a keyframe extraction algorithm that jointly considers task relevance and scene dynamics to identify informative frames. Experimental results demonstrate that TASKER achieves significant performance improvements on both VideoQA and video-guided agentic task benchmarks, outperforming the best baseline by 2.0% on the EgoSchema fullset and 1.8% on the NExT-QA dataset, respectively. These results further highlight the potential of generalized keyframe extraction methods for video understanding tasks. Our code and data are available at https://github.com/VG-GUI-TASKER/VG-GUI-TASKER.

View arXiv page View PDF Project page GitHub 12 Add to collection

Community

fansunqi

Paper submitter about 23 hours ago

Accepted by ECCV 2026.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.29445

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.29445 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.29445 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.29445 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers