Hugging Face Daily Papers · · 4 min read

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

OmniCap-IF targets instruction-following in omni-modal video captioning, where models must not only understand visual and audio streams, but also obey complex user-specified structural, stylistic, temporal, visual, audio, and audio-visual constraints. We introduce the OmniCap-IF benchmark for fine-grained checklist-based evaluation, construct the OmniCap-IF-54K instruction-tuning dataset, and train the OmniCaptioner-IF model family to improve controllable omni-video captioning.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/68abfd1ba1f07af43fbbf3f1/EI8f3ye3jcIpBFp2jjI-G.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/68abfd1ba1f07af43fbbf3f1/EI8f3ye3jcIpBFp2jjI-G.png\" alt=\"overview_framework\"></a></p>\n","updatedAt":"2026-06-09T10:05:23.117Z","author":{"_id":"68abfd1ba1f07af43fbbf3f1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/ivRfEWMAo1GQWw3x06LQp.png","fullname":"jiahaowang","name":"wang-jiahao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7415794134140015},"editors":["wang-jiahao"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/ivRfEWMAo1GQWw3x06LQp.png"],"reactions":[{"reaction":"🔥","users":["Jessamine"],"count":1},{"reaction":"🚀","users":["Jessamine"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.08572","authors":[{"_id":"6a27e3c86dde1c5ef75bd2d4","user":{"_id":"68abfd1ba1f07af43fbbf3f1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/ivRfEWMAo1GQWw3x06LQp.png","isPro":false,"fullname":"jiahaowang","user":"wang-jiahao","type":"user","name":"wang-jiahao"},"name":"Jiahao Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:40:29.534Z","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2d5","name":"An Ping","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2d6","name":"Yanghai Wang","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2d7","name":"Yuanxing Zhang","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2d8","name":"Shihao Li","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2d9","name":"Hanyan Bian","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2da","name":"Yichi Ren","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2db","name":"Yize Zhang","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2dc","name":"Han Wang","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2dd","name":"Haowen Chen","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2de","name":"Junze Li","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2df","name":"Jiaqi Wang","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2e0","name":"Yiyang Hu","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2e1","name":"Zhuze Xu","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2e2","name":"Zijie Zhang","hidden":false},{"_id":"6a27e3c86dde1c5ef75bd2e3","name":"Jiaheng Liu","hidden":false}],"publishedAt":"2026-06-07T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning","submittedOnDailyBy":{"_id":"68abfd1ba1f07af43fbbf3f1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/ivRfEWMAo1GQWw3x06LQp.png","isPro":false,"fullname":"jiahaowang","user":"wang-jiahao","type":"user","name":"wang-jiahao"},"summary":"While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical \"format-content tradeoff\", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.","upvotes":9,"discussionId":"6a27e3c86dde1c5ef75bd2e4","projectPage":"https://nju-link.github.io/OmniCap-IF/","githubRepo":"https://github.com/NJU-LINK/omnicap-if","githubRepoAddedBy":"user","ai_summary":"OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning.","ai_keywords":["Omni-modal Large Language Models","omni-modal captioning","instruction-following","Temporal Grounding","format correctness","content correctness","constraint types","multi-faceted user instructions","benchmark evaluation","format-content tradeoff"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68abfd1ba1f07af43fbbf3f1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/ivRfEWMAo1GQWw3x06LQp.png","isPro":false,"fullname":"jiahaowang","user":"wang-jiahao","type":"user"},{"_id":"69ddaddcbc1a581d3c58e300","avatarUrl":"/avatars/3cba3493f327663a29a937fabab91bb9.svg","isPro":false,"fullname":"Jiaqi Wang","user":"Groot-nju","type":"user"},{"_id":"6a08123815d2b0652ecfe9a9","avatarUrl":"/avatars/214b762b68397881ed4731876486561a.svg","isPro":false,"fullname":"张逸泽","user":"a123aA12","type":"user"},{"_id":"68355c5ec0003bc40230b3f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68355c5ec0003bc40230b3f2/fJjAPFtmAJskQJqxWUb-T.jpeg","isPro":false,"fullname":"Jiaming Wang","user":"Jessamine","type":"user"},{"_id":"6a27cf0b0fd3baf69d27719a","avatarUrl":"/avatars/9ce9b51686f50bc4074c235305626ddc.svg","isPro":false,"fullname":"Ningxin Shen","user":"NingxinShen","type":"user"},{"_id":"688dac1cbb758bd8dbb19e84","avatarUrl":"/avatars/19f5e928104e2dbe6f0a1f068d8e953c.svg","isPro":false,"fullname":"stone","user":"ger-oge2","type":"user"},{"_id":"67ebcef758bcf67c68beebaa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ebcef758bcf67c68beebaa/cxk--AFFDC9QRd7gQ2iym.jpeg","isPro":false,"fullname":"An Ping","user":"Starivers","type":"user"},{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user"},{"_id":"685d5708f55e4e848a5243ae","avatarUrl":"/avatars/ac864f34d14da3d91914f2b440d8a073.svg","isPro":false,"fullname":"lester","user":"rongll","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.08572.md"}">
Papers
arxiv:2606.08572

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Published on Jun 7
· Submitted by
jiahaowang
on Jun 9
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning.

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.

Community

Paper author Paper submitter about 9 hours ago

OmniCap-IF targets instruction-following in omni-modal video captioning, where models must not only understand visual and audio streams, but also obey complex user-specified structural, stylistic, temporal, visual, audio, and audio-visual constraints. We introduce the OmniCap-IF benchmark for fine-grained checklist-based evaluation, construct the OmniCap-IF-54K instruction-tuning dataset, and train the OmniCaptioner-IF model family to improve controllable omni-video captioning.

overview_framework

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.08572
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.08572 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers