Accepted by ECCV 2026, project page: <a href=\"https://live-edit.github.io/\" rel=\"nofollow\">https://live-edit.github.io/</a></p>\n","updatedAt":"2026-06-30T03:58:14.674Z","author":{"_id":"63871e72189d6915f1d3a87b","avatarUrl":"/avatars/4978b3ad59b5712c73fa692193837ecd.svg","fullname":"Jack Ma","name":"YueMafighting","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7158933877944946},"editors":["YueMafighting"],"editorAvatarUrls":["/avatars/4978b3ad59b5712c73fa692193837ecd.svg"],"reactions":[{"reaction":"👍","users":["Qinghew","YueMafighting","fffffky","aileshuo"],"count":4}],"isReport":false}},{"id":"6a43642ae1716b2db387bb6c","author":{"_id":"66e93deadff507737f0f39f5","avatarUrl":"/avatars/aaa4e8c54b66ea3cb091a0b5a6f72131.svg","fullname":"Feng","name":"fffffky","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-30T06:37:30.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Very impressive!","html":"<p>Very impressive!</p>\n","updatedAt":"2026-06-30T06:37:30.176Z","author":{"_id":"66e93deadff507737f0f39f5","avatarUrl":"/avatars/aaa4e8c54b66ea3cb091a0b5a6f72131.svg","fullname":"Feng","name":"fffffky","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.46779656410217285},"editors":["fffffky"],"editorAvatarUrls":["/avatars/aaa4e8c54b66ea3cb091a0b5a6f72131.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.26740","authors":[{"_id":"6a43289e763f63ca3757e831","name":"Xinyu Wang","hidden":false},{"_id":"6a43289e763f63ca3757e832","name":"Chongbo Zhao","hidden":false},{"_id":"6a43289e763f63ca3757e833","name":"Fangneng Zhan","hidden":false},{"_id":"6a43289e763f63ca3757e834","name":"Yue Ma","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/63871e72189d6915f1d3a87b/60OGJiv2KA8MY07jcYmZ3.qt"],"publishedAt":"2026-06-25T00:00:00.000Z","submittedOnDailyAt":"2026-06-30T00:00:00.000Z","title":"LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing","submittedOnDailyBy":{"_id":"63871e72189d6915f1d3a87b","avatarUrl":"/avatars/4978b3ad59b5712c73fa692193837ecd.svg","isPro":false,"fullname":"Jack Ma","user":"YueMafighting","type":"user","name":"YueMafighting"},"summary":"Streaming video editing has made rapid progress, yet practical deployment is still limited by two core issues: maintaining stable backgrounds and non-edited regions over time, and achieving the low latency required for real-time interactive scenarios. Meanwhile, recent streaming video generation methods are mostly developed for synthesis and cannot be directly applied to editing due to the strict preservation requirement and region-specific control. In this work, we present a novel streaming video editing framework that performs causal, frame-by-frame editing with strong content preservation and real-time responsiveness. Our key design is a three-stage distillation pipeline that progressively transfers editing capability from a powerful bidirectional foundation model to an efficient unidirectional streaming editor, enabling stable long-horizon edits without sacrificing visual fidelity. To further support real-time deployment, we introduce an AR-oriented mask cache that reuses region-related computation across frames, substantially reducing redundant processing and accelerating inference. Finally, we establish a dedicated benchmark for streaming video editing. Extensive evaluations demonstrate that our method achieves state-of-the-art visual quality among streaming baselines while drastically boosting inference speed to 12.66 FPS, making it suitable for interactive and augmented reality applications.","upvotes":70,"discussionId":"6a43289e763f63ca3757e835","projectPage":"https://arxiv.org/abs/2606.26740","githubRepo":"https://github.com/cp-cp/LiveEdit","githubRepoAddedBy":"user","ai_summary":"A novel streaming video editing framework enables causal, frame-by-frame editing with stable long-horizon preservation and real-time responsiveness through a three-stage distillation pipeline and AR-oriented mask cache.","ai_keywords":["streaming video editing","causal editing","frame-by-frame editing","content preservation","real-time responsiveness","three-stage distillation pipeline","bidirectional foundation model","unidirectional streaming editor","long-horizon edits","AR-oriented mask cache","inference speed","interactive applications","augmented reality"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":34,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63871e72189d6915f1d3a87b","avatarUrl":"/avatars/4978b3ad59b5712c73fa692193837ecd.svg","isPro":false,"fullname":"Jack Ma","user":"YueMafighting","type":"user"},{"_id":"68e90e155c45b1f6dcef8436","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UvWeI4WgRCUTDZavKrFmq.png","isPro":false,"fullname":"zhaochongbo","user":"cbfsixxx","type":"user"},{"_id":"6411c801e872ae3fb1e2c96e","avatarUrl":"/avatars/f8898dc13d700e545eedbbfab1c18353.svg","isPro":true,"fullname":"Franklin","user":"Franklinzhang","type":"user"},{"_id":"699d20a9d61de1aa72e2e1a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/699d20a9d61de1aa72e2e1a6/mZm2AA5vuelkjbTeYC_7a.jpeg","isPro":false,"fullname":"wangmaohan","user":"my-lora-01","type":"user"},{"_id":"633407947eb49986ce070a6c","avatarUrl":"/avatars/84245495d36f605a900950a3a76d4386.svg","isPro":false,"fullname":"Eason","user":"songyiren","type":"user"},{"_id":"6826bcdecaf89edf94b6ff29","avatarUrl":"/avatars/0387cb6e5f9f11b23410a34457033fcf.svg","isPro":false,"fullname":"Zhitao","user":"Zhitao-He","type":"user"},{"_id":"646f3418a6a58aa29505fd30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646f3418a6a58aa29505fd30/1z13rnpb6rsUgQsYumWPg.png","isPro":false,"fullname":"QINGHE WANG","user":"Qinghew","type":"user"},{"_id":"62c14609ac1b639c2d87192c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656833489364-noauth.png","isPro":false,"fullname":"SII-liangtianyi","user":"tianyilt","type":"user"},{"_id":"6942ddf588bd5c4b8102adfc","avatarUrl":"/avatars/460780bf94ee87e755a03770c9b90838.svg","isPro":false,"fullname":"Emily Chan","user":"Emily777","type":"user"},{"_id":"66c62a3013962d19a81d65f1","avatarUrl":"/avatars/da00258857b24896e421356b52e61a85.svg","isPro":false,"fullname":"kf","user":"gaokf","type":"user"},{"_id":"64892de6cbda0d1cdb95a0ab","avatarUrl":"/avatars/2bfbf91755bf7130cebdc84be001a19d.svg","isPro":false,"fullname":"XuTianling","user":"Emmatl","type":"user"},{"_id":"67891d3f42b73f87eaed95aa","avatarUrl":"/avatars/49046efeffc045e4c2a8e4a13e94f9f4.svg","isPro":false,"fullname":"Zeqian Long","user":"3087richard","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.26740.md","query":{}}">
LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing
Abstract
A novel streaming video editing framework enables causal, frame-by-frame editing with stable long-horizon preservation and real-time responsiveness through a three-stage distillation pipeline and AR-oriented mask cache.
Streaming video editing has made rapid progress, yet practical deployment is still limited by two core issues: maintaining stable backgrounds and non-edited regions over time, and achieving the low latency required for real-time interactive scenarios. Meanwhile, recent streaming video generation methods are mostly developed for synthesis and cannot be directly applied to editing due to the strict preservation requirement and region-specific control. In this work, we present a novel streaming video editing framework that performs causal, frame-by-frame editing with strong content preservation and real-time responsiveness. Our key design is a three-stage distillation pipeline that progressively transfers editing capability from a powerful bidirectional foundation model to an efficient unidirectional streaming editor, enabling stable long-horizon edits without sacrificing visual fidelity. To further support real-time deployment, we introduce an AR-oriented mask cache that reuses region-related computation across frames, substantially reducing redundant processing and accelerating inference. Finally, we establish a dedicated benchmark for streaming video editing. Extensive evaluations demonstrate that our method achieves state-of-the-art visual quality among streaming baselines while drastically boosting inference speed to 12.66 FPS, making it suitable for interactive and augmented reality applications.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.26740 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.26740 in a dataset README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.