Hugging Face Daily Papers

500 articles archived · Visit source ↗ · RSS

Hugging Face Daily Papers research 6d ago

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

Abstract Verification challenges in AI agents arise from the difficulty of aligning proxy signals with human intent, requiring adaptive verification systems that evolve alongside generative capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A classical intuition holds…

26
Hugging Face Daily Papers research 6d ago

GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

Abstract Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a…

7
Hugging Face Daily Papers research 6d ago

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

Abstract Research investigates how different supervisory signals and training strategies improve the stability and performance of large language models in tool-use tasks, addressing issues like catastrophic collapse and format sensitivity through interleaved supervised…

21
Hugging Face Daily Papers research 6d ago

OpenBioRQ: Unsolved Biomedical Research Questions for Agents

Abstract A new biomedical benchmark evaluates agentic models' ability to verify sources and avoid false citations by testing unsolved research questions with no answer keys, revealing significant failures in retrieval-grounded reasoning and tool usage. Generated by…

9
Hugging Face Daily Papers research 6d ago

In-Context World Modeling for Robotic Control

Abstract ICWM enables robot policies to infer system variables from self-generated interactions, allowing adaptation to novel configurations without parameter updates by treating system identification as an in-context adaptation problem. Generated by…

8
Hugging Face Daily Papers research 6d ago

Confidence-Aware Tool Orchestration for Robust Video Understanding

Abstract Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework that improves accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning. Generated by…

17
Hugging Face Daily Papers research 6d ago

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

Abstract ViQ presents a visual quantization framework that balances semantic richness and detail preservation in discrete representations, enabling efficient multimodal training with native-resolution inputs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A unified representation…

26
Hugging Face Daily Papers research 6d ago

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Abstract On-policy skill distillation framework extracts dense hindsight supervision from completed trajectories to improve language agent training efficiency and performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Outcome-based reinforcement learning provides a stable…

20
Hugging Face Daily Papers research 6d ago

DanceOPD: On-Policy Generative Field Distillation

Abstract A novel on-policy generative field distillation framework called DanceOPD is proposed to unify text-to-image generation, local editing, and global editing capabilities in flow-matching models through capability-specific routing and velocity-based training. Generated by…

10
Hugging Face Daily Papers research 6d ago

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Abstract A vision-language model-based hierarchical question graph framework evaluates video generation models' adherence to physical laws with granular violation detection and human correlation validation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generation models are…

23
Hugging Face Daily Papers research 6d ago

Do Thinking Tokens Help with Safety?

Abstract Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final responses, and current safety interventions inadvertently suppress genuine deliberation…

25
Hugging Face Daily Papers research 6d ago

Forecasting Future Behavior as a Learning Task

Abstract Behavior Forecasters are trained to predict large reasoning model outputs from single trajectories, outperforming large language models while requiring significantly less computational cost. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Trust in an AI system is often…

24
Hugging Face Daily Papers research 6d ago

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

Abstract Standard LLM agents rely on plan content remaining in context rather than maintaining it as persistent state, with evidence shown through replay pairing diagnostics and compression stress tests. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Long-horizon agents depend on…

27
Hugging Face Daily Papers research 6d ago

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

Abstract A novel speaker verification framework combines frozen self-supervised features with ECAPA-TDNN and MoE modules to improve identity verification across both speech and non-verbal vocalizations while maintaining speech performance. Generated by…

30
Hugging Face Daily Papers research 6d ago

Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching

Abstract Lite Any Stereo V2 (LAS2) presents an efficient stereo matching approach that achieves state-of-the-art accuracy with significantly reduced latency through optimized architecture and training strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent advances in…

9
Hugging Face Daily Papers research 6d ago

PrivacyAlign: Contextual Privacy Alignment for LLM Agents

Abstract Researchers develop a human-centered approach to align AI agents with privacy norms by creating a comprehensive dataset of privacy judgments and using annotation-conditioned reward modeling to improve agent behavior. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI…

7
Hugging Face Daily Papers research 7d ago

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Abstract Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Jailbreak attacks reveal…

23
Hugging Face Daily Papers research 7d ago

Distill Once, Adapt Life-Long: Exploring Dataset Distillation for Continual Test-Time Adaptation

Abstract DO-ALL is a test-time adaptation framework that uses dataset distillation to create synthetic anchors for stable long-term model performance without retaining source data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Continual Test-Time Adaptation (CTTA) aims to…

20
Hugging Face Daily Papers research 7d ago

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

Abstract ReNIO enhances on-policy distillation for language models by reweighting negative trajectories based on token-level probability ratios, improving reasoning performance in mathematical and code generation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct On-policy…

25
Hugging Face Daily Papers research 7d ago

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

Abstract Tool Suppression occurs when JSON Schema constraints and tool calling are jointly enabled, preventing open-weight models from invoking tools despite maintaining schema compliance, with the issue stemming from grammar-based token masking that makes tool-call tokens…

5
Hugging Face Daily Papers research 7d ago

Autodata: An agentic data scientist to create high quality synthetic data

Abstract Autodata enables AI agents to function as data scientists who create high-quality training data through meta-optimization, demonstrating improved performance across multiple task domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce Autodata, a general…

30
Hugging Face Daily Papers research 7d ago

Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

Abstract Autoregressive video diffusion extends diffusion distillation frameworks to real-time streaming generation through causal training paradigms, achieving state-of-the-art performance with fast convergence and interactive world modeling capabilities. Generated by…

4
Hugging Face Daily Papers research 7d ago

Improved Large Language Diffusion Models

Abstract Masked diffusion language models with fully bidirectional attention outperform autoregressive counterparts on various benchmarks while maintaining competitiveness with established models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Modern large language models are…

18
Hugging Face Daily Papers research 7d ago

MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

Abstract A novel-view video synthesis method that enhances motion-aware diffusion models through multi-view point tracking supervision to improve geometric consistency and motion fidelity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Synthesizing a novel-view video from a…

37
Hugging Face Daily Papers research 7d ago

ShutterMuse: Capture-Time Photography Guidance with MLLMs

Abstract Researchers developed a new benchmark and dataset for photography assistance, along with a unified multimodal model that provides both composition guidance and pose recommendations during image capture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Real-world photography…

12
Hugging Face Daily Papers research 7d ago

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

Abstract The book provides a comprehensive guide to building autonomous AI systems, covering foundational elements like transformer architecture and training methods, along with advanced topics such as reinforcement learning, agent architectures, and production deployment.…

5
Hugging Face Daily Papers research 7d ago

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

Abstract RL-Index introduces an agentic indexing framework that shifts reasoning from query time to indexing stage by using LLM-generated rationales and reinforcement learning to improve retrieval effectiveness and reduce latency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

25
Hugging Face Daily Papers research 7d ago

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

Abstract Two-channel evaluation shows output compression reduces costs while input compression increases costs and degrades accuracy across models and datasets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct "Talk short. Drop grammar. Save token." This caveman style is widely…

28
Hugging Face Daily Papers research 7d ago

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

Abstract LLM agents frequently select higher-privilege tools unnecessarily, and while safety alignment doesn't ensure least-privilege choices, a post-training defense can reduce excessive privilege use without sacrificing performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

26
Hugging Face Daily Papers research 7d ago

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Abstract UnityShots is a memory-driven audio-video generation system that maintains consistent subject appearance and audio across video cuts using fixed-size long-term and short-term memory slots with boundary-conditioned gates and discrete cut-type priors. Generated by…

7
Hugging Face Daily Papers research 7d ago

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

Abstract A novel label-free framework for visual reasoning called V-Zero is presented, which uses contrastive evidence gating to improve fine-grained visual reasoning without requiring annotated answer labels, achieving faster training than traditional methods. Generated by…

12
Hugging Face Daily Papers research 7d ago

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

Abstract EBench is a comprehensive simulation benchmark for evaluating generalist mobile manipulation policies across diverse tasks and dimensions, revealing distinct capability profiles and generalization patterns among state-of-the-art models. Generated by…

18
Hugging Face Daily Papers research 7d ago

TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy

Abstract Camera-controllable video virtual try-on framework uses a 4D proxy with explicit human-environment decoupling and DiT-based video generation for omnidirectional viewing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While Video Virtual Try-on (VVT) has achieved…

4
Hugging Face Daily Papers research 7d ago

Are We Ready For An Agent-Native Memory System?

Abstract Large language model agents' memory systems have evolved into complex data management frameworks requiring systematic evaluation across multiple modules and workloads to understand their performance characteristics and trade-offs. Generated by…

7
Hugging Face Daily Papers research 7d ago

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Abstract Multimodal Chain-of-Thought reasoning shows selective effectiveness across different tasks, with limitations in maintaining visual introspection during reasoning processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Chain-of-Thought (CoT) has become a standard method…

17
Hugging Face Daily Papers research 7d ago

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

Abstract DomainShuttle enables open domain subject-driven text-to-video generation with high fidelity and flexibility across in-domain and cross-domain scenarios through domain-aware modeling and dual RoPE schemes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Open domain…

10
Hugging Face Daily Papers research 7d ago

RoPE-Aware Bit Allocation for KV-Cache Quantization

Abstract Block-GTQ introduces a RoPE-aware bit allocation method for key-cache quantization that improves attention accuracy and downstream performance through adaptive bit distribution and packed cache serving. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing low-bit…

22
Hugging Face Daily Papers research 7d ago

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

Abstract This survey explores multimodal code intelligence systems that generate and reason with code based on visual inputs, categorizing approaches across GUI, scientific visualization, structured graphics, and emerging frameworks while identifying verification-centered…

25
Hugging Face Daily Papers research 7d ago

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Abstract Implicit Visual Chain-of-Thought decomposes visual conditioning into structural and semantic cascades for improved structure-aware image generation with sketch supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Unified multi-modal large language models (MLLMs)…

7
Hugging Face Daily Papers research 7d ago

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

Abstract A large-scale synthetic dataset and specialized model architecture are introduced to address the challenges of artistic text recognition by improving data diversity and model flexibility for irregular text layouts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct WordArt…

9
Hugging Face Daily Papers research 7d ago

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Abstract Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present…

20
Hugging Face Daily Papers research 7d ago

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

Abstract Long-term memory in LLM agents should be evaluated as an auditable post-interaction artifact by reconstructing structured user state from the agent's memory, as demonstrated by MEMPROBE, a benchmark testing memory recovery against synthetic ground truth across 50…

21
Hugging Face Daily Papers research 7d ago

Critique of Agent Model

Abstract True artificial agency requires internalized structures for goals, identity, decision-making, self-regulation, and learning, distinguishing autonomous systems from task-specific ones. Generated by Qwen/Qwen2.5-Coder-32B-Instruct What is an agent? What constitutes…

24
Hugging Face Daily Papers research 7d ago

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Abstract InSight enables autonomous skill acquisition for vision-language-action models through primitive-action level steerability and automated demonstration generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-language-action (VLA) models can learn manipulation…

19
Hugging Face Daily Papers research 7d ago

Multi4D: High-Fidelity Dynamic Gaussian Splatting via Multi-Level Competitive Allocation

Abstract Multi4D addresses the trade-off between motion consistency and visual fidelity in dynamic 3D Gaussian splatting through a multi-level competitive allocation framework that enables adaptive specialization and efficient representation. Generated by…

21
Hugging Face Daily Papers research 7d ago

Semantic Browsing: Controllable Diversity for Image Generation

Abstract Text-to-image models are enhanced with controlled diversity through semantic browsing capabilities that enable structured navigation of image variations based on meaningful semantic decisions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Modern text-to-image models…

4
Hugging Face Daily Papers research 8d ago

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Abstract Large language models face challenges in archive-grounded reasoning tasks involving evidence retrieval and synthesis across diverse document collections, with performance varying significantly across domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language…

26
Hugging Face Daily Papers research 8d ago

ChartWalker: Benchmarking the Cross-Chart RAG Task

Abstract ChartWalker presents a novel framework for cross-chart retrieval-augmented generation with hierarchical knowledge graph construction and structure-aware sampling for challenging multi-modal analytical tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Cross-Chart…

33
Hugging Face Daily Papers research 8d ago

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

Abstract QG-MIL introduces a gated transformer aggregator for multiple instance learning in medical imaging that stabilizes attention distribution and improves prediction consistency across different medical domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Attention-based…

38
Hugging Face Daily Papers research 8d ago

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Abstract EventVLA addresses long-horizon robotic manipulation challenges by introducing a sparse visual evidence memory framework with visual anchors and dynamic Keyframe Evidence Memory module for improved task performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory…

23

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

OpenBioRQ: Unsolved Biomedical Research Questions for Agents

In-Context World Modeling for Robotic Control

Confidence-Aware Tool Orchestration for Robust Video Understanding

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

DanceOPD: On-Policy Generative Field Distillation

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Do Thinking Tokens Help with Safety?

Forecasting Future Behavior as a Learning Task

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching

PrivacyAlign: Contextual Privacy Alignment for LLM Agents

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Distill Once, Adapt Life-Long: Exploring Dataset Distillation for Continual Test-Time Adaptation

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

Autodata: An agentic data scientist to create high quality synthetic data

Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

Improved Large Language Diffusion Models

MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

ShutterMuse: Capture-Time Photography Guidance with MLLMs

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy

Are We Ready For An Agent-Native Memory System?

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

RoPE-Aware Bit Allocation for KV-Cache Quantization

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

Critique of Agent Model

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Multi4D: High-Fidelity Dynamic Gaussian Splatting via Multi-Level Competitive Allocation

Semantic Browsing: Controllable Diversity for Image Generation

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

ChartWalker: Benchmarking the Cross-Chart RAG Task

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies