Hugging Face Daily Papers

500 articles archived · Visit source ↗ · RSS

Hugging Face Daily Papers research 6h ago

MemLearner: Learning to Query Context memory for Video World Models

Abstract MemLearner improves video world models by using learning-based adaptive context querying with query tokens to enhance scene consistency and memory in long video sequences with occlusions and dynamic objects. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video World…

24
Hugging Face Daily Papers research 12h ago

SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE

Abstract A novel zero-shot framework injects spherical priors into pre-trained diffusion transformers for 360 panoramic generation, using spherical RoPE and semantic distortion guidance to overcome topological constraints without training or optimization. Generated by…

35
Hugging Face Daily Papers research 13h ago

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

Abstract TRIAGE introduces a role-typed credit assignment framework that enhances agentic reinforcement learning by providing more nuanced credit assignment than standard GRPO methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic reinforcement learning requires assigning…

26
Hugging Face Daily Papers research 14h ago

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

Abstract SWE-Interact presents a testbed that evaluates coding agents in realistic multi-turn, user-driven software engineering scenarios, revealing significant gaps between single-turn performance and interactive task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

6
Hugging Face Daily Papers research 14h ago

Play2Perfect: What Matters in Dexterous Play Pretraining for Precise Assembly?

Abstract A reinforcement learning framework called Play2Perfect enables sample-efficient robotic assembly tasks by first learning general manipulation skills through playful interaction with diverse objects, then adapting these skills for precise assembly through fine-tuning.…

34
Hugging Face Daily Papers research 14h ago

PolyFlow: Continuous Topology Embedding Flow Matching for Artist-style Mesh Generation

Abstract PolyFlow introduces a continuous mesh representation using a topology embedder and applies flow-matching with Transformers for parallel mesh generation, achieving faster inference and precise resolution control compared to autoregressive methods. Generated by…

5
Hugging Face Daily Papers research 15h ago

Hierarchical Experimentalist Agents

Abstract HExA enables large language models to improve through active experimentation and skill learning in novel domains without requiring training or external supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models (LLMs) are increasingly used to take…

24
Hugging Face Daily Papers research 16h ago

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Abstract Approach-level diversity in LLM mathematical reasoning captures strategic variation in problem-solving methods, revealing limitations of surface-level diversity metrics and highlighting challenges in directly optimizing diverse reasoning approaches. Generated by…

11
Hugging Face Daily Papers research 16h ago

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Abstract Act2Answer protocol evaluates embodied vision-language-action models by having agents answer questions through physical actions, revealing knowledge retention and generalization patterns across different semantic categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

35
Hugging Face Daily Papers research 16h ago

Lexical Consensus: Grounded Word Learning and Shared Meaning in Artificial Agents

Abstract Grounded word learning experiments using visual embeddings and lexical learners reveal that perceptual distance, rather than semantic relatedness, determines acquisition success, with distinct patterns in naming and retrieval performance. Generated by…

34
Hugging Face Daily Papers research 18h ago

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

Abstract Multi-teacher On-Policy Distillation (MOPD) enables efficient integration of multiple domain capabilities in large language models through specialized reinforcement learning teachers and on-policy distillation, achieving superior performance over existing methods.…

33
Hugging Face Daily Papers research 18h ago

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

Abstract A large-scale video editing dataset and model are introduced that support multi-task and structural manipulations through advanced data synthesis and network architectures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing instruction-based video editing datasets…

38
Hugging Face Daily Papers research 19h ago

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Abstract A testbed called QVal is introduced for evaluating dense supervision signals in long-horizon LLM agent tasks by measuring how well method scores align with Q-values, enabling fair comparison of different supervision approaches without training. Generated by…

22
Hugging Face Daily Papers research 20h ago

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Abstract Flexible Spoken Language Model (FlexiSLM) introduces dynamic frame rate capabilities for speech input and output, achieving superior performance over fixed-frame-rate models while enabling controllable inference speed. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spoken…

15
Hugging Face Daily Papers research 21h ago

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

Abstract Procedural memory enhances LLM agents on workplace tasks through skill transfer across roles and models, with varying generalization capabilities affecting deployment strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Procedural memory is increasingly used to…

22
Hugging Face Daily Papers research 22h ago

SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History

Abstract SkillHone enables continuous evolution of agent skills by maintaining persistent decision histories and incorporating practice feedback for improved performance across research and tool-mediated analysis tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agent skills…

35
Hugging Face Daily Papers research 22h ago

DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation

Abstract DataEvolver is a self-evolving multi-agent framework that improves text-rich image generation by leveraging feedback from rejected samples to iteratively enhance data quality. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Text-rich image generation is one of the most…

11
Hugging Face Daily Papers research 22h ago

MuSViT: A Foundation Vision Model for Sheet Music Representation

Abstract MuSViT is a vision transformer-based foundation model pre-trained on millions of sheet music pages that demonstrates superior performance in music score recognition and symbol detection tasks through both linear probing and fine-tuning approaches. Generated by…

10
Hugging Face Daily Papers research 23h ago

Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views

Abstract A feed-forward framework decomposes 3D scenes into instance-structured token groups from multi-view images, enabling direct object-level reconstruction, segmentation, and manipulation without 3D annotations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A 3D scene is…

38
Hugging Face Daily Papers research 23h ago

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Abstract Multilingual safety and fairness benchmark for speech models reveals persistent vulnerabilities across languages and naturalistic conditions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech-capable models are increasingly deployed in real-world applications across…

36
Hugging Face Daily Papers research 23h ago

Xiaomi-GUI-0 Technical Report

Abstract A native multimodal GUI agent trained in real-device environments demonstrates superior performance and stability compared to traditional benchmark-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Graphical user interface (GUI) agents build on…

7
Hugging Face Daily Papers research 1d ago

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Abstract GEAR trains a vector-quantized tokenizer and autoregressive generator jointly end-to-end using representation alignment, overcoming non-differentiability issues through a dual read-out approach that improves convergence speed and feature quality. Generated by…

36
Hugging Face Daily Papers research 1d ago

Little Brains, Big Feats: Exploring Compact Language Models

Abstract Small language models can effectively perform retrieval-augmented generation tasks directly on-device without GPU acceleration. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While large language models have been dominating the research landscape recently, small language…

13
Hugging Face Daily Papers research 1d ago

Multi-Block Diffusion Language Models

Abstract Multi-Block Diffusion Language Models extend single-block diffusion to concurrent block decoding with improved training strategies and optimized decoding algorithms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Block Diffusion Language Models (BD-LMs) improve…

35
Hugging Face Daily Papers research 1d ago

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

Abstract Reinforcement learning with metacognitive feedback and metacognitive data selection improve large language model calibration by enabling accurate self-assessment of performance and uncertainty. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Metacognition is a critical…

38
Hugging Face Daily Papers research 1d ago

TerraDiT-Ω: Unified Spatial Control for Satellite Image Synthesis with Any Geospatial Primitive

Abstract TerraDiT-Ω generates satellite imagery from native geospatial primitives using Geometry-Aware Local Attention, enabling flexible conditioning and improved downstream geospatial tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative models have achieved…

36
Hugging Face Daily Papers research 1d ago

PhotoQuilt: Training-Free Arbitrary-Resolution Photomosaics via Bootstrapped Tiled Denoising

Abstract PhotoQuilt is a training-free framework that generates high-resolution photomosaics by combining global layout composition with separate tile generation in latent space, overcoming limitations of diffusion models in balancing local detail and global structure. Generated…

5
Hugging Face Daily Papers research 1d ago

BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language

Abstract BrainJanus represents the first unified brain model integrating brain, vision, and language through a shared Omni space, enabling bidirectional mapping between neural activity and sensory stimuli via a tokenized representation and autoregressive architecture. Generated…

38
Hugging Face Daily Papers research 1d ago

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

Abstract Speculative decoding with adaptive block size selection improves inference efficiency by predicting optimal block sizes from prefilling representations, achieving significant speedup with minimal overhead. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speculative…

30
Hugging Face Daily Papers research 1d ago

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

Abstract AVTok is a unified tokenizer for audio-video generation that uses a dual-stream transformer architecture with shared encoder-decoder and modal-specific queries to create compact one-dimensional latent representations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

21
Hugging Face Daily Papers research 1d ago

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

Abstract Evolutionary fine-tuning enables large language models to develop cross-task problem-solving capabilities by learning from search trajectories, demonstrating improved performance on mathematical conjectures and optimization tasks. Generated by…

11
Hugging Face Daily Papers research 1d ago

DOPD: Dual On-policy Distillation

Abstract DOPD addresses privilege illusion in on-policy distillation by dynamically routing token-level supervision between teacher and student policies based on advantage gaps and probabilities, improving capability transfer in large and vision-language models. Generated by…

6
Hugging Face Daily Papers research 1d ago

Dockerless: Environment-Free Program Verifier for Coding Agents

Abstract A Dockerless environment-free agentic patch verifier improves code patch evaluation accuracy and enables effective post-training without execution-based verification costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Program verifiers play a central role in training…

21
Hugging Face Daily Papers research 1d ago

LUMOS: A Semantic Operating-System Layer for Accessibility-Grounded AI Agents

Abstract LUMOS provides a semantic interaction layer that converts operating system metadata into machine-readable formats, enabling AI agents to interact more efficiently with computer interfaces than through traditional visual methods. Generated by…

34
Hugging Face Daily Papers research 1d ago

One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Abstract InnerZoom addresses GUI grounding challenges by preserving target-region awareness across decoder layers through a single-forward pass that bridges cross-layer evidence, achieving state-of-the-art performance with reduced computational cost. Generated by…

16
Hugging Face Daily Papers research 1d ago

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Abstract OSWorld 2.0 presents a comprehensive benchmark for evaluating computer-use agents through complex, real-world workflows that reveal current limitations in agent reasoning and task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing computer-use benchmarks…

24
Hugging Face Daily Papers research 1d ago

MirrorPPR: Exemplar-Based Portrait Photo Retouching

Abstract Exemplar-based portrait retouching framework using Diffusion Transformer with LoRA adaptation and self-augmented training data achieves superior quality and identity preservation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While text-guided image editing has made…

24
Hugging Face Daily Papers research 1d ago

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

Abstract Delayed verification in multi-agent LLM systems can cause instability leading to oscillations, but grounded factual answering stabilizes the system by making truth an absorbing boundary. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-agent large language model (LLM)…

14
Hugging Face Daily Papers research 1d ago

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

Abstract Research reveals that language backbones in Vision-Language-Action models are highly redundant for robotic manipulation tasks, while vision and action pathways are more critical, suggesting need for deliberate capacity allocation in future architectures. Generated by…

11
Hugging Face Daily Papers research 1d ago

LLM Program Optimization via Retrieval Augmented Search

Abstract Blackbox adaptation methods using retrieval-augmented search and atomic edit decomposition improve program optimization performance for both C++ and Python code. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent work has demonstrated the potential of large language…

19
Hugging Face Daily Papers research 1d ago

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Abstract HeRA aligns individual attention heads in MLLMs to preserve local neighborhood relationships across modalities, improving vision-centric task performance and reducing visual hallucinations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Representation alignment has…

27
Hugging Face Daily Papers research 1d ago

SWE-Together: Evaluating Coding Agents in Interactive User Sessions

Abstract SWE-Together is a multi-turn coding benchmark created from real user-agent interactions, featuring a reactive LLM simulator to evaluate agents based on both final correctness and interaction efficiency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Most coding-agent…

32
Hugging Face Daily Papers research 1d ago

DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model

Abstract DreamForge-World 0.1 Preview adapts a video generation architecture with a residual action pathway to enable real-time interactive world simulation on consumer hardware with low computational requirements. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present…

18
Hugging Face Daily Papers research 1d ago

RocketSmith: Agentic Additive Manufacturing of High-Powered Rockets

Abstract An agentic system using large language models automates high-power rocket design processes, enabling successful flight testing with consistent simulation results. Generated by Qwen/Qwen2.5-Coder-32B-Instruct RocketSmith is an agentic system which intelligently automates…

9
Hugging Face Daily Papers research 1d ago

A Gravitational Interpretation of Fine-Tuning Reversion

Abstract Post-alignment safety degradation arises from geometric properties of training history, where fine-tuning reversion follows a persistent direction defined by early training dynamics. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Fine-tuning on harmless data can partially…

35
Hugging Face Daily Papers research 1d ago

One Model, Many Latencies: Universal Speech Enhancement for Diverse Real-Time Applications

Abstract A universal speech enhancement model with configurable algorithmic and computational latency controls using parallel convolutions and early-exit mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Different real-time speech applications impose distinct latency…

9
Hugging Face Daily Papers research 1d ago

SAM2Matting: Generalized Image and Video Matting

Abstract SAM2Matting advances video matting by decoupling tracking and matting tasks through a tracker-to-matting framework that leverages foundational trackers with region-proposal bridges and dedicated matting heads. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite…

36
Hugging Face Daily Papers research 1d ago

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Abstract Asynchronous pipeline parallelism with PipeDream-2BW can achieve near-synchronous performance through optimizer selection and error feedback correction, overcoming traditional stability concerns. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Modern large-scale LLM…

29
Hugging Face Daily Papers research 1d ago

One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models

Abstract A faithful 3D world representation should account for layered geometry, where a single camera ray may contain multiple visible and geometrically valid surfaces. Monocular depth estimation, however, reduces this structure to one scalar depth per pixel. Transparent scenes…

29
Hugging Face Daily Papers research 1d ago

RaysUp: Ultra-light Universal Feature Upsampling via Geometry-Aware Ray Representation

Abstract RaysUp is a lightweight, task-agnostic feature upsampling framework that reconstructs high-resolution features using geometry-aware ray domain techniques with improved efficiency and accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Pre-trained Vision Foundation…

37

MemLearner: Learning to Query Context memory for Video World Models

SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

Play2Perfect: What Matters in Dexterous Play Pretraining for Precise Assembly?

PolyFlow: Continuous Topology Embedding Flow Matching for Artist-style Mesh Generation

Hierarchical Experimentalist Agents

Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Lexical Consensus: Grounded Word Learning and Shared Meaning in Artificial Agents

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

Goku: A Million-Scale Universal Dataset and Benchmark for Instruction-Based Video Editing

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History

DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation

MuSViT: A Foundation Vision Model for Sheet Music Representation

Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Xiaomi-GUI-0 Technical Report

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Little Brains, Big Feats: Exploring Compact Language Models

Multi-Block Diffusion Language Models

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

TerraDiT-Ω: Unified Spatial Control for Satellite Image Synthesis with Any Geospatial Primitive

PhotoQuilt: Training-Free Arbitrary-Resolution Photomosaics via Bootstrapped Tiled Denoising

BrainJanus: A Unified Model for Understanding and Generation across Brain, Vision, and Language

BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

DOPD: Dual On-policy Distillation

Dockerless: Environment-Free Program Verifier for Coding Agents

LUMOS: A Semantic Operating-System Layer for Accessibility-Grounded AI Agents

One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

MirrorPPR: Exemplar-Based Portrait Photo Retouching

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

LLM Program Optimization via Retrieval Augmented Search

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

SWE-Together: Evaluating Coding Agents in Interactive User Sessions

DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model

RocketSmith: Agentic Additive Manufacturing of High-Powered Rockets

A Gravitational Interpretation of Fine-Tuning Reversion

One Model, Many Latencies: Universal Speech Enhancement for Diverse Real-Time Applications

SAM2Matting: Generalized Image and Video Matting

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models

RaysUp: Ultra-light Universal Feature Upsampling via Geometry-Aware Ray Representation