Hugging Face Daily Papers

500 articles archived · Visit source ↗ · RSS

Hugging Face Daily Papers research 14d ago

CEO-Bench: Can Agents Play the Long Game?

Abstract CEO-Bench evaluates language model agents' ability to manage a simulated startup over 500 days, testing their proficiency in long-term planning, noise handling, adaptability, and multi-task coordination through a Python interface. Generated by…

5
Hugging Face Daily Papers research 14d ago

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

Abstract Quality-aware self-distillation improves vision-language model performance for GUI grounding by enhancing coordinate-token teacher signals through correctness-aware gating and probability scaling. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Graphical user interface…

38
Hugging Face Daily Papers research 14d ago

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Abstract IndustryBench-MIPU is introduced as the first large-scale benchmark for multi-image industrial product understanding, focusing on structured attribute extraction from heterogeneous product images to evaluate multimodal models' ability to recover dense technical…

24
Hugging Face Daily Papers research 14d ago

Kairos: A Native World Model Stack for Physical AI

Abstract Kairos is a native world model framework that learns from diverse experiences, maintains persistent states through hybrid temporal attention, and supports efficient deployment for physical AI applications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct World models are…

33
Hugging Face Daily Papers research 14d ago

Learning User Simulators with Turing Rewards

Abstract A reinforcement learning approach using Turing test-based rewards trains language models to generate responses indistinguishable from human users in conversational and forum discussion settings. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Learning to simulate human…

26
Hugging Face Daily Papers research 14d ago

Physics-IQ Verified

Abstract A systematic evaluation of the Physics-IQ benchmark reveals limitations in measuring physical understanding of video generative models, leading to improvements in prompt quality and sample-level scoring that enhance reliability for assessing physically accurate video…

29
Hugging Face Daily Papers research 14d ago

Guava: An Effective and Universal Harness for Embodied Manipulation

Abstract A harness framework for embodied tool use combines high-level reasoning with external modules, enabling compact models to perform complex manipulation tasks with minimal training data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Language models trained on large-scale…

15
Hugging Face Daily Papers research 14d ago

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Abstract A new benchmark suite called RNG-Bench is introduced to evaluate multimodal foundation models' ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games with controlled difficulty parameters and a memory…

23
Hugging Face Daily Papers research 14d ago

Sumi: Open Uniform Diffusion Language Model from Scratch

Abstract A large-scale uniform diffusion language model pretrained from scratch demonstrates competitive performance on knowledge and reasoning tasks while highlighting differences in commonsense reasoning compared to autoregressive models. Generated by…

15
Hugging Face Daily Papers research 14d ago

ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

Abstract ActWorld extends navigation-centric interactive world models to support object interaction through a chunk-autoregressive framework with hierarchical action-aware memory and persistent memory banks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Interactive world models…

9
Hugging Face Daily Papers research 14d ago

Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences

Abstract A unified scientific generative language model encodes diverse scientific objects and spatial interactions as token sequences, demonstrating strong performance across multiple domains through autoregressive next-token prediction. Generated by…

16
Hugging Face Daily Papers research 14d ago

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Abstract SAGA framework uses multimodal large language models to provide attribute-aware supervision for vision encoders through Group Relative Policy Optimization, improving zero-shot image retrieval performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision encoders for…

21
Hugging Face Daily Papers research 15d ago

Self-Evolving Visual Questioner

Abstract A vision-language model autonomously improves its question-generation capabilities through self-evolution, enhancing both question quality and answerer performance without external supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-language models (VLMs)…

10
Hugging Face Daily Papers research 15d ago

Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

Abstract Multi-agent LLM systems with shared state are analyzed through formal methods identifying concurrency anomalies and establishing a verified consistency hierarchy with mechanized proofs of soundness and completeness. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

14
Hugging Face Daily Papers research 15d ago

The Price of Anarchy in Disaggregated Inference

Abstract Disaggregated inference architectures separate prefill and decode phases across distinct GPU pools, and a game-theoretic analysis characterizes how GPU saturation affects system performance through regime transitions and payoff structure changes, enabling an adaptive…

25
Hugging Face Daily Papers research 15d ago

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Abstract DR-DCI framework combines retrieval with direct corpus interaction by dynamically pulling relevant documents into a local workspace, enabling scalable and efficient agentic search across large corpora. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search over…

27
Hugging Face Daily Papers research 15d ago

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Abstract Visual-Seeker enables visual-native multimodal deep search through active visual reasoning, outperforming proprietary models on real-world web environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models (MLLMs) have demonstrated…

25
Hugging Face Daily Papers research 15d ago

EgoCS-400K: An Egocentric Gameplay Dataset for World Models

Abstract EgoCS-400K is a large-scale egocentric Counter-Strike dataset that bridges passive web videos and costly real-world embodied data by providing temporally aligned video-action-language trajectories with detailed player states and game events. Generated by…

16
Hugging Face Daily Papers research 15d ago

RepSelect: Robust LLM Unlearning via Representation Selectivity

Abstract RepSelect isolates forget-set-specific representations in LLMs by collapsing top principal components of weight gradients, achieving deeper and more robust unlearning compared to existing methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Making large language models…

32
Hugging Face Daily Papers research 15d ago

RefGC-SR^2: Reference-guided Generated Content Super-Resolution and Refinement

Abstract A new reference-guided generated content super-resolution-refinement task is introduced that simultaneously recovers high-resolution details and refines generative artifacts using a frequency-aware diffusion transformer model. Generated by…

32
Hugging Face Daily Papers research 15d ago

Text-Vision Co-Instructed Image Editing

Abstract A unified text-visual image editing framework is presented that combines semantic intent from textual instructions with spatial guidance from visual prompts to achieve more precise and faithful image manipulation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing…

16
Hugging Face Daily Papers research 15d ago

Learning from the Self-future: On-policy Self-distillation for dLLMs

Abstract d-OPSD introduces a novel on-policy self-distillation framework for diffusion language models by adapting self-teacher construction and supervision mechanisms to match the non-autoregressive nature of diffusion models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

29
Hugging Face Daily Papers research 15d ago

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

Abstract Parallel loop Transformers achieve better code generation performance with two loops due to refined representations, while additional loops cause diminishing returns and increased positional mismatch costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Looped…

5
Hugging Face Daily Papers research 15d ago

Rethinking the Role of Efficient Attention in Hybrid Architectures

Abstract Hybrid architectures combining full attention with efficient attention modules like sliding-window attention exhibit distinct scaling behaviors and optimization trajectories, with efficient attention primarily affecting the emergence speed of long-context capabilities…

29
Hugging Face Daily Papers research 15d ago

Variable-Width Transformers

Abstract A novel transformer architecture with nonuniform width allocation across layers achieves better performance and efficiency compared to uniform designs by utilizing a parameter-free residual resizing mechanism. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Scaling model…

5
Hugging Face Daily Papers research 15d ago

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

Abstract ChLogic benchmark reveals persistent performance gaps between English and Chinese logical reasoning in large language models, influenced by surface realization differences and translation artifacts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models…

37
Hugging Face Daily Papers research 15d ago

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

Abstract MemSlides presents a hierarchical memory framework for personalized presentation agents that separates long-term user profiles, working memory for session constraints, and tool memory for reusable execution experiences to enable stable personalization and reliable local…

21
Hugging Face Daily Papers research 15d ago

MotionVLA: Vision-Language-Action Model for Humanoid Motion

Abstract A dual-stream frequency tokenizer and autoregressive model are proposed to improve humanoid motion generation by separately encoding pose and physical dynamics, achieving better diversity and consistency compared to single-codebook approaches. Generated by…

11
Hugging Face Daily Papers research 15d ago

Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

Abstract Spectral Forcing, a time-conditional 2D-DCT low-pass operator, improves diffusion model efficiency by explicitly separating signal from noise in pixel-space models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Pixel-space diffusion models are trained on full-bandwidth…

32
Hugging Face Daily Papers research 15d ago

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Abstract A unified Vision-Language-Action pretraining framework leverages heterogeneous data sources including human egocentric videos and robot trajectories through a reliability-aware training approach that improves performance on embodied AI tasks. Generated by…

6
Hugging Face Daily Papers research 15d ago

ProCUA-SFT Technical Report

Abstract Training computer-use agents using a large-scale synthetic dataset with automated task generation and verification achieves significantly improved performance on desktop interaction benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training computer-use agents…

4
Hugging Face Daily Papers research 15d ago

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Abstract Zone of Proximal Policy Optimization (ZPPO) improves knowledge distillation by using reformulated prompts that help students learn from both correct and incorrect responses, enhancing performance especially at smaller model sizes. Generated by…

32
Hugging Face Daily Papers research 15d ago

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Abstract OPD-Evolver is a self-evolving agent framework that combines slow-fast co-evolution with on-policy self-distillation to enhance memory management and policy learning across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory has become a standard…

28
Hugging Face Daily Papers research 15d ago

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

Abstract Research agents face significant challenges when evidence is in a different language than the query, with performance degrading even when gold evidence is provided directly. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep research agents are increasingly evaluated on…

28
Hugging Face Daily Papers research 15d ago

A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization

Abstract Training instability in reinforcement learning with verifiable rewards is analyzed through token-level gradient dynamics, leading to a stable policy optimization method that updates only on positive-advantage completions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

20
Hugging Face Daily Papers research 15d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Abstract End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive…

31
Hugging Face Daily Papers research 15d ago

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Abstract UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and…

19
Hugging Face Daily Papers research 15d ago

Looped World Models

Abstract Looped World Models introduce iterative latent state refinement through shared transformer blocks, achieving 100x parameter efficiency while adapting computational depth to prediction complexity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Current world models face a…

14
Hugging Face Daily Papers research 15d ago

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

Abstract A framework called TRIAGE is proposed to improve clinical early warning systems by training large language models to generate dialectical reasoning for continuous risk scoring with better calibration and interpretability. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

29
Hugging Face Daily Papers research 15d ago

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

Abstract LectūraAgents is a multi-agent framework that enables personalized learning through adaptive embodied teaching by mimicking professor-student interactions and generating coordinated teaching actions aligned with learner profiles. Generated by…

9
Hugging Face Daily Papers research 15d ago

Aligning Quantum Operators with Large Language Models

Abstract Large language models can be adapted to understand quantum operators by mapping unitary matrices into their latent space, enabling quantum circuit synthesis and language-conditioned gate constraint specification. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Can Large…

19
Hugging Face Daily Papers research 15d ago

Attacks on Machine-Text Detectors Retain Stylistic Fingerprints

Abstract Machine-text detection remains challenging despite evasion techniques, but stylistic features can provide robust defense when analyzed across multiple documents rather than individual instances. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite considerable progress…

17
Hugging Face Daily Papers research 16d ago

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

Abstract Temporal Difference in Vision (TDV) presents a novel self-supervised learning approach for video data that eliminates traditional inductive biases by leveraging causal relationships between past and future frames. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Progress in…

30
Hugging Face Daily Papers research 16d ago

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

Abstract Track2View generates novel camera viewpoints from videos by using 3D point tracks to establish explicit spatiotemporal correspondences, achieving superior visual quality and camera accuracy compared to existing methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

9
Hugging Face Daily Papers research 16d ago

ExpRL: Exploratory RL for LLM Mid-Training

Abstract ExpRL uses human-written question-answer data as reward scaffolds to provide automated reinforcement learning priming for language models, outperforming traditional methods on math reasoning tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse reward reinforcement…

23
Hugging Face Daily Papers research 16d ago

Human Universal Grasping

Abstract A flow-matching model generates diverse human grasps from RGB-D images, enabling zero-shot robotic grasping with improved performance over existing methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Humans can grasp objects effortlessly, whereas multi-fingered robots…

25
Hugging Face Daily Papers research 16d ago

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

Abstract EgoPhys enables deformable digital twin generation from egocentric RGB video by using generalizable priors and compact codebooks to predict dense spring stiffness fields without per-spring optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Humans naturally…

33
Hugging Face Daily Papers research 16d ago

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Abstract Sparse autoencoders exhibit feature stability patterns where stable features carry most predictive signal while unstable features reflect reproducible low-dimensional structure despite individual non-reproducibility. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse…

13
Hugging Face Daily Papers research 16d ago

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Abstract LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-Language-Action models (VLAs)…

33
Hugging Face Daily Papers research 16d ago

MVEB: Massive Video Embedding Benchmark

Abstract A large-scale video embedding benchmark evaluates diverse models across multiple video understanding tasks, revealing that different model architectures excel in specific domains and demonstrating the nuanced impact of audio on performance based on dataset…

7

CEO-Bench: Can Agents Play the Long Game?

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Kairos: A Native World Model Stack for Physical AI

Learning User Simulators with Turing Rewards

Physics-IQ Verified

Guava: An Effective and Universal Harness for Embodied Manipulation

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Sumi: Open Uniform Diffusion Language Model from Scratch

ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Self-Evolving Visual Questioner

Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

The Price of Anarchy in Disaggregated Inference

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

EgoCS-400K: An Egocentric Gameplay Dataset for World Models

RepSelect: Robust LLM Unlearning via Representation Selectivity

RefGC-SR^2: Reference-guided Generated Content Super-Resolution and Refinement

Text-Vision Co-Instructed Image Editing

Learning from the Self-future: On-policy Self-distillation for dLLMs

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

Rethinking the Role of Efficient Attention in Hybrid Architectures

Variable-Width Transformers

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

MotionVLA: Vision-Language-Action Model for Humanoid Motion

Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

ProCUA-SFT Technical Report

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Looped World Models

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

Aligning Quantum Operators with Large Language Models

Attacks on Machine-Text Detectors Retain Stylistic Fingerprints

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

ExpRL: Exploratory RL for LLM Mid-Training

Human Universal Grasping

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

MVEB: Massive Video Embedding Benchmark