Hugging Face Daily Papers

500 articles archived · Visit source ↗ · RSS

Hugging Face Daily Papers research 21d ago

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Abstract Researchers propose a novel router redesign for Mixture-of-Experts models that aligns router rows with the principal singular directions of expert matrices using Manifold Power Iteration to improve model effectiveness. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Router…

33
Hugging Face Daily Papers research 21d ago

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Abstract A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

16
Hugging Face Daily Papers research 21d ago

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Abstract A new benchmark called ComBench is introduced to evaluate large language models' combinatorial reasoning abilities through Olympiad-level problems that test both proof construction and explicit mathematical constructions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

37
Hugging Face Daily Papers research 21d ago

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Abstract An AI framework called Arbor enables autonomous scientific research by combining strategic coordination, isolated hypothesis testing, and a persistent knowledge tree to iteratively improve research outcomes across multiple domains. Generated by…

18
Hugging Face Daily Papers research 21d ago

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Abstract TRL-Bench establishes a standardized benchmark for evaluating tabular representation learning models across multiple granularities, revealing that encoder performance varies by task type and requires capability-specific assessment rather than single leaderboard…

6
Hugging Face Daily Papers research 21d ago

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Abstract Recursive automated composition framework enables scalable reinforcement learning for language models by automatically combining verifiable environments through compositional operators. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement Learning (RL) with…

11
Hugging Face Daily Papers research 21d ago

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Abstract A teacher-student framework decouples complex reasoning from efficient reward deployment in text-to-image training, achieving superior preference accuracy and optimization performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reward models are central to…

22
Hugging Face Daily Papers research 21d ago

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

Abstract Large language model agents require specialized environments for training and evaluation, which can be categorized by their engineering lifecycle stages and evolved through various paradigms including neural and symbolic approaches. Generated by…

8
Hugging Face Daily Papers research 21d ago

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Abstract Embodied-R1.5 is a unified embodied foundation model that integrates embodied reasoning capabilities and achieves state-of-the-art performance on embodied vision-language benchmarks through a multi-task balanced reinforcement learning approach. Generated by…

35
Hugging Face Daily Papers research 21d ago

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Abstract InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

18
Hugging Face Daily Papers research 21d ago

i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

Abstract A comprehensive experimental study of text-to-image diffusion models reveals key design choices and training insights leading to the development of i1, a 3B-parameter model that matches leading performance while maintaining full openness. Generated by…

21
Hugging Face Daily Papers research 21d ago

World Model Self-Distillation: Training World Models to Solve General Tasks

Abstract A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Pretrained video…

15
Hugging Face Daily Papers research 21d ago

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Abstract Bebop addresses the efficiency bottleneck in reinforcement learning training of large language models by optimizing multi-token prediction techniques through entropy-aware sampling and novel training objectives that improve acceptance rates and inference throughput.…

28
Hugging Face Daily Papers research 21d ago

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Abstract World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

10
Hugging Face Daily Papers research 21d ago

Lius: Translation Model Based Instructional Lingustic Using Continual Instruction Tuning In Kupang Malay

Abstract Continual Instruction Tuning enables effective fine-tuning of large language models for low-resource language translation, achieving superior performance compared to standard instruction tuning and multilingual models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large…

4
Hugging Face Daily Papers research 21d ago

DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

Abstract A large-scale dataset called DeNovoSWE is introduced for training code agents to generate entire software repositories from documentation, significantly improving performance on long-horizon software engineering tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As the…

15
Hugging Face Daily Papers research 21d ago

ICA Lens: Interpreting Language Models Without Training Another Dictionary

Abstract Independent component analysis (ICA) is revived as an efficient method for discovering interpretable directions in language model representations, offering a faster alternative to sparse autoencoder training while maintaining competitive performance in probing tasks.…

22
Hugging Face Daily Papers research 22d ago

PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf

Abstract A human-centered writing assistant system called PaperMentor integrates expert research advice with specialized agents to provide actionable feedback during manuscript drafting, outperforming AI baselines in usability and relevance. Generated by…

38
Hugging Face Daily Papers research 22d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by…

38
Hugging Face Daily Papers research 22d ago

In-Context Multiple Instance Learning

Abstract Pretraining a Perceiver-style architecture on synthetic bag-structured data enables efficient, task-adaptive classification from few labeled examples in multiple instance learning scenarios. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multiple Instance Learning (MIL)…

10
Hugging Face Daily Papers research 22d ago

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

Abstract Video generative models achieve improved long-range consistency through coarse-to-fine token generation using a multi-scale autoencoder and diffusion model architecture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generative models have become increasingly…

28
Hugging Face Daily Papers research 22d ago

Decentralized Multi-Agent Systems with Shared Context

Abstract Decentralized Language Models (DeLM) framework enables scalable large language model reasoning through parallel agents that asynchronously coordinate via a shared verified context, improving performance and efficiency over centralized approaches. Generated by…

25
Hugging Face Daily Papers research 22d ago

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

Abstract SkillHarm is a benchmark for evaluating skill-based attacks across the skill-use lifecycle, demonstrating significant vulnerabilities in current agents with attack success rates up to 86.3%. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agent skills occupy a privileged…

36
Hugging Face Daily Papers research 22d ago

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Abstract CapCode framework uses randomized testing with performance caps to detect and prevent shortcut exploitation in agent evaluation, while CapReward rewards systems that adhere to intended task specifications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A growing failure…

21
Hugging Face Daily Papers research 22d ago

The Role of Feedback Alignment in Self-Distillation

Abstract Self-distillation effectiveness depends on structural alignment between feedback and solver reasoning, with step-aligned critique outperforming binary rewards and reference solutions by targeting specific reasoning failures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

32
Hugging Face Daily Papers research 22d ago

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Abstract Next Forcing introduces a multi-chunk prediction framework that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Autoregressive video generation has…

19
Hugging Face Daily Papers research 22d ago

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

Abstract FadeMem introduces a distance-aware key-value memory consolidation mechanism that organizes historical video data into a temporal hierarchy, improving long-video generation by preserving recent context and long-range anchors under fixed cache constraints. Generated by…

36
Hugging Face Daily Papers research 22d ago

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Abstract CPPO addresses limitations in reinforcement learning with verifiable rewards by introducing position-weighted thresholds and cumulative prefix budgeting to better handle autoregressive generation challenges. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement…

12
Hugging Face Daily Papers research 22d ago

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Abstract Sparse autoencoders trained on language model representations reveal interpretable features for speech synthesis that can be manipulated to control linguistic and prosodic attributes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Language models increasingly serve as the…

19
Hugging Face Daily Papers research 22d ago

Kwai Keye-VL-2.0 Technical Report

Abstract Kwai Keye-VL-2.0-30B-A3B is an open-source Mixture-of-Experts multimodal foundation model that enables long-video understanding and agentic intelligence through DeepSeek Sparse Attention and specialized training infrastructure. Generated by…

36
Hugging Face Daily Papers research 22d ago

IR3DE: A Linear Router for Large Language Models

Abstract A ridge regression-based routing method achieves competitive performance in selecting domain-expert LLMs for different tasks while enabling dynamic addition/removal of experts without retraining. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Foundational Large Language…

28
Hugging Face Daily Papers research 22d ago

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Abstract A psychologically-informed refusal framework called PsychoSafe is developed for large language models to improve harmful request handling through structured supportive communication, showing enhanced refusal quality and resource referral while maintaining performance on…

14
Hugging Face Daily Papers research 22d ago

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

Abstract BrainSurgery is a tool for robust and reproducible tensor manipulation of neural network checkpoints through declarative YAML plans with built-in validation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As deep learning models scale, managing, inspecting, and modifying…

12
Hugging Face Daily Papers research 22d ago

UniPET: a universal network for high-quality PET image denoising across varied dose reduction factors

Abstract A universal PET image denoising framework addresses variability in dose reduction factors through domain generalization techniques and region-aware learning strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Most existing deep learning-based PET image denoising…

26
Hugging Face Daily Papers research 22d ago

U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training

Abstract A novel U-shaped deep learning model with test-time training layers and dual-domain adaptation mechanisms achieves robust PET image denoising under distribution shifts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing deep learning models for Positron Emission…

32
Hugging Face Daily Papers research 22d ago

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Abstract Role-Agent framework enables LLM agents to function as both agent and environment through bootstrapped co-evolution, improving performance via environment-aware reasoning and targeted practice. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Although Large Language Model…

33
Hugging Face Daily Papers research 22d ago

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Abstract Research reveals that vision and text tokens in multimodal models evolve asynchronously, leading to inefficient computation; a new asymmetric routing framework reduces visual processing overhead while maintaining performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

9
Hugging Face Daily Papers research 22d ago

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Abstract MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead. Generated by…

33
Hugging Face Daily Papers research 22d ago

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Abstract Current AI agents struggle with long-horizon professional GUI workflows, achieving low success rates due to issues with workflow consistency and domain-specific software understanding. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent years have witnessed the rapid…

15
Hugging Face Daily Papers research 22d ago

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

Abstract Retrospective Harness Optimization (RHO) is a self-supervised method that improves AI agent performance by optimizing agent harness using only past trajectories through diverse task selection, parallel re-solving, and self-validation techniques. Generated by…

8
Hugging Face Daily Papers research 22d ago

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Abstract Autoregressive diffusion method for video-to-video lip synchronization achieves real-time performance through distillation and optimized inference schedules. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Diffusion-based lip synchronization models achieve strong visual…

29
Hugging Face Daily Papers research 22d ago

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Abstract Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent work has…

34
Hugging Face Daily Papers research 22d ago

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Abstract A multi-agent framework automates data journalism by generating evidence-grounded, multimodal news stories while maintaining transparency and verifiability. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Data tells stories that shape society; the data journalist's job is…

10
Hugging Face Daily Papers research 22d ago

WorldOlympiad: Can Your World Model Survive a Triathlon?

Abstract WorldOlympiad presents a comprehensive benchmark for evaluating video-based world models across physical faithfulness, geometric consistency, and interaction fidelity, revealing significant gaps in current generative models' capabilities. Generated by…

13
Hugging Face Daily Papers research 22d ago

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Abstract Chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, but a training-free method called QK-Restore can restore long-context capabilities by reverting query-key…

8
Hugging Face Daily Papers research 22d ago

Rethinking the Divergence Regularization in LLM RL

Abstract DRPO improves LLM reinforcement learning stability by replacing hard masks with smooth regularization that provides continuous gradient corrections beyond trust-region boundaries. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement learning (RL) has become a key…

29
Hugging Face Daily Papers research 22d ago

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Abstract EEVEE is a novel test-time prompt learning framework for LLM agents that handles heterogeneous data streams through task clustering and co-evolving router-prompt optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In this paper, we propose EEVEE, the first…

6
Hugging Face Daily Papers research 22d ago

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Abstract A large language model trained on synthesized delegation intelligence achieves superior performance on long-horizon research tasks through task decomposition and subagent coordination. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models are increasingly…

12
Hugging Face Daily Papers research 22d ago

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Abstract Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks. Generated by…

28
Hugging Face Daily Papers research 22d ago

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

Abstract Struct-Searcher introduces a belief revision theory-based structural agentic workflow for multimodal information seeking that improves accuracy over existing vision-language models and deep research agents. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep research…

17

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

World Model Self-Distillation: Training World Models to Solve General Tasks

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Lius: Translation Model Based Instructional Lingustic Using Continual Instruction Tuning In Kupang Malay

DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

ICA Lens: Interpreting Language Models Without Training Another Dictionary

PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

In-Context Multiple Instance Learning

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

Decentralized Multi-Agent Systems with Shared Context

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

The Role of Feedback Alignment in Self-Distillation

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Kwai Keye-VL-2.0 Technical Report

IR3DE: A Linear Router for Large Language Models

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

UniPET: a universal network for high-quality PET image denoising across varied dose reduction factors

U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

WorldOlympiad: Can Your World Model Survive a Triathlon?

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Rethinking the Divergence Regularization in LLM RL

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking