Hugging Face Daily Papers

500 articles archived · Visit source ↗ · RSS

Hugging Face Daily Papers research 9d ago

HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

Abstract HAKARI-Bench provides a lightweight benchmark for comparing retrieval methods across multiple configurations and languages, enabling efficient model selection and performance analysis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct With the rapid spread of…

23
Hugging Face Daily Papers research 9d ago

AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction

Abstract AOHP presents an Android-based operating system framework that treats AI agents as first-class entities, enhancing task completion rates and reducing execution costs through specialized agent-oriented mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI agents…

16
Hugging Face Daily Papers research 9d ago

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured…

19
Hugging Face Daily Papers research 9d ago

UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

Abstract UniverSat introduces a Universal Patch Encoder for Vision Transformers that enables robust, sensor-agnostic spatial feature extraction across diverse Earth Observation data types. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision Transformers (ViT) dominate computer…

6
Hugging Face Daily Papers research 9d ago

FastMix: Fast Data Mixture Optimization via Gradient Descent

Abstract FASTMIX automates optimal data mixture discovery during training by formulating mixture selection as a bilevel optimization problem that jointly optimizes mixture coefficients and model parameters through iterative updates. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

19
Hugging Face Daily Papers research 9d ago

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

Abstract A principled synthesis engine generates high-quality terminal-agent tasks through multi-dimensional capability taxonomy and evidence-guided research, creating a distilled dataset that enables significant performance gains in LLM training. Generated by…

5
Hugging Face Daily Papers research 9d ago

PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

Abstract PoLAR introduces a geometrically structured latent action representation in hyperbolic space that separates transition extent from transition mode, improving robotic policy learning performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Latent action pretraining…

12
Hugging Face Daily Papers research 9d ago

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Abstract DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-view 3D…

15
Hugging Face Daily Papers research 9d ago

Causal Discovery in the Era of Agents

Abstract Language models should assist causal discovery workflows by providing contextual support and explanations rather than generating causal conclusions, as demonstrated through a platform that integrates data analysis and expert knowledge. Generated by…

31
Hugging Face Daily Papers research 9d ago

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Abstract EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Enterprise agents…

30
Hugging Face Daily Papers research 9d ago

PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

Abstract PolicyTrim is a reinforcement learning-based framework that enhances VLA model efficiency by extending reliable action chunk lengths and reducing redundant physical steps through dynamic exploration and redundancy-aware rewards. Generated by…

25
Hugging Face Daily Papers research 9d ago

Safe Few-Step Generation via Velocity Editing

Abstract VESFlow is a training-free safety method for flow matching-based text-to-image generation that edits velocity fields to ensure safe output while maintaining prompt integrity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Flow matching has recently emerged as a strong…

16
Hugging Face Daily Papers research 9d ago

Tmax: A simple recipe for terminal agents

Abstract A novel RL training approach for terminal agents achieves superior performance using a simplified recipe and expanded dataset, enabling effective training with fewer parameters than previous methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Terminal-using agents…

36
Hugging Face Daily Papers research 9d ago

OpenRath: Session-Centered Runtime State for Agent Systems

Abstract OpenRath introduces a PyTorch-like programming model for multi-agent systems using Session as a central runtime abstraction that enables explicit fork, merge, and replay operations while recording comprehensive execution state. Generated by…

21
Hugging Face Daily Papers research 9d ago

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Abstract Trajectory-Augmented Policy Optimization (TAPO) enhances large language model reasoning by creating explicit corrective trajectories that preserve erroneous reasoning while incorporating natural-language diagnoses and corrections, outperforming traditional…

31
Hugging Face Daily Papers research 9d ago

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Abstract Large language models can be trained through reinforcement learning to develop a meta-capability enabling continuous learning and adaptation across long sequences of tasks in dynamic environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This work presents a general…

31
Hugging Face Daily Papers research 9d ago

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Abstract PlanBench-XL evaluates large language model agents' ability to plan and adapt in complex tool-rich environments with limited visibility and dynamic disruptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents increasingly operate in large tool ecosystems, where…

10
Hugging Face Daily Papers research 9d ago

World Action Models: A Survey

Abstract World Action Models are predictive-action systems that generate future states for decision-making, with designs balancing representational richness against computational constraints. Generated by Qwen/Qwen2.5-Coder-32B-Instruct World Action Models (WAMs) are embodied…

30
Hugging Face Daily Papers research 9d ago

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

Abstract HydraHead is a novel attention hybridization architecture that combines Full Attention and Linear Attention at the head level, achieving superior long-context performance with reduced training overhead through interpretability-driven selection and scale-normalized…

34
Hugging Face Daily Papers research 9d ago

KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking

Abstract KaLM-Reranker-V1 is a fast reranker that decouples query and passage computation using encoder-decoder architecture with Matryoshka embedding pooling and cross-attention for efficient relevance modeling. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As retrieval systems…

32
Hugging Face Daily Papers research 9d ago

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Abstract Reinforcement learning approaches for improving LLM reasoning capabilities are enhanced by a Bayesian Manifold Curriculum framework that structures problem sampling based on task manifold relationships and endogenous non-stationarity. Generated by…

20
Hugging Face Daily Papers research 9d ago

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory

Abstract EvoEmbedding is a dynamic embedding model that generates adaptive representations by maintaining a continuously updated latent memory, enabling improved retrieval performance in long-context scenarios. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing embedding…

32
Hugging Face Daily Papers research 9d ago

Exploring the Design Space of Reward Backpropagation for Flow Matching

Abstract FlowBP addresses limitations in flow matching model alignment by using a surrogate trajectory framework that reduces memory usage and gradient chaining while maintaining performance across multiple text-to-image models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

23
Hugging Face Daily Papers research 9d ago

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Abstract Search agents face challenges in real-world evaluation due to limited benchmarks and coarse metrics, necessitating more nuanced assessment approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search Agents (SAs) typically leverage large language models (LLMs) to…

14
Hugging Face Daily Papers research 9d ago

CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

Abstract Calibrated verifier telemetry enhances LLM agents in knowledge-intensive question answering by providing confidence scores and grounding verification, reducing both over-retrieval and unsupported answers. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents in…

7
Hugging Face Daily Papers research 9d ago

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Abstract PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms. Generated by…

5
Hugging Face Daily Papers research 9d ago

When, Where, and How: Adaptive Binning for Tabular Self-Supervised Learning

Abstract Adaptive Binning introduces a training-adaptive discretization method for self-supervised learning on medical tabular data, improving representation learning through feature-wise refinement and heterogeneous feature handling. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

13
Hugging Face Daily Papers research 9d ago

Characterizing Narrative Content in Web-scale LLM Pretraining Data

Abstract A comprehensive analysis of narrative structures in large-scale language model training data reveals measurable, multidimensional narrative patterns that vary across different content sources and topics. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The narrative…

21
Hugging Face Daily Papers research 9d ago

SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

Abstract SproutRAG is an attention-guided hierarchical retrieval-augmented generation framework that organizes sentence-level chunks into semantically coherent units using learned inter-sentence attention, enabling multi-granularity retrieval without additional LLM calls or…

33
Hugging Face Daily Papers research 9d ago

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

Abstract MCompassRAG enhances retrieval-augmented generation by using topic-level metadata to guide chunk selection, improving both efficiency and precision in complex research tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Retrieval-augmented generation (RAG) systems…

32
Hugging Face Daily Papers research 10d ago

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

Abstract Multimodal large language models exhibit social bias driven by specific visual attributes, with fashion style and socioeconomic cues having the greatest impact on model judgments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models (MLLMs) are…

37
Hugging Face Daily Papers research 10d ago

Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

Abstract Reflective Masking enables iterative local refinement in Mask Diffusion Models through lightweight post-training, supporting multi-turn reasoning without architectural changes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While reasoning on autoregressive (AR) models is…

26
Hugging Face Daily Papers research 10d ago

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

Abstract GeneralVLA-2 addresses limitations in vision-language-action systems by introducing GeoFuse-MV3D for improved 3D reconstruction and an enhanced KnowledgeBank for better memory management in robotic manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

32
Hugging Face Daily Papers research 10d ago

SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

Abstract SpatialAvatar-0 enables high-quality 4D head avatar generation by combining feed-forward prediction with per-subject refinement through a shared Gaussian representation, achieving superior performance across multiple benchmarks. Generated by…

20
Hugging Face Daily Papers research 10d ago

Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations

Abstract A novel approach for B2B conversation classification that reduces token usage by 99% while improving performance and maintaining robustness as context length increases. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In-context learning (ICL) is the standard method for…

8
Hugging Face Daily Papers research 10d ago

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Abstract Current memory agents lack reliable shared institutional deployment due to challenges in balancing utility, access control, and forgetting across multiple principals with diverse authorization contexts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory benchmarks for…

5
Hugging Face Daily Papers research 10d ago

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

Abstract A 3D brain MRI generative model uses a masked-autoencoder tokenizer to create clinically informative embeddings that support both medical task performance and controlled image generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Three-dimensional (3D) brain MRI is…

6
Hugging Face Daily Papers research 10d ago

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

Abstract WorldLines benchmark evaluates long-term memory in embodied agents through household scenarios, while ObsMem framework addresses challenges in partial observability and memory translation for decision-making. Generated by Qwen/Qwen2.5-Coder-32B-Instruct To assist humans…

19
Hugging Face Daily Papers research 12d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Abstract LEDGERAGENT is a method for customer service agents that maintains task states in a separate ledger to improve policy adherence and state management during tool calling. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Policy-adherent tool-calling agents in customer-service…

36
Hugging Face Daily Papers research 12d ago

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Abstract PerceptionDLM enables efficient parallel region perception in multimodal diffusion language models through structured attention masking and efficient prompting, achieving faster inference without sacrificing caption quality. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

12
Hugging Face Daily Papers research 13d ago

Context-Aware RL for Agentic and Multimodal LLMs

Abstract ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks. Generated by…

21
Hugging Face Daily Papers research 13d ago

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

Abstract A comprehensive corpus and access layer for U.S. local ordinance codes has been developed to enable machine-readable legal AI research, addressing the lack of authoritative legal text at scale for local regulations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Progress…

4
Hugging Face Daily Papers research 13d ago

ReSyn: A Generalized Recursive Regular Expression Synthesis Framework

Abstract A divide-and-conquer framework named ReSyn enhances regex synthesis accuracy by decomposing complex problems, combined with a parameter-efficient synthesizer called Set2Regex that handles example permutation invariance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

27
Hugging Face Daily Papers research 13d ago

LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

Abstract LegalHalluLens audits AI systems in legal workflows by identifying specific error patterns and directional biases in hallucinations across different claim types, enabling more reliable deployment through targeted diagnostic and mitigation approaches. Generated by…

31
Hugging Face Daily Papers research 13d ago

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

Abstract ACIE, an agentic RAG system deployed in a clinical setting, demonstrates high accuracy in extracting medical information from complex patient contexts, achieving 96.5% acceptance rate by nuclear-medicine physicians across 7,326 judgments. Generated by…

5
Hugging Face Daily Papers research 13d ago

The Data Manifold under the Microscope

Abstract A benchmarking framework is introduced to study data-manifold geometry by extending dSprites and COIL-20 datasets with additional transformation dimensions and dense sampling, enabling accurate estimation of curvature, reach, and volume for theoretical analysis and…

36
Hugging Face Daily Papers research 13d ago

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Abstract Analysis of FID variance across different training and sampling seeds reveals significant reproducibility issues in image generation evaluation, with retraining causing larger fluctuations than resampling, and recommends updated evaluation protocols with error bars and…

21
Hugging Face Daily Papers research 13d ago

Duration Aware Scheduling for ASR Serving Under Workload Drift

Abstract Duration-aware scheduling policies improve ASR serving latency by leveraging audio length as a predictor for processing time, with SJF and HRRN algorithms showing significant median latency reductions while maintaining throughput. Generated by…

26
Hugging Face Daily Papers research 13d ago

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Abstract Uniform 4-bit training with RHT-based quantization outperforms E2M1-based methods by eliminating shrinkage bias and improving training stability across large language model architectures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct FP4 training promises substantial…

31
Hugging Face Daily Papers research 13d ago

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Abstract Multi-LCB addresses the limitation of LiveCodeBench by providing a multi-language benchmark for evaluating LLMs across twelve programming languages while maintaining contamination controls and evaluation protocols. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

33

HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

FastMix: Fast Data Mixture Optimization via Gradient Descent

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Causal Discovery in the Era of Agents

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

Safe Few-Step Generation via Velocity Editing

Tmax: A simple recipe for terminal agents

OpenRath: Session-Centered Runtime State for Agent Systems

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

World Action Models: A Survey

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory

Exploring the Design Space of Reward Backpropagation for Flow Matching

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

When, Where, and How: Adaptive Binning for Tabular Self-Supervised Learning

Characterizing Narrative Content in Web-scale LLM Pretraining Data

SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Context-Aware RL for Agentic and Multimodal LLMs

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

ReSyn: A Generalized Recursive Regular Expression Synthesis Framework

LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

The Data Manifold under the Microscope

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Duration Aware Scheduling for ASR Serving Under Workload Drift

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages