News / #long-context Tag Long Context 228 articles archived under #long-context · RSS Sign in to follow arXiv — NLP / Computation & Language research 1mo ago Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks arXiv:2605.23170v1 Announce Type: new Abstract: Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11… 33 Hugging Face Daily Papers research 1mo ago Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps Abstract RTPurbo leverages intrinsic sparsity in full-attention LLMs to achieve efficient long-context inference with minimal training overhead, enabling significant speedups while maintaining near-lossless accuracy. AI-generated summary Long-context inference in large language… 10 arXiv — Machine Learning research 1mo ago EntmaxKV: Support-Aware Decoding for Entmax Attention arXiv:2605.21649v1 Announce Type: new Abstract: Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting… 26 arXiv — Machine Learning research 1mo ago Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents arXiv:2605.21768v1 Announce Type: new Abstract: Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session… 13 arXiv — NLP / Computation & Language research 1mo ago ACC: Compiling Agent Trajectories for Long-Context Training arXiv:2605.21850v1 Announce Type: new Abstract: Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents… 11 Hugging Face Daily Papers research 1mo ago Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Abstract Gated DeltaNet-2 improves upon existing linear attention models by separating erase and write operations through distinct channel-wise gates, achieving superior performance in long-context language modeling and retrieval tasks. AI-generated summary Linear attention… 29 Hugging Face Daily Papers research 1mo ago ACC: Compiling Agent Trajectories for Long-Context Training Abstract Agent Context Compilation (ACC) enhances long-context reasoning in LLMs by converting multi-turn agent trajectories into structured QA pairs, enabling direct supervision of distant context integration without additional annotation. AI-generated summary Recent… 28 Hugging Face Daily Papers research 1mo ago Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs Abstract Mix-Quant is a phase-aware quantization framework that accelerates long-context, multi-turn LLM inference by applying high-throughput NVFP4 quantization to the prefilling phase while maintaining BF16 precision for decoding. AI-generated summary LLM agents have recently… 30 arXiv — NLP / Computation & Language research 1mo ago Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning arXiv:2605.20201v1 Announce Type: new Abstract: Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context --… 7 arXiv — NLP / Computation & Language research 1mo ago Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task arXiv:2605.20626v1 Announce Type: new Abstract: We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then… 8 Latent.Space news-outlet 1mo ago Railway: The Agent-Native Cloud — Jake Cooper 3M Users, 100K Signups/Week, Own-Metal Data Centers, $200K+ Coding Agent Spend, and the Death of PRs 21 Hugging Face Daily Papers research 1mo ago RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably Abstract Rotary Positional Embeddings in Transformer models lose locality bias and token relevance consistency as context length increases, leading to unpredictable attention patterns that cannot be mitigated by multi-head, multi-layer architectures. AI-generated summary We… 9 r/LocalLLaMA community 1mo ago RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round. Three configs, tested at real coding-agent context lengths (not just 512 tokens). The main finding surprised me. TL;DR: 35B Q4_K_XL, no MTP,… 38 Hugging Face Daily Papers research 1mo ago Context Memorization for Efficient Long Context Generation Abstract Attention-state memory enables efficient long-prefix inference by storing precomputed attention states in lightweight memory, improving accuracy and reducing latency compared to traditional methods. AI-generated summary Modern large language model (LLM) applications… 13 Hugging Face Daily Papers research 1mo ago PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents Abstract PEEK enables large language model agents to efficiently reuse orientation knowledge about recurring external contexts through a persistent context map that reduces computational costs and improves performance. AI-generated summary Large language model (LLM) agents… 4 arXiv — Machine Learning research 1mo ago Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery arXiv:2605.18854v1 Announce Type: new Abstract: Coding agents accumulate extensive context during long-running tasks, yet fixed context windows force practitioners to choose between truncation and task failure. While numerous memory condensation strategies have been proposed,… 12 arXiv — Machine Learning research 1mo ago SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference arXiv:2605.18856v1 Announce Type: new Abstract: Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods… 22 arXiv — Machine Learning research 1mo ago KVBuffer: IO-aware Serving for Linear Attention arXiv:2605.19049v1 Announce Type: new Abstract: Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by… 28 arXiv — NLP / Computation & Language research 1mo ago GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment arXiv:2605.19577v1 Announce Type: new Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter… 17 arXiv — NLP / Computation & Language research 1mo ago OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond arXiv:2605.19660v1 Announce Type: cross Abstract: The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel… 30 arXiv — NLP / Computation & Language research 1mo ago PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents arXiv:2605.19932v1 Announce Type: cross Abstract: Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory,… 22 Hugging Face Daily Papers research 1mo ago GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment Abstract GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology. AI-generated summary We present GoLongRL, a fully open-source,… 37 arXiv — Machine Learning research 1mo ago ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference arXiv:2605.16360v1 Announce Type: new Abstract: Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision… 13 arXiv — Machine Learning research 1mo ago SE-GA: Memory-Augmented Self-Evolution for GUI Agents arXiv:2605.16883v1 Announce Type: new Abstract: Autonomous Graphical User Interface (GUI) agents often struggle with multi-step tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work… 26 arXiv — NLP / Computation & Language research 1mo ago CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection arXiv:2605.16839v1 Announce Type: new Abstract: Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed… 31 arXiv — NLP / Computation & Language research 1mo ago Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps arXiv:2605.16928v1 Announce Type: new Abstract: Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an… 13 arXiv — NLP / Computation & Language research 1mo ago KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference arXiv:2605.18071v1 Announce Type: new Abstract: Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding,… 28 arXiv — NLP / Computation & Language research 1mo ago Context Memorization for Efficient Long Context Generation arXiv:2605.18226v1 Announce Type: new Abstract: Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the… 25 r/LocalLLaMA community 1mo ago Is there any <3B model with usable 200k+ context window? I need a small model for processing conversation transcripts from larger models, so need usable context window out to at least 200k tokens. I know some models claim to support this, but I don’t know which are actually good at this in practice. Also desirable: low hallucination… 15 r/LocalLLaMA community 1mo ago Configuration Qwen3.6-35b-a3b (12Gb VRAM) Has anyone here tested different KV cache quantizations and compared their performance? I’m currently using the model in Q5_K_M with Q4 KV cache on a 12 GB VRAM GPU. With this setup, I’m offloading about 27 MoE layers to the CPU and getting around 40 tok/s with a 128k total… 38 arXiv — Machine Learning research 1mo ago DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts arXiv:2605.15422v1 Announce Type: new Abstract: Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both… 36 arXiv — NLP / Computation & Language research 1mo ago RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably arXiv:2605.15514v1 Announce Type: new Abstract: We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its… 29 arXiv — NLP / Computation & Language research 1mo ago Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation arXiv:2605.15913v1 Announce Type: new Abstract: Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG).… 29 arXiv — NLP / Computation & Language research 1mo ago RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents arXiv:2605.16045v1 Announce Type: new Abstract: Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process… 31 r/LocalLLaMA community 1mo ago Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and… 4 r/LocalLLaMA community 1mo ago Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4 CPU is just a secondhand 10900x. Using 128k context, unquantized kv cache. Model is at q8_0 to mitigate some weird behavior I was seeing at lower quants. Speed is very slow at around 50tps pp, 10tps tg, but usable for coding agent workflows. Anybody else running MoE models in… 22 r/LocalLLaMA community 1mo ago Developers who use local AI - Q4_0 vs Q8_0 KV quant? I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territory. I don't really need a full study,… 25 r/LocalLLaMA community 1mo ago Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090 Saw some posts around PP being slower, so they were cautious on trying it. Here's a real-world datapoint. Settings: Headless RTX 3090 24G OpenCode Model unsloth's Qwen3.6-27B-MTP-Q4_K_M.gguf 128k context q8_0 kv cache --spec-draft-n-max: 3 --draft-p-min: 0 Use Cases: Research… 8 r/LocalLLaMA community 1mo ago Deepseek V4's 1M context window: the breaking point Just ran to verify deepseek v4's context claim of 1M and ran it across three production codebases like 45k (microservice), 180k (monorepo backend) and 520k(full stack app). For the observation, tasks included dependency tracing, cross file refractors and bug isolation to see… 13 Ahead of AI (Sebastian Raschka) research 1mo ago Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention From Gemma 4 to DeepSeek V4, How New Open-Weight LLMs Are Reducing Long-Context Costs 22 r/LocalLLaMA community 1mo ago RAG on Snapdragon X2 Laptop, 200K documents. Qualcomm recently released the new 𝐒𝐧𝐚𝐩𝐝𝐫𝐚𝐠𝐨𝐧 𝐗2 𝐥𝐚𝐩𝐭𝐨𝐩 𝐜𝐡𝐢𝐩𝐬𝐞𝐭. I immediately ordered one: ASUS Zenbook A16 16" 3K OLED Touchscreen Laptop — Snapdragon X2 Elite Extreme (2026) A few things I really like about this machine: 𝐄𝐱𝐭𝐫𝐞𝐦𝐞𝐥𝐲 𝐥𝐢𝐠𝐡𝐭.… 26 r/LocalLLaMA community 1mo ago Gemma4 26b MoE running in MLX with turboquant (and custom kernel) TL;DR I spent a few crazy evenings this past week seeing if I could get Gemma4 running with proper turbo quant and rotating KV cache support. The answer was yes, and I'm now able to run Gemma4 26b on my MacBook Air M5 at 128k context with 4 concurrent batches 😄 At 8k context… 12 Hugging Face Daily Papers research 1mo ago Long Context Pre-Training with Lighthouse Attention Abstract Lighthouse Attention enables efficient training of causal transformers at long sequences by using hierarchical selection-based attention that reduces computational complexity while maintaining model performance. AI-generated summary Training causal transformers at… 33 r/LocalLLaMA community 1mo ago Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version) In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests. The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context… 28 arXiv — NLP / Computation & Language research 1mo ago EndPrompt: Efficient Long-Context Extension via Terminal Anchoring arXiv:2605.14589v1 Announce Type: new Abstract: Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to… 14 Hugging Face Daily Papers research 1mo ago MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models Abstract A new benchmark evaluates memory capabilities in vision-language models through multi-session conversations, revealing limitations of both long-context and memory-augmented approaches. AI-generated summary Memory is essential for large vision-language models (LVLMs) to… 25 Hugging Face Daily Papers research 1mo ago RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation Abstract RealICU benchmark evaluates large language models for ICU decision support using hindsight-annotated patient trajectories, revealing limitations in clinical recommendation accuracy and early interpretation bias. AI-generated summary Intensive care units (ICU) generate… 32 Hugging Face Daily Papers research 1mo ago MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading Abstract MemReread addresses long-context reasoning challenges by avoiding intermediate retrieval and employing question decomposition with rereading to recover discarded information, maintaining linear time complexity. AI-generated summary To tackle long-context reasoning tasks… 38 Hugging Face Daily Papers research 1mo ago Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context Abstract Long-context continued pre-training enhances vision-language models' ability to handle extended documents while maintaining performance across diverse contexts through strategic data mixture design. AI-generated summary Long-context modeling is becoming a core… 24 r/LocalLLaMA community 1mo ago 24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M models, 128k context): Model tok/s Key… 19 Page 4 of 5 · 228 articles ← Newer Older →