Tag

Long Context

228 articles archived under #long-context · RSS

arXiv — NLP / Computation & Language research 1mo ago

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

arXiv:2605.23170v1 Announce Type: new Abstract: Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11…

33
Hugging Face Daily Papers research 1mo ago

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Abstract RTPurbo leverages intrinsic sparsity in full-attention LLMs to achieve efficient long-context inference with minimal training overhead, enabling significant speedups while maintaining near-lossless accuracy. AI-generated summary Long-context inference in large language…

10
arXiv — Machine Learning research 1mo ago

EntmaxKV: Support-Aware Decoding for Entmax Attention

arXiv:2605.21649v1 Announce Type: new Abstract: Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting…

26
arXiv — Machine Learning research 1mo ago

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

arXiv:2605.21768v1 Announce Type: new Abstract: Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session…

13
arXiv — NLP / Computation & Language research 1mo ago

ACC: Compiling Agent Trajectories for Long-Context Training

arXiv:2605.21850v1 Announce Type: new Abstract: Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents…

11
Hugging Face Daily Papers research 1mo ago

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Abstract Gated DeltaNet-2 improves upon existing linear attention models by separating erase and write operations through distinct channel-wise gates, achieving superior performance in long-context language modeling and retrieval tasks. AI-generated summary Linear attention…

29
Hugging Face Daily Papers research 1mo ago

ACC: Compiling Agent Trajectories for Long-Context Training

Abstract Agent Context Compilation (ACC) enhances long-context reasoning in LLMs by converting multi-turn agent trajectories into structured QA pairs, enabling direct supervision of distant context integration without additional annotation. AI-generated summary Recent…

28
Hugging Face Daily Papers research 1mo ago

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Abstract Mix-Quant is a phase-aware quantization framework that accelerates long-context, multi-turn LLM inference by applying high-throughput NVFP4 quantization to the prefilling phase while maintaining BF16 precision for decoding. AI-generated summary LLM agents have recently…

30
arXiv — NLP / Computation & Language research 1mo ago

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv:2605.20201v1 Announce Type: new Abstract: Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context --…

7
arXiv — NLP / Computation & Language research 1mo ago

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

arXiv:2605.20626v1 Announce Type: new Abstract: We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then…

8
Latent.Space news-outlet 1mo ago

Railway: The Agent-Native Cloud — Jake Cooper

3M Users, 100K Signups/Week, Own-Metal Data Centers, $200K+ Coding Agent Spend, and the Death of PRs

21
Hugging Face Daily Papers research 1mo ago

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

Abstract Rotary Positional Embeddings in Transformer models lose locality bias and token relevance consistency as context length increases, leading to unpredictable attention patterns that cannot be mitigated by multi-head, multi-layer architectures. AI-generated summary We…

9
r/LocalLLaMA community 1mo ago

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round. Three configs, tested at real coding-agent context lengths (not just 512 tokens). The main finding surprised me. TL;DR: 35B Q4_K_XL, no MTP,…

38
Hugging Face Daily Papers research 1mo ago

Context Memorization for Efficient Long Context Generation

Abstract Attention-state memory enables efficient long-prefix inference by storing precomputed attention states in lightweight memory, improving accuracy and reducing latency compared to traditional methods. AI-generated summary Modern large language model (LLM) applications…

13
Hugging Face Daily Papers research 1mo ago

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Abstract PEEK enables large language model agents to efficiently reuse orientation knowledge about recurring external contexts through a persistent context map that reduces computational costs and improves performance. AI-generated summary Large language model (LLM) agents…

4
arXiv — Machine Learning research 1mo ago

Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery

arXiv:2605.18854v1 Announce Type: new Abstract: Coding agents accumulate extensive context during long-running tasks, yet fixed context windows force practitioners to choose between truncation and task failure. While numerous memory condensation strategies have been proposed,…

12
arXiv — Machine Learning research 1mo ago

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

arXiv:2605.18856v1 Announce Type: new Abstract: Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods…

22
arXiv — Machine Learning research 1mo ago

KVBuffer: IO-aware Serving for Linear Attention

arXiv:2605.19049v1 Announce Type: new Abstract: Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by…

28
arXiv — NLP / Computation & Language research 1mo ago

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

arXiv:2605.19577v1 Announce Type: new Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter…

17
arXiv — NLP / Computation & Language research 1mo ago

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

arXiv:2605.19660v1 Announce Type: cross Abstract: The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel…

30
arXiv — NLP / Computation & Language research 1mo ago

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

arXiv:2605.19932v1 Announce Type: cross Abstract: Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory,…

22
Hugging Face Daily Papers research 1mo ago

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Abstract GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology. AI-generated summary We present GoLongRL, a fully open-source,…

37
arXiv — Machine Learning research 1mo ago

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

arXiv:2605.16360v1 Announce Type: new Abstract: Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision…

13
arXiv — Machine Learning research 1mo ago

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

arXiv:2605.16883v1 Announce Type: new Abstract: Autonomous Graphical User Interface (GUI) agents often struggle with multi-step tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work…

26
arXiv — NLP / Computation & Language research 1mo ago

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

arXiv:2605.16839v1 Announce Type: new Abstract: Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed…

31
arXiv — NLP / Computation & Language research 1mo ago

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

arXiv:2605.16928v1 Announce Type: new Abstract: Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an…

13
arXiv — NLP / Computation & Language research 1mo ago

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

arXiv:2605.18071v1 Announce Type: new Abstract: Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding,…

28
arXiv — NLP / Computation & Language research 1mo ago

Context Memorization for Efficient Long Context Generation

arXiv:2605.18226v1 Announce Type: new Abstract: Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the…

25
r/LocalLLaMA community 1mo ago

Is there any <3B model with usable 200k+ context window?

I need a small model for processing conversation transcripts from larger models, so need usable context window out to at least 200k tokens. I know some models claim to support this, but I don’t know which are actually good at this in practice. Also desirable: low hallucination…

15
r/LocalLLaMA community 1mo ago

Configuration Qwen3.6-35b-a3b (12Gb VRAM)

Has anyone here tested different KV cache quantizations and compared their performance? I’m currently using the model in Q5_K_M with Q4 KV cache on a 12 GB VRAM GPU. With this setup, I’m offloading about 27 MoE layers to the CPU and getting around 40 tok/s with a 128k total…

38
arXiv — Machine Learning research 1mo ago

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

arXiv:2605.15422v1 Announce Type: new Abstract: Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both…

36
arXiv — NLP / Computation & Language research 1mo ago

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

arXiv:2605.15514v1 Announce Type: new Abstract: We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its…

29
arXiv — NLP / Computation & Language research 1mo ago

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

arXiv:2605.15913v1 Announce Type: new Abstract: Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG).…

29
arXiv — NLP / Computation & Language research 1mo ago

RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

arXiv:2605.16045v1 Announce Type: new Abstract: Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process…

31
r/LocalLLaMA community 1mo ago

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and…

4
r/LocalLLaMA community 1mo ago

Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4

CPU is just a secondhand 10900x. Using 128k context, unquantized kv cache. Model is at q8_0 to mitigate some weird behavior I was seeing at lower quants. Speed is very slow at around 50tps pp, 10tps tg, but usable for coding agent workflows. Anybody else running MoE models in…

22
r/LocalLLaMA community 1mo ago

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territory. I don't really need a full study,…

25
r/LocalLLaMA community 1mo ago

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Saw some posts around PP being slower, so they were cautious on trying it. Here's a real-world datapoint. Settings: Headless RTX 3090 24G OpenCode Model unsloth's Qwen3.6-27B-MTP-Q4_K_M.gguf 128k context q8_0 kv cache --spec-draft-n-max: 3 --draft-p-min: 0 Use Cases: Research…

8
r/LocalLLaMA community 1mo ago

Deepseek V4's 1M context window: the breaking point

Just ran to verify deepseek v4's context claim of 1M and ran it across three production codebases like 45k (microservice), 180k (monorepo backend) and 520k(full stack app). For the observation, tasks included dependency tracing, cross file refractors and bug isolation to see…

13
Ahead of AI (Sebastian Raschka) research 1mo ago

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

From Gemma 4 to DeepSeek V4, How New Open-Weight LLMs Are Reducing Long-Context Costs

22
r/LocalLLaMA community 1mo ago

RAG on Snapdragon X2 Laptop, 200K documents.

Qualcomm recently released the new 𝐒𝐧𝐚𝐩𝐝𝐫𝐚𝐠𝐨𝐧 𝐗2 𝐥𝐚𝐩𝐭𝐨𝐩 𝐜𝐡𝐢𝐩𝐬𝐞𝐭. I immediately ordered one: ASUS Zenbook A16 16" 3K OLED Touchscreen Laptop — Snapdragon X2 Elite Extreme (2026) A few things I really like about this machine: 𝐄𝐱𝐭𝐫𝐞𝐦𝐞𝐥𝐲 𝐥𝐢𝐠𝐡𝐭.…

26
r/LocalLLaMA community 1mo ago

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

TL;DR I spent a few crazy evenings this past week seeing if I could get Gemma4 running with proper turbo quant and rotating KV cache support. The answer was yes, and I'm now able to run Gemma4 26b on my MacBook Air M5 at 128k context with 4 concurrent batches 😄 At 8k context…

12
Hugging Face Daily Papers research 1mo ago

Long Context Pre-Training with Lighthouse Attention

Abstract Lighthouse Attention enables efficient training of causal transformers at long sequences by using hierarchical selection-based attention that reduces computational complexity while maintaining model performance. AI-generated summary Training causal transformers at…

33
r/LocalLLaMA community 1mo ago

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests. The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context…

28
arXiv — NLP / Computation & Language research 1mo ago

EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

arXiv:2605.14589v1 Announce Type: new Abstract: Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to…

14
Hugging Face Daily Papers research 1mo ago

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Abstract A new benchmark evaluates memory capabilities in vision-language models through multi-session conversations, revealing limitations of both long-context and memory-augmented approaches. AI-generated summary Memory is essential for large vision-language models (LVLMs) to…

25
Hugging Face Daily Papers research 1mo ago

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Abstract RealICU benchmark evaluates large language models for ICU decision support using hindsight-annotated patient trajectories, revealing limitations in clinical recommendation accuracy and early interpretation bias. AI-generated summary Intensive care units (ICU) generate…

32
Hugging Face Daily Papers research 1mo ago

MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

Abstract MemReread addresses long-context reasoning challenges by avoiding intermediate retrieval and employing question decomposition with rereading to recover discarded information, maintaining linear time complexity. AI-generated summary To tackle long-context reasoning tasks…

38
Hugging Face Daily Papers research 1mo ago

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Abstract Long-context continued pre-training enhances vision-language models' ability to handle extended documents while maintaining performance across diverse contexts through strategic data mixture design. AI-generated summary Long-context modeling is becoming a core…

24
r/LocalLLaMA community 1mo ago

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M models, 128k context): Model tok/s Key…

19

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

EntmaxKV: Support-Aware Decoding for Entmax Attention

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

ACC: Compiling Agent Trajectories for Long-Context Training

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

ACC: Compiling Agent Trajectories for Long-Context Training

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

Railway: The Agent-Native Cloud — Jake Cooper

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

Context Memorization for Efficient Long Context Generation

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

KVBuffer: IO-aware Serving for Linear Attention

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

Context Memorization for Efficient Long Context Generation

Is there any <3B model with usable 200k+ context window?

Configuration Qwen3.6-35b-a3b (12Gb VRAM)

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Deepseek V4's 1M context window: the breaking point

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

RAG on Snapdragon X2 Laptop, 200K documents.

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Long Context Pre-Training with Lighthouse Attention

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)