Tag

Long Context

228 articles archived under #long-context · RSS

arXiv — NLP / Computation & Language research 28d ago

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

arXiv:2606.04302v1 Announce Type: new Abstract: Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation…

10
arXiv — NLP / Computation & Language research 28d ago

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv:2606.04511v1 Announce Type: new Abstract: Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe…

5
arXiv — NLP / Computation & Language research 28d ago

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

arXiv:2606.04557v1 Announce Type: new Abstract: Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable…

14
arXiv — NLP / Computation & Language research 29d ago

Memory Retrieval for Changing Preferences

arXiv:2606.02976v1 Announce Type: new Abstract: Long-context dialogue systems must decide both when to access memory and which parts of the interaction history are relevant. Existing approaches typically rely on heuristic retrieval signals or always-on memory usage, failing to…

19
arXiv — NLP / Computation & Language research 29d ago

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

arXiv:2606.03363v1 Announce Type: new Abstract: Text-to-SQL enables natural language access to databases, and recent LLMs have substantially advanced its capabilities. Existing benchmarks such as Spider, BIRD, and Spider~2.0 evaluate schema generalization, large-scale databases,…

15
arXiv — NLP / Computation & Language research 29d ago

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

arXiv:2606.02812v1 Announce Type: cross Abstract: Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but…

38
r/MachineLearning community 29d ago

MiniMax dropped a new attention architecture. [N]

It contains something interesting about context windows. They’re natively scaling to 1M tokens with MiniMax Sparse Attention (MSA) , bypassing standard quadratic complexity by completely restructuring the memory access patterns at the operator level. Instead of relying on…

26
r/LocalLLaMA community 1mo ago

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Llama benchmark results model size params backend ngl threads type_k type_v fa test t/s qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 pp512 977.40 ± 2.02 qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 tg128 70.54 ± 0.12 I've…

22
arXiv — NLP / Computation & Language research 1mo ago

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

arXiv:2606.00024v1 Announce Type: new Abstract: Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth required to fetch the extensive Key-Value (KV) cache. Most existing KV management methods rely on key-only pruning before…

19
arXiv — NLP / Computation & Language research 1mo ago

MemPro: Agentic Memory Systems as Evolvable Programs

arXiv:2606.00619v1 Announce Type: new Abstract: Long-horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge beyond finite context windows. Existing agentic memory systems typically follow a memory…

5
arXiv — NLP / Computation & Language research 1mo ago

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

arXiv:2606.00724v1 Announce Type: new Abstract: Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in…

28
arXiv — NLP / Computation & Language research 1mo ago

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

arXiv:2606.01223v1 Announce Type: new Abstract: Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into…

10
arXiv — NLP / Computation & Language research 1mo ago

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

arXiv:2606.01294v1 Announce Type: new Abstract: Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of…

21
arXiv — NLP / Computation & Language research 1mo ago

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

arXiv:2606.01336v1 Announce Type: new Abstract: As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs…

35
Hugging Face Daily Papers research 1mo ago

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

Abstract LongAttnComp adapts AttnComp for long-context processing by fine-tuning lightweight attention layers and implementing token-level chunking and positional reordering techniques. AI-generated summary As real-world applications increasingly require processing inputs of…

27
NVIDIA Developer Blog official-blog 1mo ago

Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX Spark

The rise of autonomous, long-running AI agents has introduced a new class of compute demand, namely tasks that maintain large context windows, spawn concurrent...

16
r/LocalLLaMA community 1mo ago

For Ling-2.6-1T, what would make the size feel justified first: quality per token, local serving reality, or long context stability?

The first question I have about Ling-2.6-1T is not “is the model card impressive?” It is whether the boring trade-off makes sense. It is an open-sourced Ant/InclusionAI flagship with about 1T total params / 63B activated params, up to 1M native context, and 256K currently…

21
arXiv — Machine Learning research 1mo ago

CoMem: Context Management with A Decoupled Long-Context Model

arXiv:2605.30842v1 Announce Type: new Abstract: Context management enables agentic models to solve long-horizon tasks through iterative summarization of previous interaction histories. However, this process typically incurs substantial decoding overhead for the extra…

11
arXiv — NLP / Computation & Language research 1mo ago

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

arXiv:2605.31105v1 Announce Type: new Abstract: Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache…

4
Hugging Face Daily Papers research 1mo ago

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Abstract LongTraceRL addresses long-context reasoning challenges in large language models through tiered distractor construction and rubric reward design for improved reasoning quality. AI-generated summary Long-context reasoning remains a central challenge for large language…

38
r/LocalLLaMA community 1mo ago

MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal

  submitted by   /u/dryadofelysium [link]   [comments]

14
Vercel — AI dev-tools 1mo ago

MiniMax M3 on AI Gateway

MiniMax M3 is now available on Vercel AI Gateway . M3 is MiniMax's first model with a 1M-token context window and native multimodality, built around MiniMax Sparse Attention (MSA). M3 improves on software engineering, terminal-based tool use, and agentic web browsing, and is…

8
r/LocalLLaMA community 1mo ago

Liquid AI releases LFM2.5-8B-A1B

Liquid AI released LFM2.5-8B-A1B, an edge model designed to power real-life applications. It builds on LFM2-8B-A1B with three major upgrades: an expanded 128K context window, 38T tokens of pre-training (up from 12T), and large-scale reinforcement learning. It also comes with a…

14
arXiv — NLP / Computation & Language research 1mo ago

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

arXiv:2605.29324v1 Announce Type: new Abstract: Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy…

26
arXiv — NLP / Computation & Language research 1mo ago

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

arXiv:2605.29379v1 Announce Type: new Abstract: We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's…

38
r/LocalLLaMA community 1mo ago

Upgrade path from 4x 3090s

Hey everyone, looking for some upgrade advice. Right now, I’m running 4x 3090s hosting Qwen 3.6 27B 128K in full precision. It's a great model, but I'm looking for a step up and trying to figure out the best "middle-tier" hardware path. I've seen people here mention running 8x…

5
Hacker News — AI on Front Page community 1mo ago

Bricks and Minifigs Stole a Man's $200k Lego Collection

Article URL: https://mybricklog.com/blog/bricks-minifigs-corporate-stole-old-mans-200000-lego-collection Comments URL: https://news.ycombinator.com/item?id=48314136 Points: 208 # Comments: 57

34
r/LocalLLaMA community 1mo ago

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060. All credit goes to spiritbuun's fork ( github.com/spiritbuun/buun-llama-cpp ) and mudler's APEX quantizations ( huggingface.co/mudler ). Spiritbuun's CUDA optimizations for NVIDIA GPUs — fused MMA…

18
r/LocalLLaMA community 1mo ago

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.

Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090? The project is starting to seem very bloated, at least readme wise. I use…

6
arXiv — Machine Learning research 1mo ago

Heterogeneous Parallelism for Multimodal Large Language Model Training

arXiv:2605.27678v1 Announce Type: new Abstract: Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP…

34
arXiv — NLP / Computation & Language research 1mo ago

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

arXiv:2605.27740v1 Announce Type: new Abstract: Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but…

35
arXiv — NLP / Computation & Language research 1mo ago

Periodic RoPE for Infinite Context LLMs

arXiv:2605.27980v1 Announce Type: new Abstract: The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence…

33
arXiv — NLP / Computation & Language research 1mo ago

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

arXiv:2605.28009v1 Announce Type: new Abstract: Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and…

7
arXiv — NLP / Computation & Language research 1mo ago

ATLAS: All-round Testing of Long-context Abilities across Scales

arXiv:2605.28079v1 Announce Type: new Abstract: Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and…

4
r/LocalLLaMA community 1mo ago

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs : Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how…

11
r/LocalLLaMA community 1mo ago

Finally pioneering beyond the local 256k context window frontier!

The autocompact at 341.5k tokens is manually set and I'll be slowly pushing it back now I'm confident there's overhead for memory eviction of key values into cache. The question now is will the proposed fix complete in those remaining 16k tokens, as I'll be cross if the trial…

11
arXiv — NLP / Computation & Language research 1mo ago

NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

arXiv:2605.26678v1 Announce Type: new Abstract: Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation,…

30
r/LocalLLaMA community 1mo ago

Long-context performance at lower quants

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k context use, it starts to get dumb all of a…

26
Hugging Face Daily Papers research 1mo ago

Language Models Need Sleep

Abstract A sleep-like consolidation mechanism for transformer models uses fast weights and recurrent passes to improve long-context processing while maintaining inference speed. AI-generated summary Transformer-based large language models are increasingly used for long-horizon…

25
Hugging Face Daily Papers research 1mo ago

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Abstract ThriftAttention reduces long-context attention computation by selectively applying higher precision to critical query-key interactions, achieving near-full precision quality at reduced bitwidth efficiency. AI-generated summary Efficient attention algorithms are critical…

8
Smol AI News news-outlet 1mo ago

not much happened today

**Inference optimization** is increasingly architectural, with **EAGLE 3.1** improving speculative decoding and long-context handling, collaborating with **vLLM** and **TorchSpec**. **Perplexity** open-sourced a rebuilt **Unigram tokenizer** cutting CPU use by **5–6×** and…

15
Hugging Face Daily Papers research 1mo ago

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

Abstract MemForest presents a memory framework for long-context LLM agents that improves scalability and reduces latency through parallel chunk extraction and hierarchical temporal indexing. AI-generated summary Memory is a fundamental component for enabling long-context LLM…

4
arXiv — NLP / Computation & Language research 1mo ago

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

arXiv:2605.24579v1 Announce Type: new Abstract: Long-context memory systems often fail under fixed budgets, but end-to-end evaluation does not reveal whether evidence was discarded during compression or preserved but never retrieved. We introduce a four-condition diagnostic…

27
arXiv — NLP / Computation & Language research 1mo ago

H$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer

arXiv:2605.24930v1 Announce Type: new Abstract: Transformer-based LLMs achieve strong results on many language tasks; however, long inputs remain challenging because context windows are finite, and prefill latency and memory grow rapidly with prompt length. Flat token-stream…

13
r/LocalLLaMA community 1mo ago

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

  submitted by   /u/miserlou [link]   [comments]

5
r/LocalLLaMA community 1mo ago

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost,…

37
arXiv — Machine Learning research 1mo ago

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

arXiv:2605.23081v1 Announce Type: new Abstract: Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit…

37
arXiv — Machine Learning research 1mo ago

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

arXiv:2605.23200v1 Announce Type: new Abstract: The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on…

16
arXiv — Machine Learning research 1mo ago

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

arXiv:2605.23258v1 Announce Type: new Abstract: KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical…

38
arXiv — NLP / Computation & Language research 1mo ago

The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

arXiv:2605.23071v1 Announce Type: new Abstract: Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and…

20

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

Memory Retrieval for Changing Preferences

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

MiniMax dropped a new attention architecture. [N]

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

MemPro: Agentic Memory Systems as Evolvable Programs

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX Spark

For Ling-2.6-1T, what would make the size feel justified first: quality per token, local serving reality, or long context stability?

CoMem: Context Management with A Decoupled Long-Context Model

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal

MiniMax M3 on AI Gateway

Liquid AI releases LFM2.5-8B-A1B

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

Upgrade path from 4x 3090s

Bricks and Minifigs Stole a Man's $200k Lego Collection

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.

Heterogeneous Parallelism for Multimodal Large Language Model Training

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

Periodic RoPE for Infinite Context LLMs

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

ATLAS: All-round Testing of Long-context Abilities across Scales

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Finally pioneering beyond the local 256k context window frontier!

NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

Long-context performance at lower quants

Language Models Need Sleep

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

not much happened today

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

H$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management