Tag

Long Context

228 articles archived under #long-context · RSS

r/LocalLLaMA community 15d ago

GLM-5.2 just dropped open weights and it already looks weirdly strong for coding

GLM-5.2 just released and the early numbers look pretty insane. 1M context window, open weights, MIT license, two reasoning effort modes, and it is already showing up near the top of coding arenas. I know every new model gets hyped for 24 hours, but this one actually looks worth…

28
Smol AI News news-outlet 16d ago

GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs

**Z.ai released GLM-5.2**, an MIT-licensed open-weight frontier model targeting **coding and long-horizon agentic tasks** with a **1M-token context window** and **two reasoning-effort modes**. It features a **744B-parameter mixture-of-experts architecture** with **40B active…

14
arXiv — Machine Learning research 16d ago

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

arXiv:2606.15157v1 Announce Type: new Abstract: KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all…

29
arXiv — NLP / Computation & Language research 16d ago

Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

arXiv:2606.16093v1 Announce Type: new Abstract: Modeling long-range dependencies remains a central challenge in natural language processing. Transformer architectures achieve strong performance via self-attention but scale quadratically ($O(N^2)$) with sequence length, while…

11
r/LocalLLaMA community 16d ago

Maybe dumb question, but how do you serve multiple users with the full context length?

After experimenting with llama.cpp, I'm wondering a thing. Let's say we have an LLM with a context size of 128k. Now let's say we want have up to 8 parallel users, and we want to provide each client with the full context capabilities. With llama.cpp, how does that work? AFAIK it…

20
Ollama releases dev-tools 16d ago

v0.30.9-rc1

server: context shift for context windows larger than 8k, add error w…

28
r/LocalLLaMA community 16d ago

Context window + project size + Aider?

Forgive the naivety of this post, I'm a noob, bear with me! If a project, understood as a set of files, is larger than the context window of a model, how do you fit it in? After doing some naive research, various major LLMs like Deepseek, Kimi, and company say the solution is…

32
arXiv — NLP / Computation & Language research 17d ago

Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

arXiv:2606.14047v1 Announce Type: cross Abstract: Long-context language modeling requires not only extending context windows but maintaining coherent understanding of entity states and relationships across thousands of tokens -- a challenge that semantic similarity alone cannot…

12
arXiv — NLP / Computation & Language research 17d ago

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

arXiv:2606.14470v1 Announce Type: cross Abstract: Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software…

37
Hacker News — AI on Front Page community 18d ago

Don't trust large context windows

Article URL: https://garrit.xyz/posts/2026-05-06-dont-trust-large-context-windows Comments URL: https://news.ycombinator.com/item?id=48524620 Points: 201 # Comments: 146

27
r/LocalLLaMA community 18d ago

[NEW FAMILY OF MODELS] Supra1.5 family just released!

SupraLabs just released the Supra-1.5-exp line, Base, Instruct, and GGUF! (Reasoning soon) Hey r/LocalLLaMA ! We are releasing the experimental Supra-1.5-50M family today: a new Base model with 5x the context window of the original Supra-50M, an Instruct fine-tune on top of it,…

20
r/LocalLLaMA community 18d ago

GLM 5.2 is deployed in GLM Coding Plan. API and MIT weights in a week. Voting and benchmarks on X.

The model now supports a 1M context window and two thinking modes: max and high. z.ai recommends using max for coding. Vote on X What should we prioritize most? Longer context window MIT-licensed open weights No price increase Other links: GLM 5.2 announcement LLM Benchmark…

32
r/LocalLLaMA community 19d ago

MiniMax Sparse Attention (MSA)

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax…

14
NVIDIA Developer Blog official-blog 19d ago

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and...

25
r/LocalLLaMA community 20d ago

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

What it is, in plain words. Your GPU keeps two float vectors for every token of your conversation. That’s the KV cache, and it’s why long contexts eat VRAM: Llama-3.1-8B needs about 0.12 MB per token, so 100k tokens costs 12 GB and a million tokens costs 122 GB. No consumer card…

33
Hugging Face Daily Papers research 20d ago

MiniMax Sparse Attention

Abstract MiniMax Sparse Attention enables efficient processing of ultra-long contexts in large language models through blockwise sparsity and optimized GPU execution, achieving significant speedups while maintaining performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

20
arXiv — NLP / Computation & Language research 20d ago

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most…

23
arXiv — NLP / Computation & Language research 20d ago

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

arXiv:2606.13115v1 Announce Type: new Abstract: While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive…

15
arXiv — NLP / Computation & Language research 20d ago

Recursive Agent Harnesses

arXiv:2606.13643v1 Announce Type: new Abstract: Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in…

35
Vercel — AI dev-tools 20d ago

GLM 5.2 now available on AI Gateway

GLM 5.2 is now available on AI Gateway . Built for long-horizon tasks, GLM 5.2 carries project-level engineering context across a single task, runs long-running tasks more reliably, and follows engineering standards more consistently. The context window for this model has been…

16
Hugging Face Daily Papers research 20d ago

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Abstract SparDA is a decoupled sparse attention architecture that improves long-context LLM inference by reducing KV cache bottlenecks and attention complexity through aForecast projection for lookahead selection. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse attention…

23
r/MachineLearning community 20d ago

What should context compression keep? I looked at how six agents handle it[D]

I use Claude Code, Codex CLI, OpenCode, Cline, Cursor, and Amp enough to notice a pattern in how they handle long context. They are all converging on layered progressive compression, but they disagree on what to protect. Most protect recent user messages as a first-class asset.…

20
arXiv — NLP / Computation & Language research 21d ago

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

arXiv:2606.11213v1 Announce Type: new Abstract: We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through…

6
r/LocalLLaMA community 21d ago

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the…

25
Hugging Face Daily Papers research 22d ago

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Abstract Chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, but a training-free method called QK-Restore can restore long-context capabilities by reverting query-key…

8
arXiv — Machine Learning research 22d ago

Blurry Window Attention

arXiv:2606.09862v1 Announce Type: new Abstract: The Softmax Attention operation in Transformer language models has a quadratic complexity in the sequence length and a growing state size in the form of KV cache, which becomes a bottleneck in long context scenarios. To overcome…

33
arXiv — NLP / Computation & Language research 22d ago

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

arXiv:2606.10537v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose…

37
arXiv — NLP / Computation & Language research 22d ago

Dynamic Linear Attention

arXiv:2606.10650v1 Announce Type: new Abstract: The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To…

34
arXiv — NLP / Computation & Language research 22d ago

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

arXiv:2606.10694v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management…

22
arXiv — NLP / Computation & Language research 22d ago

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

arXiv:2606.11052v1 Announce Type: new Abstract: Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including…

26
arXiv — NLP / Computation & Language research 22d ago

Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling

arXiv:2606.10435v1 Announce Type: cross Abstract: Transformers achieve strong language modeling performance by providing direct token-to-token communication paths, but causal self-attention scales quadratically with context length. Recurrent and state-space models reduce this…

21
Hugging Face Daily Papers research 22d ago

Dynamic Linear Attention

Abstract DLA addresses limitations in long-context LLMs by introducing adaptive state merging and capacity-bounded memory modeling for improved multi-state linear attention. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The scalability of Large Language Models (LLMs) to long…

25
Smol AI News news-outlet 23d ago

Anthropic Claude Fable 5

**Anthropic** released two major models: **Claude Fable 5** for general availability and **Claude Mythos 5** for restricted access, with fallback to **Claude Opus 4.8** for sensitive queries. **Fable 5** features a **1M-token context window** and pricing at **$10/million input…

24
arXiv — Machine Learning research 23d ago

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

arXiv:2606.07703v1 Announce Type: new Abstract: Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, sparse, linear, or recurrent components. We study how much dense attention is needed to preserve…

4
Hugging Face Daily Papers research 23d ago

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Abstract Lookahead Sparse Attention with Neural Memory Indexer reduces GPU memory usage for long-context LLM inference while maintaining accuracy through proactive KV cache management and decoupled training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Conventional LLMs keep the…

19
Hugging Face Daily Papers research 23d ago

End-to-End Context Compression at Scale

Abstract Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.…

25
r/LocalLLaMA community 23d ago

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling , and u/complexminded pointed out the tool-eval-bench utility by…

9
Hugging Face Daily Papers research 23d ago

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Abstract RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Efficient inference is critical for long-context language models, where…

28
arXiv — NLP / Computation & Language research 24d ago

A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models

arXiv:2606.06758v1 Announce Type: new Abstract: Final-answer accuracy, retrieval recall, and citation overlap do not by themselves identify whether a long-context or retrieval-augmented language model used the evidence it was given. A model can answer from parametric memory,…

21
arXiv — NLP / Computation & Language research 24d ago

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

arXiv:2606.06906v1 Announce Type: new Abstract: Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate…

6
r/LocalLLaMA community 24d ago

How are you all managing multiple MCP servers on startup?

Hello! I'm using openCode and loading a bunch of different MCP servers at startup. This starts becoming a mess, it eats up tokens and pollutes the context window before I even type a single prompt. How are you all handling this locally? Are you using a proxy/hub to route…

14
r/LocalLLaMA community 24d ago

Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ

Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks . BeeLlama.cpp (my llama.cpp fork) was used as inference engine due to support of additional types:…

31
r/LocalLLaMA community 25d ago

KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive!

TL;DR Based on long context KLD benchmarks, KVarN appears to be just better than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher. A number of people in the comments under my previous post asked a fair question: what if we…

21
arXiv — Machine Learning research 27d ago

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

arXiv:2606.06034v1 Announce Type: new Abstract: Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We…

23
arXiv — NLP / Computation & Language research 27d ago

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

arXiv:2606.05182v1 Announce Type: new Abstract: Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer…

14
arXiv — NLP / Computation & Language research 27d ago

Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs

arXiv:2606.06203v1 Announce Type: new Abstract: Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density -- the rate at which a context introduces distinct information --…

4
r/LocalLLaMA community 27d ago

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Model Summary Total Parameters 550B (55B active) Architecture LatentMoE - Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP) Context Length Up to 1M tokens Minimum GPU Requirement 8x GB200/B200/GB300/B300, 16x H100, 8x H200 Supported Languages English, French,…

21
Vercel — AI dev-tools 28d ago

Nemotron 3 Ultra now available on AI Gateway

Nemotron 3 Ultra from Nvidia is now available on Vercel AI Gateway . Nemotron 3 Ultra is an open Mixture-of-Experts reasoning model built for orchestrating long-running agent workflows, with a 1M token context window. The model targets multi-turn agent workflows: planning, tool…

37
Smol AI News news-outlet 28d ago

not much happened today

**NVIDIA** released **Nemotron 3 Ultra**, a fully open **550B MoE** model with **55B active parameters** and **1M context**, optimized for long-running agent tasks with up to **5x speedup** and **30% cost reduction**. It features hybrid Mamba/attention, LatentMoE, native MTP,…

7
arXiv — NLP / Computation & Language research 28d ago

SaliMory: Orchestrating Cognitive Memory for Conversational Agents

arXiv:2606.04120v1 Announce Type: new Abstract: Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents…

10

GLM-5.2 just dropped open weights and it already looks weirdly strong for coding

GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

Maybe dumb question, but how do you serve multiple users with the full context length?

v0.30.9-rc1

Context window + project size + Aider?

Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

Don't trust large context windows

[NEW FAMILY OF MODELS] Supra1.5 family just released!

GLM 5.2 is deployed in GLM Coding Plan. API and MIT weights in a week. Voting and benchmarks on X.

MiniMax Sparse Attention (MSA)

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

MiniMax Sparse Attention

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

Recursive Agent Harnesses

GLM 5.2 now available on AI Gateway

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

What should context compression keep? I looked at how six agents handle it[D]

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Blurry Window Attention

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Dynamic Linear Attention

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling

Dynamic Linear Attention

Anthropic Claude Fable 5

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

End-to-End Context Compression at Scale

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

How are you all managing multiple MCP servers on startup?

Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ

KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive!

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Nemotron 3 Ultra now available on AI Gateway

not much happened today

SaliMory: Orchestrating Cognitive Memory for Conversational Agents