News / #long-context Tag Long Context 228 articles archived under #long-context · RSS Sign in to follow r/LocalLLaMA community 15d ago GLM-5.2 just dropped open weights and it already looks weirdly strong for coding GLM-5.2 just released and the early numbers look pretty insane. 1M context window, open weights, MIT license, two reasoning effort modes, and it is already showing up near the top of coding arenas. I know every new model gets hyped for 24 hours, but this one actually looks worth… 28 Smol AI News news-outlet 16d ago GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs **Z.ai released GLM-5.2**, an MIT-licensed open-weight frontier model targeting **coding and long-horizon agentic tasks** with a **1M-token context window** and **two reasoning-effort modes**. It features a **744B-parameter mixture-of-experts architecture** with **40B active… 14 arXiv — Machine Learning research 16d ago PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression arXiv:2606.15157v1 Announce Type: new Abstract: KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all… 29 arXiv — NLP / Computation & Language research 16d ago Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing arXiv:2606.16093v1 Announce Type: new Abstract: Modeling long-range dependencies remains a central challenge in natural language processing. Transformer architectures achieve strong performance via self-attention but scale quadratically ($O(N^2)$) with sequence length, while… 11 r/LocalLLaMA community 16d ago Maybe dumb question, but how do you serve multiple users with the full context length? After experimenting with llama.cpp, I'm wondering a thing. Let's say we have an LLM with a context size of 128k. Now let's say we want have up to 8 parallel users, and we want to provide each client with the full context capabilities. With llama.cpp, how does that work? AFAIK it… 20 Ollama releases dev-tools 16d ago v0.30.9-rc1 server: context shift for context windows larger than 8k, add error w… 28 r/LocalLLaMA community 16d ago Context window + project size + Aider? Forgive the naivety of this post, I'm a noob, bear with me! If a project, understood as a set of files, is larger than the context window of a model, how do you fit it in? After doing some naive research, various major LLMs like Deepseek, Kimi, and company say the solution is… 32 arXiv — NLP / Computation & Language research 17d ago Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling arXiv:2606.14047v1 Announce Type: cross Abstract: Long-context language modeling requires not only extending context windows but maintaining coherent understanding of entity states and relationships across thousands of tokens -- a challenge that semantic similarity alone cannot… 12 arXiv — NLP / Computation & Language research 17d ago GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge arXiv:2606.14470v1 Announce Type: cross Abstract: Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software… 37 Hacker News — AI on Front Page community 18d ago Don't trust large context windows Article URL: https://garrit.xyz/posts/2026-05-06-dont-trust-large-context-windows Comments URL: https://news.ycombinator.com/item?id=48524620 Points: 201 # Comments: 146 27 r/LocalLLaMA community 18d ago [NEW FAMILY OF MODELS] Supra1.5 family just released! SupraLabs just released the Supra-1.5-exp line, Base, Instruct, and GGUF! (Reasoning soon) Hey r/LocalLLaMA ! We are releasing the experimental Supra-1.5-50M family today: a new Base model with 5x the context window of the original Supra-50M, an Instruct fine-tune on top of it,… 20 r/LocalLLaMA community 18d ago GLM 5.2 is deployed in GLM Coding Plan. API and MIT weights in a week. Voting and benchmarks on X. The model now supports a 1M context window and two thinking modes: max and high. z.ai recommends using max for coding. Vote on X What should we prioritize most? Longer context window MIT-licensed open weights No price increase Other links: GLM 5.2 announcement LLM Benchmark… 32 r/LocalLLaMA community 19d ago MiniMax Sparse Attention (MSA) Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax… 14 NVIDIA Developer Blog official-blog 19d ago Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and... 25 r/LocalLLaMA community 20d ago Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo What it is, in plain words. Your GPU keeps two float vectors for every token of your conversation. That’s the KV cache, and it’s why long contexts eat VRAM: Llama-3.1-8B needs about 0.12 MB per token, so 100k tokens costs 12 GB and a million tokens costs 122 GB. No consumer card… 33 Hugging Face Daily Papers research 20d ago MiniMax Sparse Attention Abstract MiniMax Sparse Attention enables efficient processing of ultra-long contexts in large language models through blockwise sparsity and optimized GPU execution, achieving significant speedups while maintaining performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 20 arXiv — NLP / Computation & Language research 20d ago LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most… 23 arXiv — NLP / Computation & Language research 20d ago G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents arXiv:2606.13115v1 Announce Type: new Abstract: While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive… 15 arXiv — NLP / Computation & Language research 20d ago Recursive Agent Harnesses arXiv:2606.13643v1 Announce Type: new Abstract: Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in… 35 Vercel — AI dev-tools 20d ago GLM 5.2 now available on AI Gateway GLM 5.2 is now available on AI Gateway . Built for long-horizon tasks, GLM 5.2 carries project-level engineering context across a single task, runs long-running tasks more reliably, and follows engineering standards more consistently. The context window for this model has been… 16 Hugging Face Daily Papers research 20d ago SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference Abstract SparDA is a decoupled sparse attention architecture that improves long-context LLM inference by reducing KV cache bottlenecks and attention complexity through aForecast projection for lookahead selection. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse attention… 23 r/MachineLearning community 20d ago What should context compression keep? I looked at how six agents handle it[D] I use Claude Code, Codex CLI, OpenCode, Cline, Cursor, and Amp enough to notice a pattern in how they handle long context. They are all converging on layered progressive compression, but they disagree on what to protect. Most protect recent user messages as a first-class asset.… 20 arXiv — NLP / Computation & Language research 21d ago Beyond Compaction: Structured Context Eviction for Long-Horizon Agents arXiv:2606.11213v1 Announce Type: new Abstract: We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through… 6 r/LocalLLaMA community 21d ago FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the… 25 Hugging Face Daily Papers research 22d ago Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It Abstract Chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, but a training-free method called QK-Restore can restore long-context capabilities by reverting query-key… 8 arXiv — Machine Learning research 22d ago Blurry Window Attention arXiv:2606.09862v1 Announce Type: new Abstract: The Softmax Attention operation in Transformer language models has a quadratic complexity in the sequence length and a growing state size in the form of KV cache, which becomes a bottleneck in long context scenarios. To overcome… 33 arXiv — NLP / Computation & Language research 22d ago Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models arXiv:2606.10537v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose… 37 arXiv — NLP / Computation & Language research 22d ago Dynamic Linear Attention arXiv:2606.10650v1 Announce Type: new Abstract: The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To… 34 arXiv — NLP / Computation & Language research 22d ago REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs arXiv:2606.10694v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management… 22 arXiv — NLP / Computation & Language research 22d ago Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It arXiv:2606.11052v1 Announce Type: new Abstract: Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including… 26 arXiv — NLP / Computation & Language research 22d ago Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling arXiv:2606.10435v1 Announce Type: cross Abstract: Transformers achieve strong language modeling performance by providing direct token-to-token communication paths, but causal self-attention scales quadratically with context length. Recurrent and state-space models reduce this… 21 Hugging Face Daily Papers research 22d ago Dynamic Linear Attention Abstract DLA addresses limitations in long-context LLMs by introducing adaptive state merging and capacity-bounded memory modeling for improved multi-state linear attention. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The scalability of Large Language Models (LLMs) to long… 25 Smol AI News news-outlet 23d ago Anthropic Claude Fable 5 **Anthropic** released two major models: **Claude Fable 5** for general availability and **Claude Mythos 5** for restricted access, with fallback to **Claude Opus 4.8** for sensitive queries. **Fable 5** features a **1M-token context window** and pricing at **$10/million input… 24 arXiv — Machine Learning research 23d ago How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models arXiv:2606.07703v1 Announce Type: new Abstract: Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, sparse, linear, or recurrent components. We study how much dense attention is needed to preserve… 4 Hugging Face Daily Papers research 23d ago FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention Abstract Lookahead Sparse Attention with Neural Memory Indexer reduces GPU memory usage for long-context LLM inference while maintaining accuracy through proactive KV cache management and decoupled training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Conventional LLMs keep the… 19 Hugging Face Daily Papers research 23d ago End-to-End Context Compression at Scale Abstract Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.… 25 r/LocalLLaMA community 23d ago Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling , and u/complexminded pointed out the tool-eval-bench utility by… 9 Hugging Face Daily Papers research 23d ago Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity Abstract RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Efficient inference is critical for long-context language models, where… 28 arXiv — NLP / Computation & Language research 24d ago A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models arXiv:2606.06758v1 Announce Type: new Abstract: Final-answer accuracy, retrieval recall, and citation overlap do not by themselves identify whether a long-context or retrieval-augmented language model used the evidence it was given. A model can answer from parametric memory,… 21 arXiv — NLP / Computation & Language research 24d ago EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering arXiv:2606.06906v1 Announce Type: new Abstract: Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate… 6 r/LocalLLaMA community 24d ago How are you all managing multiple MCP servers on startup? Hello! I'm using openCode and loading a bunch of different MCP servers at startup. This starts becoming a mess, it eats up tokens and pollutes the context window before I even type a single prompt. How are you all handling this locally? Are you using a proxy/hub to route… 14 r/LocalLLaMA community 24d ago Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks . BeeLlama.cpp (my llama.cpp fork) was used as inference engine due to support of additional types:… 31 r/LocalLLaMA community 25d ago KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive! TL;DR Based on long context KLD benchmarks, KVarN appears to be just better than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher. A number of people in the comments under my previous post asked a fair question: what if we… 21 arXiv — Machine Learning research 27d ago When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet arXiv:2606.06034v1 Announce Type: new Abstract: Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We… 23 arXiv — NLP / Computation & Language research 27d ago LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations arXiv:2606.05182v1 Announce Type: new Abstract: Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer… 14 arXiv — NLP / Computation & Language research 27d ago Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs arXiv:2606.06203v1 Announce Type: new Abstract: Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density -- the rate at which a context introduces distinct information --… 4 r/LocalLLaMA community 27d ago nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face Model Summary Total Parameters 550B (55B active) Architecture LatentMoE - Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP) Context Length Up to 1M tokens Minimum GPU Requirement 8x GB200/B200/GB300/B300, 16x H100, 8x H200 Supported Languages English, French,… 21 Vercel — AI dev-tools 28d ago Nemotron 3 Ultra now available on AI Gateway Nemotron 3 Ultra from Nvidia is now available on Vercel AI Gateway . Nemotron 3 Ultra is an open Mixture-of-Experts reasoning model built for orchestrating long-running agent workflows, with a 1M token context window. The model targets multi-turn agent workflows: planning, tool… 37 Smol AI News news-outlet 28d ago not much happened today **NVIDIA** released **Nemotron 3 Ultra**, a fully open **550B MoE** model with **55B active parameters** and **1M context**, optimized for long-running agent tasks with up to **5x speedup** and **30% cost reduction**. It features hybrid Mamba/attention, LatentMoE, native MTP,… 7 arXiv — NLP / Computation & Language research 28d ago SaliMory: Orchestrating Cognitive Memory for Conversational Agents arXiv:2606.04120v1 Announce Type: new Abstract: Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents… 10 Page 2 of 5 · 228 articles ← Newer Older →