r/LocalLLaMA

500 articles archived · Visit source ↗ · RSS

r/LocalLLaMA community 58m ago

Palantir CEO rages against closed models

For context, this week they struck a deal to buy Nvidia chips and run local models for their enterprise clients. So in this video he is railing against Anthropic and OpenAI saying they are ripping everyone off while stealing their data too. Always a special moment when the enemy…

30
r/LocalLLaMA community 3h ago

SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.

I’m pretty jaded like most of y’all. I don’t really get excited by new models much anymore. Last few weeks have been kinda meh to be honest. Monday, I stumbled upon SenseNova’s Mixture of Transformers models and they seem kinda like a different animal than other typical image…

4
r/LocalLLaMA community 4h ago

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

I came across this interesting article https://blog.exolabs.net/nvidia-dgx-spark/ while I don't have the DGX spark but it made me curious will this kind of arch speed up my setup for LLMs? Mac can host large models but the prefill speed sucks, so I tested in it on my setup for…

25
r/LocalLLaMA community 4h ago

They fit! Mostly.... 2x 3090, Thermaltake Core p3

Got another 3090 had to print a bracket to angle the radiator and make room for the GPUs 💀 ended up liking the look more than I thought ..qwen 27b go brrrrr   submitted by   /u/anthonyg45157 [link]   [comments]

6
r/LocalLLaMA community 5h ago

Making LLMs Better at Creative Writing using Entropy

  submitted by   /u/CountBayesie [link]   [comments]

26
r/LocalLLaMA community 7h ago

I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3

Just wanted to share that I was looking for an optimal way to run Ornith 35B in FP8 with E4M3 and MTP with vLLM but there was no out-of-the-box model with MTP drafter support. So I grafted this new model! It's 18% faster than without MTP and the drafter acceptance rate is not…

31
r/LocalLLaMA community 9h ago

I extended Gemma4-31B to 44B (88 layers) — since Google won't give us anything bigger than 31B

I've been just sit on this thread for a while now, both as a reader and occasional poster, so I figured it was finally time to share something I've been working on last weekends. Google hasn't shipped a dense Gemma4 bigger than 31B, so I decided to just build one myself. Heads…

38
r/LocalLLaMA community 10h ago

Senior SWE Bench: a new benchmark focussed on realistically underspecified feature tasks

  submitted by   /u/jordo45 [link]   [comments]

37
r/LocalLLaMA community 10h ago

My reasons to run local models

I can finetune any model on any dataset I want. I can use techniques like speculative decoding and other sota approaches to get the max tps The llm provides like anthropic and openai are not getting access to my data The hardware is reusable for vision text speech, and I can run…

10
r/LocalLLaMA community 11h ago

End of an Agony. Real production service that uses LLM to earn money my team had made and now we are so happy that it will die. Here are some of my final "experiences".

Hello everyone. I had posted in this sub about making a production service about 8 months ago. Here the link of my previous post . The idea was the same. We wanted to make a real production service that we can provide to clients to earn money. AI assistant that works through…

30
r/LocalLLaMA community 11h ago

ZCode: New Agentic Code Editor from the Makers of GLM

  submitted by   /u/johnnyApplePRNG [link]   [comments]

16
r/LocalLLaMA community 12h ago

Anyone using TensTorrent gpus for your local ai? What's been your experience?

I'm always keeping an eye on competitive hardware and was looking at tenstorrent cards, particularly the p150a which while its memory bandwidth is only 512GB/s, it does have 32 GB of GDDR6 and a high-speed Ethernet fabric (4×800 GbE) so multi-card systems don't rely on PCIe…

36
r/LocalLLaMA community 12h ago

What should I test when comparing Qwen3.6-27b quants for real world effects that humans could reason about?

I tried to find some good comparisons on how different quants of Qwen3.6-27b perform in different scenarios, but I failed to find good information on what kinds of real world effects there are to running different quants like Q4_K_M, UD-Q4_K_XL, UD-Q5_K_XL, UD-Q6_K_XL and…

37
r/LocalLLaMA community 13h ago

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

Some backstory I've been working on my local agent (openclaw), and I wanted to give it the skill to reconstruct calendar entries from a photo of the screen. I couldn't get at the calendar through an API (long story), so a photo was the only low-friction way to export the data.…

16
r/LocalLLaMA community 14h ago

Llama-b9856 Win Cuda 12.4 - Windows Defender claims it's a trojan

Hi, just downloaded this release earlier today. Attempted to run llama-server, and Windows Defender shut it down. It says it's Wacatac.H!ml. It removed the llama-server-impl.dll file from the folder. Older releases work fine   submitted by   /u/Far_Course2496 [link]…

10
r/LocalLLaMA community 14h ago

Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ?

Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ? Wondering which one would be better at speed / coding / reasoning   submitted by   /u/soyalemujica [link]   [comments]

32
r/LocalLLaMA community 15h ago

Plurality Released: fully Free and Open Source AI agents/chatbot platform for local AI

Hello everyone! Some of you might recognize my user from the work I have done on Cosmos Cloud, but today I am here to talk to you about an entirely different project: Plurality. https://github.com/azukaar/plurality Plurality has been in development for a bit more than a year and…

22
r/LocalLLaMA community 15h ago

How to improve RAM offload?

I have only 12GB VRAM (RTX3060) but have enough RAM to run Qwen3.6 27B Q4 with offload. Something tells me that it won't achieve maximum performance but why DRAM speed is only around 30GB/s (HWiNFO data) during inference with dual channel 5200 RAM? TG is 3.12 tok/sec with 18K…

38
r/LocalLLaMA community 15h ago

Couldn't hold back

Had been waiting for months and the cards finally got delivered today. No one at my workplace was excited, maybe because no one cares for AI stuff that i work on. But I just wanted to share it with you guys. Can't wait to build the server and start working on them.  …

11
r/LocalLLaMA community 15h ago

July 2026; where are Intel's GPU speeds today at?

Hey all, it is 1st of July, 2H26, and I hope that Intel has been catching up on their firmware support for their B50-B70 cards in recent months. In some places of the world, they do sound like a good VRAM/money offer, and hence I would love for you to share your recent PP / TG…

19
r/LocalLLaMA community 16h ago

The gap between closed and open models might be much smaller than commonly assumed, because we don’t know what closed model providers do *in addition to* model inference

When Claude dominates GLM-5.2 in benchmarks, it’s usually assumed that Anthropic has superior model architectures, superior training pipelines, and other advanced machine learning techniques that make their models better than the competition. But actually, this doesn’t follow.…

10
r/LocalLLaMA community 16h ago

Open Models - June 2026

After overwhelming April , OK May , here's June. Yeah, Graph has only less items. Because we got other items here last month. Finetunes : Nex-N2 Ornith-1.0 Agents-A1 Holo3.1 Tmax-27b MusaCoder-27B VibeThinker-3B NVFP4 from NVIDIA for below models :…

8
r/LocalLLaMA community 16h ago

gemma-4-31B on Cerebras is better than ChatGPT voice mode

open models will win on inference too 🚀   submitted by   /u/paf1138 [link]   [comments]

34
r/LocalLLaMA community 17h ago

SWE-rebench leaderboard update: GLM-5.2, Qwen3.6-27B, Qwen3.6-35B-A3B, Gemma 4 31B and more + improved UI

Hi all, We made several updates to the SWE-rebench leaderboard: added new models, refreshed recent results, and reworked the leaderboard UI to make results easier to read, compare, and understand. New Models: Claude Opus 4.8 xhigh: 56.5% — 2.48M tokens GLM-5.2: 51.1% — 2.62M…

16
r/LocalLLaMA community 17h ago

I mapped which local LLMs actually fit each RAM tier, 8 to 128GB (open dataset)

I kept answering the same question for friends ("I've got a 16GB MacBook / a 3060, what can I actually run?") and got tired of guessing, so I started a spreadsheet. It grew into a real dataset, so I put it on GitHub under CC BY for anyone to use or fix. Rule of thumb I landed…

28
r/LocalLLaMA community 18h ago

Software engineering best practices in the age of LLM coding

It is important to document requirements, capture key decisions, and record design goals. Best practices: Requirements doc stored in repo. Store plan files created by LLM in repo. Store in plans/<date>-<summary>.md Store session summaries in repo. Sometimes need to inform llm to…

28
r/LocalLLaMA community 18h ago

Agent execution visualizer

I've seen projects which stream tool use status and subagent generation, and represented it with a nice little visual based on the tool being used, etc. It would be pretty cool to pair this with some live model visualisations like a QKV heatmap across attention heads. Not for…

28
r/LocalLLaMA community 18h ago

Deepseek V4 Flash 2, 3 and 4 bits GGUFs

  submitted by   /u/tarruda [link]   [comments]

31
r/LocalLLaMA community 18h ago

Best tps can I get with Qwen3.5 122B on 32GB VRAM + 64GB RAM?

My attempt at running Qwen3.5 122B on my 5090 (32GB VRAM) + 64GB RAM is really bleak. I'm getting a speed that starts at 6 tps and ends at ~20 tps. Can I improve this further? build/bin/llama-server \ -m…

21
r/LocalLLaMA community 19h ago

Non Us Ally should be afraid.

Spyware-like code in Claude Code that covertly targets Chinese users.   submitted by   /u/zakadit [link]   [comments]

28
r/LocalLLaMA community 20h ago

Hister: Give Your AI Assistant a Private Memory

I have been working on Hister, a self hosted search engine that automatically indexes pages you visit, local files, and documentation, then keeps them searchable with stored offline previews. It also exposes an MCP endpoint, so local AI assistants can search your own indexed…

5
r/LocalLLaMA community 21h ago

Thinking about grabbing 4x Ascend GX10s

Some in this sub have tested GLM5.2 on 4x DGX Sparks (or Ascend GX10) with 400-500 tok/s prompt processing and ~15 tok/s output at 128k context. Not blazing fast, but usable imo, especially with quantization. My thinking: If there's an open-source fable 5 sometime in december or…

20
r/LocalLLaMA community 21h ago

README_EN.md · openpangu/openPangu-2.0-Flash at main

1. Introduction openPangu-2.0-Flash is an MoE model trained on Ascend. The model has 92B total parameters and 6B activated parameters. Its context length is 512k. The total pretraining data contains 34T tokens. During Post-training, openPangu-2.0-Flash is trained through unified…

15
r/LocalLLaMA community 21h ago

LokalBot - fully local macOS app: meetings, autocomplete, and day tracking that all run on your machine with a user friendly UI

Been lurking here a while, this sub is basically why LokalBot exists. It's a Mac app that records + summarizes your meetings, autocompletes your typing in any app, and tracks where your day went, with every model running on-device . No cloud, no account, no API keys. Most of the…

15
r/LocalLLaMA community 22h ago

Why can i never stop the looping?

I constantly see people here saying Qwen3.6 35B is amazing, Ornith V1 is amazing, but i cannot use these models at all without severe looping problems. What the hell am i doing wrong?? Temp 0.6 top_p 0.95 top_k 20 min_p 0.05 rep_penalty 1.1 Using Q6 of both models with K/V at…

35
r/LocalLLaMA community 23h ago

I built a desktop AI that scrubs your PII locally before it hits the cloud — here's every feature with real screenshots

Been building this for a few months. It's called Primnox. The core thing: before ANY message leaves your machine, a local DeBERTa NER model runs on-device, finds names/emails/addresses/phone numbers, swaps them for stable placeholders (FIRSTNAME, EMAIL etc), sends the tokens to…

37
r/LocalLLaMA community 1d ago

Ketch - Best Search Tool for local models

recently I wrote a blog post, to find which search tool will be best for the pi coding agent paired with local models (currently I use Qwen3.6 35B) Before that I were using firecrawl or brave-search, but found them very decent, so I went to SearXNG, which is fine, but lacks some…

38
r/LocalLLaMA community 1d ago

Has anyone tried using llama-server as a backend for multiplayer games or co-op working?

Curious if it’s a viable small scale distributed system.   submitted by   /u/TheSmashingChamp [link]   [comments]

7
r/LocalLLaMA community 1d ago

Biggest, baddest model to fill 144GB VRAM + 120GB RAM to the brim, regardless of speed

I'm trying to round out my quiver of daily driver models for my personal harness. Right now I drive qwen3.6 27b for balanced code and gemma4 31b for human interaction with lots of context and a few parallel sessions. Minimax M2.7 at Q6 clocks in at 207gb base and just barely…

5
r/LocalLLaMA community 1d ago

[audio.cpp] VibeVoice 1.5B released — 90-min podcast in 22.95 min, 4.08x real-time, 2.86x faster than Python without quantization. Native C++/ggml

I’m the author of audio.cpp, a C++/ggml runtime for local audio models. I just added VibeVoice 1.5B support and wanted to share the benchmark because long-form multi-speaker TTS is a good stress test for local inference runtimes. Result on RTX 5090: VibeVoice 1.5B Audio length:…

26
r/LocalLLaMA community 1d ago

Claude Code Is Steganographically Marking Requests

  submitted by   /u/johnnyApplePRNG [link]   [comments]

21
r/LocalLLaMA community 1d ago

DeepSeek-V4-Flash (MXFP4): compute buffer scales ~3x just from KV cache quant type (f16 vs q8_0) — anyone else seeing this? Llama.cpp

Bartowski's DeepSeek-V4-Flash-MXFP4 GGUF, llama.cpp build 9851 ( 0eca4d490 ), deepseek4 arch. Ran the same n_ctx = 10240 , same n_ubatch = n_batch = 8192 , flash attention on — only difference is -ctk / -ctv : Cache type Total KV cache (CUDA0) CUDA0 compute buffer f16 (default,…

18
r/LocalLLaMA community 1d ago

Is there an alternative to C-Payne for 100-lane PCIe 5.0 switches? Needed for 8-GPU build.

Sadly Christian is on vacation or something, which is a shame because the C-Payne PCIe gear is the best around. In the meantime I need this to add some urgent compute capacity:…

15
r/LocalLLaMA community 1d ago

Vibe Coding / Agentic workflow

Hey folks. I know that vibe coding is frowned upon pretty solidly here, and I get that, but I’m not a programmer. I just don’t realistically have the time to learn python or C++ to the level I would need to to build some of the things I’d like to create. On a side note, I do…

28
r/LocalLLaMA community 1d ago

Agents-A1 GGUF quants (35B Qwen3.5-MoE agent model) — NVFP4 for Blackwell + working MTP speculative decoding (up to 1.22× single-user, 91% draft acceptance)

Repo → huggingface.co/LordNeel/Agents-A1-GGUF I made GGUF quants of InternScience/Agents-A1 — a 35B Mixture of Experts agent model (Qwen3.5-MoE, ~3B active, 256 experts / 8+1 active, hybrid linear+full attention, 256K context). It's built for long-horizon search, tool-calling,…

27
r/LocalLLaMA community 1d ago

Devs - you have 64gb of VRAM - which model do you use for coding?

I've currently settled on an unsloth version of Qwen 3.5 122b-a10b model (UD-IQ4_NL). With 100k bf16 context window, I only had to load a few layers into CPU/RAM, it runs around 30 tok/sec which is fine for me. I've tested many models, hours of testing but I am currently deeply…

32
r/LocalLLaMA community 1d ago

Meta fights soaring hardware costs by reusing old DDR4 server memory in new DDR5-only servers — custom CXL 2.0 chip marries legacy DDR4-2400 with cutting-edge DDR5-6400

https://www.tomshardware.com/pc-components/dram/meta-fights-soaring-hardware-costs-by-reusing-old-ddr4-server-memory-in-new-ddr5-only-servers-custom-cxl-2-0-chip-marries-legacy-ddr4-2400-with-cutting-edge-ddr5-6400   submitted by   /u/pulse77 [link]   [comments]

34
r/LocalLLaMA community 1d ago

Dual RTX 6000, for Deepseek v4 Flash???

My last post got a lot of interaction asking 6000 pro owners if they regretted, the answer was hard NO. I ended up understanding that dual rtx 6000 pro run deepseek v4 flash extremely fast. I went to the near stores and got offers around $50-60k for dual rtx 6000 pro ai server.…

36
r/LocalLLaMA community 1d ago

This seems like a good REAP of the GLM 5.2 - Down to 290B

The coding scores don't seem to get impacted much based on the page but I don't see any GGUF, anybody knows how to request the authorize to generate quantized GGUF of this REAP ? https://huggingface.co/0xSero/GLM-5.2-504B   submitted by   /u/BoogerheadCult [link]  …

30
r/LocalLLaMA community 1d ago

What are your experiences with using local AI trained on information about you?

I know people have been talking about creating a “second brain” with local AI trained on personal information, but I’m curious about how that actually played out. What kind of use did you find from having an AI that knows everything about you? I was considering typing out a…

11

Palantir CEO rages against closed models

SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

They fit! Mostly.... 2x 3090, Thermaltake Core p3

Making LLMs Better at Creative Writing using Entropy

I added MTP to local SoTA Agentic Coding Model Ornith 35B FP8 E4M3

I extended Gemma4-31B to 44B (88 layers) — since Google won't give us anything bigger than 31B

Senior SWE Bench: a new benchmark focussed on realistically underspecified feature tasks

My reasons to run local models

End of an Agony. Real production service that uses LLM to earn money my team had made and now we are so happy that it will die. Here are some of my final "experiences".

ZCode: New Agentic Code Editor from the Makers of GLM

Anyone using TensTorrent gpus for your local ai? What's been your experience?

What should I test when comparing Qwen3.6-27b quants for real world effects that humans could reason about?

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

Llama-b9856 Win Cuda 12.4 - Windows Defender claims it's a trojan

Deepseek Flash V4 at IQ2 or Qwen 3.6 27B Q5KM ? Any tests or benchmarks ?

Plurality Released: fully Free and Open Source AI agents/chatbot platform for local AI

How to improve RAM offload?

Couldn't hold back

July 2026; where are Intel's GPU speeds today at?

The gap between closed and open models might be much smaller than commonly assumed, because we don’t know what closed model providers do *in addition to* model inference

Open Models - June 2026

gemma-4-31B on Cerebras is better than ChatGPT voice mode

SWE-rebench leaderboard update: GLM-5.2, Qwen3.6-27B, Qwen3.6-35B-A3B, Gemma 4 31B and more + improved UI

I mapped which local LLMs actually fit each RAM tier, 8 to 128GB (open dataset)

Software engineering best practices in the age of LLM coding

Agent execution visualizer

Deepseek V4 Flash 2, 3 and 4 bits GGUFs

Best tps can I get with Qwen3.5 122B on 32GB VRAM + 64GB RAM?

Non Us Ally should be afraid.

Hister: Give Your AI Assistant a Private Memory

Thinking about grabbing 4x Ascend GX10s

README_EN.md · openpangu/openPangu-2.0-Flash at main

LokalBot - fully local macOS app: meetings, autocomplete, and day tracking that all run on your machine with a user friendly UI

Why can i never stop the looping?

I built a desktop AI that scrubs your PII locally before it hits the cloud — here's every feature with real screenshots

Ketch - Best Search Tool for local models

Has anyone tried using llama-server as a backend for multiplayer games or co-op working?

Biggest, baddest model to fill 144GB VRAM + 120GB RAM to the brim, regardless of speed

[audio.cpp] VibeVoice 1.5B released — 90-min podcast in 22.95 min, 4.08x real-time, 2.86x faster than Python without quantization. Native C++/ggml

Claude Code Is Steganographically Marking Requests

DeepSeek-V4-Flash (MXFP4): compute buffer scales ~3x just from KV cache quant type (f16 vs q8_0) — anyone else seeing this? Llama.cpp

Is there an alternative to C-Payne for 100-lane PCIe 5.0 switches? Needed for 8-GPU build.

Vibe Coding / Agentic workflow

Agents-A1 GGUF quants (35B Qwen3.5-MoE agent model) — NVFP4 for Blackwell + working MTP speculative decoding (up to 1.22× single-user, 91% draft acceptance)

Devs - you have 64gb of VRAM - which model do you use for coding?

Meta fights soaring hardware costs by reusing old DDR4 server memory in new DDR5-only servers — custom CXL 2.0 chip marries legacy DDR4-2400 with cutting-edge DDR5-6400

Dual RTX 6000, for Deepseek v4 Flash???

This seems like a good REAP of the GLM 5.2 - Down to 290B

What are your experiences with using local AI trained on information about you?

The gap between closed and open models might be much smaller than commonly assumed, because we don’t know what closed model providers do in addition to model inference