r/LocalLLaMA

500 articles archived · Visit source ↗ · RSS

r/LocalLLaMA community 10d ago

Do you think dedicated hardware for running local LLMs will become affordable anytime soon?

Models like qwen 27b dense have already proved to be useful coding/general purpose assistants, but issue is still with hardware even the entry level hardware is relatively expensive, would we be getting hardware specifically built for inference for consumers at affordable price…

6
r/LocalLLaMA community 10d ago

I want to love hermes agent, but it looks so ugly, and ux is not nice

I am rechecking on hermes agent currently, also because many report great experiences, but oh my, does it look ugly. The web-UI uses such ugly fonts and background graphics, and for some reasons, UX feel slow and tedious (even in the tui). Pi mono agent feels quick and fast…

20
r/LocalLLaMA community 10d ago

Leaderboard for quantized models, similar to artificial analysis?

Artificial analysis’ leaderboard for models is somewhat useful for comparing model intelligence, but does not take into account quantization for open models. Is there a way to better compare quantized open models against each other and proprietary models other than running them…

35
r/LocalLLaMA community 10d ago

Agent recommendations

Hi, I have a Strix Halo with 128GB setup that runs a couple of models (GPT-OSS 120b, Qwen3.5-122b, Gemma-4-31b) on llama-swap. GPT and Qwen run quite fast at 40-50T/s, while Gemma is a slow 4-5T/s but seems to have the best quality. I'd like to vibe code a personal Webproject in…

17
r/LocalLLaMA community 10d ago

GLM-5.2 is on DeepSWE

https://deepswe.datacurve.ai/ Side note, why does this sub dislike DeepSWE? I want to know more and did some research and found this post which has since been retracted by the original author (highly respect them as they handled the correction well and admitted bias) Another…

37
r/LocalLLaMA community 10d ago

Your Favorite Workflow to Convert PDF with Complex Structure to Markdown?

I've tried markitdown, Docling, and Mineru. Are there better tools I should try? I need to process tables, floating box, etc. Thanks!   submitted by   /u/chibop1 [link]   [comments]

30
r/LocalLLaMA community 10d ago

For programmers with slow local LLM setup, what's your workflow?

What's your workflow and what's the best way you have found to code with local LLM when your token generation is < 10 tk/sec?   submitted by   /u/segmond [link]   [comments]

14
r/LocalLLaMA community 10d ago

Finally seeing benefits of MTP after removing GGML_CUDA_ALLREDUCE

Been fighting this a while, mtp seeing lows at 17 to sometimes 30's and today I went and dug deep and tried so many different configuartions, cmake remakes, you name it. After it all I finally tried removing GGML_CUDA_ALLREDUCE and I finally saw a nice uplift in tps! Just…

36
r/LocalLLaMA community 10d ago

Local LLM Inference Optimization: The Complete Guide

I compiled a year of local LLM experiments into a practical llama.cpp optimization guide, covering VRAM fitting, KV cache, MoE placement, MTP, CPU tuning, and common OOM traps. Pass this to an LLM of your choice and get on the local model train.…

4
r/LocalLLaMA community 11d ago

Not a new model, just a Happy Father's Day and a thank you.

I know this isn't our usual discussion about context windows, quantization, or the latest model drop, but I just wanted to take a quick moment to say thank you. As a dad myself, I really appreciate this great community. Between the daily grind and family life, diving into this…

12
r/LocalLLaMA community 11d ago

Local text to image model comparaison: The ultimate test.

I selected 192 prompts to evaluate text-to-image model various capabilities and generated images for all the local models I was able to make work on my GX10 Spark. For instance: Is the model good at text? At faces? At human anatomy? At respecting spatial composition, etc...? You…

4
r/LocalLLaMA community 11d ago

Gemma 4 31B Q6 vs Gemma 4 31B QAT

what should i do? i'm stuck been scrolling reddit for hour and no luck. what will be the best in overall scenario. Creative Writing Mainly. what's the kld? help guys.   submitted by   /u/Weak-Shelter-1698 [link]   [comments]

13
r/LocalLLaMA community 11d ago

A100 slow Qwen3.6-27B-FP8

Setting up a server for someone who has an A100 80GB, even though this doesn't natively support FP8 does 43tps decode sound too low for single request? For comparison the exact same vllm config on my RTX 6000 PRO runs the same single request test at 130tps. For 8 concurrent…

11
r/LocalLLaMA community 11d ago

Qwen 27B for planning, Qwen 35B-A3B for execution?

My 32GB unified memory setup runs both, though 27B even with MTP is something like 7-10 tok/sec. Usable but not real time by any means. (~18 tok/sec with 35B-A3B) Would it be worth using 27B to plan long horizon tasks, put together the PLAN.md, and have 35B-A4B iterate over it…

14
r/LocalLLaMA community 11d ago

Best local model for vision - 2nd benchmark update - 21 Jun 2026

I previously posted the first results of my VLM benchmark . There were a few useful comments and observations I took into account, to revise and expand my benchmark: I initially did not take into account the Gemma 4 vision budget which defaults to 280, essentially making it…

9
r/LocalLLaMA community 11d ago

Qwen 3.6 27b Abliterated (apostate)

I've been working on a project called Apostate and have finally released my first large model with it on Hugging Face. Qwen 3.6 27B with safety alignment removed down from 92% to 7.6% refusal rate with minimal impact on the model's capabilities (0.120 KL). Qwen 3.6 27B Apostate…

17
r/LocalLLaMA community 11d ago

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!

GitHub: https://github.com/mikechambers84/ik_llama.cpp/tree/numa-mirror Be sure to checkout the numa-mirror branch. Sharing this for anyone else who's trying to use their multi-socket CPU systems for inference. I've been wanting a NUMA mirror mode for a long time, so I finally…

14
r/LocalLLaMA community 11d ago

I pretrained and post trained a 500M parameter LLM and 330M parameter Image generator from scratch

Hey folks Hope you are doing well I started HobbyLM as an side project last month Initially I wrote an Agent harness using Claude SDK which takes notes on various LLM architecture does ablation studies to find optimised or well fit architecture for this model training then I…

16
r/LocalLLaMA community 11d ago

What‘s your local „Haiku“-Replacement?

Seriously looking for a reliable and fast local Haiku replacement. Basically it should be able to summarize technical stuff, code documentation, architectural descriptions Any suggestions? Edit: sorry, totally forgot that my local machine is a M4 Max 128GB. But at the same time…

6
r/LocalLLaMA community 11d ago

2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp

There isn't much information around about multi-GPU setups with the R9700, so I'm writing this up in case it helps anyone in the same situation. Here's my setup, the tests I ran, and the numbers from the server logs. Setup ThinkStation P7, Xeon w7-3455, 128 GB RDIMM 2× Gigabyte…

11
r/LocalLLaMA community 11d ago

Tokenomics

  submitted by   /u/HOLUPREDICTIONS [link]   [comments]

34
r/LocalLLaMA community 11d ago

ROCm vs Vulkan vs vLLM on Dual R9700's

Just wanted to share these numbers I saw running Qwen3.6 35BA3 and Qwen3.6 27B and the big increase I saw going to vLLM. I was just expecting better concurrency but ended up with a lot better speeds. llama.cpp services Running ROCm and Vulkan Model Backend Gen 35B-A3B Q6_K_XL…

19
r/LocalLLaMA community 11d ago

Can I realistically get close to Claude/Codex capabilities locally?

For context, I have a modest 32Gb rig running Nvidia GPUs (5070 Ti + 5060 Ti, the latter over an adapted x4 NVME slot so not as fast as if I had a motherboard with multiple proper CPU connected PCIe lanes). I can run the 27B models on it nicely enough, but the bottleneck is…

31
r/LocalLLaMA community 11d ago

Sandboxing code execution for AI agents

For those giving their agents the ability to execute code, how are you sandboxing it? The spectrum seems to be: Docker containers: familiar, decent isolation, but heavyweight for per-request sandboxing microVMs: great isolation, fast boot, but operational complexity WASM:…

5
r/LocalLLaMA community 11d ago

Rollin' MiMo-2.5 on two Halo Strixeses

Twas a very high effort post on two 128GB machines with 8060s, proxmox/containers, usb4net secondary link and a rocm llama.cpp built with a crowbar and a lot of swearing options. Not mentioning the hair pulling while trying to build the other backends. So far 356pp and 15tg,…

24
r/LocalLLaMA community 11d ago

8-16 MI50s Minimax M3 @19 tps TG (peak)

TL;DR Speeds are not too ugly for this old 2018 hardware but imo, not very usable for agentic coding (if you compare with qwen3.6 27B on 8 MI50 @ 50 tps TG 800 tps PP). More concerning is that the reasoning output is very very long and still didn’t check about the quality of…

27
r/LocalLLaMA community 11d ago

Claude Will Soon Require Identity Verification

https://support.claude.com/en/articles/14328960-identity-verification-on-claude   submitted by   /u/Few_Painter_5588 [link]   [comments]

11
r/LocalLLaMA community 11d ago

R9700 abysmal performance, getting desparate

I've been trying to get my 2x R9700 setup to work for the past two weeks. This has been such a time sink I wish I had just gone with nvidia. At this point I'm close to selling the cards. I need vLLM. This is a dedicated setup for multi-user serving. I've tried the…

17
r/LocalLLaMA community 11d ago

I mapped every agent config file (AGENTS.md, CLAUDE.md, llms.txt, .cursorrules, SKILL.md...) and tagged how widely each is actually used

Every tool ships its own magic file now and after a while the names all blur together. I put together a guide to the ones agents actually read and write, with a tag on each for real adoption instead of hype. https://github.com/ItamarZand88/awesome-agent-conventions 21…

22
r/LocalLLaMA community 11d ago

Why is AutoRound being slept on so hard?

Seriously, why is almost nobody talking about AutoRound here? I’ve been experimenting with it on Qwen3.6 27B lately (running an AMD setup), and the perplexity/accuracy retention at low bits absolutely blows standard AWQ or RTN out of the water. Especially for models with complex…

6
r/LocalLLaMA community 11d ago

Watch local LLMs escape the rooms you design

Hello! I'd like to share my repo for WATCH MY ESCAPE: https://github.com/cjami/watch-my-escape It's an inverted escape room game where you design the maps and LLMs have to try to escape them. It uses traditional action verbs (e.g. push, pull, pick-up) to interact with the…

34
r/LocalLLaMA community 11d ago

GLM-5.2 benchmarked on DeepSWE: Beats Gemini & GPT-5.4, but the token volume/cost makes it wildly inefficient? (Theo - t3.gg)

Saw this breakdown from Theo (t3.gg) on X showing the latest DeepSWE leaderboard stats for the new GLM-5.2 open-weight model.The good news: it's officially surpassing GPT-5.4 and the entire Gemini lineup in raw coding capability. Seeing an open-weight model punch that high is…

15
r/LocalLLaMA community 11d ago

Gemma 4 QAT seems to respond significantly better to KV cache quantization

KLD on wikitext with 16k context My hardware isn't up to testing 31B, if anyone else feels like investigating it would be interesting   submitted by   /u/rima_2711 [link]   [comments]

16
r/LocalLLaMA community 11d ago

Vercel CEO: "Almost shocked" by how good GLM-5.2 is at coding

Guillermo Rauch (Vercel CEO) says he is "genuinely impressed, almost shocked" by GLM-5.2's coding performance. What has your experience with GLM-5.2 been so far? Source: X post   submitted by   /u/BuildwithVignesh [link]   [comments]

20
r/LocalLLaMA community 11d ago

Qwen is never going to open source Qwen 3.7, aren't they?

Well, this was predictable. After Qwen fired Junyang Lin, the next models are no longer open source. Labs that have released open source models more recently than Qwen: GLM-5.2, 2026-06-17 Kimi-K2.7-Code, 2026-06-12 MiniMax-M3, 2026-06-11 Step-3.7-Flash, 2026-05-29…

15
r/LocalLLaMA community 11d ago

AllenAI releases MolmoMotion vision models for predicting future motion based on short frame history

AllenAI just released two models in the MolmoMotion family: https://huggingface.co/allenai/MolmoMotion-4B-H3-F30 https://huggingface.co/allenai/MolmoMotion-4B-H1-F32 MolmoMotion is a 4B vision-language model that forecasts 3D point trajectories under natural-language action…

30
r/LocalLLaMA community 11d ago

[NEW MODEL] SupraLabs started the Any2Any model family!

SupraLabs Supra-A2A-Nano-Exp - ~30M Any-to-Any Multimodal Transformer Status: Experimental / Educational Prototype 🚀 Overview Supra-A2A-Nano-Exp is a ~30M parameter autoregressive Transformer that unifies text, image, and video into a single token stream. There are: - No…

7
r/LocalLLaMA community 11d ago

What are you overengineering that nobody's ever going to use? Be honest.

Be honest.   submitted by   /u/johnnyApplePRNG [link]   [comments]

29
r/LocalLLaMA community 11d ago

Best image vision model runnable on RTX 6000 Pro

I'm looking at running OCR and classification on old historical scanned documents. (Some dating back to 1950s) What's the current best vision enabled models thats open sourced and runnable on an RTX 6000 Pro? Note: I've used Gemma 4 31B and have had good success with it. It's…

20
r/LocalLLaMA community 11d ago

What are people doing with their local models and what tools do you use them with?

I am trying to come up with some more uses for my DGX Sparks. Curious which tools work best for things like coding as well. What do you use instead of things like the claude.ai web interface? I have played with OpenWebUI but it just doesn't seem as capable without a lot of…

31
r/LocalLLaMA community 11d ago

What happens when they stop subsidizing LLM subscriptions?

We are literally burning through VC money like crazy with our coding subscriptions. I read the $200 Anthropic sub gets you $8000 worth of API calls. It's obvious that this doesn't hold for very long but what happens when they raise prices? The reason to keep the prices low for…

24
r/LocalLLaMA community 11d ago

It’s time to decentralize model distribution! Introducing Noema Atlas

TL;DR: Noema Atlas is a peer-to-peer network software using Iroh for local LLM weights, free and open source (Apache-2.0). Models come from whichever peers have them, with Hugging Face and mirrors as fallback (opt-in). Every file is identified by its content hash and a signed…

38
r/LocalLLaMA community 11d ago

Anyone running MiniMax M3 - pipenetwork Mixed 3_6 Quant?

Asking for a friend... who is challenged with 'only' 256GB unified RAM.   submitted by   /u/PracticlySpeaking [link]   [comments]

14
r/LocalLLaMA community 12d ago

GLM 5.2, what speeds are we getting locally?

Can everyone that is able to run GLM 5.2 locally report what their inference engine, system specs, quantization, context size, and tokens/sec? If you're getting great numbers expect follow-up questions. I'll start: llamma.cpp, 6x RTX 3090, 128 DDR5, i7-13700K, unsloth UD-IQ2_M,…

15
r/LocalLLaMA community 12d ago

Six months ago I turned down $8,165 for an RTX 6000 PRO. Today the same vendor is selling them for $11,575. Oh, hindsight.

  submitted by   /u/__JockY__ [link]   [comments]

34
r/LocalLLaMA community 12d ago

Qwen code companion on vscode marketplace - thoughts

I just came across this extension in vscode few days ago and tried to use with LM studio hosted models and it really is pretty good compared to `continue`, `kilo`, `cline`, `roo` like I felt without much tweaks, gets straight to the point, if any tweaks required u could do…

36
r/LocalLLaMA community 12d ago

Gemma 4 26b a4b is genuinely the best model I have tried for language learning and scientific queries!

I know gemma 4 26b is (according to this sub) a bit behind for coding tasks but for language learning and scientific (health/biology/medical/clinical/biochem) queries it’s unbeaten even by Qwen 3.5/3.6. Since the competition in the small MOE models is generally between Qwen…

28
r/LocalLLaMA community 12d ago

I wrote a free 15-part series on LLM internals — real math, real tensor shapes, real hardware constraints. All grounded in Gemma 4 12B's actual config.

If you run open-source models and want to understand what's actually happening under the hood — I spent the last few months writing a 15-part series that covers the full stack from tokenization to production serving. Most articles are grounded in Gemma 4 12B as the running…

19
r/LocalLLaMA community 12d ago

Board where every tile is an agent

I've been hacking a project which I find extremely useful and wanted to share. Imagine a board where every tile is an agent those job is to maintain the tile. I tried to illustrate the idea with a video here. The project is open source on GitHub and you can also try it out here…

36
r/LocalLLaMA community 12d ago

Deep Neural Network that can turn any Image into a Playable Game! BUT LOCALLY, NOT ON DATACENTER

Hi everyone!! I really wanted to share my research what I've been working on. I wanted to build a nn that can simulate games, or at least start doing that Most video generators are too large to run on consumer hardware realtime, so I I designed a model that does this from…

14

Do you think dedicated hardware for running local LLMs will become affordable anytime soon?

I want to love hermes agent, but it looks so ugly, and ux is not nice

Leaderboard for quantized models, similar to artificial analysis?

Agent recommendations

GLM-5.2 is on DeepSWE

Your Favorite Workflow to Convert PDF with Complex Structure to Markdown?

For programmers with slow local LLM setup, what's your workflow?

Finally seeing benefits of MTP after removing GGML_CUDA_ALLREDUCE

Local LLM Inference Optimization: The Complete Guide

Not a new model, just a Happy Father's Day and a thank you.

Local text to image model comparaison: The ultimate test.

Gemma 4 31B Q6 vs Gemma 4 31B QAT

A100 slow Qwen3.6-27B-FP8

Qwen 27B for planning, Qwen 35B-A3B for execution?

Best local model for vision - 2nd benchmark update - 21 Jun 2026

Qwen 3.6 27b Abliterated (apostate)

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!

I pretrained and post trained a 500M parameter LLM and 330M parameter Image generator from scratch

What‘s your local „Haiku“-Replacement?

2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp

Tokenomics

ROCm vs Vulkan vs vLLM on Dual R9700's

Can I realistically get close to Claude/Codex capabilities locally?

Sandboxing code execution for AI agents

Rollin' MiMo-2.5 on two Halo Strixeses

8-16 MI50s Minimax M3 @19 tps TG (peak)

Claude Will Soon Require Identity Verification

R9700 abysmal performance, getting desparate

I mapped every agent config file (AGENTS.md, CLAUDE.md, llms.txt, .cursorrules, SKILL.md...) and tagged how widely each is actually used

Why is AutoRound being slept on so hard?

Watch local LLMs escape the rooms you design

GLM-5.2 benchmarked on DeepSWE: Beats Gemini & GPT-5.4, but the token volume/cost makes it wildly inefficient? (Theo - t3.gg)

Gemma 4 QAT seems to respond significantly better to KV cache quantization

Vercel CEO: "Almost shocked" by how good GLM-5.2 is at coding

Qwen is never going to open source Qwen 3.7, aren't they?

AllenAI releases MolmoMotion vision models for predicting future motion based on short frame history

[NEW MODEL] SupraLabs started the Any2Any model family!

What are you overengineering that nobody's ever going to use? Be honest.

Best image vision model runnable on RTX 6000 Pro

What are people doing with their local models and what tools do you use them with?

What happens when they stop subsidizing LLM subscriptions?

It’s time to decentralize model distribution! Introducing Noema Atlas

Anyone running MiniMax M3 - pipenetwork Mixed 3_6 Quant?

GLM 5.2, what speeds are we getting locally?

Six months ago I turned down $8,165 for an RTX 6000 PRO. Today the same vendor is selling them for $11,575. Oh, hindsight.

Qwen code companion on vscode marketplace - thoughts

Gemma 4 26b a4b is genuinely the best model I have tried for language learning and scientific queries!

I wrote a free 15-part series on LLM internals — real math, real tensor shapes, real hardware constraints. All grounded in Gemma 4 12B's actual config.

Board where every tile is an agent

Deep Neural Network that can turn any Image into a Playable Game! BUT LOCALLY, NOT ON DATACENTER