r/LocalLLaMA

500 articles archived · Visit source ↗ · RSS

r/LocalLLaMA community 1d ago

What are your experiences with using local AI trained on information about you?

I know people have been talking about creating a “second brain” with local AI trained on personal information, but I’m curious about how that actually played out. What kind of use did you find from having an AI that knows everything about you? I was considering typing out a…

11
r/LocalLLaMA community 1d ago

The harness matters more than the model. A 27B behind good critics changed my mind.

I saw someone test Qwen3.6-27B with a 3-critic harness. The harness included code review, test review and Playwright e2e. Each critic had context. The result was that the model is usable for coding work. This matches what I have come to believe from running agents in production.…

20
r/LocalLLaMA community 1d ago

HIP: use hipBLAS for dense prefill on gfx900, keep MMQ for MoE by DEV-DUFORD · Pull Request #24588 · ggml-org/llama.cpp

Overall Performance Gains: Qwen3.5 4B : +36.1% Qwen3.6 27B : +18.9% Gemma4 12B : +65.1% Overall average : ~40% Only for gfx900 related GPUs: Vega GPU, codename vega10, including Radeon Vega Frontier Edition, Radeon RX Vega 56/64, Radeon RX Vega 64 Liquid, Radeon Pro Vega…

5
r/LocalLLaMA community 1d ago

Running Hunyuan3D Image to 3D Object on an iPhone

  submitted by   /u/arduinoRPi4 [link]   [comments]

22
r/LocalLLaMA community 1d ago

Benchmarked Graph-RAG vs. Graph-Free Multi-Hop RAG: The graph mostly bought us a massive rebuild bill, not accuracy.

We kept hitting the same wall building multi-hop RAG: the systems with the best accuracy (GraphRAG, HippoRAG 2, RAPTOR) all lean on a knowledge graph built offline - and that’s great numbers, until the moment your data changes! Every update means re-running an LLM indexing pass…

11
r/LocalLLaMA community 1d ago

I built an autonomous dev pipeline and ran the same project head to head: a 27B local on a modded 4090, then again on cheap cloud LLMs

Hey everyone! I open-sourced something I've been working on called Lullabeast. It's an autonomous dev pipeline. You describe your project and planner, executor, and reviewer agents build it phase by phase against a real git repo. How it came to be: for the last year or so I've…

9
r/LocalLLaMA community 1d ago

PageStorm: A Model Built for Creative Book Writing

Over a year ago, we set out to build a single-turn full-book writing model. Half a year ago, we published our LongPage Dataset for book scale creative writing. Today, we are announcing our first model: PageStorm Research Preview. Paper: https://arxiv.org/abs/2605.17064 Models:…

9
r/LocalLLaMA community 1d ago

TurboOCR v3 — high-speed document OCR server (C++/CUDA), ~520 img/s on RTX 5090

TurboOCR is a self-hosted, high-speed document OCR server, runs fully local. Here's What's New in v3: Speed: Full pipeline now on the newest PP-OCRv6 models (up from v5): ~270 → ~520 img/s on FUNSD (v6 tiny, RTX 5090). Still fully local, HTTP + gRPC. Structured parsing (the main…

33
r/LocalLLaMA community 1d ago

EPYC hybrid system benches and optimal CPU

Finally I've built my semi-budget setup, tho not everything went as I expected. Firstly, I purchased EPYC 9555 QS, but was scammed and CPU arrived dead. That time I was only able to afford placeholder 9135 with 2 CCD. That's why I'm interested in inference numbers of people who…

33
r/LocalLLaMA community 1d ago

Well.. it's a step up from nonstop bot spam I guess

  submitted by   /u/ForsookComparison [link]   [comments]

27
r/LocalLLaMA community 1d ago

I benchmarked full tool catalog vs ranked catalog on a local model: 8% → 77% accuracy

Been running agents locally for a while and kept hitting the same issue: the more tools I added, the worse the model got at picking the right one.. So I finally benchmarked it properly.. Setup: qwen3.5-class model on an M4 MacBook, 100 tools in the catalog. One run with the full…

23
r/LocalLLaMA community 1d ago

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization (from the Qwen team)

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear…

9
r/LocalLLaMA community 1d ago

Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090

First of all, a huge thank you to the r/LocalLLaMA community and the 3090 club. This benchmark started from your shared recipes... These are my findings on my hardware (Xeon E5-2666v3, 64GB RAM, single RTX 3090 24GB) comparing 5 engines (3 llama.cpp forks + mainline + Lucebox)…

12
r/LocalLLaMA community 1d ago

Explaining Attention with Program Synthesis

The same day I discovered Tracr, this paper dropped. Very interesting and potentially accelerates LLM training significantly. The idea of programmable attention seems promising.   submitted by   /u/Thrumpwart [link]   [comments]

27
r/LocalLLaMA community 1d ago

Meta secretly tested ChatGPT, Gemini, and Character.AI with thousands of minor-perspective crisis prompts

Meta reportedly had hundreds of contractors pose as minors and send suicide, sex, and drug-related prompts to chatbots from OpenAI, Google, and Character.AI. In a single testing round, more than 45,00.…

21
r/LocalLLaMA community 1d ago

NEW on Hugging Face: Filter by hardware compatibility

  submitted by   /u/paf1138 [link]   [comments]

13
r/LocalLLaMA community 1d ago

Huawei open-sources OpenPangu-2.0-Flash - 92B total,6B active

https://x.com/Chinazhidx/status/2071877413685109071 TODAY: #Huawei open-sources OpenPangu-2.0-Flash #OpenPangu 2.0 includes two 512K-context models: • Flash: 92B total,6B active—Weights+inference code+training ops released • Pro: 505B total,18B active—flagship model, coming in…

38
r/LocalLLaMA community 1d ago

Bartowski has delivered DS4 GGUF

Looking forward to compare with Antirez's DS4 imamtrix https://huggingface.co/bartowski/DeepSeek-V4-Flash-GGUF   submitted by   /u/challis88ocarina [link]   [comments]

31
r/LocalLLaMA community 1d ago

MTP-only GGUF subsets: Qwen3.5/3.6

They are just MTP-only GGUF subsets of Qwen3.5/3.6 Medium/Large (27B and above) models (to accelerate token generation of Qwen-based models without MTP tensors ). But I hope they help experimenting with various Qwen3.5/3.6-based fine-tunes. The reason I originally created some…

30
r/LocalLLaMA community 1d ago

nvidia/Qwen3.6-27B-NVFP4 just dropped

https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4   submitted by   /u/vanbukin [link]   [comments]

37
r/LocalLLaMA community 2d ago

Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset

Been reading the mechanistic interpretability literature on refusal for a while now. The core insight from Arditi et al. (2024) is clean: refusal is mediated by a geometrically consistent direction in the residual stream. You can find it via the difference of means between…

4
r/LocalLLaMA community 2d ago

Play poker with Reachy Mini and his friend Eliza - Built in 24hrs

Hey folks, this is my entry into the recent Cerebras x Gemma 24hr Hackathon. It features a full live poker experience with Reachy Mini, his friend Eliza, and a table that's tracked by an orchestrator that guides the whole experience and resolves any showdowns. No buttons or…

11
r/LocalLLaMA community 2d ago

ascend-tribe/openPangu-2.0-Flash (They haven't uploaded it to Huggingface yet）

https://ai.gitcode.com/ascend-tribe/openPangu-2.0-Flash openPangu-2.0-Flash is an MoE model trained on Ascend. The model has 92B total parameters and 6B activated parameters. Its context length is 512k. The total pretraining data contains 34T tokens. During Post-training,…

4
r/LocalLLaMA community 2d ago

Microsoft has taken down fastcontext model from everywhere

I tried to find any reports or news as I was about to do additional testing and noticed the HF page is empty and github page is also removed. https://huggingface.co/microsoft/FastContext-1.0-4B-SFT/tree/main https://github.com/microsoft/fastcontext…

6
r/LocalLLaMA community 2d ago

Tesla V100 16GB local LLMs, single and dual NVLink benchmarks

Picked up a couple of Tesla V100-SXM2-16GB modules a while back to run local models and drive Claude Code fully offline, figured the actual numbers and the traps might save someone else the pain. They've come right down in price and the 16GB of HBM2 at ~900 GB/s still holds up…

33
r/LocalLLaMA community 2d ago

InternScience/Agents-A1 · Hugging Face

Unbelievable benchmarks for a 35B MoE, somebody verify. Here is tech report btw: https://arxiv.org/pdf/2606.30616   submitted by   /u/mlon_eusk-_- [link]   [comments]

23
r/LocalLLaMA community 2d ago

Why Dario is on fire: lesson from dotcom bubble.

Dotcom bubble (stock surge and crash in early 2000s related to internet tech) did not pop because the Internet is not legit. Instead: the Internet became successful as cheap, easy to use stuff (modems, PCs, email, ebay, WordPress, bulletin boards) that anyone could use for his…

20
r/LocalLLaMA community 2d ago

RAMpocalypse payback

https://www.tomsguide.com/computing/samsung-sk-hynix-micron-anti-trust-lawsuit-ram-prices How can we help Bathaee Dunne LLP to win the case?   submitted by   /u/Miriel_z [link]   [comments]

31
r/LocalLLaMA community 2d ago

Anyone using Gemma4:31b over Qwen3.6:27b or 35b(a10)

Using them in opencode. Mainly writing python scripts to set up workflows. I really do like Gemma4 even though it just sometimes doesn’t want to go the extra length. I really have to end up pushing it. It’s like really stubborn or something lol For both Qwen models, they’re…

17
r/LocalLLaMA community 2d ago

How I'm using local models from real-world coding

Just want to share since after many attempts over the past year, I finally have a setup I kinda like and does useful work for me. I only have 32GB of RAM and a 4070 8GB (laptop), just very ordinary hardware. I found that Qwen3.6-35B-A3B runs reliably at about 15 tokens per…

25
r/LocalLLaMA community 2d ago

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

Been running Qwen3.6-27B (8-bit) through my coding harness for a few days, alongside GLM5.2. The harness uses 3 critics — code review, test review, Playwright e2e — each with fresh context before accepting output. Qwen3.6 is legit for a 27B dense model. Benchmarks weren't lying.…

19
r/LocalLLaMA community 2d ago

I Hate Dario Amodei, and everything he stands for.

I am so incredibly sick of this guy‘s fear mongering about open source while fundamentally misunderstanding how it actually works. He recently dropped some arguments that are so completely detached from reality, it honestly feels like he’s never even touched a local model in his…

31
r/LocalLLaMA community 2d ago

Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.

  submitted by   /u/AnticitizenPrime [link]   [comments]

18
r/LocalLLaMA community 2d ago

Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!

I've been super impressed with Krea-2-Turbo. It can generate high quality images in ~3 seconds. The quality is quite good compared to other local AI image gen models. Now, I don't want to make you watch or click a you tube video, so I'll just give these clear instructions on how…

5
r/LocalLLaMA community 2d ago

Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model

I saw a solid 30-40% token gen increase from this: ./llama-server --no-mmap --port 8080 --host 0.0.0.0 -kvu -ts 75,70 \ --alias qwen -hf bartowski/deepreinforce-ai_Ornith-1.0-35B-GGUF:Q8_0 -sm layer -c 255000 -cram 0 \ -ctk f16 -ctv f16 -fa 1 --jinja -t 7 --metrics --temp 0.6…

12
r/LocalLLaMA community 2d ago

on Dario’s statement

  submitted by   /u/turtle-toaster [link]   [comments]

32
r/LocalLLaMA community 2d ago

It’s time, Sam, it’s time.

I mean….. I’m no CEO…. but it seems like this would be the absolute perfect time to drop a super powerful GPT-OSS-2 to throw a big ol’ wet blanket on Anthropic’s IPO. It doesn’t need to be like frontier or anything, just a 20b and a 120b that is as fast as the old versions, add…

31
r/LocalLLaMA community 2d ago

An NGO for digital freedom of thought

Disclosure: I'm the chairman of this association and we're in the founding process (legal stuff, besides that we're settled). Also: I'm writing this manually, not via AI. Out of respect for this subreddit. I don't mean to spam here, but perhaps the information / opportunities I…

26
r/LocalLLaMA community 2d ago

DeepSeek V4, PR merged into llama.cpp !

The PR : https://github.com/ggml-org/llama.cpp/pull/24162 All to git pull, cmake , and download GGUFs ! A vos marques, prêt, partez !   submitted by   /u/Squik67 [link]   [comments]

4
r/LocalLLaMA community 2d ago

Qwen3-tts.cpp + Compose Desktop GUI

I improved my qwen3-tts.cpp implementation to be about 5x realtime on my RTX 5080. It is GGML based, so it should compile and run anywhere - however I only tested it with CPU & CUDA under Windows & Linux: https://github.com/Danmoreng/qwen3-tts.cpp Additionally I made a Desktop…

13
r/LocalLLaMA community 2d ago

Amodei: "Open Source Models Will Eat Your Children"

  submitted by   /u/johnnyApplePRNG [link]   [comments]

35
r/LocalLLaMA community 2d ago

What's the full local AI "doomsday prepper" kit for cold storage? 16-bit safetensors of LLMs (obv), copies/source codes of Llama.cpp, ComfyUI, vLLM, Kobold, LMStudio, etc, macOS, Linux OSes, Windows 10&11, etc, Rufus (including older ones), various VMs, P-E-W's Heretic/Grimoire,…

For those who want to be as paranoid and maximally doomsday prepped as possible, I am curious what the most thorough "doomsday kit" is of things to store offline copies of "just in case", to still be able to use local AI if things go truly crazy to a super extreme level. So far…

23
r/LocalLLaMA community 2d ago

Anthropic's Amodei: "Open Source models [could take us to] a very dangerous place."

  submitted by   /u/johnnyApplePRNG [link]   [comments]

4
r/LocalLLaMA community 2d ago

Samsung, SK hynix, Micron Sued in US Over Memory Price Fixing

  submitted by   /u/johnnyApplePRNG [link]   [comments]

15
r/LocalLLaMA community 2d ago

Effect of GLM 5.2 !!

All hail Z. Ai   submitted by   /u/Independent-Wind4462 [link]   [comments]

13
r/LocalLLaMA community 2d ago

Going from single GPU to dual GPU is nice but not in the way I expected

I was expecting what when doubling my VRAM from 24gb to 2x24gb I'd use higher quants with more context, and thus get smarter LLMs, but that's not what it ended up happening. At least for coding, I found that the difference in quality from, say, qwen 27B UD-Q4-XL to a Q6 or Q8 is…

21
r/LocalLLaMA community 2d ago

Instead of decentralized training effort we should build the “One dataset”

There are many threads here calling for united LLM training run of a new open model. Mainly, after govt. stunt of banning commercial frontier models. And also due to the lack of small-medium open-weight models releases lately. I genuinelly believe at some point we’ll have “SETI…

38
r/LocalLLaMA community 2d ago

Bolt Graphics GPU will have 2 DDR5 laptop DIMM slots

They have a few working prototypes, & are aiming for pre-production examples made by end of this year, & full production by Christmas 2027. Interesting specs: 5nm GPU "High performance CPU in GPU" on-card LPDDR5X as primary memory pool 2 DDR5 SODIMM slots for 'spill over'…

38
r/LocalLLaMA community 2d ago

Anyone else end up building a web access layer for local AI agents?

I've been running local models for most of my experiments, and I kept running into the same issue. The model lives locally, but everything it needs to interact with doesn't. Every new agent ended up with another GitHub client, another Reddit integration, another documentation…

10
r/LocalLLaMA community 2d ago

Mellum2 local deployments

Hey local community, I work at JetBrains with the team that trained Mellum2 models — 12B-2.5A LLMs. Those models are trained completely from scratch, targeting fast inference: our primary goal were H100/H200s prod deployments, but local deployments are good as well. We…

37

What are your experiences with using local AI trained on information about you?

The harness matters more than the model. A 27B behind good critics changed my mind.

HIP: use hipBLAS for dense prefill on gfx900, keep MMQ for MoE by DEV-DUFORD · Pull Request #24588 · ggml-org/llama.cpp

Running Hunyuan3D Image to 3D Object on an iPhone

Benchmarked Graph-RAG vs. Graph-Free Multi-Hop RAG: The graph mostly bought us a massive rebuild bill, not accuracy.

I built an autonomous dev pipeline and ran the same project head to head: a 27B local on a modded 4090, then again on cheap cloud LLMs

PageStorm: A Model Built for Creative Book Writing

TurboOCR v3 — high-speed document OCR server (C++/CUDA), ~520 img/s on RTX 5090

EPYC hybrid system benches and optimal CPU

Well.. it's a step up from nonstop bot spam I guess

I benchmarked full tool catalog vs ranked catalog on a local model: 8% → 77% accuracy

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization (from the Qwen team)

Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090

Explaining Attention with Program Synthesis

Meta secretly tested ChatGPT, Gemini, and Character.AI with thousands of minor-perspective crisis prompts

NEW on Hugging Face: Filter by hardware compatibility

Huawei open-sources OpenPangu-2.0-Flash - 92B total,6B active

Bartowski has delivered DS4 GGUF

MTP-only GGUF subsets: Qwen3.5/3.6

nvidia/Qwen3.6-27B-NVFP4 just dropped

Norm-preserving abliteration on Qwen3.6-35B-A3B: 0% refusal, benchmarks intact, open source dataset

Play poker with Reachy Mini and his friend Eliza - Built in 24hrs

ascend-tribe/openPangu-2.0-Flash (They haven't uploaded it to Huggingface yet）

Microsoft has taken down fastcontext model from everywhere

Tesla V100 16GB local LLMs, single and dual NVLink benchmarks

InternScience/Agents-A1 · Hugging Face

Why Dario is on fire: lesson from dotcom bubble.

RAMpocalypse payback

Anyone using Gemma4:31b over Qwen3.6:27b or 35b(a10)

How I'm using local models from real-world coding

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

I Hate Dario Amodei, and everything he stands for.

Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.

Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!

Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model

on Dario’s statement

It’s time, Sam, it’s time.

An NGO for digital freedom of thought

DeepSeek V4, PR merged into llama.cpp !

Qwen3-tts.cpp + Compose Desktop GUI

Amodei: "Open Source Models Will Eat Your Children"

What's the full local AI "doomsday prepper" kit for cold storage? 16-bit safetensors of LLMs (obv), copies/source codes of Llama.cpp, ComfyUI, vLLM, Kobold, LMStudio, etc, macOS, Linux OSes, Windows 10&11, etc, Rufus (including older ones), various VMs, P-E-W's Heretic/Grimoire,…

Anthropic's Amodei: "Open Source models [could take us to] a very dangerous place."

Samsung, SK hynix, Micron Sued in US Over Memory Price Fixing

Effect of GLM 5.2 !!

Going from single GPU to dual GPU is nice but not in the way I expected

Instead of decentralized training effort we should build the “One dataset”

Bolt Graphics GPU will have 2 DDR5 laptop DIMM slots

Anyone else end up building a web access layer for local AI agents?

Mellum2 local deployments