Tag

Edge

208 articles archived under #edge · RSS

r/MachineLearning community 1mo ago

Is AI inference platform really that saturated now? [D]

I’m thinking of expanding an on-device inference SDk into a full blown AI inference platform and seeing more and more inference platform popping out. Been talking with a VC from Seattle/NY. Is this space really that saturated?   submitted by   /u/kampak212 [link]  …

35
r/LocalLLaMA community 1mo ago

RAG for developer docs so local llm can code using latest library?

I was wondering if it would make local llm better at coding if it has access to the latest documentation available through a RAG. I'm specifically interested in python. But then this might lead ingesting and embedding a very large number of documents. Or I could just focus on…

28
r/LocalLLaMA community 1mo ago

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

Imagine you are using a local model for agentic coding. You discuss the idea (50k tokens), then say “implement it”. The agent reads files, writes files, runs commands, produces another 20k tokens and the code is ready. Then your next prompt is just “thank you”, and... nothing…

6
r/LocalLLaMA community 1mo ago

llama.cpp has a clever trick for speeding up KV cache decode

So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under developer options. This is the…

23
r/LocalLLaMA community 1mo ago

Is NVIDIA still the default best choice for local LLMs in 2026?

  submitted by   /u/pmv143 [link]   [comments]

9
r/LocalLLaMA community 1mo ago

Local model doing accounting tasks

So I've been using qwen 3.6 27b for monthly closes, bank recs, payable and receivables. Built a simple sql lite database it manages. Anyhow, wanted to post I integrated Claude skills and the https://github.com/anthropics/financial-services repo. It works well. Just wanted to…

22
r/LocalLLaMA community 1mo ago

club-rdna16: practical 16GB AMD/Radeon local LLM testing repo

Following on from club-5060ti, I’ve been doing some testing with my desktop AMD GPU and wanted to make a similar repo for 16GB Radeon cards. Repo: https://github.com/5p00kyy/club-rdna16 Pages/results: https://5p00kyy.github.io/club-rdna16/ The first test machine is an RX 6900 XT…

24
r/LocalLLaMA community 1mo ago

Gmail tie-ins

hey folks. I’m looking to setup a way to give a local LLM access to google cloud SDK for Gmail functions. The goal is to be able to have an LLM once daily check a spreadsheet, and based on criteria send an email that will be structured exactly the same way each time, simply as a…

14
arXiv — Machine Learning research 1mo ago

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

arXiv:2605.20295v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization…

36
arXiv — NLP / Computation & Language research 1mo ago

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

arXiv:2605.20815v1 Announce Type: new Abstract: Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments…

32
r/LocalLLaMA community 1mo ago

24GB M4 Mac - is Qwen 9B only option while system is running?

I have mac at work that I want to use local model for prototyping and basic prompts that needs to stay on device. What sort of model I can run that I can fit at least 64k context ? Any setups share or guides welcome. I need to have firefox open with one tab at minium. Problem I…

6
TechCrunch — AI news-outlet 1mo ago

Stability AI releases a new audio model that can create six-minute songs

Stability Audio 3.0 small model can run on-device and generate two-minute long tracks

21
r/LocalLLaMA community 1mo ago

How accurate can “whichllm” be?

Hello people I think the question is clear but I wanted to add some context: I work on internal tools in my job and some of the tools are for us developers (most tools are for marketing and factory production). I am currently working on a small cli tool that uses a local model…

12
r/LocalLLaMA community 1mo ago

what non-coding tasks have you gotten a local model to do autonomously?

coding agents are everywhere right now but i'm more interested in models that actually take actions autonomously. we built a small vlm for desktop gui automation. i mostly use it for moving data between apps that don't have apis, saves me a lot of copy pasting. still kinda janky…

11
r/LocalLLaMA community 1mo ago

Audio upscaling, cleanup, or improvement models?

I never see this type of model talked about. Are there many open models in the category? I do a lot of audio cleanup and end up using auphonic but would like to be using a local model. Edit: e.g like voice recovery, reverb removal, auto-EQ type stuff   submitted by  …

5
arXiv — Machine Learning research 1mo ago

R2V Agent: Teaching SLMs When to Ask for Help

arXiv:2605.16604v1 Announce Type: new Abstract: Efficient agentic systems should incur expensive frontier-model costs only on decisions where a cheaper local model is likely to fail. Existing LLM cascades usually route whole queries before execution, but task difficulty shifts…

18
arXiv — NLP / Computation & Language research 1mo ago

Language Acquisition Device in Large Language Models

arXiv:2605.16758v1 Announce Type: new Abstract: Large Language Models (LLMs) remain substantially less data-efficient than humans. Pre-pretraining (PPT) on synthetic languages has been proposed to close this gap, with prior work emphasizing highly expressive formal languages…

32
arXiv — NLP / Computation & Language research 1mo ago

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

arXiv:2605.18271v1 Announce Type: new Abstract: With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them on-device has become essential for privacy and responsiveness. To handle the inherently personal and context-dependent nature…

16
r/LocalLLaMA community 1mo ago

What’s your current local LLM setup in 2026?

Hey all — I’ve been trying to get a better sense of what people are actually running locally these days. Curious about your setup: GPU (or CPU if you’re brave ) RAM / VRAM Models you use the most Main use case (coding, chat, agents, etc.) Also — what’s the biggest bottleneck…

24
r/LocalLLaMA community 1mo ago

club-5060ti follow-up: cleaner RTX 5060 Ti local LLM recipes, benchmark explorer, and CUDA GPU compatibility notes

I posted earlier about RTX 5060 Ti local LLM testing, and I have cleaned the repo up quite a bit since then. The project is now a more structured benchmark/recipe repo rather than scattered notes. It has a static results explorer, schema-validated benchmark JSON, clearer…

34
Zed Editor dev-tools 1mo ago

Why and How to Run Local Models in Zed

You can run local AI models in Zed to get better performance and control over your data. Here's how.

33
r/LocalLLaMA community 1mo ago

favorite Agentic Coding Harness

So far, I’ve tried Codex CLI, Claude Code, Gemini CLI, OpenCode, and recently, Pi with local models. Pi is the leanest of them all, with just four tools: read, write, edit, and bash. Its system prompt is only under 2K tokens, and it's perfect for local models. I've been trying…

29
The Information — AI news-outlet 1mo ago

Edge Inference Chip Startup SiMa.ai Raising at $1.4 Billion Valuation

Nvidia might be on a tear, but some investors are still convinced that there’s demand for another kind of specialized chips. And they’re putting their money where their mouth is. For example: San Jose, Calif.-based SiMa.ai , which develops chips that work on devices such as…

14
r/LocalLLaMA community 1mo ago

What happens to local LLM if/when LLMs are no longer released for free?

I’m thinking about where this might wind up in 3-5+ years. As others have noted there’s no guarantee that Qwen, Google, and others will continue to release models in the future. Suppose the supply of new LLM models dries up overnight. Whatever is available today, May 2026, is…

6
r/LocalLLaMA community 1mo ago

Is anyone prioritizing code quality checks via a small local model?

Sorry if the title is confusing. What I'm trying to say is that since coding agents can write a lot of code very quickly and it can kinda get messy overtime if unchecked frequently. Shouldn't there be a tiny local model with a TESTING(dot)md or a QUALITY(dot)md which describes…

14
r/LocalLLaMA community 1mo ago

I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here's how

I was frustrated that every coding agent (OpenCode, Cursor, Claude Code) assumes you're running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart. I find that often tool calls fail, context overflows, multi-step tasks collapse. So I…

12
r/LocalLLaMA community 1mo ago

The power of structured workflows and small local models

A month ago, I experimented with a very basic home-rolled agent loop with a handful of tools and found it worked surprisingly well in spite of how crude it was: https://www.reddit.com/r/LocalLLaMA/comments/1sl7f8e/homerolled_loop_agent_is_surprisingly_effective/ Later, I wrote…

15
r/LocalLLaMA community 1mo ago

Made a simple template manager and GUI for llama.cpp so I don't have to keep memorizing CLI flags.

Introducing Hexllama Hey, I’ve always found llama-server to be more than enough for testing out local models, mostly because it guarantees you always have the absolute latest llama.cpp features and architecture support. But keeping track of different CLI commands, context sizes,…

19
r/LocalLLaMA community 1mo ago

Using Local LLMs for research

Hey there. I am an undergrad who has been doing mostly SWE, but will be doing ML research under my professor over the summer. So I am new to research - I ask not to be judged too harshly. Generally, we will be working on Physics-Informed Neural Networks. I have seen some…

9
r/LocalLLaMA community 1mo ago

LLM Phone Home: Reliable Apps that can deliver inference from local backend

Hello all, I’m wondering what suggestions there are for an ios app that can serve an openai compatible endpoint. I am using 3sparks which works GREAT for that specific use, BUT, there is no mcp, no web search, etc. I want to show people that a local model with web search on your…

25
r/LocalLLaMA community 1mo ago

What’s are the best abliterated or uncensored local models that allow financial advice-related questions?

Not trying to get rich quick or anything, but I’m just tired of models refusing to answer questions related to their opinions on money matters or having them be wishy-washy about financial decision making advice. Seems like this can be a blocker with both frontier closed source…

32
r/LocalLLaMA community 1mo ago

I built a self-hosted open-source MCP server that gives any local LLM real financial data — SEC filings, 13F, insider & congressional trades, short data, FRED

One thing missing when running local models as agents: real, current data. So I built Equibles — a self-hosted MCP server that scrapes and serves public U.S. financial data and exposes it as MCP tools, so any MCP-capable client (Claude Code/Desktop, Cursor, or your own…

30
r/LocalLLaMA community 1mo ago

how would you set up a local llm server for a business of 7 people?

Okay so i've been stalking this sub for some time and i run the occasional small 2-8b model on my laptop (not the best) for fun but say my role at a company is to set up a local LLM since we obviously don't want confidential data going to other companies etc / main use case…

16
r/LocalLLaMA community 1mo ago

Are the rich RAM /poor GPU people wrong here?

Hello Guys, I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models. a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the 100B params, the 100ish B billion…

21
r/LocalLLaMA community 1mo ago

Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup

Hi r/LocalLLaMA - I've been paying close attention to the edge AI ecosystem because it's an area where i see huge potential and where I truly believe AI will become more useful for day to day tasks. Around the gemma 4 release I was already experimenting with local AI but the…

18
Hacker News — AI on Front Page community 1mo ago

Show HN: Find the best local LLM for your hardware, ranked by benchmarks

Article URL: https://github.com/Andyyyy64/whichllm Comments URL: https://news.ycombinator.com/item?id=48146369 Points: 224 # Comments: 38

21
r/LocalLLaMA community 1mo ago

What is the most unexpected thing you have gotten a local model to do?

Most local LLM use cases I see are chat, coding, and RAG. But with vision models getting better and faster on consumer hardware, I feel like there is a lot of untapped territory. I got a local VLM to play a board game by just looking at the screen and it worked way better than I…

25
r/LocalLLaMA community 1mo ago

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests. The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context…

28
arXiv — Machine Learning research 1mo ago

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

arXiv:2605.14373v1 Announce Type: new Abstract: Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are…

7
r/LocalLLaMA community 1mo ago

club-5060ti: practical RTX 5060 Ti local LLM notes and configs

I put together a small public repo for RTX 5060 Ti 16GB local LLM setups: I took inspiration from the club-3090 repo, but this one is focused on documenting what we’ve actually tested on 5060 Ti hardware so the setup details are easier to share and reproduce. Current seed setup…

6
r/LocalLLaMA community 1mo ago

A VERY lightweight open web-search tool for smaller local LLMs

Hey everyone, Been playing around with local agent setups lately, mostly Cline/Roo with smaller models, and web search kept annoying me. Not because it doesn’t work, but because it usually throws way too much random page text into the context. small models really don’t handle…

29
r/LocalLLaMA community 1mo ago

Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future.

I'm the founder behind Hedy, an AI meeting app. I'm a huge supporter of Local AI, and we've been working on making it "consumer friendly". Speech recognition in Hedy has always run on-device (whisper.cpp and now also parakeet). What just shipped is that the rest of the AI…

22
r/LocalLLaMA community 1mo ago

Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup?

So I've been going down a rabbit hole lately and I can't find many people actually talking about this specific use case. everyone here runs local LLMs for coding, chat, maybe some creative writing. cool. But what about using it as a proper personal knowledge base? like, dump…

24
r/LocalLLaMA community 1mo ago

The "the future is fictional" problem of many local LLMs

Many local models have a problem (that raised due to excessive RHLF training): They mostly think that everything that is beyond their knowledge cutoff date would be "fictional" or "satirical". To be fair: Even the Gemini API without web access can have this sometimes. But it…

20
r/MachineLearning community 1mo ago

Your AI Use Is Breaking My Brain: Why 10 Minutes of Prompting Fries Us[D]

It’s 2:30 AM. My youngest just woke up crying for water, completely derailing my train of thought while I was trying to debug a weird edge case in a side project. I stared at my IDE, then at my local model running in the terminal, then back at the IDE. My brain felt like…

26
r/LocalLLaMA community 1mo ago

Small local model for questions on German grammar

I'm trying to learn German. I use Qwen3.5/3.6 locally, but this is pretty bad for German grammar. Has anyone got a recommendation for a small-ish local model that knows German grammer well and can answer questions on this? EDIT: I give an example output from unquantized Qwen3.5…

38
arXiv — Machine Learning research 1mo ago

A Comparative Study of Federated Learning Aggregation Strategies under Homogeneous and Heterogeneous Data Distributions

arXiv:2605.11010v1 Announce Type: new Abstract: Federated Learning has emerged as a transformative paradigm for collaborative machine learning across distributed environments. However, its performance is strongly influenced by the aggregation strategy used to combine local model…

17
r/LocalLLaMA community 1mo ago

I've seen a lot of folks ask "can local LLMs actually do anything useful?"

And I'm here to share my experience. The answer is resoundingly 'yes'. Let me start with the local model I use every day in my AI harness: embedding models. I'm using an embedding model to give my AI's persistent memory system a semantic search protocol that makes its memory…

37
r/LocalLLaMA community 1mo ago

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable. Autocomplete : bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L Agentic : unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL Why these models: Qwen2.5 is still the best model for infill…

9
Smol AI News news-outlet 3mo ago

not much happened today

**Gemma 4** was launched by **Google** under an **Apache 2.0 license**, marking a significant open-model release focused on **reasoning, agentic workflows, multimodality, and on-device use**. It outperforms models 10x larger and has immediate ecosystem support including…

35

Is AI inference platform really that saturated now? [D]

RAG for developer docs so local llm can code using latest library?

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

llama.cpp has a clever trick for speeding up KV cache decode

Is NVIDIA still the default best choice for local LLMs in 2026?

Local model doing accounting tasks

club-rdna16: practical 16GB AMD/Radeon local LLM testing repo

Gmail tie-ins

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

24GB M4 Mac - is Qwen 9B only option while system is running?

Stability AI releases a new audio model that can create six-minute songs

How accurate can “whichllm” be?

what non-coding tasks have you gotten a local model to do autonomously?

Audio upscaling, cleanup, or improvement models?

R2V Agent: Teaching SLMs When to Ask for Help

Language Acquisition Device in Large Language Models

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

What’s your current local LLM setup in 2026?

club-5060ti follow-up: cleaner RTX 5060 Ti local LLM recipes, benchmark explorer, and CUDA GPU compatibility notes

Why and How to Run Local Models in Zed

favorite Agentic Coding Harness

Edge Inference Chip Startup SiMa.ai Raising at $1.4 Billion Valuation

What happens to local LLM if/when LLMs are no longer released for free?

Is anyone prioritizing code quality checks via a small local model?

I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here's how

The power of structured workflows and small local models

Made a simple template manager and GUI for llama.cpp so I don't have to keep memorizing CLI flags.

Using Local LLMs for research

LLM Phone Home: Reliable Apps that can deliver inference from local backend

What’s are the best abliterated or uncensored local models that allow financial advice-related questions?

I built a self-hosted open-source MCP server that gives any local LLM real financial data — SEC filings, 13F, insider & congressional trades, short data, FRED

how would you set up a local llm server for a business of 7 people?

Are the rich RAM /poor GPU people wrong here?

Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup

Show HN: Find the best local LLM for your hardware, ranked by benchmarks

What is the most unexpected thing you have gotten a local model to do?

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

club-5060ti: practical RTX 5060 Ti local LLM notes and configs

A VERY lightweight open web-search tool for smaller local LLMs

Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future.

Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup?

The "the future is fictional" problem of many local LLMs

Your AI Use Is Breaking My Brain: Why 10 Minutes of Prompting Fries Us[D]

Small local model for questions on German grammar

A Comparative Study of Federated Learning Aggregation Strategies under Homogeneous and Heterogeneous Data Distributions

I've seen a lot of folks ask "can local LLMs actually do anything useful?"

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

not much happened today