I built an autonomous dev pipeline and ran the same project head to head: a 27B local on a modded 4090, then again on cheap cloud LLMs
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Hey everyone! I open-sourced something I've been working on called Lullabeast. It's an autonomous dev pipeline. You describe your project and planner, executor, and reviewer agents build it phase by phase against a real git repo. How it came to be: for the last year or so I've been trying to standardize a process for building, and I kept finding success with plan, execute, review loops, so I started building a system around that. Every time I hit a pain point I'd try to address it in the rules. But at some point the prompts weren't enough on their own, so I started looking at how to build this into an actual pipeline. After a few attempts, OpenClaw was the first runtime I could get working the way I needed. I wanted to show how this actually performs, so I had it build a multi-team version of Conway's Game of Life with live analytics, and ran the same roadmap through the pipeline twice: **Local** (modded 48GB RTX 4090, Qwen3.6-27B Q8_0, planner + executor used MTP, reviewer was non-MTP) **Cloud** (GLM-5.2 planner, Kimi-k2.7 Code executor + reviewer) *Pro life tip: You can save a lot on API bills if you just buy a regrettably expensive GPU lol* Both builds are live, so check them out and tell me which one you like better. I know which one I'd pick but I want to hear yours: https://lullabeast.ai/living-proof The secret sauce of the pipeline is the deterministic gates that sit between the agent calls. These models fail in predictable ways. They delete files randomly, drift off the spec, and say they're done without ever running the tests. So at every handoff, a gate has to pass before anything moves forward, no LLM involved. The gates check the file manifest, the git diff, the test results, and whether anything got deleted that shouldn't have. They run the show, so an agent never gets to advance on its own say-so. I added multiple retries so you don't have to babysit it, but once the agents use up all their retries, it escalates instead of spinning endlessly. The agents run inside OpenClaw as the runtime. No frontier models anywhere in the loop, just cheap open and local ones. Honestly speaking, it's an early beta. It does well on small, focused webapps. Push it toward something something too big or complex and more issues can show up. UI-heavy phases are where it struggles the most when you run fully local too. It also executes agent-written code on your host, so I suggest running it in a VM (that's what I do). Mostly I'm putting this out to find where it breaks, so I'd really value your feedback. If there's something obvious I'm missing, or an easy way to make this better, I want to hear it. You all actually run this stuff, so your insight is exactly what I'm after. Tell me what you'd change. Repo: https://github.com/bigbraingoldfish/lullabeast Site: https://lullabeast.ai (there's a click-through walkthrough of the dashboard on there if you want to see it work before installing anything) [link] [comments] |
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
-
They fit! Mostly.... 2x 3090, Thermaltake Core p3
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.