Evalatro: an open benchmark where LLMs play the real Balatro
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Hey! I made Evalatro - an open benchmark where your LLMs play actual Balatro. Real game. It started because I kept asking Claude to help me beat levels while playing (yeah, I'm too weak). I'd just throw screenshots at it and ask for tactics. Then the idea grew into something bigger and I decided to dig a little deeper. Dug in... First I wanted to build an MCP through mods, turns out something already exists - balatrobot (respect to the author). And so it began. The model connects to the game and on each turn gets the state as a text structure, not a picture, and decides what to play on its own. No tactical hints. What's there already: I've only run a couple of models so far, just a little, so treat it as poking around, not a ranking. But it's already funny: nobody got anywhere near Ante 12. The leader, mimo-v2.5-pro, crawled to Ante 5. There was also deepseek-v4-pro, which couldn't beat the boss on ante 8, but I lost the results after the leaderboard update. So the challenge is wide open - come watch the models suffer. Would love feedback from Balatro players and the LLM crowd: is Ante 12 a sane bar or overkill? What else is worth measuring besides "reached / didn't reach"? How do I close the holes so the bench can't be cheated? I'm not exactly a master at building benchmarks. PS. I would be endlessly grateful for your stars on GitHub! Links: [link] [comments] |
More from r/LocalLLaMA
-
Palantir CEO rages against closed models
Jul 2
-
SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Jul 2
-
[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode
Jul 2
-
They fit! Mostly.... 2x 3090, Thermaltake Core p3
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.