r/LocalLLaMA · June 15, 2026 · 2 min read

Evalatro: an open benchmark where LLMs play the real Balatro

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Evalatro: an open benchmark where LLMs play the real Balatro

Hey! I made Evalatro - an open benchmark where your LLMs play actual Balatro. Real game.

It started because I kept asking Claude to help me beat levels while playing (yeah, I'm too weak). I'd just throw screenshots at it and ask for tactics.

Then the idea grew into something bigger and I decided to dig a little deeper.

Dug in...

First I wanted to build an MCP through mods, turns out something already exists - balatrobot (respect to the author). And so it began.

The model connects to the game and on each turn gets the state as a text structure, not a picture, and decides what to play on its own. No tactical hints.

What's there already:
- fixed seeds for reproducibility — every model sees the same deals
- the real Balatro + Steamodded + balatrobot
- a live viewer and a public leaderboard
- your run results get sent to a public dashboard at the end of a run (zero private info — no keys, no paths; source is open)
- the score is computed by the server, not the client, so you can't fake it
- the benchmark goal is to clear Ante 12 (picked it kind of arbitrarily, open to debate), not just win the base-game Ante 8
- auto-install on Windows/macOS
- you can watch the model's reasoning (that part's fun) and replay every run
- before a run it sets up a separate game profile with EVERYTHING unlocked so the model isn't limited (your main save is left untouched)

I've only run a couple of models so far, just a little, so treat it as poking around, not a ranking. But it's already funny: nobody got anywhere near Ante 12. The leader, mimo-v2.5-pro, crawled to Ante 5. There was also deepseek-v4-pro, which couldn't beat the boss on ante 8, but I lost the results after the leaderboard update. So the challenge is wide open - come watch the models suffer.

Would love feedback from Balatro players and the LLM crowd: is Ante 12 a sane bar or overkill? What else is worth measuring besides "reached / didn't reach"? How do I close the holes so the bench can't be cheated? I'm not exactly a master at building benchmarks.

PS. I would be endlessly grateful for your stars on GitHub!

Links:
Github: https://github.com/alesha-pro/evalatro
Public Dashboard: evalatro.dev

submitted by /u/awfulalexey
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA