LLMs Play Balatro Autonomously via New API Framework
A new framework enables language models to autonomously play Balatro, the poker roguelike deckbuilder, by exposing game state through an API and translating
LLMs Can Now Play Balatro Autonomously via API
What It Is
A new framework called BalatroLLM enables language models to play Balatro - the poker-inspired roguelike deckbuilder - without human intervention. The system consists of two parts: BalatroBot, a game mod that exposes the current game state through an HTTP API, and BalatroLLM, which feeds that state to any OpenAI-compatible language model and translates its decisions back into game actions.
What makes this particularly interesting is the strategy system. Instead of hardcoding game logic, developers define strategies using Jinja2 templates that shape how the model perceives the game state and frames its decision-making process. The same underlying model can exhibit completely different playstyles - conservative, aggressive, or experimental - depending solely on how the prompt template structures the information and decision criteria.
The framework works with local models through Ollama or vLLM, as well as commercial APIs. Models receive serialized game state (current hand, available jokers, chip requirements, remaining discards) and must decide which cards to play, discard, or hold. Each decision gets logged, creating a full trace of the model’s reasoning process.
Why It Matters
This project demonstrates how far language models have progressed in structured decision-making under constraints. Balatro requires planning across multiple dimensions - immediate hand value, future deck composition, joker synergies, and risk assessment. Unlike chess or Go where the state space is well-defined, Balatro combines randomness with strategic depth, making it a genuine test of adaptive reasoning.
The benchmark results at https://balatrobench.com/ reveal significant performance gaps between models. Some consistently reach higher stakes, while others struggle with basic poker hand recognition. These differences aren’t just about raw capability - they expose how models handle probabilistic reasoning, resource management, and multi-turn planning.
For researchers, this provides a reproducible testbed for evaluating model decision-making in a complex but bounded environment. For developers building AI agents, it offers practical patterns for structuring game state, managing API calls, and designing prompt strategies that influence behavior without retraining models.
Getting Started
The repository lives at https://github.com/coder/balatrollm. Setting up requires installing the BalatroBot mod in the game, then configuring the LLM framework to connect to a model endpoint:
# Configure your model endpoint (Ollama, vLLM, or commercial API)
# Edit strategy templates in the strategies/ directory
# Run the bot against your local Balatro instance
The strategy templates are where customization happens. A template might emphasize high-value hands, prioritize specific joker combinations, or take calculated risks on marginal plays. Modifying these templates changes model behavior without touching the underlying inference code.
For those curious about model performance, the Twitch stream at https://www.twitch.tv/S1M0N38 runs continuous games with various models. Watching Claude Opus 4.6 deliberate over whether to keep a pair of threes provides unexpected entertainment - and insight into how models weigh competing priorities.
Context
This isn’t the first AI game-playing framework, but most focus on perfect-information games or simpler decision spaces. Balatro’s combination of randomness, deck-building mechanics, and scoring multipliers creates a richer challenge than tic-tac-toe solvers or basic chess engines.
The main limitation is speed. Language models aren’t optimized for real-time gameplay - each decision involves API latency and inference time. This works fine for turn-based games but wouldn’t scale to action-oriented titles. The framework also requires the game to expose its state programmatically, which limits applicability to moddable games.
Alternative approaches like reinforcement learning could potentially achieve higher performance through training, but would lose the interpretability and flexibility of prompt-based strategies. The ability to modify behavior by editing a text template, rather than retraining a model, makes this approach particularly accessible for experimentation.
The broader implication is that language models are becoming viable for complex decision-making in structured environments, not just text generation. As APIs improve and latency decreases, similar frameworks could extend to strategy games, resource management simulations, or other domains where reasoning under uncertainty matters more than reaction time.
Related Tips
Testing Hermes Skins with GLM 5.1 AI Model
Testing article explores the performance and compatibility of Hermes skins when integrated with the GLM 5.1 AI model, examining rendering quality and system
AI Giants Form Alliance Against Chinese Model Theft
Major AI companies including OpenAI, Google, and Anthropic have formed a coalition to combat intellectual property theft and unauthorized use of their models
Gemma 4 Jailbroken 90 Minutes After Release
Google's Gemma 4 AI model was successfully jailbroken within 90 minutes of its public release, highlighting ongoing security challenges in large language model