general

LLMs Develop Unique Strategies Playing Civ V

Researchers trained large language models to play Civilization V across 1,408 games, discovering that different AI models developed remarkably distinct

LLMs Play Civ V, Develop Distinct Personalities

What It Is

Researchers have successfully trained large language models to play complete games of Civilization V, revealing that different models develop remarkably distinct strategic personalities. The experiment ran 1,408 full games using a hybrid architecture where the LLM handles high-level strategic decisions while Civ V’s native AI executes the tactical gameplay.

This approach solves a critical problem that plagued earlier attempts: pure LLM implementations couldn’t survive to the endgame. The hybrid method achieves a 97.5% game completion rate even with models as small as 20B parameters. Each game costs approximately $0.86 using OpenRouter pricing, consuming roughly 53,000 input tokens and 1,500 output tokens per turn.

The personality differences emerged clearly across models. OSS-120B exhibited aggressive militaristic tendencies, achieving 31.5% more Domination victories while securing 23% fewer Cultural wins. GLM-4.6 demonstrated more balanced gameplay, splitting focus between military conquest and cultural development. Both models showed an unexpected preference for the Order ideology, selecting this communist-style government structure about 24% more frequently than the Freedom ideology.

Why It Matters

This research demonstrates that AI models possess inherent strategic biases that manifest consistently across thousands of gameplay scenarios. The personality differences aren’t random noise - they represent reproducible patterns in how different architectures evaluate risk, prioritize objectives, and respond to competitive pressure.

Game developers gain a new tool for creating more varied AI opponents. Rather than hand-coding different difficulty levels or playstyles, teams can deploy different LLM architectures to generate naturally diverse strategic behaviors. The hybrid approach also suggests a practical pattern for AI integration: let language models handle abstract reasoning while specialized systems manage execution.

The cost structure makes this accessible for research purposes but expensive for casual use. At $0.86 per game, running the full 1,408-game dataset would cost roughly $1,200. This positions the technology somewhere between academic research and commercial deployment - affordable for serious analysis but not yet ready for consumer-scale applications.

The ideological preference for Order over Freedom raises interesting questions about training data bias. Strategy games require balancing multiple competing objectives, and the consistent tilt toward centralized control suggests these models may favor deterministic, top-down approaches over distributed decision-making.

Getting Started

The research uses a two-layer architecture that can be adapted for similar experiments. The basic pattern involves:

# Pseudocode for hybrid LLM-game integration def play_turn(game_state):
 strategic_decision = llm.query(
 context=game_state.serialize(),
 prompt="Analyze situation and set priorities"
 )
 tactical_actions = native_ai.execute(
 strategy=strategic_decision,
 available_actions=game_state.get_actions()
 )
 return tactical_actions

Developers interested in replicating this work would need access to Civilization V’s API or modding framework, an LLM provider like OpenRouter (https://openrouter.ai), and infrastructure to run extended game sessions. The token consumption pattern suggests each turn requires serializing substantial game state information for the LLM to process.

Models in the 20B parameter range appear sufficient for viable gameplay, making this achievable with open-source options like Llama or Mistral variants rather than requiring frontier models.

Context

This hybrid approach contrasts with pure reinforcement learning methods like DeepMind’s AlphaStar for StarCraft II, which learned through millions of self-play games. The LLM method requires far fewer games but depends on the model’s pre-existing strategic reasoning capabilities rather than learning game-specific patterns from scratch.

The 97.5% completion rate represents a major improvement over pure LLM approaches, but the 2.5% failure rate still indicates edge cases where strategic reasoning breaks down. These failures likely occur when the game state becomes too complex for the context window or when the LLM generates invalid strategic directives.

The personality differences between models suggest that architecture choices - not just training data - influence strategic thinking. This has implications beyond gaming for any domain requiring multi-objective optimization under uncertainty, from resource allocation to project planning.