coding

Agentic Text-to-SQL Benchmark Tests LLM Database Skills

A comprehensive benchmark evaluates large language models' abilities to convert natural language queries into accurate SQL statements for database interactions

What It Is

A new agentic text-to-SQL benchmark tests how well language models convert natural language queries into working SQL code. The benchmark at https://sql-benchmark.nicklothian.com/ presents 25 questions that require models to translate English requests into SQL statements, execute them against database tables, and debug any errors within a limited number of attempts.

The agent architecture mirrors real-world database interaction. When given a query like “Show order lines, revenue, units sold, revenue per unit (total revenue ÷ total units sold), average list price per product in the subcategory, gross profit, and margin percentage for each product subcategory,” the model must generate SQL, examine the results, and iterate to fix problems. This tests not just code generation but also error recovery and logical reasoning about data structures.

The benchmark runs in under five minutes for most models, making it practical for testing different configurations and comparing approaches. A WASM version of Llama.cpp enables local testing against custom servers.

Why It Matters

Text-to-SQL remains one of the most practical applications of language models in enterprise settings. Databases power critical business operations, but writing SQL requires specialized knowledge that many analysts and business users lack. Effective text-to-SQL systems could democratize data access without compromising query accuracy.

The benchmark reveals surprising performance patterns among current models. Kimi-k2.5, Qwen 3.5 397B-A17B, and notably the smaller Qwen 3.5 27B model lead the open-source field. The strong showing from a 27B parameter model challenges assumptions about the correlation between model size and specialized task performance.

NVIDIA’s Nemotron-Cascade-2-30B-A3B outperforming Qwen 3.5-35B-A3B despite similar parameter counts suggests that architecture and training approaches matter as much as raw scale. The Mimo v2 Flash model also punches above its weight class, indicating that focused optimization for code generation tasks can yield outsized returns.

These results matter for teams evaluating which models to deploy for database applications. The benchmark’s speed enables rapid iteration on prompt engineering, system prompts, and agent configurations without expensive compute costs.

Getting Started

The benchmark runs directly in the browser at https://sql-benchmark.nicklothian.com/. Developers can select from pre-configured models or test against local instances.

For local testing with Llama.cpp, the benchmark supports WASM execution:

# Example configuration for local model testing
# Point the benchmark to your local server endpoint
# Configure retry limits and temperature settings
# Run the 25-question suite and compare results

The 25 questions cover common SQL patterns: aggregations, joins, subqueries, window functions, and complex calculations. Each question tests whether the model can generate syntactically correct SQL that returns accurate results.

Teams should test models with their actual database schemas and query patterns. The benchmark provides a baseline, but production performance depends on domain-specific terminology, table complexity, and query sophistication.

Context

Existing SQL benchmarks like Spider and WikiSQL focus primarily on single-shot generation accuracy. This agentic approach better reflects real-world usage where developers iterate on queries based on results and error messages.

The benchmark’s brevity trades comprehensive coverage for practical utility. Twenty-five questions cannot capture every SQL edge case, but the quick runtime enables experimentation that longer benchmarks discourage. Version 2 could expand question diversity while maintaining fast execution.

Alternative approaches to text-to-SQL include fine-tuned models trained specifically on SQL datasets, retrieval-augmented generation with schema examples, and hybrid systems combining rule-based parsing with LLM flexibility. Each approach involves different tradeoffs between accuracy, latency, and maintenance overhead.

The benchmark’s limitations include its fixed question set and single database schema. Production systems must handle varied schemas, ambiguous queries, and domain-specific terminology. The debugging round limit may not reflect real-world patience thresholds where users might iterate extensively.