
LiveSQLBench
A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks
News

NEW FEATURES: more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values. See the dataset for details.




Introduction
LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, continuously evolving benchmark designed to evaluate LLMs on complex, real-world text-to-SQL tasks, featuring diverse real-world user queries, including Business Intelligence (BI), CRUD operations, etc. Each release will include 50 new, fully open-source DBs curated by the BIRD team through expert collaboration and continuous improvement. It will cover a wide range of database sizes, from end-user level (around 127 columns) to industrial level (1340+ columns). Here are the features of the LiveSQLBench benchmark:
- 1Live Databases:Constructed dynamically from extensive and regularly updated CSV datasets, with both base (user-end level) and large (industrial level) versions (1340+ columns each DB) to test scalability.
- 2Live User Queries and SQL:Each task pairs unambiguous user queries with annotated, gold-standard SQL statements. The user queries are grounded in external knowledge, with medium to hard complexity SQL statements.
- 3Contextual Reasoning (HKB):Every DB includes a hierarchical knowledge base (HKB) where each knowledge is related to others, which requires the multi-hop reasoning ability. Two HKB formats are provided: (1) structured JSON format, and (2) unstructured Document format.
- 4The First Full SQL Spectrum:Supports not just SELECT (Business Intelligence) queries, but also CRUD (e.g., UPDATE, CREATE, and other database management operations) queries.
- 5Automated Evaluation:Each question includes verifiable test cases for accurate, reproducible scoring.
- 6Truly Live & Hidden Test:New databases and tasks are added over time. Each release features both open development and hidden test phases. The hidden test set from each release becomes the open development set for the next release, ensuring continuous evolution and fair evaluation.
Currently, we released two versions of LiveSQLBench:
• LiveSQLBench-Base-Lite, containing 18 NEW end-user level databases with NEW 270 tasks, featuring HKB-JSON and the JSON operation in SQL.
• LiveSQLBench-Base-Full v1, containing 22 NEW end-user level databases with NEW 600 tasks, featuring HKB-JSON and the JSON operation in SQL, and more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values.
LiveSQLBench Leaderboard
Model Types
Model Base
Direct SQL generation from natural language queries
Agent
Multi-step reasoning with database exploration
CLI (coming soon)
Command-line interface agentic tools, e.g. Gemini CLI, Codex CLI, Claude Code
Metric: Success Rate. Defined by the ratio of the number of tasks passing the test cases to the total number of tasks.
Evaluation Methodology
- •SELECT queries: Compare execution results with golden SQL outputs
- •Management SQLs: Verify through comprehensive test cases
Date Range
Rank / Window | Model | Organization | Success Rate (%) | Avg. Cost (USD) / Task | Link |
---|---|---|---|---|---|
🥇1 2024-11 — 2026-03 | ![]() | OpenAI | 31.15 | 0.0225 | 🔗 |
🥈2 2024-11 — 2026-03 | ![]() | OpenAI | 31.15 | 0.0383 | 🔗 |
🥉3 2024-11 — 2026-03 | ![]() | OpenAI | 29.54 | 0.0188 | 🔗 |
4 2024-11 — 2026-03 | ![]() | OpenAI | 29.54 | 0.1752 | 🔗 |
5 2024-11 — 2026-03 | ![]() | Anthropic | 27.01 | 0.0601 | 🔗 |
6 2024-11 — 2026-03 | ![]() | Qwen | 26.90 | 0.0043 | 🔗 |
7 2024-11 — 2026-03 | ![]() | DeepSeek | 26.90 | 0.0149 | 🔗 |
8 2024-11 — 2026-03 | ![]() | Anthropic | 26.55 | 0.1045 | 🔗 |
9 2024-11 — 2026-03 | ![]() | Anthropic | 25.75 | 0.0600 | 🔗 |
10 2024-11 — 2026-03 | ![]() | DeepSeek | 23.68 | 0.0045 | 🔗 |
11 2024-11 — 2026-03 | ![]() | Qwen | 22.30 | 0.0010 | 🔗 |
12 2024-11 — 2026-03 | ![]() | OpenAI | 21.38 | 0.0394 | 🔗 |
13 2024-11 — 2026-03 | ![]() | Meta | 18.55 | 0.0014 | 🔗 |
14 2024-11 — 2026-03 | ![]() | Meta | 18.05 | 0.0029 | 🔗 |
15 2024-11 — 2026-03 | ![]() | Meta | 15.86 | 0.0006 | 🔗 |
16 2024-11 — 2026-03 | ![]() | Qwen | 15.75 | 0.0008 | 🔗 |
17 2024-11 — 2026-03 | ![]() | Mistral AI | 12.53 | 0.0045 | 🔗 |
18 2024-11 — 2026-03 | ![]() | Qwen | 8.16 | 0.0018 | 🔗 |
19 2024-11 — 2026-03 | ![]() | Mistral AI | 3.10 | 0.0026 | 🔗 |
20 2024-11 — 2026-03 | ![]() | Mistral AI | 2.41 | 0.0012 | 🔗 |
- | ![]() | N/A | 0.0027 | 🔗 | |
- | ![]() | N/A | 0.0055 | 🔗 | |
- | ![]() | Qwen | N/A | 0.0088 | 🔗 |
- | ![]() | N/A | 0.0115 | 🔗 | |
- | ![]() | DeepSeek | N/A | 0.0160 | 🔗 |
- | ![]() | N/A | 0.0165 | 🔗 | |
- | ![]() | N/A | 0.0185 | 🔗 | |
- | ![]() | OpenAI | N/A | 0.0336 | 🔗 |
- | ![]() | OpenAI | N/A | 0.0353 | 🔗 |
- | ![]() | DeepSeek | N/A | 0.0427 | 🔗 |
- | ![]() | N/A | 0.0468 | 🔗 | |
- | ![]() | OpenAI | N/A | 0.0788 | 🔗 |
- | ![]() | OpenAI | N/A | 0.2283 | 🔗 |
- | ![]() | OpenAI | N/A | 0.3729 | 🔗 |
- | ![]() | OpenAI | N/A | 0.4310 | 🔗 |
- | ![]() | Anthropic | N/A | 0.8936 | 🔗 |
- | ![]() | N/A | - | 🔗 | |
- | ![]() | N/A | - | 🔗 | |
- | ![]() | OpenAI | N/A | - | 🔗 |
- | ![]() | Anthropic | N/A | - | 🔗 |
Note: Success Rate is a micro-average across datasets in the selected date window, weighted by each dataset's sample count. We require full coverage: "Success Rate" is shown only if a model has results for all datasets in the window; otherwise it displays N/A. Models with reasoning ability are marked with a black-and-white logo.
* Agent I is a baseline agent we designed for interacting with the database. (1) All DB context (schema, column meanings, and external knowledge definitions) is provided to the agent at the beginning before user question, and (2) Agnet can execute SQL (3) Agnet can "submit" the SQL once. The steps are set to 20.
^ Agent II is based on Agent I but WITHOUT the database context provided at the beginning before user question. This suits for the case where the database context is too large to fit in the prompt. However, it requires higher agent reasoning ability of the model.
*^ However, the Agent has the actions of getting complete (ALL) database schema, column meanings, and external knowledge definitions, which is not suitable and cost-effective for the case where the database context is too large. To address this, we are going to propose a Mini-Agent that replaces these actions with more cost-friendly alternatives.
LiveSQLBench Data Viewer
Here we could explore LiveSQLBench-Base-Lite examples with DBs, tasks, and HKB-JSON, our initial release featuring 270 tasks across 18 end-user level databases. Each task features unambiguous and straightforward user queries grounded in external knowledge, with medium to hard complexity SQL statements.
Preview: Large Version (Industrial Level) DBs and unstructured HKB-Document will be supported in the future LiveSQLBench-Full version.
Discussion
1Current Model Performance
LiveSQLBench-Base-Lite evaluates LLMs on PostgreSQL, the most widely used and feature-rich open-source database system. Our benchmark provides Docker-based evaluation environments for easy deployment and reproducibility. We conduct separate evaluations across three categories: (1) Model Base - direct SQL generation without external tools, and (2) Agent - models with external tool orchestration. Initial results on Model Base reveal significant challenges, with the best-performing model (o3-mini) achieving a 47.78% success rate. The performance gap between models is notable, with a cluster of top models (o3-mini, GPT-4.1, o4-mini, o1-preview, and Gemini 2.5 Flash with thinking) showing capabilities in the 37-45% range, while others still struggle to consistently generate correct SQL queries. This suggests that while there's an improvement at the top end, complex SQL generation remains a difficult task for most current LLMs. The introduction of reasoning-specific models and newer architectures like OpenAI's 'o' series and Google's Gemini 2.5 shows promise, but highlights the ongoing need for advancements in this domain.
2GPT-5 Early Observations
GPT-5 just arrived and shows distinct behavior on LiveSQLBench. It tends to generate long SQL, averaging 373.2 tokens per query (longest among all models; measured with tiktoken, cl100k_base). It also achieves the highest success rate on DQL questions (category "Query"), indicating strong capabilities in data retrieval and BI analysis tasks.
3Agent Cost Optimization: Mini-Agent
Agent I and Agent II evaluations reveal significant cost implications due to extensive context usage. Actions like retrieving ALL column meanings, complete schemas, and external knowledge definitions with very long contexts lead to high costs. For example, Claude 3.7 Sonnet costs 0.0619 USD/task for model-based evaluation, but 0.8936 USD/task for Agent I evaluation. It is also not suitable if the database context is too large to fit in the prompt. To address this, we're developing a Mini-Agent that replaces expensive actions with more cost-friendly alternatives: getting table names first, then retrieving column meanings per table instead of all columns at once, significantly reducing context length and cost.
Stay tuned!
We are developing several new versions of LiveSQLBench for the first release:
- •LiveSQLBench-Large-Lite featuring industrial-scale databases with 1340+ columns
- •LiveSQLBench-Large-Full containing complete large version DBs and tasks
Additionally, we are expanding to multi-dialect support, starting with SQLite for research purposes, with plans to add more dialects based on community voting.
Each new version will include both open development and hidden test sets, with hidden tests becoming the next version's open development set.
Citation
LiveSQLBench Citation
@misc{livesqlbench2025, author = {BIRD Team}, title = {LiveSQLBench: A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks}, year = {2025}, howpublished = {https://github.com/bird-bench/livesqlbench}, note = {Accessed: 2025-05-22} }
BirdBench Citation
@article{li2024can, title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls}, author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others}, journal={Advances in Neural Information Processing Systems}, volume={36}, year={2024} }