LiveSQLBench logo showing a cloud with SQL text inside

LiveSQLBench

A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks

Paper (Coming Soon)GitHubGitHubHuggingfaceLiveSQLBench-Base-Lite HuggingfaceLiveSQLBench-Base-Full v1

News

livesqlbench
[09/04/2025]:
🔥🔥🔥 We are pleased to release LiveSQLBench-Base-Full v1, a comprehensive benchmark with 600 NEW tasks over 22 NEW real, complex databases with KB docs.
NEW FEATURES: more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values. See the dataset for details.
livesqlbench
[07/28/2025]:
We are pleased to release a SQLite version of LiveSQLBench-Base-Lite, called LiveSQLBench-Base-Lite-SQLite, extending from PostgreSQL to SQLite dialect to improve accessibility. Please check the GitHub repository and dataset for more details.
livesqlbench
[05/30/2025]:
The first release of LiveSQLBench has been released! It contains our initial version: LiveSQLBench-Base-Lite. Download it and test your text-to-SQL LLMs or agents in a containmation-free way!
BIRD Team logo
&
Google Cloud logo

Introduction

LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, continuously evolving benchmark designed to evaluate LLMs on complex, real-world text-to-SQL tasks, featuring diverse real-world user queries, including Business Intelligence (BI), CRUD operations, etc. Each release will include 50 new, fully open-source DBs curated by the BIRD team through expert collaboration and continuous improvement. It will cover a wide range of database sizes, from end-user level (around 127 columns) to industrial level (1340+ columns). Here are the features of the LiveSQLBench benchmark:

  • 1
    Live Databases:Constructed dynamically from extensive and regularly updated CSV datasets, with both base (user-end level) and large (industrial level) versions (1340+ columns each DB) to test scalability.
  • 2
    Live User Queries and SQL:Each task pairs unambiguous user queries with annotated, gold-standard SQL statements. The user queries are grounded in external knowledge, with medium to hard complexity SQL statements.
  • 3
    Contextual Reasoning (HKB):Every DB includes a hierarchical knowledge base (HKB) where each knowledge is related to others, which requires the multi-hop reasoning ability. Two HKB formats are provided: (1) structured JSON format, and (2) unstructured Document format.
  • 4
    The First Full SQL Spectrum:Supports not just SELECT (Business Intelligence) queries, but also CRUD (e.g., UPDATE, CREATE, and other database management operations) queries.
  • 5
    Automated Evaluation:Each question includes verifiable test cases for accurate, reproducible scoring.
  • 6
    Truly Live & Hidden Test:New databases and tasks are added over time. Each release features both open development and hidden test phases. The hidden test set from each release becomes the open development set for the next release, ensuring continuous evolution and fair evaluation.

Currently, we released two versions of LiveSQLBench:
• LiveSQLBench-Base-Lite, containing 18 NEW end-user level databases with NEW 270 tasks, featuring HKB-JSON and the JSON operation in SQL.
• LiveSQLBench-Base-Full v1, containing 22 NEW end-user level databases with NEW 600 tasks, featuring HKB-JSON and the JSON operation in SQL, and more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values.

LiveSQLBench Leaderboard

Model Types

Model Base

Direct SQL generation from natural language queries

Agent

Multi-step reasoning with database exploration

CLI (coming soon)

Command-line interface agentic tools, e.g. Gemini CLI, Codex CLI, Claude Code

Metric: Success Rate. Defined by the ratio of the number of tasks passing the test cases to the total number of tasks.

Evaluation Methodology

  • •SELECT queries: Compare execution results with golden SQL outputs
  • •Management SQLs: Verify through comprehensive test cases
Last Updated: 2025-09-04

Date Range

2024-11 — 2026-03
iDrag handles to set months."Success Rate" shows only if a model covers all datasets in window.
2024-11
2026-03
2024-11
2024-12
2025-01
2025-02
2025-03
2025-04
2025-05
2025-06
2025-07
2025-08
2025-09
2025-10
2025-11
2025-12
2026-01
2026-02
2026-03
2025-05-28
LiveSQLBench-Base-Lite
2025-09-04
LiveSQLBench-Base-Full v1
Selected Datasets
Datasets: 2 • Samples: 870
2025-05-28LiveSQLBench-Base-Lite
270 samples
2025-09-04LiveSQLBench-Base-Full v1
600 samples
Rank / WindowModelOrganizationSuccess Rate (%) Avg. Cost (USD) / TaskLink
🥇1
2024-11 — 2026-03
o3-mini logoo3-mini
OpenAI31.15
0.0225
🔗
🥈2
2024-11 — 2026-03
GPT-5 logoGPT-5
OpenAI31.15
0.0383
🔗
🥉3
2024-11 — 2026-03
o4-mini logoo4-mini
OpenAI29.54
0.0188
🔗
4
2024-11 — 2026-03
o3 logoo3
OpenAI29.54
0.1752
🔗
5
2024-11 — 2026-03
Claude Sonnet 4 logoClaude Sonnet 4
Anthropic27.01
0.0601
🔗
6
2024-11 — 2026-03
Qwen3-235B-A22B logoQwen3-235B-A22B
Qwen26.90
0.0043
🔗
7
2024-11 — 2026-03
DeepSeek R1 logoDeepSeek R1
DeepSeek26.90
0.0149
🔗
8
2024-11 — 2026-03
Claude 3.7 Sonnet (Thinking) logoClaude 3.7 Sonnet (Thinking)
Anthropic26.55
0.1045
🔗
9
2024-11 — 2026-03
Claude 3.7 Sonnet logoClaude 3.7 Sonnet
Anthropic25.75
0.0600
🔗
10
2024-11 — 2026-03
DeepSeek V3 logoDeepSeek V3
DeepSeek23.68
0.0045
🔗
11
2024-11 — 2026-03
QwQ-32B logoQwQ-32B
Qwen22.30
0.0010
🔗
12
2024-11 — 2026-03
GPT-4o logoGPT-4o
OpenAI21.38
0.0394
🔗
13
2024-11 — 2026-03
Llama 4 Scout logoLlama 4 Scout
Meta18.55
0.0014
🔗
14
2024-11 — 2026-03
Llama 4 Maverick logoLlama 4 Maverick
Meta18.05
0.0029
🔗
15
2024-11 — 2026-03
Llama 3.3 70B Instruct logoLlama 3.3 70B Instruct
Meta15.86
0.0006
🔗
16
2024-11 — 2026-03
Qwen2.5 Coder 32B logoQwen2.5 Coder 32B
Qwen15.75
0.0008
🔗
17
2024-11 — 2026-03
Codestral 22B logoCodestral 22B
Mistral AI12.53
0.0045
🔗
18
2024-11 — 2026-03
Qwen2.5 Coder 7B logoQwen2.5 Coder 7B
Qwen8.16
0.0018
🔗
19
2024-11 — 2026-03
Mistral 7B Instruct logoMistral 7B Instruct
Mistral AI3.10
0.0026
🔗
20
2024-11 — 2026-03
Mixtral 8x7B Instruct logoMixtral 8x7B Instruct
Mistral AI2.41
0.0012
🔗
-
Gemini 2.0 Flash logoGemini 2.0 Flash
GoogleN/A
0.0027
🔗
-
Gemini 2.5 Flash logoGemini 2.5 Flash
GoogleN/A
0.0055
🔗
-
Qwen3-235B-A22B (Agent I)* logoQwen3-235B-A22B (Agent I)*
QwenN/A
0.0088
🔗
-
Gemini 2.0 Flash (Agent II)^ logoGemini 2.0 Flash (Agent II)^
GoogleN/A
0.0115
🔗
-
DeepSeek R1-0528 logoDeepSeek R1-0528
DeepSeekN/A
0.0160
🔗
-
Gemini 2.5 Flash (Thinking) logoGemini 2.5 Flash (Thinking)
GoogleN/A
0.0165
🔗
-
Gemini 2.0 Flash (Agent I)* logoGemini 2.0 Flash (Agent I)*
GoogleN/A
0.0185
🔗
-
GPT-4.1 logoGPT-4.1
OpenAIN/A
0.0336
🔗
-
o3-mini (Agent I)* logoo3-mini (Agent I)*
OpenAIN/A
0.0353
🔗
-
DeepSeek R1-0528 (Agent I)* logoDeepSeek R1-0528 (Agent I)*
DeepSeekN/A
0.0427
🔗
-
Gemini 2.5 Pro logoGemini 2.5 Pro
GoogleN/A
0.0468
🔗
-
o1-mini logoo1-mini
OpenAIN/A
0.0788
🔗
-
o1 logoo1
OpenAIN/A
0.2283
🔗
-
GPT-4o (Agent I)* logoGPT-4o (Agent I)*
OpenAIN/A
0.3729
🔗
-
o1-preview logoo1-preview
OpenAIN/A
0.4310
🔗
-
Claude 3.7 Sonnet (Agent I)* logoClaude 3.7 Sonnet (Agent I)*
AnthropicN/A
0.8936
🔗
-
Gemma 3 27B logoGemma 3 27B
GoogleN/A-🔗
-
Gemini CLI (coming soon) logoGemini CLI (coming soon)
GoogleN/A-🔗
-
Codex CLI (coming soon) logoCodex CLI (coming soon)
OpenAIN/A-🔗
-
Claude Code (coming soon) logoClaude Code (coming soon)
AnthropicN/A-🔗

Note: Success Rate is a micro-average across datasets in the selected date window, weighted by each dataset's sample count. We require full coverage: "Success Rate" is shown only if a model has results for all datasets in the window; otherwise it displays N/A. Models with reasoning ability are marked with a black-and-white logo.

* Agent I is a baseline agent we designed for interacting with the database. (1) All DB context (schema, column meanings, and external knowledge definitions) is provided to the agent at the beginning before user question, and (2) Agnet can execute SQL (3) Agnet can "submit" the SQL once. The steps are set to 20.

^ Agent II is based on Agent I but WITHOUT the database context provided at the beginning before user question. This suits for the case where the database context is too large to fit in the prompt. However, it requires higher agent reasoning ability of the model.

*^ However, the Agent has the actions of getting complete (ALL) database schema, column meanings, and external knowledge definitions, which is not suitable and cost-effective for the case where the database context is too large. To address this, we are going to propose a Mini-Agent that replaces these actions with more cost-friendly alternatives.

LiveSQLBench Data Viewer

Here we could explore LiveSQLBench-Base-Lite examples with DBs, tasks, and HKB-JSON, our initial release featuring 270 tasks across 18 end-user level databases. Each task features unambiguous and straightforward user queries grounded in external knowledge, with medium to hard complexity SQL statements.

180
SELECT Queries
(Base Version)
90
Management SQLs
(Base Version)
360
Avg SQL Tokens
Current Avg
18
Databases
(Base Version)

Preview: Large Version (Industrial Level) DBs and unstructured HKB-Document will be supported in the future LiveSQLBench-Full version.

Discussion

1Current Model Performance

LiveSQLBench-Base-Lite evaluates LLMs on PostgreSQL, the most widely used and feature-rich open-source database system. Our benchmark provides Docker-based evaluation environments for easy deployment and reproducibility. We conduct separate evaluations across three categories: (1) Model Base - direct SQL generation without external tools, and (2) Agent - models with external tool orchestration. Initial results on Model Base reveal significant challenges, with the best-performing model (o3-mini) achieving a 47.78% success rate. The performance gap between models is notable, with a cluster of top models (o3-mini, GPT-4.1, o4-mini, o1-preview, and Gemini 2.5 Flash with thinking) showing capabilities in the 37-45% range, while others still struggle to consistently generate correct SQL queries. This suggests that while there's an improvement at the top end, complex SQL generation remains a difficult task for most current LLMs. The introduction of reasoning-specific models and newer architectures like OpenAI's 'o' series and Google's Gemini 2.5 shows promise, but highlights the ongoing need for advancements in this domain.

2GPT-5 Early Observations

GPT-5 just arrived and shows distinct behavior on LiveSQLBench. It tends to generate long SQL, averaging 373.2 tokens per query (longest among all models; measured with tiktoken, cl100k_base). It also achieves the highest success rate on DQL questions (category "Query"), indicating strong capabilities in data retrieval and BI analysis tasks.

3Agent Cost Optimization: Mini-Agent

Agent I and Agent II evaluations reveal significant cost implications due to extensive context usage. Actions like retrieving ALL column meanings, complete schemas, and external knowledge definitions with very long contexts lead to high costs. For example, Claude 3.7 Sonnet costs 0.0619 USD/task for model-based evaluation, but 0.8936 USD/task for Agent I evaluation. It is also not suitable if the database context is too large to fit in the prompt. To address this, we're developing a Mini-Agent that replaces expensive actions with more cost-friendly alternatives: getting table names first, then retrieving column meanings per table instead of all columns at once, significantly reducing context length and cost.

Stay tuned!

We are developing several new versions of LiveSQLBench for the first release:

  • •LiveSQLBench-Large-Lite featuring industrial-scale databases with 1340+ columns
  • •LiveSQLBench-Large-Full containing complete large version DBs and tasks

Additionally, we are expanding to multi-dialect support, starting with SQLite for research purposes, with plans to add more dialects based on community voting.

Each new version will include both open development and hidden test sets, with hidden tests becoming the next version's open development set.

Citation

LiveSQLBench Citation

@misc{livesqlbench2025,
  author       = {BIRD Team},
  title        = {LiveSQLBench: A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks},
  year         = {2025},
  howpublished = {https://github.com/bird-bench/livesqlbench},
  note         = {Accessed: 2025-05-22}
}

BirdBench Citation

@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}
For any inquiries or feedback, please contact us at shawnxxh@gmail.com,jl0725@connect.hku.hk,bird.bench25@gmail.com
Submit feedback to questions in the dataset via this form