LiveSQLBench logo showing a cloud with SQL text inside

LiveSQLBench

A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks

Paper (Coming Soon)GitHubGitHub
BIRD🔥 Data Intelligence IndexSubmission Guidelines

News

livesqlbench
[04/04/2026]:
🖥️ We release LiveSQLBench-CLI, an evaluation framework for benchmarking CLI-based agents (OpenHands, Claude Code, Aider, etc.) on LiveSQLBench tasks via terminal interactions. Supports both base-lite and base-full-v1 datasets. First batch of results: 6 models via OpenHands across Base-Lite and Base-Full v1. Check the README for details.
livesqlbench
[04/04/2026]:
🚀 We release LiveSQLBench-Agent, a Google ADK-based text-to-SQL agent framework with multi-provider LLM support, per-task DB isolation, and parallel execution. Check the README for details.
livesqlbench
[03/05/2026]:
🔥🔥🔥 We release the Data Intelligence Index, a comprehensive evaluation of frontier AI Models and Agents on data-centric intelligence across various aspects: DB querying, BI analysis & data manipulation, DB application debugging, human-centric interaction, digital, data science, and more.
livesqlbench
[03/02/2026]:
🔥🔥🔥 We are pleased to release LiveSQLBench-Large-v1, the industrial-scale counterpart with 18 databases (~1K columns each) and 480 tasks.
NEW FEATURES: 10x schema complexity, ~84K avg prompt tokens for long-context challenge, and Business Rule Drift for live context-learning evaluation.
livesqlbench
[02/26/2026]:
🔥 Thrilled to have our BIRD-Interact, based on LiveSQLBench, accepted at ICLR 2026 (Oral)!
livesqlbench
[09/04/2025]:
We are pleased to release LiveSQLBench-Base-Full v1, a comprehensive benchmark with 600 NEW tasks over 22 NEW real, complex databases with KB docs.
NEW FEATURES: more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values. See the dataset for details.
livesqlbench
[07/28/2025]:
We are pleased to release a SQLite version of LiveSQLBench-Base-Lite, called LiveSQLBench-Base-Lite-SQLite, extending from PostgreSQL to SQLite dialect to improve accessibility. Please check the GitHub repository and dataset for more details.
livesqlbench
[05/30/2025]:
The first release of LiveSQLBench has been released! It contains our initial version: LiveSQLBench-Base-Lite. Download it and test your text-to-SQL LLMs or agents in a containmation-free way!
BIRD Team logo
&
Google Cloud logo

Introduction

LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, continuously evolving benchmark designed to evaluate LLMs on complex, real-world text-to-SQL tasks, featuring diverse real-world user queries, including Business Intelligence (BI), CRUD operations, etc. Each release will include 50 new, fully open-source DBs curated by the BIRD team through expert collaboration and continuous improvement. It will cover a wide range of database sizes, from end-user level (around 127 columns) to industrial level (1340+ columns). Here are the features of the LiveSQLBench benchmark:

  • 1
    Live Databases:Constructed dynamically from extensive and regularly updated CSV datasets, with both base (user-end level) and large (industrial level) versions (1340+ columns each DB) to test scalability.
  • 2
    Live User Queries and SQL:Each task pairs unambiguous user queries with annotated, gold-standard SQL statements. The user queries are grounded in external knowledge, with medium to hard complexity SQL statements.
  • 3
    Contextual Reasoning (HKB):Every DB includes a hierarchical knowledge base (HKB) where each knowledge is related to others, which requires the multi-hop reasoning ability. Two HKB formats are provided: (1) structured JSON format, and (2) unstructured Document format.
  • 4
    The First Full SQL Spectrum:Supports not just SELECT (Business Intelligence) queries, but also CRUD (e.g., UPDATE, CREATE, and other database management operations) queries.
  • 5
    Automated Evaluation:Each question includes verifiable test cases for accurate, reproducible scoring.
  • 6
    Truly Live & Hidden Test:New databases and tasks are added over time. Each release features both open development and hidden test phases. The hidden test set from each release becomes the open development set for the next release, ensuring continuous evolution and fair evaluation.
  • 7
    Business Rule Drift (Live Context-Learning):Business rules embedded in external knowledge can change across releases, requiring models to adapt to updated context rather than relying on memorized patterns. This evaluates live context-learning ability under evolving real-world conditions.

Currently, we released three versions of LiveSQLBench:
LiveSQLBench-Base-Lite, containing 18 NEW end-user level databases with NEW 270 tasks, featuring HKB-JSON and the JSON operation in SQL.
LiveSQLBench-Base-Full v1, containing 22 NEW end-user level databases with NEW 600 tasks, featuring HKB-JSON and the JSON operation in SQL, and more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values.
LiveSQLBench-Large-v1, the industrial-scale counterpart containing 18 databases (~1K columns, ~54 tables each) with 480 tasks, featuring 10x schema complexity over Base-Full-v1, ~84K avg prompt tokens for long-context challenge, and Business Rule Drift for live context-learning evaluation.

LiveSQLBench Leaderboard

Model Types

Model Base

Direct SQL generation from natural language queries

Agent

Multi-step reasoning with tools, including database exploration, predefined actions, and CLI environments.

Metric: Success Rate. Defined by the ratio of the number of tasks passing the test cases to the total number of tasks.

Evaluation Methodology

  • SELECT queries: Compare execution results with golden SQL outputs
  • Management SQLs: Verify through comprehensive test cases
Last Updated: 2026-03-02

Benchmark Set

Base

Date Range

2025-092026-06
iChoose datasets first, then drag handles to set months.Success Rate requires coverage of all selected Base datasets.
Select Base Datasets
1 selected • 600 samples
2025-09
2026-06
2025-03
2025-04
2025-05
2025-06
2025-07
2025-08
2025-09
2025-10
2025-11
2025-12
2026-01
2026-02
2026-03
2026-04
2026-05
2026-06
2025-05-28
LiveSQLBench-Base-Lite
2025-09-04
LiveSQLBench-Base-Full v1
Verified submission badgeVerified Submission: Submissions marked with the badge are verified by the BIRD Team. To get verified, please submit your codebase for pipeline evaluation.
💡Entries with complete coverage over the selected Base releases are ranked. Partial-coverage entries are shown in score order without rank.
Rank / Sub DateModelOrganizationSuccess RateCoverageCost / TaskLink
🥇1
2026-05-29
DIA (Data Intelligence Agents) logoDIA (Data Intelligence Agents)New
C3 AI
48.00
1/1Base-Full-v1
-🔗
🥈2
2026-06-01
MiniMax M3 (Claude Code) logoMiniMax M3 (Claude Code)Reported?
MiniMax
40.17
1/1Base-Full-v1
-🔗
🥉3
2026-03-27
Claude Opus 4.6 (OpenHands CLI) logoClaude Opus 4.6 (OpenHands CLI)
Anthropic
38.00
1/1Base-Full-v1
-🔗
4
2026-04-26
Gemini 3.1 Pro logoGemini 3.1 ProVerified submission
Google
36.50
1/1Base-Full-v1
0.0507
🔗
5
2026-03-02
Claude Opus 4.6 logoClaude Opus 4.6
Anthropic
35.50
1/1Base-Full-v1
0.0979
🔗
6
2026-03-27
Claude Sonnet 4.5 (OpenHands CLI) logoClaude Sonnet 4.5 (OpenHands CLI)
Anthropic
35.20
1/1Base-Full-v1
-🔗
7
2026-04-25
GPT-5.5 (low) logoGPT-5.5 (low)Verified submission
OpenAI
33.50
1/1Base-Full-v1
0.0896
🔗
8
2026-05-01
GPT-5.5 (xhigh) logoGPT-5.5 (xhigh)Verified submission
OpenAI
33.33
1/1Base-Full-v1
0.2220
🔗
9
2026-06-01
MiniMax M2.7 (Claude Code) logoMiniMax M2.7 (Claude Code)Reported?
MiniMax
33.17
1/1Base-Full-v1
-🔗
10
2026-03-27
Kimi 2.5 (OpenHands CLI) logoKimi 2.5 (OpenHands CLI)
Moonshot AI
32.20
1/1Base-Full-v1
-🔗
11
2026-05-01
GPT-5.4 logoGPT-5.4
OpenAI
31.00
1/1Base-Full-v1
-🔗
12
2026-03-02
GPT-5.3 Codex logoGPT-5.3 Codex
OpenAI
30.00
1/1Base-Full-v1
-🔗
13
2026-05-01
Kimi K2.6 logoKimi K2.6
Moonshot AI
29.33
1/1Base-Full-v1
0.0364
🔗
14
2026-05-01
GLM-5.1 logoGLM-5.1
Zhipu AI
29.00
1/1Base-Full-v1
-🔗
15
2026-03-27
MiniMax M2.1 (OpenHands CLI) logoMiniMax M2.1 (OpenHands CLI)
MiniMax
28.70
1/1Base-Full-v1
-🔗
16
2025-09-04
Gemini 2.5 Pro logoGemini 2.5 Pro
Google
28.67
1/1Base-Full-v1
0.0468
🔗
17
2026-03-27
Qwen3 Coder 480B (OpenHands CLI) logoQwen3 Coder 480B (OpenHands CLI)
Alibaba
26.80
1/1Base-Full-v1
-🔗
18
2026-05-01
Qwen3.6 Max logoQwen3.6 Max
Alibaba
26.50
1/1Base-Full-v1
-🔗
19
2025-08-17
GPT-5 logoGPT-5
OpenAI
25.50
1/1Base-Full-v1
0.0216
🔗
20
2025-09-04
o1 logoo1
OpenAI
25.17
1/1Base-Full-v1
0.2283
🔗
21
2025-05-28
DeepSeek R1 logoDeepSeek R1
DeepSeek
24.50
1/1Base-Full-v1
0.0142
🔗
22
2025-05-28
o4-mini logoo4-mini
OpenAI
24.17
1/1Base-Full-v1
0.0168
🔗
23
2025-09-04
Gemini 2.5 Flash logoGemini 2.5 Flash
Google
23.83
1/1Base-Full-v1
0.0055
🔗
24
2026-03-02
Claude Sonnet 4.5 logoClaude Sonnet 4.5
Anthropic
23.83
1/1Base-Full-v1
0.0598
🔗
25
2025-05-28
o3-mini logoo3-mini
OpenAI
23.67
1/1Base-Full-v1
0.0222
🔗
26
2025-05-28
o3 logoo3
OpenAI
23.67
1/1Base-Full-v1
0.1583
🔗
27
2026-03-02
Kimi 2.5 logoKimi 2.5
Moonshot AI
23.00
1/1Base-Full-v1
0.0110
🔗
28
2025-05-28
Qwen3-235B-A22B logoQwen3-235B-A22B
Qwen
22.17
1/1Base-Full-v1
0.0044
🔗
29
2025-05-28
Claude 3.7 Sonnet (Thinking) logoClaude 3.7 Sonnet (Thinking)
Anthropic
21.67
1/1Base-Full-v1
0.1168
🔗
30
2026-03-27
GLM 4.7 (OpenHands CLI) logoGLM 4.7 (OpenHands CLI)
Zhipu AI
20.30
1/1Base-Full-v1
-🔗
31
2025-05-28
Claude Sonnet 4 logoClaude Sonnet 4
Anthropic
20.00
1/1Base-Full-v1
0.0591
🔗
32
2025-05-28
DeepSeek V3 logoDeepSeek V3
DeepSeek
19.83
1/1Base-Full-v1
0.0044
🔗
33
2025-05-28
Claude 3.7 Sonnet logoClaude 3.7 Sonnet
Anthropic
19.67
1/1Base-Full-v1
0.0591
🔗
34
2026-03-02
Qwen3 Coder 480B logoQwen3 Coder 480B
Alibaba
19.17
1/1Base-Full-v1
0.0038
🔗
35
2025-05-28
Llama 4 Scout logoLlama 4 Scout
Meta
18.90
1/1Base-Full-v1
-🔗
36
2026-03-02
MiniMax M2.1 logoMiniMax M2.1
MiniMax
18.83
1/1Base-Full-v1
0.0057
🔗
37
2025-06-09
QwQ-32B logoQwQ-32B
Qwen
18.17
1/1Base-Full-v1
0.0014
🔗
38
2026-03-02
GLM 4.7 logoGLM 4.7
Zhipu AI
18.17
1/1Base-Full-v1
0.0061
🔗
39
2025-05-28
GPT-4o logoGPT-4o
OpenAI
15.50
1/1Base-Full-v1
0.0386
🔗
40
2025-05-28
Llama 4 Maverick logoLlama 4 Maverick
Meta
13.17
1/1Base-Full-v1
0.0029
🔗
41
2025-06-09
Qwen2.5 Coder 32B logoQwen2.5 Coder 32B
Qwen
12.50
1/1Base-Full-v1
0.0008
🔗
42
2025-09-04
Llama 3.3 70B Instruct logoLlama 3.3 70B Instruct
Meta
12.17
1/1Base-Full-v1
0.0006
🔗
43
2025-06-09
Codestral 22B logoCodestral 22B
Mistral AI
8.67
1/1Base-Full-v1
0.0045
🔗
44
2025-06-09
Qwen2.5 Coder 7B logoQwen2.5 Coder 7B
Qwen
6.33
1/1Base-Full-v1
0.0018
🔗
45
2025-06-09
Mistral 7B Instruct logoMistral 7B Instruct
Mistral AI
2.83
1/1Base-Full-v1
0.0026
🔗
46
2025-06-09
Mixtral 8x7B Instruct logoMixtral 8x7B Instruct
Mistral AI
2.33
1/1Base-Full-v1
0.0012
🔗

Note: Success Rate is a micro-average across covered datasets in the selected benchmark set and date window, weighted by each dataset sample count. Entries with full coverage are ranked; partial-coverage entries can be shown in score order without rank. Cost is shown only when all selected releases have cost data. Models with reasoning ability are marked with a black-and-white logo.

* Agent I is a baseline agent we designed for interacting with the database. (1) All DB context (schema, column meanings, and external knowledge definitions) is provided to the agent at the beginning before user question, (2) Agent can execute SQL, and (3) Agent can submit the SQL once. The steps are set to 20.

^ Agent II is based on Agent I but WITHOUT the database context provided at the beginning before user question. This suits for the case where the database context is too large to fit in the prompt. However, it requires higher agent reasoning ability of the model.

*^ However, the Agent has the actions of getting complete (ALL) database schema, column meanings, and external knowledge definitions, which is not suitable and cost-effective for the case where the database context is too large. To address this, we are going to propose a Mini-Agent that replaces these actions with more cost-friendly alternatives.

LiveSQLBench Data Viewer

Here we could explore LiveSQLBench-Base-Lite examples with DBs, tasks, and HKB-JSON, our initial release featuring 270 tasks across 18 end-user level databases. Each task features unambiguous and straightforward user queries grounded in external knowledge, with medium to hard complexity SQL statements.

180
SELECT Queries
(Base Version)
90
Management SQLs
(Base Version)
360
Avg SQL Tokens
Current Avg
18
Databases
(Base Version)

Preview: Large Version (Industrial Level) DBs and unstructured HKB-Document will be supported in the future LiveSQLBench-Full version.

Discussion

1Current Model Performance

LiveSQLBench-Base-Lite evaluates LLMs on PostgreSQL, the most widely used and feature-rich open-source database system. Our benchmark provides Docker-based evaluation environments for easy deployment and reproducibility. We conduct separate evaluations across three categories: (1) Model Base - direct SQL generation without external tools, and (2) Agent - models with external tool orchestration. Initial results on Model Base reveal significant challenges, with the best-performing model (o3-mini) achieving a 47.78% success rate. The performance gap between models is notable, with a cluster of top models (o3-mini, GPT-4.1, o4-mini, o1-preview, and Gemini 2.5 Flash with thinking) showing capabilities in the 37-45% range, while others still struggle to consistently generate correct SQL queries. This suggests that while there's an improvement at the top end, complex SQL generation remains a difficult task for most current LLMs. The introduction of reasoning-specific models and newer architectures like OpenAI's 'o' series and Google's Gemini 2.5 shows promise, but highlights the ongoing need for advancements in this domain.

2GPT-5 Early Observations

GPT-5 just arrived and shows distinct behavior on LiveSQLBench. It tends to generate long SQL, averaging 373.2 tokens per query (longest among all models; measured with tiktoken, cl100k_base). It also achieves the highest success rate on DQL questions (category "Query"), indicating strong capabilities in data retrieval and BI analysis tasks.

3Agent Cost Optimization: Mini-Agent

Agent I and Agent II evaluations reveal significant cost implications due to extensive context usage. Actions like retrieving ALL column meanings, complete schemas, and external knowledge definitions with very long contexts lead to high costs. For example, Claude 3.7 Sonnet costs 0.0619 USD/task for model-based evaluation, but 0.8936 USD/task for Agent I evaluation. It is also not suitable if the database context is too large to fit in the prompt. To address this, we're developing a Mini-Agent that replaces expensive actions with more cost-friendly alternatives: getting table names first, then retrieving column meanings per table instead of all columns at once, significantly reducing context length and cost.

Stay tuned!

We are developing several new versions of LiveSQLBench for the first release:

  • LiveSQLBench-Large-Lite featuring industrial-scale databases with 1340+ columns
  • LiveSQLBench-Large-Full containing complete large version DBs and tasks

Additionally, we are expanding to multi-dialect support, starting with SQLite for research purposes, with plans to add more dialects based on community voting.

Each new version will include both open development and hidden test sets, with hidden tests becoming the next version's open development set.

Citation

LiveSQLBench Citation

@misc{livesqlbench2025,
  author       = {BIRD Team},
  title        = {LiveSQLBench: A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks},
  year         = {2025},
  howpublished = {https://github.com/bird-bench/livesqlbench},
  note         = {Accessed: 2025-05-22}
}

BirdBench Citation

@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}
For any inquiries or feedback, please contact us at shawnxxh@gmail.com,jl0725@connect.hku.hk,xxh24@connect.hku.hk,bird.bench25@gmail.com
Submit feedback to questions in the dataset via this form