LiveSQLBench

A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks

Paper (Coming Soon)

GitHub

LiveSQLBench-Base-Lite2025-05-28

LiveSQLBench-Base-Full v12025-09-04

LiveSQLBench-Base-Lite-SQLite2025-07-28

LiveSQLBench-Large-v12026-03-02

🔥 Data Intelligence Index Submission Guidelines

News

[04/04/2026]:

🖥️ We release LiveSQLBench-CLI, an evaluation framework for benchmarking CLI-based agents (OpenHands, Claude Code, Aider, etc.) on LiveSQLBench tasks via terminal interactions. Supports both base-lite and base-full-v1 datasets. First batch of results: 6 models via OpenHands across Base-Lite and Base-Full v1. Check the README for details.

[04/04/2026]:

🚀 We release LiveSQLBench-Agent, a Google ADK-based text-to-SQL agent framework with multi-provider LLM support, per-task DB isolation, and parallel execution. Check the README for details.

[03/05/2026]:

🔥🔥🔥 We release the Data Intelligence Index, a comprehensive evaluation of frontier AI Models and Agents on data-centric intelligence across various aspects: DB querying, BI analysis & data manipulation, DB application debugging, human-centric interaction, digital, data science, and more.

[03/02/2026]:

🔥🔥🔥 We are pleased to release LiveSQLBench-Large-v1, the industrial-scale counterpart with 18 databases (~1K columns each) and 480 tasks.
NEW FEATURES: 10x schema complexity, ~84K avg prompt tokens for long-context challenge, and Business Rule Drift for live context-learning evaluation.

[02/26/2026]:

🔥 Thrilled to have our BIRD-Interact, based on LiveSQLBench, accepted at ICLR 2026 (Oral)!

[09/04/2025]:

We are pleased to release LiveSQLBench-Base-Full v1, a comprehensive benchmark with 600 NEW tasks over 22 NEW real, complex databases with KB docs.
NEW FEATURES: more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values. See the dataset for details.

[07/28/2025]:

We are pleased to release a SQLite version of LiveSQLBench-Base-Lite, called LiveSQLBench-Base-Lite-SQLite, extending from PostgreSQL to SQLite dialect to improve accessibility. Please check the GitHub repository and dataset for more details.

[05/30/2025]:

The first release of LiveSQLBench has been released! It contains our initial version: LiveSQLBench-Base-Lite. Download it and test your text-to-SQL LLMs or agents in a containmation-free way!

BIRD Team, HKU
Google Cloud

Introduction

LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, continuously evolving benchmark designed to evaluate LLMs on complex, real-world text-to-SQL tasks, featuring diverse real-world user queries, including Business Intelligence (BI), CRUD operations, etc. Each release will include 50 new, fully open-source DBs curated by the BIRD team through expert collaboration and continuous improvement. It will cover a wide range of database sizes, from end-user level (around 127 columns) to industrial level (1340+ columns). Here are the features of the LiveSQLBench benchmark:

1
Live Databases:Constructed dynamically from extensive and regularly updated CSV datasets, with both base (user-end level) and large (industrial level) versions (1340+ columns each DB) to test scalability.
2
Live User Queries and SQL:Each task pairs unambiguous user queries with annotated, gold-standard SQL statements. The user queries are grounded in external knowledge, with medium to hard complexity SQL statements.
3
Contextual Reasoning (HKB):Every DB includes a hierarchical knowledge base (HKB) where each knowledge is related to others, which requires the multi-hop reasoning ability. Two HKB formats are provided: (1) structured JSON format, and (2) unstructured Document format.
4
The First Full SQL Spectrum:Supports not just SELECT (Business Intelligence) queries, but also CRUD (e.g., UPDATE, CREATE, and other database management operations) queries.
5
Automated Evaluation:Each question includes verifiable test cases for accurate, reproducible scoring.
6
Truly Live & Hidden Test:New databases and tasks are added over time. Each release features both open development and hidden test phases. The hidden test set from each release becomes the open development set for the next release, ensuring continuous evolution and fair evaluation.
7
Business Rule Drift (Live Context-Learning):Business rules embedded in external knowledge can change across releases, requiring models to adapt to updated context rather than relying on memorized patterns. This evaluates live context-learning ability under evolving real-world conditions.

Currently, we released three versions of LiveSQLBench:
• LiveSQLBench-Base-Lite, containing 18 NEW end-user level databases with NEW 270 tasks, featuring HKB-JSON and the JSON operation in SQL.
• LiveSQLBench-Base-Full v1, containing 22 NEW end-user level databases with NEW 600 tasks, featuring HKB-JSON and the JSON operation in SQL, and more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values.
• LiveSQLBench-Large-v1, the industrial-scale counterpart containing 18 databases (~1K columns, ~54 tables each) with 480 tasks, featuring 10x schema complexity over Base-Full-v1, ~84K avg prompt tokens for long-context challenge, and Business Rule Drift for live context-learning evaluation.

LiveSQLBench Leaderboard

Model Types

Model Base

Direct SQL generation from natural language queries

Agent

Multi-step reasoning with tools, including database exploration, predefined actions, and CLI environments.

Metric: Success Rate. Defined by the ratio of the number of tasks passing the test cases to the total number of tasks.

Evaluation Methodology

•SELECT queries: Compare execution results with golden SQL outputs
•Management SQLs: Verify through comprehensive test cases

Last Updated: 2026-03-02

Benchmark Set

Base

Date Range

2025-09 — 2026-07

iChoose datasets first, then drag handles to set months.Success Rate requires coverage of all selected Base datasets.

Select Base Datasets

1 selected • 600 samples

2025-09

2026-07

2025-03

2025-04

2025-05

2025-06

2025-07

2025-08

2025-09

2025-10

2025-11

2025-12

2026-01

2026-02

2026-03

2026-04

2026-05

2026-06

2026-07

2025-05-28

LiveSQLBench-Base-Lite

2025-09-04

LiveSQLBench-Base-Full v1

Verified Submission: Submissions marked with the badge are verified by the BIRD Team. To get verified, please submit your codebase for pipeline evaluation.

💡Only entries with results for all selected Base datasets are shown. Switch benchmark sets to compare Base, SQLite, and Large independently.

Rank / Sub Date	Model	Organization	Success Rate	Cost / Task	Link
🥇1 2026-05-29	DIA (Data Intelligence Agents)New	C3 AI	48.00	-	🔗
🥈2 2026-06-01	MiniMax M3 (Claude Code)Reportedi	MiniMax	40.17	-	🔗
🥉3 2026-03-27	Claude Opus 4.6 (OpenHands CLI)	Anthropic	38.00	-	🔗
4 2026-04-26	Gemini 3.1 Pro	Google	36.50	0.0507	🔗
5 2026-03-02	Claude Opus 4.6	Anthropic	35.50	0.0979	🔗
6 2026-03-27	Claude Sonnet 4.5 (OpenHands CLI)	Anthropic	35.20	-	🔗
7 2026-04-25	GPT-5.5 (low)	OpenAI	33.50	0.0896	🔗
8 2026-05-01	GPT-5.5 (xhigh)	OpenAI	33.33	0.2220	🔗
9 2026-06-01	MiniMax M2.7 (Claude Code)Reportedi	MiniMax	33.17	-	🔗
10 2026-03-27	Kimi 2.5 (OpenHands CLI)	Moonshot AI	32.20	-	🔗
11 2026-05-01	GPT-5.4	OpenAI	31.00	-	🔗
12 2026-03-02	GPT-5.3 Codex	OpenAI	30.00	-	🔗
13 2026-05-01	Kimi K2.6	Moonshot AI	29.33	0.0364	🔗
14 2026-05-01	GLM-5.1	Zhipu AI	29.00	-	🔗
15 2026-03-27	MiniMax M2.1 (OpenHands CLI)	MiniMax	28.70	-	🔗
16 2025-09-04	Gemini 2.5 Pro	Google	28.67	0.0468	🔗
17 2026-03-27	Qwen3 Coder 480B (OpenHands CLI)	Alibaba	26.80	-	🔗
18 2026-05-01	Qwen3.6 Max	Alibaba	26.50	-	🔗
19 2025-08-17	GPT-5	OpenAI	25.50	0.0216	🔗
20 2025-09-04	o1	OpenAI	25.17	0.2283	🔗
21 2025-05-28	DeepSeek R1	DeepSeek	24.50	0.0142	🔗
22 2025-05-28	o4-mini	OpenAI	24.17	0.0168	🔗
23 2025-09-04	Gemini 2.5 Flash	Google	23.83	0.0055	🔗
24 2026-03-02	Claude Sonnet 4.5	Anthropic	23.83	0.0598	🔗
25 2025-05-28	o3-mini	OpenAI	23.67	0.0222	🔗
26 2025-05-28	o3	OpenAI	23.67	0.1583	🔗
27 2026-03-02	Kimi 2.5	Moonshot AI	23.00	0.0110	🔗
28 2025-05-28	Qwen3-235B-A22B	Qwen	22.17	0.0044	🔗
29 2025-05-28	Claude 3.7 Sonnet (Thinking)	Anthropic	21.67	0.1168	🔗
30 2026-03-27	GLM 4.7 (OpenHands CLI)	Zhipu AI	20.30	-	🔗
31 2025-05-28	Claude Sonnet 4	Anthropic	20.00	0.0591	🔗
32 2025-05-28	DeepSeek V3	DeepSeek	19.83	0.0044	🔗
33 2025-05-28	Claude 3.7 Sonnet	Anthropic	19.67	0.0591	🔗
34 2026-03-02	Qwen3 Coder 480B	Alibaba	19.17	0.0038	🔗
35 2025-05-28	Llama 4 Scout	Meta	18.90	-	🔗
36 2026-03-02	MiniMax M2.1	MiniMax	18.83	0.0057	🔗
37 2025-06-09	QwQ-32B	Qwen	18.17	0.0014	🔗
38 2026-03-02	GLM 4.7	Zhipu AI	18.17	0.0061	🔗
39 2025-05-28	GPT-4o	OpenAI	15.50	0.0386	🔗
40 2025-05-28	Llama 4 Maverick	Meta	13.17	0.0029	🔗
41 2025-06-09	Qwen2.5 Coder 32B	Qwen	12.50	0.0008	🔗
42 2025-09-04	Llama 3.3 70B Instruct	Meta	12.17	0.0006	🔗
43 2025-06-09	Codestral 22B	Mistral AI	8.67	0.0045	🔗
44 2025-06-09	Qwen2.5 Coder 7B	Qwen	6.33	0.0018	🔗
45 2025-06-09	Mistral 7B Instruct	Mistral AI	2.83	0.0026	🔗
46 2025-06-09	Mixtral 8x7B Instruct	Mistral AI	2.33	0.0012	🔗

Note: Success Rate is a micro-average across datasets in the selected benchmark set and date window, weighted by each dataset sample count. We require full coverage within the active set, so Base, SQLite, and Large leaderboards are compared independently. Cost is shown only when all selected releases have cost data. Models with reasoning ability are marked with a black-and-white logo.

* Agent I is a baseline agent we designed for interacting with the database. (1) All DB context (schema, column meanings, and external knowledge definitions) is provided to the agent at the beginning before user question, (2) Agent can execute SQL, and (3) Agent can submit the SQL once. The steps are set to 20.

^ Agent II is based on Agent I but WITHOUT the database context provided at the beginning before user question. This suits for the case where the database context is too large to fit in the prompt. However, it requires higher agent reasoning ability of the model.

*^ However, the Agent has the actions of getting complete (ALL) database schema, column meanings, and external knowledge definitions, which is not suitable and cost-effective for the case where the database context is too large. To address this, we are going to propose a Mini-Agent that replaces these actions with more cost-friendly alternatives.

LiveSQLBench Data Viewer

Here we could explore LiveSQLBench-Base-Lite examples with DBs, tasks, and HKB-JSON, our initial release featuring 270 tasks across 18 end-user level databases. Each task features unambiguous and straightforward user queries grounded in external knowledge, with medium to hard complexity SQL statements.

180

SELECT Queries

(Base Version)

Management SQLs

(Base Version)

360

Avg SQL Tokens

Current Avg

Databases

(Base Version)

Preview: Large Version (Industrial Level) DBs and unstructured HKB-Document will be supported in the future LiveSQLBench-Full version.

Select Database

Discussion

1Current Model Performance

LiveSQLBench-Base-Lite evaluates LLMs on PostgreSQL, the most widely used and feature-rich open-source database system. Our benchmark provides Docker-based evaluation environments for easy deployment and reproducibility. We conduct separate evaluations across three categories: (1) Model Base - direct SQL generation without external tools, and (2) Agent - models with external tool orchestration. Initial results on Model Base reveal significant challenges, with the best-performing model (o3-mini) achieving a 47.78% success rate. The performance gap between models is notable, with a cluster of top models (o3-mini, GPT-4.1, o4-mini, o1-preview, and Gemini 2.5 Flash with thinking) showing capabilities in the 37-45% range, while others still struggle to consistently generate correct SQL queries. This suggests that while there's an improvement at the top end, complex SQL generation remains a difficult task for most current LLMs. The introduction of reasoning-specific models and newer architectures like OpenAI's 'o' series and Google's Gemini 2.5 shows promise, but highlights the ongoing need for advancements in this domain.

2GPT-5 Early Observations

GPT-5 just arrived and shows distinct behavior on LiveSQLBench. It tends to generate long SQL, averaging 373.2 tokens per query (longest among all models; measured with tiktoken, cl100k_base). It also achieves the highest success rate on DQL questions (category "Query"), indicating strong capabilities in data retrieval and BI analysis tasks.

3Agent Cost Optimization: Mini-Agent

Agent I and Agent II evaluations reveal significant cost implications due to extensive context usage. Actions like retrieving ALL column meanings, complete schemas, and external knowledge definitions with very long contexts lead to high costs. For example, Claude 3.7 Sonnet costs 0.0619 USD/task for model-based evaluation, but 0.8936 USD/task for Agent I evaluation. It is also not suitable if the database context is too large to fit in the prompt. To address this, we're developing a Mini-Agent that replaces expensive actions with more cost-friendly alternatives: getting table names first, then retrieving column meanings per table instead of all columns at once, significantly reducing context length and cost.

Stay tuned!

We are developing several new versions of LiveSQLBench for the first release:

•LiveSQLBench-Large-Lite featuring industrial-scale databases with 1340+ columns
•LiveSQLBench-Large-Full containing complete large version DBs and tasks

Additionally, we are expanding to multi-dialect support, starting with SQLite for research purposes, with plans to add more dialects based on community voting.

Each new version will include both open development and hidden test sets, with hidden tests becoming the next version's open development set.

NeurIPS 2023

Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs

This paper introduces BIRD, a Big benchmark for large-scale database grounded in text-to-SQL tasks, containing 12,751 pairs of text-to-SQL data and 95 databases with a total size of 33.4 GB, spanning 37 professional domains.

ICLR 2026 Oral

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

BIRD-INTERACT evaluates text-to-SQL systems through dynamic multi-turn interactions with both users and database environments.

Citation

LiveSQLBench Citation

@misc{livesqlbench2025,
  author       = {BIRD Team},
  title        = {LiveSQLBench: A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks},
  year         = {2025},
  howpublished = {https://github.com/bird-bench/livesqlbench},
  note         = {Accessed: 2025-05-22}
}

BirdBench Citation

@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

For any inquiries or feedback, please contact us at shawnxxh@gmail.com,jl0725@connect.hku.hk,xxh24@connect.hku.hk,bird.bench25@gmail.com

Submit feedback to questions in the dataset via this form

LiveSQLBench

A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks

News

Introduction

LiveSQLBench Leaderboard

Model Types

Model Base

Agent

Evaluation Methodology

Benchmark Set

Date Range

LiveSQLBench Data Viewer

Discussion

1Current Model Performance

2GPT-5 Early Observations

3Agent Cost Optimization: Mini-Agent

Stay tuned!

Related Articles

Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

Citation

LiveSQLBench Citation

BirdBench Citation