LiveSQLBench

A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks

Paper (Coming Soon)

GitHub

Dataset

News

[05/30/2025]:

The first release of LiveSQLBench has been released! It contains our initial version: LiveSQLBench-Base-Lite. Download it and test your text-to-SQL LLMs or agents in a containmation-free way!

BIRD Team, HKU
Google Cloud

Introduction

LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, continuously evolving benchmark designed to evaluate LLMs on complex, real-world text-to-SQL tasks, featuring diverse real-world user queries, including Business Intelligence (BI), CRUD operations, etc. Each release will include 50 new, fully open-source DBs curated by the BIRD team through expert collaboration and continuous improvement. It will cover a wide range of database sizes, from end-user level (around 127 columns) to industrial level (1340+ columns). Here are the features of the LiveSQLBench benchmark:

1
Live Databases:Constructed dynamically from extensive and regularly updated CSV datasets, with both base (user-end level) and large (industrial level) versions (1340+ columns each DB) to test scalability.
2
Live User Queries and SQL:Each task pairs unambiguous user queries with annotated, gold-standard SQL statements. The user queries are grounded in external knowledge, with medium to hard complexity SQL statements.
3
Contextual Reasoning (HKB):Every DB includes a hierarchical knowledge base (HKB) where each knowledge is related to others, which requires the multi-hop reasoning ability. Two HKB formats are provided: (1) structured JSON format, and (2) unstructured Document format.
4
The First Full SQL Spectrum:Supports not just SELECT (Business Intelligence) queries, but also CRUD (e.g., UPDATE, CREATE, and other database management operations) queries.
5
Automated Evaluation:Each question includes verifiable test cases for accurate, reproducible scoring.
6
Truly Live & Hidden Test:New databases and tasks are added over time. Each release features both open development and hidden test phases. The hidden test set from each release becomes the open development set for the next release, ensuring continuous evolution and fair evaluation.

Currently, we release a LiveSQLBench-Base-Lite, featuring 18 end-user level databases with 270 tasks, HKB-JSON and the JSON operation in SQL for trial.

LiveSQLBench's updating databases, tasks, and HKB support BIRD-Interact's conversational and agentic evaluation. BIRD-Interact evaluates LLMs' text-to-SQL ability in dynamic interactive settings with database and user simulation.

LiveSQLBench-Base-Lite Data

Please explore LiveSQLBench-Base-Lite examples with DBs, tasks, and HKB-JSON, our initial release featuring 270 tasks across 18 end-user level databases. Each task features unambiguous and straightforward user queries grounded in external knowledge, with medium to hard complexity SQL statements.

180

SELECT Queries

(Base Version)

Management SQLs

(Base Version)

360

Avg SQL Tokens

Current Avg

Databases

(Base Version)

Preview: Large Version (Industrial Level) DBs and unstructured HKB-Document will be supported in the future LiveSQLBench-Full version.

Select Database

LiveSQLBench Leaderboard

Success Rate. Defined by the ratio of the number of tasks passing the test cases to the total number of tasks.

Evaluation Methodology

•SELECT queries: Compare execution results with golden SQL outputs
•Management SQLs: Verify through comprehensive test cases

Last Updated: 06/09/2025

Rank / Date	Model	Organization	Success Rate (%)	Avg. Cost (USD) / Task	Link
🥇1 2025-05-28	o3-mini (Agent)	OpenAI	50.00	-	🔗
🥈2 2025-05-28	o3-mini	OpenAI	47.78	0.0233	🔗
🥉3 2025-05-28	Claude 3.7 Sonnet (Agent)	Anthropic	45.27	-	🔗
4 2025-05-28	GPT-4.1	OpenAI	44.10	0.0336	🔗
5 2025-05-28	Claude Sonnet 4	Anthropic	42.59	0.0623	🔗
6 2025-05-28	o3	OpenAI	42.59	0.2129	🔗
7 2025-05-28	o1-preview	OpenAI	42.22	0.4310	🔗
8 2025-05-28	Qwen3.2-235B-A22B (Agent)	Qwen	42.22	-	🔗
9 2025-05-28	GPT-4o (Agent)	OpenAI	41.63	-	🔗
10 2025-05-28	o4-mini	OpenAI	41.48	0.0231	🔗
11 2025-05-28	Gemini 2.0 Flash (Agent)	Google	39.55	-	🔗
12 2025-05-28	Claude 3.7 Sonnet	Anthropic	39.26	0.0619	🔗
13 2025-05-28	Gemini 2.5 Flash (Thinking)	Google	38.51	0.0165	🔗
14 2025-06-03	DeepSeek R1-0528 (Agent)	DeepSeek	38.15	-	🔗
15 2025-06-03	DeepSeek R1-0528	DeepSeek	38.14	0.0160	🔗
16 2025-05-28	Qwen3.2-235B-A22B	Qwen	37.41	0.0043	🔗
17 2025-05-28	Claude 3.7 Sonnet (Thinking)	Anthropic	37.40	0.0771	🔗
18 2025-05-28	o1-mini	OpenAI	34.81	0.0788	🔗
19 2025-05-28	Gemini 2.0 Flash	Google	34.44	0.0027	🔗
20 2025-05-28	GPT-4o	OpenAI	34.44	0.0412	🔗
21 2025-05-28	DeepSeek V3	DeepSeek	32.22	0.0047	🔗
22 2025-05-28	DeepSeek R1	DeepSeek	32.22	0.0165	🔗
23 2025-06-09	QwQ-32B	Qwen	31.48	-	🔗
24 2025-05-28	Llama 4 Maverick	Meta	28.89	0.0029	🔗
25 2025-06-09	Qwen2.5 Coder 32B	Qwen	22.96	-	🔗
26 2025-06-09	Codestral 22B	Mistral AI	21.11	-	🔗
27 2025-05-28	Llama 4 Scout	Meta	17.78	0.0014	🔗
28 2025-06-09	Gemma 3 27B	Google	15.19	-	🔗
29 2025-06-09	Qwen2.5 Coder 7B	Qwen	12.22	-	🔗
30 2025-06-09	Mistral 7B Instruct	Mistral AI	3.70	-	🔗
31 2025-06-09	Mixtral 8x7B Instruct	Mistral AI	2.59	-	🔗

Note: Results are based on LiveSQLBench-Base-Lite (270 tasks across 18 end-user level databases with HKB-JSON), including both SELECT queries and management operations; Model with reasoning ability is marked with a black-and-white logo.

Discussion

1Current Model Performance

LiveSQLBench-Base-Lite evaluates LLMs on PostgreSQL, the most widely used and feature-rich open-source database system. Our benchmark provides Docker-based evaluation environments for easy deployment and reproducibility. We conduct separate evaluations across three categories: (1) Model Base - direct SQL generation without external tools, and (2) Agent - models with external tool orchestration. Initial results on Model Base reveal significant challenges, with the best-performing model (o3-mini) achieving a 44.81% success rate. The performance gap between models is notable, with a cluster of top models (o3-mini, GPT-4.1, o4-mini, o1-preview, and Gemini 2.5 Flash with thinking) showing capabilities in the 37-45% range, while others still struggle to consistently generate correct SQL queries. This suggests that while there's an improvement at the top end, complex SQL generation remains a difficult task for most current LLMs. The introduction of reasoning-specific models and newer architectures like OpenAI's 'o' series and Google's Gemini 2.5 shows promise, but highlights the ongoing need for advancements in this domain.

Stay tuned!

We are developing several new versions of LiveSQLBench for the first release:

•LiveSQLBench-Base-Full with 600 BI tasks, 200 Management tasks and HKB-Documents
•LiveSQLBench-Large-Lite featuring industrial-scale databases with 1340+ columns
•LiveSQLBench-Large-Full containing complete large version DBs and tasks

Additionally, we are expanding to multi-dialect support, starting with SQLite for research purposes, with plans to add more dialects based on community voting.

Each new version will include both open development and hidden test sets, with hidden tests becoming the next version's open development set.

NeurIPS 2023

Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs

This paper introduces BIRD, a Big benchmark for large-scale database grounded in text-to-SQL tasks, containing 12,751 pairs of text-to-SQL data and 95 databases with a total size of 33.4 GB, spanning 37 professional domains.

Citation

LiveSQLBench Citation

@misc{livesqlbench2025,
  author       = {BIRD Team},
  title        = {LiveSQLBench: A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks},
  year         = {2024},
  howpublished = {https://github.com/bird-bench/livesqlbench},
  note         = {Accessed: 2025-05-22}
}

BirdBench Citation

@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

For any inquiries or feedback, please contact us at shawnxxh@gmail.com,jl0725@connect.hku.hk,bird.bench25@gmail.com

Submit feedback to questions in the dataset via this form

LiveSQLBench

A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks

News

Introduction

LiveSQLBench-Base-Lite Data

LiveSQLBench Leaderboard

Evaluation Methodology

Discussion

1Current Model Performance

Stay tuned!

Related Articles

Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs

Citation

LiveSQLBench Citation

BirdBench Citation