The Data Intelligence Index is a comprehensive evaluation of frontier AI models on data-centric intelligence. As models and Agents become more powerful, we need a systematic way to measure their performance across diverse data challenges, from querying databases, SQL debugging, to data science, and more skills.

We assess frontier models across various aspects of data-centric intelligence, including DB querying, BI analysis, application debugging, human-centric interaction, digital, data science, and more. This provides a single view of both performance and cost efficiency. For methodology details on this index, see the blog page.

SQL Only: restrict the index to pure SQL benchmarks (BIRD-SQL, LiveSQLBench, BIRD-Critic, BIRD-Interact). Include Vision: add BIRD-Vision and show only models with vision results.

BIRDData Intelligence Index

Evaluating frontier AI on data-centric intelligence across various aspects, including DB querying, BI analysis, application debugging, human-centric interaction, digital, data science, and more.

Data Intelligence Index = average score (%) across each aspect. · Presented by HKU BIRD Team · bird-bench.github.io · Mar 2026

Model Profiles

Hover legend to highlight

Each axis is normalized to the top score in that dimension. Hover for raw scores.

Score by Aspect

Representative score (%) per aspect · sorted by index

Cost vs Performance

Average cost per task vs Data Intelligence Index

Benchmark Details

Tests basic text-to-SQL generation across SQLite, PostgreSQL, and MySQL on the Mini Dev set. Models must translate natural language questions into correct SQL queries over complex real-world database schemas. 1,500 instances. bird-bench.github.io
Evaluates whether models can identify and fix issues in database applications across 5 SQL dialect variants (SQLite, PostgreSQL, MySQL, SQL Server, Oracle). Models must diagnose broken SQL and produce working fixes. 1,804 instances. bird-critic.github.io
Contamination-free, periodically updated benchmark covering business intelligence queries and full database manipulation (SELECT, UPDATE, CREATE, DELETE). Evolving business rules and real-world tasks make memorization impossible. Lite: 270, Full: 600, Large: 480 tasks. livesqlbench.ai
Evaluates whether models can clarify ambiguous requests with user, iteratively refine queries, and explore complex DB env. (ICLR 2026, Oral). Two evaluation modes: a-Interact (agentic, model explores flexibility) and c-Interact (conversational, model communicates with user first). Each task contans two subtasks, requiring model to interact with both User and DB Env to solve. Metrics: Success Rate (SR) per subtask, and Reward (a weighted combination of both SRs). Lite: 300, Full: 600, Mini: 300 instances. bird-interact.github.io
Evaluating data science capabilities starting with data science code translation across three domains: data querying, data management, and deep learning. 800 instances (300 + 300 + 200).
Combines visual inputs, database queries, and tool use. Models analyze images alongside structured data to answer grounded questions. Life track (300) and Science track (300). * Only Opus 4.6, Sonnet 4.5, Kimi 2.5 evaluated.

Key Findings

Unsolved

Top model averages under 50%. SQL debugging peaks at 40.9%, human-centric interaction at just 28.8%.

Leader

Opus 4.6 wins overall, ranking 1st on DB querying, BI analysis, debugging, and code translation.

Value

Kimi 2.5 leads Vision (47.0% vs Opus 43.8%) at a fraction of the cost. Best accuracy-per-dollar.

Hard

Human-centric interaction: Even Opus 4.6 achieves just 28.1% and Qwen3 Coder drops to 18.4%.

Varied

Opus dominates overall, but Kimi leads multi-modal, and Qwen3 matches Opus on DB querying at 1/17th cost.