The Data Intelligence Index is a comprehensive evaluation of frontier AI models on data-centric intelligence. As models and Agents become more powerful, we need a systematic way to measure their performance across diverse data challenges, from querying databases to debugging SQL in production.

We assess 6 models across 6 aspects: DB Querying, BI Analysis & Data Manipulation, DB Application Debugging, Human-centric Interaction, Multi-modal Querying, and Data Science Code Translation. This provides a single view of both performance and cost efficiency across data-centric intelligence. For methodology details on this index, see the blog page.

BIRDData Intelligence Index

Evaluating frontier AI on data-centric intelligence across 6 aspects: DB querying, BI analysis & data manipulation, DB application debugging, human-centric interaction, multi-modal querying, and data science code translation.

Data Intelligence Index = average score (%) across each aspect. · Presented by HKU BIRD Team · bird-bench.github.io · Mar 2026

Model Profiles

Hover legend to highlight

Score by Aspect

Representative score (%) per aspect · sorted by index

Cost vs Performance

Average cost per task vs Data Intelligence Index

Benchmark Details

Tests basic text-to-SQL generation across SQLite, PostgreSQL, and MySQL on the Mini Dev set. Models must translate natural language questions into correct SQL queries over complex real-world database schemas. 1,500 instances. bird-bench.github.io
Evaluates whether models can identify and fix issues in database applications across 7 SQL dialect variants: SQLite, Flash, PG-530, and Open set (PostgreSQL, MySQL, SQL Server, Oracle). Models must diagnose broken SQL and produce working fixes. 1,804 instances. bird-critic.github.io
Contamination-free, periodically updated benchmark covering business intelligence queries and full database manipulation (SELECT, UPDATE, CREATE, DELETE). Evolving business rules and real-world tasks make memorization impossible. Lite: 270, Full: 600 tasks. livesqlbench.ai
Evaluates whether models can clarify ambiguous requests with user, iteratively refine queries, and explore complex DB env. (ICLR 2026, Oral). Two evaluation modes: a-Interact (agentic, model explores flexibility) and c-Interact (conversational, model communicates with user first). Each task contans two subtasks, requiring model to interact with both User and DB Env to solve. Metrics: Success Rate (SR) per subtask, and Reward (a weighted combination of both SRs). Lite: 300, Full: 600, Mini: 300 instances. bird-interact.github.io
Data science code translation across three domains: data querying, data management, and deep learning. Models must bridge between languages and frameworks in real data workflows. 800 instances (300 + 300 + 200).
Combines visual inputs, database queries, and tool use. Models analyze images alongside structured data to answer grounded questions. Life track (300) and Science track (300). * Only Opus 4.6, Sonnet 4.5, Kimi 2.5 evaluated.

Key Findings

Unsolved

Top model averages under 50%. SQL debugging peaks at 40.9%, human-centric interaction at just 28.8%.

Leader

Opus 4.6 wins overall, ranking 1st on DB querying, BI analysis, debugging, and code translation.

Value

Kimi 2.5 leads Vision (47.0% vs Opus 43.8%) at a fraction of the cost. Best accuracy-per-dollar.

Hard

Human-centric interaction: Even Opus 4.6 achieves just 28.1% and Qwen3 Coder drops to 18.4%.

Varied

Opus dominates overall, but Kimi leads multi-modal, and Qwen3 matches Opus on DB querying at 1/17th cost.