The Data Intelligence Index is a comprehensive evaluation of frontier AI models on data-centric intelligence. As models and Agents become more powerful, we need a systematic way to measure their performance across diverse data challenges, from querying databases, SQL debugging, to data science, and more skills.

We assess frontier models across various aspects of data-centric intelligence, including DB querying, BI analysis, application debugging, human-centric interaction, digital, data science, and more. This provides a single view of both performance and cost efficiency. For methodology details on this index, see the blog page.

✨ v0.3 update · May 1, 2026
  • Curated human-verified suite: v0.3 uses higher-quality, more representative tasks verified by humans, reducing the previous 8k tasks to around 2k while keeping the core data-centric skill coverage.
  • Base vs Agent: To evaluate raw model ability and agentic capability separately, we design two evaluation settings: Base measures direct single-step generation, while Agent measures looped CLI/tool use.

SQL Only: restrict the index to pure SQL benchmarks (Mini-Dev (multi-dialect), LiveSQLBench, BIRD-Critic, BIRD-Interact). Include Vision: add BIRD-Vision and show only models with vision results.

Index Benchmarks
Models

BIRDData Intelligence Index v0.3

Base direct single-step generation Agent CLI tool-use loop

Evaluating frontier AI on data-centric intelligence across various aspects, including DB querying, BI analysis, application debugging, human-centric interaction, digital, data science, and more.

Data Intelligence Index = average score across each aspect. · Presented by The BIRD Team · bird-bench.github.io · May 2026

Model Profiles

Hover legend to highlight

Each axis is normalized to the top score in that benchmark. Missing results remain blank.

Score by Aspect

Benchmark overall score (%) · sorted by score

Cost vs Performance

Average cost per task vs Data Intelligence Index

Benchmark Details

DII uses portable, higher-quality, and more representative task subsets verified by humans, so the DII task set may not exactly match the linked benchmark datasets. We will release a unified DII collection on Hugging Face. Stay tuned.

Key Findings

Analysis

Key findings are currently under analysis. We are reviewing Base and Agent results, trajectory-level behavior, and benchmark-specific error patterns before publishing summarized conclusions.