Data Intelligence Index

DII v0.3

v0.3 moves from a broad beta pool to a more curated, human-verified benchmark suite of around 2K tasks. The goal is to keep the evaluation compact while making each task more representative of real data work.

We also separate Base Mode and Agent Mode. Base Mode evaluates direct single-step, long-context model generation, while Agent Mode is designed for looped CLI/tool-use workflows, so raw model ability and agentic capability can be tracked separately.

Why Build This?

Most AI evaluations focus on general reasoning, coding, or chat quality. But everyone works with data, from analysts querying databases to everyday users exploring spreadsheets. The question is: how well can AI actually work with data?

Existing benchmarks test isolated skills like a single SQL query or a standalone coding task. Real data work is harder and more diverse: it spans querying, DB operations, debugging, multi-turn interaction, multi-modal querying, data science, and more aspects.

The Data Intelligence Index is our attempt to measure this comprehensively. It brings together tests across a broad range of data-centric capabilities, covering database querying, BI analysis and manipulation, application debugging, human-centric interaction, data science, SQL optimization, computer-use, and more. See the aspect details below.

What Data Intelligence Aspects We Evaluate

DB Querying Mini-Dev (multi-dialect)

This aspect measures foundational database querying: translating natural-language questions into correct SQL over real-world database schemas. Mini-Dev (multi-dialect) evaluates this ability across PostgreSQL, MySQL, and SQLite.

BI Analysis & Data Manipulation LiveSQLBench

This aspect measures BI analysis and database manipulation beyond standard SELECT queries, including UPDATE, CREATE, and DELETE workflows grounded in business logic documents. LiveSQLBench evaluates this capability with periodically updated, contamination-resistant tasks.

DB Application Debugging BIRD-Critic
BIRD-Critic 1.5: SWE

This aspect measures database application debugging ability. BIRD-Critic evaluates whether models can diagnose and fix broken SQL in database applications across multiple dialect variants. BIRD-Critic 1.5 stress-tests harness engineering in complex business environments through production-grade tasks across 40+ heterogeneous data management repositories, covering Python, Node.js, Ruby, and PHP.

Human-centric Interaction BIRD-Interact
Tapilot-Crossing

This aspect measures human-centric data interaction: handling ambiguity, context, clarification, follow-up goals, and iterative data analysis across turns. BIRD-Interact evaluates dynamic Text-to-SQL interaction with both the user and the data environment. Tapilot-Crossing evaluates conversational tabular data analysis across code updates, clarification, best-guess reasoning, plot question answering, insight mining, and private-library settings.

Data Science DS-CodeTrans
DARE-Bench

This aspect evaluates data-science coding ability across data querying, data management, modeling, and deep learning workflows. DS-CodeTrans tests data-science code translation across SQL, pandas, NumPy, PyTorch, and TensorFlow-style programs. DARE-Bench evaluates machine-learning modeling and data-science instruction fidelity with verifiable, Kaggle-derived tasks.

SQL Optimization Effi-SQL

This aspect measures execution-aware SQL optimization: improving query efficiency while preserving semantic correctness. Effi-SQL evaluates semantics-consistent SQL rewrite optimization with execution-grounded scoring.

Computer-Use Workspace-Bench

This aspect measures how well agents perform data-centric computer-use tasks across local files, spreadsheets, databases, and analysis tools. Workspace-Bench evaluates workspace learning with large-scale file dependencies, covering cross-file retrieval, contextual reasoning, and adaptive decision-making in realistic digital workspaces.

New User Experience UX-Tau-Bench

This aspect measures a new paradigm of human-agent interaction for data-centric tasks: generating user-friendly interactive UIs instead of relying only on text chat to improve user experience. UX-Tau-Bench evaluates whether an agent can reason over the task context, explore the data environment, generate an interactive React/TSX clarification UI, collect structured user feedback through user simulation, and use that feedback to complete the final action for managing data.

How the Index Is Calculated

Each benchmark provides one overall score (%) per model. We use that overall score directly inside its aspect and show subset columns in Benchmark Details.

The Data Intelligence Index is the average of available aspect scores in the selected evaluation setting. If one aspect has multiple benchmarks, such as Human-centric with Tapilot-Crossing and BIRD-Interact, the aspect score is the average of available benchmark overall scores. Missing benchmark results are shown as placeholders so they can be filled later without changing the schema.

Explore the Results

The interactive dashboard lets you compare all models across every v0.3 benchmark, switch Base/Agent settings, and inspect subset scores in Benchmark Details.

Open Dashboard →

We will continue updating the Data Intelligence Index as new models and benchmarks emerge. All benchmark code and data are open-source at bird-bench.github.io.

Want to contribute? We welcome new model results, benchmark datasets, and evaluation contributions. Reach out to us to get your model or benchmark included in the index.

Contact: bird.bench25@gmail.com · jl0725@connect.hku.hk · xxh24@connect.hku.hk

Data Intelligence Index v0.3