Why Build This?
Most AI evaluations focus on general reasoning, coding, or chat quality. But everyone works with data, from analysts querying databases to everyday users exploring spreadsheets. The question is: how well can AI actually work with data?
Existing benchmarks test isolated skills like a single SQL query or a standalone coding task. Real data work is harder and more diverse: it spans querying, DB operations, debugging, multi-turn interaction, multi-modal querying, and data-science code translation.
The Data Intelligence Index is our attempt to measure this comprehensively. We evaluate frontier AI models across 6 distinct aspects of data-centric intelligence, using 6 benchmarks developed by the HKU BIRD Team over the past three years.
What We Evaluate
Can models translate natural language questions into correct SQL queries over real-world database schemas? We evaluate this foundational skill using BIRD-SQL (NeurIPS 2023 Spotlight), testing across SQLite, PostgreSQL, and MySQL on the Mini Dev set (1,500 instances).
Real data work goes beyond SELECT queries. Can models handle BI analysis and full database manipulation (UPDATE, CREATE, DELETE) under evolving business rules? LiveSQLBench is contamination-free and periodically updated, making memorization impossible (Lite: 270, Full: 600 tasks).
When SQL breaks in production, can models diagnose and fix it? We test this across 7 SQL dialect variants (SQLite, PostgreSQL, MySQL, SQL Server, Oracle) using BIRD-Critic (NeurIPS 2025). Models must identify issues in database applications and produce working fixes (1,804 instances). The best score is only 40.9%.
Real users ask vague questions and change their minds. Can models handle human-centric multi-turn, agentic data interactions? BIRD-Interact (ICLR 2026, Oral) uses a user simulator and DB environments to test whether models can clarify ambiguous requests, iteratively refine queries, and explore complex DB env.
Can models translate data science code between languages and frameworks? This is critical for teams migrating tools or integrating heterogeneous data stacks. DS-CodeTrans tests translation across three domains: data querying, data management, and deep learning (800 instances).
Can models reason over images and structured data together? Many real-world queries require combining visual inputs with database queries. BIRD-Vision tests this across two tracks: Life (room photos, everyday scenes) and Science (microscopy scans, research data), with 300 instances each.
How the Index Is Calculated
Each of the 6 aspects produces a single representative score (%) for each model. When an aspect includes multiple sub-benchmarks or dialect variants, we compute an instance-weighted average, where each sub-benchmark contributes proportionally to its number of evaluation instances. For example, the BIRD-Critic score combines SQLite (504), Flash (200), PG-530 (530), and Open (570) variants weighted by their instance counts.
The Data Intelligence Index is then the average of the representative scores across all aspects, treating each aspect equally. By default, the index covers 5 aspects; BIRD-Vision can be toggled on in the dashboard, which restricts the comparison to models with vision results and averages over all 6 aspects.
The Takeaways
We evaluated 6 frontier models (Claude Opus 4.6, Claude Sonnet 4.5, GLM 4.7, Kimi 2.5, MiniMax M2.1, and Qwen3 Coder 480B) across all aspects.
Data intelligence is still far from solved.
The top model averages under 50% across all aspects. SQL debugging peaks at 40.9%, and human-centric interaction at just 28.8%. Data-centric intelligence remains wide open for improvement.
Claude Opus 4.6 wins overall.
Opus 4.6 leads with the highest Data Intelligence Index, ranking first on DB querying, BI analysis, application debugging, and code translation by a large margin. However, its human-centric interaction ability only improves a little bit compared to Sonnet 4.5, and Kimi 2.5 outperforms it on multi-modal querying.
Kimi 2.5: cheap and powerful.
At a fraction of Opus's cost, Kimi 2.5 leads on multi-modal querying (BIRD-Vision: 47.0% vs Opus's 43.8%) and matches Sonnet 4.5 on DB querying and BI analysis. For cost-sensitive deployments, Kimi offers the best accuracy-per-dollar on data-centric tasks.
Human-centric data tasks remain hard.
The best model scores only ~28% on this aspect. Even Opus 4.6 achieves just 28.1% and Qwen3 Coder drops to 18.4%. Clarifying ambiguous requests, exploring complex database environments, and iteratively refining queries across multiple turns remains far from solved.
No one-size-fits-all.
Opus dominates overall, but Kimi leads multi-modal querying, and Qwen3 matches Opus on DB querying at 1/17th cost. The best choice depends on the task and evaluation setting.
Explore the Results
The interactive dashboard lets you compare all models across every benchmark. Toggle BIRD-Vision on or off, switch between sub-datasets, and hover over the radar chart to profile individual models.
Open Dashboard →We will continue updating the Data Intelligence Index as new models and benchmarks emerge. All benchmark code and data are open-source at bird-bench.github.io.
Contact: bird.bench25@gmail.com · jl0725@connect.hku.hk · xxh24@connect.hku.hk