AISB 2026: AI Scientist Benchmark
NLPCC 2026 Shared Task
评测AI自主科研能力:发现问题 + 实验验证 + 成果表达

Overview / 概述
AISB (AI Scientist Benchmark) evaluates AI systems as autonomous researchers — the complete cycle of Idea + Experiment + Report. Given a research topic and reference papers, AI systems must discover scientific problems, form hypotheses, validate through experiments, and communicate findings in a paper humans can understand.
As an NLPCC 2026 Shared Task, AISB invites participants to develop AI systems that autonomously conduct scientific research across 12 directions and 117 benchmarks. Integrity verification (CAS) ensures all claimed results are real, not fabricated.
Organized by Westlake University NLP Lab under the supervision of Prof. Yue Zhang. Top-3 teams receive CCF-NLP certification.
AISB(AI科学家基准测试)评测AI系统的自主科研能力—— 完整的Idea + Experiment + Report循环。 给定研究主题和参考论文,AI系统须发现科学问题、形成假设、通过实验验证、并将结论写成人类可理解的论文。
作为 NLPCC 2026 共享任务,AISB 邀请参赛者开发能够跨12个研究方向、117个基准测试 自主进行科学研究的AI系统。CAS(声明准确度分数)确保所有声称的结果真实可追溯。
由WestlakeNLP组织,张岳教授指导。前三名团队获得 CCF-NLP 认证。
Tracks / 赛道
Track 1: Scientific Research / 科学研究赛道
The AI system is given a research topic, reference papers, and a benchmark, and must autonomously conduct a full research cycle: discover scientific problems, form hypotheses, design experiments as validation, analyze results, and write an ICLR-format paper that communicates findings clearly.
AI系统被给定研究主题、参考论文和基准,须自主完成完整科研循环:发现科学问题、形成假设、设计实验验证、分析结果、撰写ICLR格式论文,将结论清晰表达。
- Paper Quality (significance, novelty, methodology, writing)
- Benchmark Performance
- Reproducibility (CAS: claims vs experiment logs)
Top-10: human expert review
Track 2: Benchmark SOTA Challenge / 基准提升赛道
The AI system is given a benchmark with known SOTA baselines and must develop a new method that improves over current SOTA. Experiments are the primary evidence, but the system must explain why the method works — interpretable improvement, not blind optimization.
AI系统被给定已有SOTA基准线,须提出新方法超越当前最优。实验和跑分作为重要依据,但须解释方法为什么有效——可解释的科学提升,而非盲目优化。
AI/CS Reasoning & Engineering / 推理与工程
HLE-Verified 50q + FeatureBench 20 tasks
SOTA: 37-45% / 12.5%
Math & Proof / 数学证明
FormalMATH 50 problems (Lean4)
Baseline: 28% (DeepSeek-Prover)
Scientific Discovery / 科学发现
TDC ADMET 5 endpoints + Matbench
Baseline: MAE 0.93
Evaluation Architecture / 评测架构
↓
Think & Explore (思考探索、形成假设)
↓
Hypothesis → Experiment → Validation (假设-实验-验证)
↓
Communicate Findings (将结论讲出来,让人理解)
两个赛道共同评估AI的自主科研能力,差异在于侧重点:Track 1 以idea为核心,Track 2 以实验为核心。
Important Dates / 重要日期
How to Participate / 参赛方式
Register / 注册
Register via Google Form (https://forms.gle/9oWtS77UduudpRM1A), then create a HuggingFace account.
通过 Google Form 注册(https://forms.gle/9oWtS77UduudpRM1A),然后创建 HuggingFace 账户。
Download Data / 下载数据
Run our download script: python scripts/download_competition_data.py --all
运行数据下载脚本获取所有赛道数据。
Develop Your AI Scientist / 开发 AI 科学家系统
Build a system that designs research strategies for the fixed executor. Test locally with the dev set.
开发为固定执行器设计研究策略的系统,使用开发集本地测试。
Submit / 提交
Package your system as submission.tar.gz and upload via our HuggingFace Space.
将系统打包为 submission.tar.gz 并通过 HuggingFace Space 提交。
Submission Requirements / 提交要求
Format / 格式
- ICLR format paper (paper.md or paper.tex)
- Complete executable code
- Standardized iterations.jsonl log
- metadata.json with team info
Integrity / 诚信
- CAS (Claim Accuracy Score) must be above 0.5
- All numbers must trace to experiment logs
- Docker sandbox execution
- No network access during evaluation
Resources / 资源
- API budget: $10-$30 per track
- Wall-clock: 30min - 2hr per track
- Fixed executor (same for all teams)
- 1 submission per day per track
Contact / 联系方式
WestlakeNLP · Prof. Yue Zhang
Email: sunjoey035@gmail.com
NLPCC · CCF-NLP · Top-3 Teams Receive CCF-NLP Certification