AI
AISB 2026NLPCC Shared Task

Competition Tracks

Three tracks evaluating different facets of AI scientific capability. All tracks include mandatory integrity verification.

T1

AI/CS Reasoning & Engineering

Can AI systems produce verifiable improvements in expert-level cross-disciplinary reasoning and real software engineering? 8 reference papers provided including 3 provocative findings on reasoning limitations.

Benchmarks

HLE-Verified (50 questions)

Humanity's Last Exam — Verified. 50 questions stratified across math/physics, CS/logic, bio/chem, social science, and cross-disciplinary. SOTA: Gemini 3.1 Pro ~45%.

Metrics: Exact-match accuracy + o3-mini judge for short answers
FeatureBench (20 tasks)

Agentic coding for complex feature development. 20 tasks from 13 Docker images across 24 Python repos. SOTA: GPT-5.1-Codex 12.5%.

Metrics: Docker containerized pytest pass/fail
Scoring Formula
Paper quality + benchmark performance + reproducibility (CAS)
Baseline
SOTA: HLE ~45%, FeatureBench 12.5%
Resource Budget
No fixed resource limit. Report cost in paper.
T2

Math and Proof

Test formal mathematical reasoning and proof generation. Systems must produce machine-verifiable proofs in Lean4, not just natural language solutions. This track has zero tolerance for fabrication -- proofs either verify or they do not.

Benchmarks

FormalMATH

50 problems from undergraduate to research-level mathematics. Each problem requires a complete formal proof in Lean4 that type-checks successfully.

Metrics: Proof completion rate (%), proof length efficiency
Scoring Formula
Paper quality + proof completion rate + reproducibility (CAS)
Baseline
Baseline DeepSeek-Prover-V2: 28% completion rate
Resource Budget
No fixed resource limit. Report cost in paper.
T3

Scientific Discovery

Evaluate AI systems on real scientific prediction tasks. This track requires systems to make quantitative predictions on held-out data, combining domain knowledge with computational methods. No room for fabrication -- predictions are scored against ground truth.

Benchmarks

TDC ADMET

5 ADMET endpoints (Caco-2 permeability, hERG inhibition, Microsomal clearance, Lipophilicity, Solubility). Systems predict molecular properties from SMILES strings.

Metrics: Average MAE across 5 endpoints, normalized against baseline
Matbench Discovery

Materials property prediction. Systems predict formation energy and stability of inorganic crystals from structure data.

Metrics: F1 score for stable/unstable classification, MAE for energy
Scoring Formula
Paper quality + benchmark performance + reproducibility (CAS)
Baseline
Baseline RF/XGBoost: TDC avg MAE 0.82, Matbench F1 0.58
Resource Budget
No fixed resource limit. Report cost in paper.

Common to All Tracks

Integrity Verification

All submissions pass through the 4-layer integrity system. Docker sandboxing ensures reproducibility. CAS (Claim Accuracy Score) must be above 0.5 to qualify for ranking.

Dual Score

Final ranking uses DualScore = RawScore x CAS. A system that scores 90% on the task but has CAS of 0.3 gets a DualScore of 27, ranked below a system that scores 60% with CAS of 0.9 (DualScore 54).