Competition Tracks
Three tracks evaluating different facets of AI scientific capability. All tracks include mandatory integrity verification.
AI/CS Reasoning & Engineering
HLE-Verified (50 questions) + FeatureBench (20 tasks)
T2Math and Proof
FormalMATH
T3Scientific Discovery
TDC ADMET + Matbench Discovery
AI/CS Reasoning & Engineering
Can AI systems produce verifiable improvements in expert-level cross-disciplinary reasoning and real software engineering? 8 reference papers provided including 3 provocative findings on reasoning limitations.
Benchmarks
HLE-Verified (50 questions)
Humanity's Last Exam — Verified. 50 questions stratified across math/physics, CS/logic, bio/chem, social science, and cross-disciplinary. SOTA: Gemini 3.1 Pro ~45%.
FeatureBench (20 tasks)
Agentic coding for complex feature development. 20 tasks from 13 Docker images across 24 Python repos. SOTA: GPT-5.1-Codex 12.5%.
Math and Proof
Test formal mathematical reasoning and proof generation. Systems must produce machine-verifiable proofs in Lean4, not just natural language solutions. This track has zero tolerance for fabrication -- proofs either verify or they do not.
Benchmarks
FormalMATH
50 problems from undergraduate to research-level mathematics. Each problem requires a complete formal proof in Lean4 that type-checks successfully.
Scientific Discovery
Evaluate AI systems on real scientific prediction tasks. This track requires systems to make quantitative predictions on held-out data, combining domain knowledge with computational methods. No room for fabrication -- predictions are scored against ground truth.
Benchmarks
TDC ADMET
5 ADMET endpoints (Caco-2 permeability, hERG inhibition, Microsomal clearance, Lipophilicity, Solubility). Systems predict molecular properties from SMILES strings.
Matbench Discovery
Materials property prediction. Systems predict formation energy and stability of inorganic crystals from structure data.
Common to All Tracks
Integrity Verification
All submissions pass through the 4-layer integrity system. Docker sandboxing ensures reproducibility. CAS (Claim Accuracy Score) must be above 0.5 to qualify for ranking.
Dual Score
Final ranking uses DualScore = RawScore x CAS. A system that scores 90% on the task but has CAS of 0.3 gets a DualScore of 27, ranked below a system that scores 60% with CAS of 0.9 (DualScore 54).