Submission Rules
Requirements, format specifications, and resource limits for all AISB 2026 competition tracks.
Submission Format
Each submission is a Docker image that, when executed with the provided task inputs, produces outputs in the specified format. The image must be self-contained: all code, model weights, and dependencies must be included.
Required Directory Structure
submission/
Dockerfile # Build instructions
run.sh # Entry point script
src/ # Source code
models/ # Model weights (if any)
config.yaml # System configuration
metadata.json # Team info, system description
output/ # Generated at runtime
results.json # Final results
iterations.jsonl # Step-by-step execution log
claims.json # Self-reported numerical claimsmetadata.json Schema
{
"team_name": "string",
"system_name": "string",
"track": "T1 | T2 | T3",
"contact_email": "string",
"description": "string (max 500 chars)",
"base_models": ["list of LLM/model names used"],
"estimated_cost": "float (USD)",
"submission_date": "YYYY-MM-DD"
}Requirements
Must
- +Run inside the provided Docker sandbox without modification
- +Produce output/results.json and output/iterations.jsonl
- +Complete within the track-specific time limit
- +Stay within the API cost budget
- +Include accurate metadata.json with team and system details
- +Report all numerical claims in output/claims.json for CAS verification
Must Not
- xAttempt to access the internet or external APIs
- xModify the Docker container or escape the sandbox
- xAccess test labels, answer keys, or canary tokens
- xHard-code benchmark-specific answers in source code
- xReport results not produced during the evaluation run
- xUse more than the allocated compute resources (GPU, memory, disk)
Evaluation Procedure
1.
Submission Upload -- Teams upload their Docker image to the AISB submission portal. The image is scanned for security issues and size compliance.
2.
Sandbox Execution -- The image is launched inside the Docker sandbox (Layer 1) with task inputs. Execution traces are recorded (Layer 2).
3.
Output Collection -- Results, execution logs, and claims are extracted from the container output directory.
4.
Integrity Check -- CAS is computed (Layer 3). Canary tokens are checked (Layer 4). Integrity status is assigned (PASS / FLAGGED / FAIL).
5.
Scoring -- Raw task score is computed. DualScore = RawScore x CAS. Results are posted to the leaderboard.
6.
Review Period -- FLAGGED submissions enter a 7-day review period. Teams may provide explanations for unverified claims. Final status is determined by organizers.
Disqualification Criteria
- --CAS below 0.5 (significant result fabrication)
- --Canary token detected in output (test set contamination)
- --Sandbox escape attempt detected
- --Exceeding resource limits by more than 10%
- --Evidence of hard-coded benchmark answers
- --Multiple submissions from the same team under different names
Disqualified submissions remain on the leaderboard with FAIL status for transparency but are ineligible for prizes or rankings.