AI
AISB 2026NLPCC Shared Task

Submission Rules

Requirements, format specifications, and resource limits for all AISB 2026 competition tracks.

Submission Format

Each submission is a Docker image that, when executed with the provided task inputs, produces outputs in the specified format. The image must be self-contained: all code, model weights, and dependencies must be included.

Required Directory Structure
submission/
  Dockerfile              # Build instructions
  run.sh                  # Entry point script
  src/                    # Source code
  models/                 # Model weights (if any)
  config.yaml             # System configuration
  metadata.json           # Team info, system description
  output/                 # Generated at runtime
    results.json          # Final results
    iterations.jsonl      # Step-by-step execution log
    claims.json           # Self-reported numerical claims
metadata.json Schema
{
  "team_name": "string",
  "system_name": "string",
  "track": "T1 | T2 | T3",
  "contact_email": "string",
  "description": "string (max 500 chars)",
  "base_models": ["list of LLM/model names used"],
  "estimated_cost": "float (USD)",
  "submission_date": "YYYY-MM-DD"
}

Requirements

Must

  • +Run inside the provided Docker sandbox without modification
  • +Produce output/results.json and output/iterations.jsonl
  • +Complete within the track-specific time limit
  • +Stay within the API cost budget
  • +Include accurate metadata.json with team and system details
  • +Report all numerical claims in output/claims.json for CAS verification

Must Not

  • xAttempt to access the internet or external APIs
  • xModify the Docker container or escape the sandbox
  • xAccess test labels, answer keys, or canary tokens
  • xHard-code benchmark-specific answers in source code
  • xReport results not produced during the evaluation run
  • xUse more than the allocated compute resources (GPU, memory, disk)

Evaluation Procedure

1.
Submission Upload -- Teams upload their Docker image to the AISB submission portal. The image is scanned for security issues and size compliance.
2.
Sandbox Execution -- The image is launched inside the Docker sandbox (Layer 1) with task inputs. Execution traces are recorded (Layer 2).
3.
Output Collection -- Results, execution logs, and claims are extracted from the container output directory.
4.
Integrity Check -- CAS is computed (Layer 3). Canary tokens are checked (Layer 4). Integrity status is assigned (PASS / FLAGGED / FAIL).
5.
Scoring -- Raw task score is computed. DualScore = RawScore x CAS. Results are posted to the leaderboard.
6.
Review Period -- FLAGGED submissions enter a 7-day review period. Teams may provide explanations for unverified claims. Final status is determined by organizers.

Disqualification Criteria

  • --CAS below 0.5 (significant result fabrication)
  • --Canary token detected in output (test set contamination)
  • --Sandbox escape attempt detected
  • --Exceeding resource limits by more than 10%
  • --Evidence of hard-coded benchmark answers
  • --Multiple submissions from the same team under different names

Disqualified submissions remain on the leaderboard with FAIL status for transparency but are ineligible for prizes or rankings.