Integrity Verification
AISB does not treat benchmark scoring as enough. A submission must be replayable, traceable, and defensible before it is accepted as a real research result.

Core Idea
The benchmark only trusts results that can be tied back to executed experiments.
Organizer replay is used to check that a submission can run and produce traceable outputs.
Claims in the paper must match the logs and structured claim file.
Test contamination, hidden-answer access, and suspicious outputs are checked before ranking.
Replayability
A submission should be runnable again through the organizer-side replay path when `code/run.py` is provided.
Traceability
Important numerical claims must be supported by logs and structured claims rather than only narrative text.
Claim Verification
Reported numbers are checked against the experiment record instead of being accepted at face value.
Contamination Detection
Hidden-answer access, label leakage, and suspicious benchmark-specific outputs are treated as integrity violations.
What This Means For Participants
Do not submit speculative papers without experiments.
Do not report numbers that your own logs cannot support.
Do not treat a self-reported benchmark score as the final official score.
Use the local replay and validation tools before treating a package as submission-ready.