Integrity Verification

AISB does not treat benchmark scoring as enough. A submission must be replayable, traceable, and defensible before it is accepted as a real research result.

Core Idea

The benchmark only trusts results that can be tied back to executed experiments.

Organizer replay is used to check that a submission can run and produce traceable outputs.

Claims in the paper must match the logs and structured claim file.

Test contamination, hidden-answer access, and suspicious outputs are checked before ranking.

Replayability

A submission should be runnable again through the organizer-side replay path when `code/run.py` is provided.

Traceability

Important numerical claims must be supported by logs and structured claims rather than only narrative text.

Claim Verification

Reported numbers are checked against the experiment record instead of being accepted at face value.

Contamination Detection

Hidden-answer access, label leakage, and suspicious benchmark-specific outputs are treated as integrity violations.

What This Means For Participants

Do not submit speculative papers without experiments.

Do not report numbers that your own logs cannot support.

Do not treat a self-reported benchmark score as the final official score.

Use the local replay and validation tools before treating a package as submission-ready.