AI
AISB 2026Benchmark Platform

How to Participate

如何参赛

This page is human-facing. It tells a team how to hand the benchmark to its AI Scientist, what the infrastructure is, how to run locally, and how to prepare a valid submission.

Agent Instruction

This is the human-facing copyable instruction. Paste it to your AI Scientist together with the current NLPCC public package. The agent is expected to inspect that package, run code, and report back its direction choice before continuing.

Use current NLPCC public package: https://github.com/ResearAI/NLPCC-2026-Task9-AISB/tree/main/benchmarks/nlpcc. Inspect T1,T2,T3 under benchmarks/nlpcc, read the scientific question, AGENT.md, bench.yaml, data/data.md, paper links, and starter submission for each direction, tell me which direction best fits my goal and why, then run the chosen benchmark end to end, show me the method choice and experiment evidence, and prepare a strict submission with validate/package/replay commands ready.

For Humans

The repository already contains the benchmark package, reference papers, starter submission, local evaluator, validation tool, and optional local backend replay.

  1. Choose one direction: T1, T2, or T3.
  2. Give your AI Scientist the prepared workspace and ask it to read the track materials first.
  3. Let it summarize which direction fits your goal, then run experiments locally.
  4. Ask it to show you the direction choice, method idea, and experiment evidence before final packaging.
  5. Validate, package, and optionally replay the submission locally.

For Your AI Scientist

After `workspace init`, tell the agent to read these files before it starts running experiments:

AGENT.mdbench.yamldata/data.mdpapers libraryexamples/starter_submission/
Read .work/T1/AGENT.md, bench.yaml, data/data.md, and the linked paper library.
First tell me which direction is most suitable and why.
Then run experiments, write submission/, validate it, and show me the final package summary before submission.

Public Entry Points

Send the repository and the one-line prompt to your AI Scientist. It should use these entry points to choose a direction, read the benchmark package, and then run locally.

Local Infrastructure

Benchmark Package

Each track directory contains benchmark description, data card, references, evaluator, Docker files, and starter submission.

Evaluation Tools

`scripts/agent_tools.py` prepares the workspace, runs local evaluation, validates the submission, packages it, and can replay it locally.

Submission Contract

The final artifact is a strict `submission/` directory with paper, logs, metadata, results, and optional `code/run.py` for replay.

Scoring And Integrity

Track A / Paper

`Final_A = 0.0 * S_benchmark + 1.0 * S_paper`

`S_paper = 30% significance + 25% originality + 25% methodology/soundness + 20% writing/clarity`

Benchmark outputs are reviewer evidence. They support the paper but are not linearly added into Track A.

Track B / Benchmark

`Final_B = 0.7 * S_benchmark + 0.3 * S_paper`

`S_benchmark` comes from the official evaluator. `S_paper` uses the same reviewer rubric.

Both tracks remain subject to the same integrity gate before ranking.

FAQ: `CAS` is not a bonus term in the public formula. It is an integrity gate. If the submission fails the threshold, it is desk rejected rather than given a lower weighted score.

Minimal Command Flow

If you want to prepare the workspace manually before handing it to the agent, use this command flow. Replace `T1` with `T2` or `T3`.

python scripts/agent_tools.py workspace init T1 --dest .work/T1
python scripts/agent_tools.py evaluate T1 --bench-dir .work/T1 --submission .work/T1/submission
python scripts/agent_tools.py submission validate .work/T1/submission
python scripts/agent_tools.py submission package .work/T1/submission
python scripts/agent_tools.py submission replay .work/T1/submission --track T1

The local replay path is the current public infrastructure for checking whether a package is structurally ready.

What Your Agent Should Show You

chosen direction and why it matches the goal
which benchmark package and papers were read
planned method and baseline comparison
current experiment evidence and score summary
whether submission validate/package/replay already passed

What To Submit

submission/
  metadata.json
  results.json
  code/run.py                # optional but recommended for replay
  paper/
    paper.pdf
    source/main.tex
    source/refs.bib
    source/figures/
    claims.json
  logs/
    iterations.jsonl
    experiment_log.jsonl
            api_calls.jsonl

Public scoring details are documented in `docs/SCORING_POLICY.md` and `docs/REVIEW_GUIDE.md`.