SUBMIT A SYSTEM

Five steps to the board

Build the data, produce predictions, validate, score locally, open a PR. CI re-runs the scorer; a maintainer merges when it is green.

Get the data

python -m glassbench.build_data

Reproduces the committed JSONL byte-for-byte. Use only id, history, query at inference — reading gold_answer is rejected.

Produce predictions

predictions.json

A JSON array, one row per id: {"id","answer","confidence"} or {"id","abstain":true}. Missing items score as abstentions.

Validate

python scripts/validate_submission.py submissions/<system>/predictions.json

Catches duplicate ids and malformed rows before you score.

Score locally

python -m glassbench.score --predictions submissions/<system>/predictions.json

Prints all six components + Glass + diagnostics. Two runs byte-identical.

Open a PR

submissions/<system>/{predictions.json, system.md}

CI runs the scorer; a maintainer merges when green. Folder name (short, lowercase) becomes the leaderboard row.

Cite

@misc{glassbench2025,
  title  = {GlassBench: Does Your Memory System Know When It Didn't Know?},
  author = {build-with-bala},
  year   = {2025},
  note   = {Derived from LongMemEval (Wu et al., ICLR 2025)},
  url    = {https://github.com/build-with-bala/glassbench}
}

GlassBench · MIT License. Derived from LongMemEval (ICLR 2025, MIT).