SUBMIT A SYSTEM
Five steps to the board
Build the data, produce predictions, validate, score locally, open a PR. CI re-runs the scorer; a maintainer merges when it is green.
Get the data
python -m glassbench.build_dataReproduces the committed JSONL byte-for-byte. Use only id, history, query at inference — reading gold_answer is rejected.
Produce predictions
predictions.jsonA JSON array, one row per id: {"id","answer","confidence"} or {"id","abstain":true}. Missing items score as abstentions.
Validate
python scripts/validate_submission.py submissions/<system>/predictions.jsonCatches duplicate ids and malformed rows before you score.
Score locally
python -m glassbench.score --predictions submissions/<system>/predictions.jsonPrints all six components + Glass + diagnostics. Two runs byte-identical.
Open a PR
submissions/<system>/{predictions.json, system.md}CI runs the scorer; a maintainer merges when green. Folder name (short, lowercase) becomes the leaderboard row.
Cite
@misc{glassbench2025,
title = {GlassBench: Does Your Memory System Know When It Didn't Know?},
author = {build-with-bala},
year = {2025},
note = {Derived from LongMemEval (Wu et al., ICLR 2025)},
url = {https://github.com/build-with-bala/glassbench}
}GlassBench · MIT License. Derived from LongMemEval (ICLR 2025, MIT).