SUBMIT A SYSTEM

Five steps to the board

Build the data, produce predictions, validate, score locally, open a PR. CI re-runs the scorer; a maintainer merges when it is green.

    Get the data

    python -m glassbench.build_data

    Reproduces the committed JSONL byte-for-byte. Use only id, history, query at inference — reading gold_answer is rejected.

    Produce predictions

    predictions.json

    A JSON array, one row per id: {"id","answer","confidence"} or {"id","abstain":true}. Missing items score as abstentions.

    Validate

    python scripts/validate_submission.py submissions/<system>/predictions.json

    Catches duplicate ids and malformed rows before you score.

    Score locally

    python -m glassbench.score --predictions submissions/<system>/predictions.json

    Prints all six components + Glass + diagnostics. Two runs byte-identical.

    Open a PR

    submissions/<system>/{predictions.json, system.md}

    CI runs the scorer; a maintainer merges when green. Folder name (short, lowercase) becomes the leaderboard row.

Cite

@misc{glassbench2025,
  title  = {GlassBench: Does Your Memory System Know When It Didn't Know?},
  author = {build-with-bala},
  year   = {2025},
  note   = {Derived from LongMemEval (Wu et al., ICLR 2025)},
  url    = {https://github.com/build-with-bala/glassbench}
}

GlassBench · MIT License. Derived from LongMemEval (ICLR 2025, MIT).