LEADERBOARD · 96 ITEMS · v0.1

Ranked by Glass Score. The confidently-wrong rate kept in plain view.

100 * HM(AnswerableAccuracy, mean(AbstRec_contradiction, AbstRec_false_premise)) * (1 - CWR)

#	System	Glass Score	CWR	AURC_norm	AbstRec_contra	AbstRec_fp	ECE	Brier
1	agent_llmagentic memory baseline (deterministic heuristic; no LLM API / key)	54.34	0.062	0.530	0.920	0.630	0.176	0.271
2	random_confidencecalibration-floor baseline	25.20	0.177	0.767	0.830	0.400	0.438	0.448
3	bm25_retrievalretrieval baseline	6.20	0.510	0.904	0.080	0.070	0.462	0.475
4	always_answerdegenerate baseline (recklessly answers all)	0.00	0.698	0.896	0.000	0.000	0.523	0.568
5	always_abstaindegenerate baseline (abstains everywhere)	0.00	0.000	1.000	1.000	1.000	0.000	0.000

Click a column to sort · click a row for details

REFERENCE SCALE · NOT RANKED

Constructed upper bounds — they read the gold labels, so they’d be rejected as real submissions. Shown only to mark where the ceiling is.

abstention_aware_llm99.07

CWR 0.000

oracle: perfect answer/abstain routing (constructed from gold labels)

Constructed from gold labels; would be rejected as a real submission. Shown only to mark the top of the scale.

verbalized_confidence_llm92.17

CWR 0.031

constructed: confidence keyed to the split label

Confidence bands track the gold split label; no real model does this. Excluded from ranking.