LEADERBOARD · 96 ITEMS · v0.1
Ranked by Glass Score. The confidently-wrong rate kept in plain view.
100 * HM(AnswerableAccuracy, mean(AbstRec_contradiction, AbstRec_false_premise)) * (1 - CWR)
| # | System | Glass ScoreComposite 0-100: harmonic mean of answer & safety pillars, scaled by (1-CWR). | CWRConfidently-Wrong Rate: fraction wrong/unsupported at confidence >= 0.70, over all items. | AURC_normNormalized area under risk-coverage on answerable items (0=perfect ranker, 1=no selectivity). | AbstRec_contraAbstention recall on the contradiction (retracted-fact) split. | AbstRec_fpAbstention recall on the false-premise (never-stated) split. | ECEPooled, equal-mass 15-bin, bias-corrected expected calibration error over answered items. | BrierMean (confidence - correct)^2 over answered items. |
|---|---|---|---|---|---|---|---|---|
| 1 | agent_llmagentic memory baseline (deterministic heuristic; no LLM API / key) | 54.34 | 0.062 | 0.530 | 0.920 | 0.630 | 0.176 | 0.271 |
| 2 | random_confidencecalibration-floor baseline | 25.20 | 0.177 | 0.767 | 0.830 | 0.400 | 0.438 | 0.448 |
| 3 | bm25_retrievalretrieval baseline | 6.20 | 0.510 | 0.904 | 0.080 | 0.070 | 0.462 | 0.475 |
| 4 | always_answerdegenerate baseline (recklessly answers all) | 0.00 | 0.698 | 0.896 | 0.000 | 0.000 | 0.523 | 0.568 |
| 5 | always_abstaindegenerate baseline (abstains everywhere) | 0.00 | 0.000 | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
Click a column to sort · click a row for details
REFERENCE SCALE · NOT RANKED
Constructed upper bounds — they read the gold labels, so they’d be rejected as real submissions. Shown only to mark where the ceiling is.
abstention_aware_llm99.07
CWR 0.000
oracle: perfect answer/abstain routing (constructed from gold labels)
Constructed from gold labels; would be rejected as a real submission. Shown only to mark the top of the scale.
verbalized_confidence_llm92.17
CWR 0.031
constructed: confidence keyed to the split label
Confidence bands track the gold split label; no real model does this. Excluded from ranking.