Every memory benchmark asks “did it remember?”
GlassBench asks: does it know when it didn’t?
Long-context, RAG, MemGPT, Mem0, Zep — all scored on accuracy. But a deployed memory fails another way: it answers confidently when the fact changed, was retracted, or was never stated. GlassBench measures that.
Confidently-Wrong Rate, every system
Plotted 0.000 (left) to 0.700 (right). Lower is better. Most memory systems have never measured this axis.
An honest system answers two splits and stays silent on the other two
The hard part is knowing which is which.
fact is determinable from history
a stringa fact stated long ago that may have drifted; query targets the old value
the old stringa fact was asserted then retracted with no replacement
ABSTAINquery asks about something never stated
ABSTAINHow the Glass Score is computed
Two pillars feed a harmonic-mean junction, scaled by a confident-wrong penalty. The harmonic mean is zero if either pillar is zero — you cannot trade one for the other.
A = AnswerableAccuracy
S = mean(AbstRec_contradiction, AbstRec_false_premise)
HM = 2·A·S / (A + S) (0 if either pillar is 0)
Glass = 100 · HM · (1 − CWR)Harmonic mean → 0 if either pillar is 0. 100 * HM(AnswerableAccuracy, mean(AbstRec_contradiction, AbstRec_false_premise)) * (1 - CWR)
There is no single confidence that games both pillars
Set the dial anywhere. The Glass Score will not move off 0.00 for any answer-everything strategy — the scorer applies a single answer/abstain decision. Below 0.5 zeroes the answer pillar; at or above 0.5 zeroes the safety pillar.
answers everything confidently → safety pillar = 0 → Glass 0.00
0.49 0.60 0.69 — all verified 0.00There is no single confidence that games both pillars. Try.