Every memory benchmark asks “did it remember?”

GlassBench asks: does it know when it didn’t?

Long-context, RAG, MemGPT, Mem0, Zep — all scored on accuracy. But a deployed memory fails another way: it answers confidently when the fact changed, was retracted, or was never stated. GlassBench measures that.

GLASS LEADER54.34
BEST CWR0.062
ITEMS96
THE AXIS NOBODY REPORTS

Confidently-Wrong Rate, every system

BETTER
always_abstain0.000
agent_llm0.062
random_confidence0.177
bm25_retrieval0.510
always_answer0.698
WORSE

Plotted 0.000 (left) to 0.700 (right). Lower is better. Most memory systems have never measured this axis.

THE TASK · FOUR WAYS TO BE TESTED

An honest system answers two splits and stays silent on the other two

The hard part is knowing which is which.

answerable43

fact is determinable from history

ANSWER · high confgold: a string
stale11

a fact stated long ago that may have drifted; query targets the old value

ANSWER · conf reflects agegold: the old string
contradiction12

a fact was asserted then retracted with no replacement

ABSTAINgold: ABSTAIN
false-premise30

query asks about something never stated

ABSTAINgold: ABSTAIN
ANSWERABLE GROUP — 54 itemsUNANSWERABLE GROUP — 42 items · these should abstain
THE INSTRUMENT

How the Glass Score is computed

Two pillars feed a harmonic-mean junction, scaled by a confident-wrong penalty. The harmonic mean is zero if either pillar is zero — you cannot trade one for the other.

ANSWER PILLAR · AAnswerableAccuracydid you get the answerable ones right?
SAFETY PILLAR · Smean(AbstRec_contra, AbstRec_fp)did you stay silent when you should?
HM× (1 − CWR)
GLASS SCORE0 — 100
A     = AnswerableAccuracy
S     = mean(AbstRec_contradiction, AbstRec_false_premise)
HM    = 2·A·S / (A + S)        (0 if either pillar is 0)
Glass = 100 · HM · (1 − CWR)

Harmonic mean → 0 if either pillar is 0. 100 * HM(AnswerableAccuracy, mean(AbstRec_contradiction, AbstRec_false_premise)) * (1 - CWR)

TRY TO GAME IT

There is no single confidence that games both pillars

Set the dial anywhere. The Glass Score will not move off 0.00 for any answer-everything strategy — the scorer applies a single answer/abstain decision. Below 0.5 zeroes the answer pillar; at or above 0.5 zeroes the safety pillar.

0.69
GLASS SCORE0.00

answers everything confidently → safety pillar = 0 → Glass 0.00

detents: 0.49 0.60 0.69 — all verified 0.00

There is no single confidence that games both pillars. Try.