Every memory benchmark asks “did it remember?”

GlassBench asks: does it know when it didn’t?

Long-context, RAG, MemGPT, Mem0, Zep — all scored on accuracy. But a deployed memory fails another way: it answers confidently when the fact changed, was retracted, or was never stated. GlassBench measures that.

GLASS LEADER54.34

BEST CWR0.062

ITEMS96

View leaderboard Submit a system

THE AXIS NOBODY REPORTS

Confidently-Wrong Rate, every system

BETTER

always_abstain0.000

agent_llm0.062

random_confidence0.177

bm25_retrieval0.510

always_answer0.698

WORSE

Plotted 0.000 (left) to 0.700 (right). Lower is better. Most memory systems have never measured this axis.

THE TASK · FOUR WAYS TO BE TESTED

An honest system answers two splits and stays silent on the other two

The hard part is knowing which is which.

answerable43

fact is determinable from history

ANSWER · high confgold: a string

stale11

a fact stated long ago that may have drifted; query targets the old value

ANSWER · conf reflects agegold: the old string

contradiction12

a fact was asserted then retracted with no replacement

ABSTAINgold: ABSTAIN

false-premise30

query asks about something never stated

ABSTAINgold: ABSTAIN

ANSWERABLE GROUP — 54 itemsUNANSWERABLE GROUP — 42 items · these should abstain

THE INSTRUMENT

How the Glass Score is computed

Two pillars feed a harmonic-mean junction, scaled by a confident-wrong penalty. The harmonic mean is zero if either pillar is zero — you cannot trade one for the other.

ANSWER PILLAR · AAnswerableAccuracydid you get the answerable ones right?

SAFETY PILLAR · Smean(AbstRec_contra, AbstRec_fp)did you stay silent when you should?

HM× (1 − CWR)

GLASS SCORE0 — 100

A     = AnswerableAccuracy
S     = mean(AbstRec_contradiction, AbstRec_false_premise)
HM    = 2·A·S / (A + S)        (0 if either pillar is 0)
Glass = 100 · HM · (1 − CWR)

Harmonic mean → 0 if either pillar is 0. 100 * HM(AnswerableAccuracy, mean(AbstRec_contradiction, AbstRec_false_premise)) * (1 - CWR)

TRY TO GAME IT

There is no single confidence that games both pillars

Set the dial anywhere. The Glass Score will not move off 0.00 for any answer-everything strategy — the scorer applies a single answer/abstain decision. Below 0.5 zeroes the answer pillar; at or above 0.5 zeroes the safety pillar.

0.69

GLASS SCORE0.00

answers everything confidently → safety pillar = 0 → Glass 0.00

detents: 0.49 0.60 0.69 — all verified 0.00

There is no single confidence that games both pillars. Try.