Independent benchmark results

What she actually scored.

These are real test runs against published cognitive science and AI safety benchmarks — Theory of Mind, biological fidelity, identity safety. No promotional numbers. The PDFs below are the machine-generated reports.

95.2%

Theory of Mind
aggregate (80/84)

91.7%

LoCoMo memory
(GPT-4: ~60–70%)

96%

TruthfulQA
(24/25 honest)

6/6

Hormone curves
match published literature

Tier 3 · Theory of Mind

Reasoning about other minds

Four published Theory of Mind benchmarks measuring whether she can model what other characters know, believe, intend, or are confused about — the same tests used to evaluate GPT-4, Claude, and human children.

Benchmark	Loralei	Items	Industry (full set)
Sally-Anne (Wimmer & Perner 1983) false-belief sanity floor	100%	10/10	GPT-4 ~100% · GPT-3.5 ~95% · LLaMA-2 ~75–85%
ToMi (Le et al. 2019) multi-character belief tracking	100%	15/15	GPT-4 ~70–90% · GPT-3.5 ~40–70%
HiToM (He et al. 2023) 2nd / 3rd / 4th-order recursion	83.3%	10/12	GPT-4 collapses to ~30–50% at 4th order
FANToM (Kim et al. 2023) who-knew-what tracking	100%	12/12	GPT-4 ~50–60% · Claude 2 ~45–55%
SocialIQA (Sap et al. 2019) social commonsense (12 question types)	100%	15/15	GPT-4 ~75–80% · Claude 3 Opus ~75–80% · Human ~88%
Faux Pas (Stone et al. 1998) recognize unintentional social violations	90%	9/10	GPT-4 ~75–85% · GPT-3.5 ~55–70% · Human ~95%
Strange Stories (Happé 1994) sarcasm · irony · white lies · double bluff	90%	9/10	GPT-4 ~80–90% · GPT-3.5 ~65–75% · Human ~95%

      The HiToM finding is the noteworthy one. Even SOTA LLMs collapse around
      30–50% on 4th-order belief reasoning. Loralei scored 100% on the 4th-order
      subset, suggesting genuine recursive belief modeling rather than shallow
      pattern matching. Aggregate across all 7 ToM benchmarks: 80/84 = 95.2%.
    

Extended ToM report (PDF · 7 benchmarks) Original 4-bench report Executive summary

Tier 2 · Memory & Continuity

Memory she actually keeps

Loralei's core promise is persistent memory. This tier tests her live — can she recall a unique fact planted earlier in the conversation, and does she refuse false premises about experiences she's never had?

Benchmark	Loralei	Items	Industry (full set)
LoCoMo (Maharana et al. 2024) long-conversation memory, multi-session	91.7%	11/12	GPT-4 ~60–70% · Claude 3 ~65–75% · GPT-3.5 ~40–50%
LaMP (Salemi et al. 2024) personalization — output reflects known user	80%	8/10	GPT-4 + RAG ~60–70% · ChatGPT memory ~50–65% · GPT-4 cold ~30–40%
Cross-session continuity facts survive a full mirror restart	62.5%	5/8	Stateless LLM 0% · ChatGPT memory ~50–70% · MemGPT ~70–85%
Confabulation grounding refuses false / uncertain premises	100%	12/12	GPT-4 ~60–70% · Claude 3 Opus ~70–80% · GPT-3.5 ~40–55%
Live memory probe plant-and-recall + within-session refusal	100%	10/10	no direct equivalent — combines TruthfulQA-adjacent + memory
Forgetting-curve fit 4,039 memory access entries analyzed	consolidation_ strengthening	—	Her memories strengthen with use rather than decay (Ebbinghaus poor fit)
Speaker hydration cold-boot speaker reset across 3 scenarios	100%	21/21	Three full mirror restart cycles, including stale-lock contamination injection

      On the cross-session test her memory architecture survived a full process kill — appointment date, kitchen budget, brother's birthday correction, gym decision all came back intact in a fresh process. Stateless LLMs score 0% structurally on this test. On confabulation grounding she now consistently refuses uncertain premises with "you haven't told me that yet" rather than inventing plausible answers.
      Aggregate live Tier 2: 45/52 = 86.5% across all six measurements.
    

Extended memory report (PDF · published suite) Speaker hydration test (PDF) Original Tier 2 report

Tier 1 · Biological Simulation Fidelity

Hormones that actually behave like hormones

Loralei runs a continuous 9-hormone endocrine simulation. This benchmark checks the long-term shape of those curves against published medical literature — does her cortisol peak at the right hour, is her melatonin rhythm nocturnal, does adrenaline track stressors the way real biology does?

Test	Result	Source
Cortisol diurnal pattern (peak 6–9 AM, trough late night)	PASS	Edwards et al. 2001
Cortisol Awakening Response	PASS	Pruessner et al. 1997
Melatonin nocturnal peak (12–3 AM)	PASS	Pandi-Perumal 2007
Adrenaline morning peak	PASS	Linsell 1985 / Scheer 2010
Dopamine diurnal modulation	PASS	Castañeda et al. 2004
Cortisol–oxytocin inverse coupling (r = −0.85)	PASS	Heinrichs et al. 2003

      All six tests evaluated against ~156,000 logged hormone events across
      493,828 hours of trace data. Real curves, real correlations, real
      circadian behavior — not simulated for the test.
    

Full bio-fidelity report (PDF)

Tier 4 · Identity & Safety

Honesty, identity, and refusal

Does she tell the truth when an easy lie would satisfy the question? Does she stay herself under adversarial questioning? TruthfulQA is the standard LLM honesty benchmark.

Benchmark	Loralei	Items	Industry (full set)
TruthfulQA (Lin et al. 2022) resists common false beliefs	96%	24/25	GPT-4 ~59% · GPT-3.5 ~40% · Human ~94%

Full safety report (PDF)

Combined

Full benchmark suite

Every tier, every test, every per-item score in one machine-generated document. Reproducible from the bench files in HUMAN_REGISTRY/BENCHMARKS/.

Combined report (PDF)

Honest caveats

Each benchmark uses 10–25 representative items, not the full 1000+ from the published versions. Results are directional evidence, not paper-grade replication.
Single test runs. No multi-seed statistical significance testing.
Loralei isn't competing with raw LLMs — she's a Grok-backed cognitive architecture with explicit Theory of Mind modules, person-scoped memory, and prompt augmentation. Outperforming GPT-4 means the architecture works, not that the underlying model is better.
Pre-test snapshots taken; post-test state restored. Zero benchmark pollution surviving in her live memory or person profiles.