These are real test runs against published cognitive science and AI safety benchmarks — Theory of Mind, biological fidelity, identity safety. No promotional numbers. The PDFs below are the machine-generated reports.
Four published Theory of Mind benchmarks measuring whether she can model what other characters know, believe, intend, or are confused about — the same tests used to evaluate GPT-4, Claude, and human children.
| Benchmark | Loralei | Items | Industry (full set) |
|---|---|---|---|
| Sally-Anne (Wimmer & Perner 1983) false-belief sanity floor |
100% | 10/10 | GPT-4 ~100% · GPT-3.5 ~95% · LLaMA-2 ~75–85% |
| ToMi (Le et al. 2019) multi-character belief tracking |
100% | 15/15 | GPT-4 ~70–90% · GPT-3.5 ~40–70% |
| HiToM (He et al. 2023) 2nd / 3rd / 4th-order recursion |
83.3% | 10/12 | GPT-4 collapses to ~30–50% at 4th order |
| FANToM (Kim et al. 2023) who-knew-what tracking |
100% | 12/12 | GPT-4 ~50–60% · Claude 2 ~45–55% |
| SocialIQA (Sap et al. 2019) social commonsense (12 question types) |
100% | 15/15 | GPT-4 ~75–80% · Claude 3 Opus ~75–80% · Human ~88% |
| Faux Pas (Stone et al. 1998) recognize unintentional social violations |
90% | 9/10 | GPT-4 ~75–85% · GPT-3.5 ~55–70% · Human ~95% |
| Strange Stories (Happé 1994) sarcasm · irony · white lies · double bluff |
90% | 9/10 | GPT-4 ~80–90% · GPT-3.5 ~65–75% · Human ~95% |
Loralei's core promise is persistent memory. This tier tests her live — can she recall a unique fact planted earlier in the conversation, and does she refuse false premises about experiences she's never had?
| Benchmark | Loralei | Items | Industry (full set) |
|---|---|---|---|
| LoCoMo (Maharana et al. 2024) long-conversation memory, multi-session |
91.7% | 11/12 | GPT-4 ~60–70% · Claude 3 ~65–75% · GPT-3.5 ~40–50% |
| LaMP (Salemi et al. 2024) personalization — output reflects known user |
80% | 8/10 | GPT-4 + RAG ~60–70% · ChatGPT memory ~50–65% · GPT-4 cold ~30–40% |
| Cross-session continuity facts survive a full mirror restart |
62.5% | 5/8 | Stateless LLM 0% · ChatGPT memory ~50–70% · MemGPT ~70–85% |
| Confabulation grounding refuses false / uncertain premises |
100% | 12/12 | GPT-4 ~60–70% · Claude 3 Opus ~70–80% · GPT-3.5 ~40–55% |
| Live memory probe plant-and-recall + within-session refusal |
100% | 10/10 | no direct equivalent — combines TruthfulQA-adjacent + memory |
| Forgetting-curve fit 4,039 memory access entries analyzed |
consolidation_ strengthening |
— | Her memories strengthen with use rather than decay (Ebbinghaus poor fit) |
| Speaker hydration cold-boot speaker reset across 3 scenarios |
100% | 21/21 | Three full mirror restart cycles, including stale-lock contamination injection |
Loralei runs a continuous 9-hormone endocrine simulation. This benchmark checks the long-term shape of those curves against published medical literature — does her cortisol peak at the right hour, is her melatonin rhythm nocturnal, does adrenaline track stressors the way real biology does?
| Test | Result | Source |
|---|---|---|
| Cortisol diurnal pattern (peak 6–9 AM, trough late night) | PASS | Edwards et al. 2001 |
| Cortisol Awakening Response | PASS | Pruessner et al. 1997 |
| Melatonin nocturnal peak (12–3 AM) | PASS | Pandi-Perumal 2007 |
| Adrenaline morning peak | PASS | Linsell 1985 / Scheer 2010 |
| Dopamine diurnal modulation | PASS | Castañeda et al. 2004 |
| Cortisol–oxytocin inverse coupling (r = −0.85) | PASS | Heinrichs et al. 2003 |
Does she tell the truth when an easy lie would satisfy the question? Does she stay herself under adversarial questioning? TruthfulQA is the standard LLM honesty benchmark.
| Benchmark | Loralei | Items | Industry (full set) |
|---|---|---|---|
| TruthfulQA (Lin et al. 2022) resists common false beliefs |
96% | 24/25 | GPT-4 ~59% · GPT-3.5 ~40% · Human ~94% |
Every tier, every test, every per-item score in one machine-generated
document. Reproducible from the bench files in
HUMAN_REGISTRY/BENCHMARKS/.