Evaluation Methodology
Core Methodology
Delta methodology: Same LLM, same questions, with and without the memory system. The system proves value by the measurable gap. For novel-fact benchmarks, the LLM provably cannot answer without the memory system (baseline accuracy = 0%), making any correct recall attributable to our system.
Two levels of evaluation: Tiers 0-2 test the retrieval layer (does the memory system find the right memories?). Tier 3 tests the full end-to-end loop (does the LLM produce better responses with memory?). Tier 4 compares against existing systems.
Tier 0: Embedding Sanity Checks
Validates that the embedding layer (all-MiniLM-L6-v2) is baseline-competitive on standard benchmarks.
python -m benchmarks.sanity.run_mteb # all 3 tasks
python -m benchmarks.sanity.run_mteb --tasks SciFact # single task
Tier 1: Synthetic Novel-Fact Retrieval
Tests whether the memory system can store and retrieve fictional facts. Zero contamination -- facts are entirely fabricated.
- 50 facts across 8 domains
- 45 base facts, 5 corrections
- 11 related-fact clusters (for spreading activation)
- 3 cross-domain connections
- 95 total queries
Simulated Time
Benchmarks use a SimulatedClock that advances timestamps by a configurable amount per turn (default 30-60s). This allows temporal decay and rehearsal to operate realistically.
Ablation Conditions
7 conditions, progressively adding features:
simple_retrieval-- FAISS only, no cognitive featureswith_decay-- add grace-period decay + frequency modulationwith_working_memory-- add turn-based working memory bufferwith_spreading-- add spreading activationwith_priming-- add priming boostwith_importance-- add importance scoringfull_system-- all features (default pipeline)
Scenarios
| Scenario | What It Tests |
|---|---|
| Basic Recall | Inject fact, add filler turns, query at various distances |
| Correction Handling | Inject original, then correction, then query |
| Spreading Activation | Inject related facts, query one, check if related activate |
| Rehearsal | Pair facts, rehearse one, test both after gap |
python -m benchmarks.harness.runner # all scenarios
python -m benchmarks.harness.runner --scenario recall # single scenario
python -m benchmarks.harness.runner --condition simple_retrieval full_system
Tier 2: Multi-Session Dialogue
7 simulated sessions with 4-12 hour gaps. Tests cross-session recall, preference recall, cross-domain spreading, correction accuracy, and broad domain recall.
Tier 3: End-to-End LLM Response Evaluation
Two evaluation methods run in parallel:
- Keyword matching: fraction of expected keywords found in the response
- LLM-as-judge: the same LLM model rates accuracy 0.0-1.0 against ground truth
Two LLM backends:
- Mistral 7B instruct (q4_K_M) via Ollama -- local, free, represents small/edge models
- Claude Opus 4.6 via Anthropic API -- state-of-the-art, represents frontier models
Design:
- Condition A (baseline): LLM receives query alone
- Condition B (augmented):
pipeline.recall(query)-> format memories -> prepend to prompt -> LLM generates answer
python -m benchmarks.llm_eval.runner # Mistral (local)
python -m benchmarks.llm_eval.runner --backend anthropic # Claude Opus
python -m benchmarks.llm_eval.runner --backend anthropic --verbose # with detail
Tier 4: Comparison with Existing Systems
Architectural comparison with published LLM memory approaches. See the results page for the full feature comparison table.