Evaluation Methodology

Core Methodology

Delta methodology: Same LLM, same questions, with and without the memory system. The system proves value by the measurable gap. For novel-fact benchmarks, the LLM provably cannot answer without the memory system (baseline accuracy = 0%), making any correct recall attributable to our system.

Two levels of evaluation: Tiers 0-2 test the retrieval layer (does the memory system find the right memories?). Tier 3 tests the full end-to-end loop (does the LLM produce better responses with memory?). Tier 4 compares against existing systems.

Tier 0: Embedding Sanity Checks

Validates that the embedding layer (all-MiniLM-L6-v2) is baseline-competitive on standard benchmarks.

python -m benchmarks.sanity.run_mteb                    # all 3 tasks
python -m benchmarks.sanity.run_mteb --tasks SciFact    # single task

Tier 1: Synthetic Novel-Fact Retrieval

Tests whether the memory system can store and retrieve fictional facts. Zero contamination -- facts are entirely fabricated.

50 facts across 8 domains
45 base facts, 5 corrections
11 related-fact clusters (for spreading activation)
3 cross-domain connections
95 total queries

Simulated Time

Benchmarks use a SimulatedClock that advances timestamps by a configurable amount per turn (default 30-60s). This allows temporal decay and rehearsal to operate realistically.

Ablation Conditions

7 conditions, progressively adding features:

simple_retrieval -- FAISS only, no cognitive features
with_decay -- add grace-period decay + frequency modulation
with_working_memory -- add turn-based working memory buffer
with_spreading -- add spreading activation
with_priming -- add priming boost
with_importance -- add importance scoring
full_system -- all features (default pipeline)

Scenarios

Scenario	What It Tests
Basic Recall	Inject fact, add filler turns, query at various distances
Correction Handling	Inject original, then correction, then query
Spreading Activation	Inject related facts, query one, check if related activate
Rehearsal	Pair facts, rehearse one, test both after gap

python -m benchmarks.harness.runner                              # all scenarios
python -m benchmarks.harness.runner --scenario recall             # single scenario
python -m benchmarks.harness.runner --condition simple_retrieval full_system

Tier 2: Multi-Session Dialogue

7 simulated sessions with 4-12 hour gaps. Tests cross-session recall, preference recall, cross-domain spreading, correction accuracy, and broad domain recall.

Tier 3: End-to-End LLM Response Evaluation

Two evaluation methods run in parallel:

Keyword matching: fraction of expected keywords found in the response
LLM-as-judge: the same LLM model rates accuracy 0.0-1.0 against ground truth

Two LLM backends:

Mistral 7B instruct (q4_K_M) via Ollama -- local, free, represents small/edge models
Claude Opus 4.6 via Anthropic API -- state-of-the-art, represents frontier models

Design:

Condition A (baseline): LLM receives query alone
Condition B (augmented): pipeline.recall(query) -> format memories -> prepend to prompt -> LLM generates answer

python -m benchmarks.llm_eval.runner                                    # Mistral (local)
python -m benchmarks.llm_eval.runner --backend anthropic                # Claude Opus
python -m benchmarks.llm_eval.runner --backend anthropic --verbose      # with detail

Tier 4: Comparison with Existing Systems

Architectural comparison with published LLM memory approaches. See the results page for the full feature comparison table.