Skip to content

Evaluation Methodology

Core Methodology

Delta methodology: Same LLM, same questions, with and without the memory system. The system proves value by the measurable gap. For novel-fact benchmarks, the LLM provably cannot answer without the memory system (baseline accuracy = 0%), making any correct recall attributable to our system.

Two levels of evaluation: Tiers 0-2 test the retrieval layer (does the memory system find the right memories?). Tier 3 tests the full end-to-end loop (does the LLM produce better responses with memory?). Tier 4 compares against existing systems.

Tier 0: Embedding Sanity Checks

Validates that the embedding layer (all-MiniLM-L6-v2) is baseline-competitive on standard benchmarks.

python -m benchmarks.sanity.run_mteb                    # all 3 tasks
python -m benchmarks.sanity.run_mteb --tasks SciFact    # single task

Tier 1: Synthetic Novel-Fact Retrieval

Tests whether the memory system can store and retrieve fictional facts. Zero contamination -- facts are entirely fabricated.

  • 50 facts across 8 domains
  • 45 base facts, 5 corrections
  • 11 related-fact clusters (for spreading activation)
  • 3 cross-domain connections
  • 95 total queries

Simulated Time

Benchmarks use a SimulatedClock that advances timestamps by a configurable amount per turn (default 30-60s). This allows temporal decay and rehearsal to operate realistically.

Ablation Conditions

7 conditions, progressively adding features:

  1. simple_retrieval -- FAISS only, no cognitive features
  2. with_decay -- add grace-period decay + frequency modulation
  3. with_working_memory -- add turn-based working memory buffer
  4. with_spreading -- add spreading activation
  5. with_priming -- add priming boost
  6. with_importance -- add importance scoring
  7. full_system -- all features (default pipeline)

Scenarios

Scenario What It Tests
Basic Recall Inject fact, add filler turns, query at various distances
Correction Handling Inject original, then correction, then query
Spreading Activation Inject related facts, query one, check if related activate
Rehearsal Pair facts, rehearse one, test both after gap
python -m benchmarks.harness.runner                              # all scenarios
python -m benchmarks.harness.runner --scenario recall             # single scenario
python -m benchmarks.harness.runner --condition simple_retrieval full_system

Tier 2: Multi-Session Dialogue

7 simulated sessions with 4-12 hour gaps. Tests cross-session recall, preference recall, cross-domain spreading, correction accuracy, and broad domain recall.

Tier 3: End-to-End LLM Response Evaluation

Two evaluation methods run in parallel:

  • Keyword matching: fraction of expected keywords found in the response
  • LLM-as-judge: the same LLM model rates accuracy 0.0-1.0 against ground truth

Two LLM backends:

  • Mistral 7B instruct (q4_K_M) via Ollama -- local, free, represents small/edge models
  • Claude Opus 4.6 via Anthropic API -- state-of-the-art, represents frontier models

Design:

  1. Condition A (baseline): LLM receives query alone
  2. Condition B (augmented): pipeline.recall(query) -> format memories -> prepend to prompt -> LLM generates answer
python -m benchmarks.llm_eval.runner                                    # Mistral (local)
python -m benchmarks.llm_eval.runner --backend anthropic                # Claude Opus
python -m benchmarks.llm_eval.runner --backend anthropic --verbose      # with detail

Tier 4: Comparison with Existing Systems

Architectural comparison with published LLM memory approaches. See the results page for the full feature comparison table.