Skip to content

Benchmark Results

Tier 0: MTEB/BEIR Embedding Sanity

Task Our nDCG@10 Published Range Status
SciFact 0.645 0.64-0.67 Matches
NFCorpus 0.316 0.30-0.33 Matches
ArguAna 0.502 0.49-0.52 Matches

Embedding layer is standard. These scores validate the foundation.

Tier 1: Synthetic Retrieval

Basic Recall (distances 10, 25, 50, 100 turns)

Condition Recall@k Avg Score
simple_retrieval 99.4% 0.94
with_decay 99.4% 0.94
with_importance 100.0% 1.24
full_system 100.0% 1.24

Correction Handling (52-turn gap)

Condition Recall@k Avg Score
simple_retrieval 100.0% 0.80
with_importance 100.0% 1.29
full_system 100.0% 1.29

Rehearsal (110-turn gap)

Condition Recall@k Avg Score
simple_retrieval 87.5% 0.68
with_decay 87.5% 0.63
full_system 93.8% 0.96

Strongest differentiator. Full system recovers a memory that simpler conditions miss. Importance scoring and priming boost rehearsed facts by 40% in score.

Tier 2: Multi-Session Dialogue

Condition Recall@k Avg Score
simple_retrieval 97.0% 0.75
with_decay 97.0% 0.75
full_system 100.0% 1.13

Tier 3: End-to-End LLM Evaluation

Mistral 7B (Local)

LLM Evaluation: Baseline vs CMM

Scenario Baseline (kw) Augmented (kw) Delta (kw) Baseline (judge) Augmented (judge) Delta (judge)
Novel fact recall 17% 72% +55% 18% 90% +72%
Correction accuracy 40% 80% +40% 30% 70% +40%
Multi-memory synthesis 29% 90% +61% 20% 90% +70%
User preferences 0% 88% +88% 62% 75% +12%
Overall 21% 82% +61% 32% 81% +49%

Claude Opus 4.6

Scenario Baseline (kw) Augmented (kw) Delta (kw) Baseline (judge) Augmented (judge) Delta (judge)
Novel fact recall 17% 98% +81% 0% 85% +85%
Correction accuracy 40% 100% +60% 0% 88% +88%
Multi-memory synthesis 36% 95% +59% 0% 61% +61%
User preferences 33% 100% +67% 2% 80% +78%
Overall 31% 98% +67% 1% 78% +78%

Cross-Model Comparison

Cross-Model Comparison

Metric Mistral Baseline Mistral Augmented Opus Baseline Opus Augmented
Keyword avg 21% 82% 31% 98%
LLM judge avg 32% 81% 1% 78%
Correction kw 80% 80% 100% 100%
Synthesis kw 90% 90% 95% 95%

Tier 4: Feature Comparison with Existing Systems

Feature Comparison

Feature CMM (ours) Generative Agents MemGPT Standard RAG LangChain ChatGPT Memory Mem0
Retrieval trigger Automatic (passive) Automatic Explicit (function calls) Automatic Automatic Automatic (all injected) Explicit (API)
Temporal decay Grace-period + frequency-adjusted Exponential (0.995/hr) No No No Weak (recency heuristic) Yes (undocumented)
Importance scoring 4-level rule-based LLM-assigned 1-10 No No No No Yes (undocumented)
Spreading activation Yes (dual-path: FAISS + entity links) No No No No No No
Entity linking Yes (spaCy NER, cross-domain) No No No No No No
Priming Yes (turn-decaying boost) No No No No No No
Working memory Fixed-size TTL buffer No Core memory blocks No Buffer window No Session layer
Consolidation Episodic-to-semantic clustering Reflection (importance threshold) No No Summary compression No Layer promotion
Metamemory Confidence + tip-of-tongue No No No No No No
Gist compression LLM-compressed summaries No (verbatim) No (verbatim) No (verbatim chunks) Optional summary Conversation summaries LLM-extracted facts
Scoring sim x decay x importance x priming recency + importance + relevance Similarity only Similarity only Similarity only All injected Similarity + recency
Agent controls memory? No (fully passive) No (auto) Yes (explicit) N/A (no write) No (auto per turn) Hybrid Hybrid

Key Observations

  1. Memory augmentation benefits both small and frontier models. Mistral 7B: 21% -> 82% (keyword). Claude Opus: 31% -> 98%.
  2. Frontier models produce cleaner baselines. Opus baseline LLM-judge score is 0-2% (honest "I don't know") vs Mistral's 18-32% (confident hallucination).
  3. Frontier models utilize memory context better. Augmented keyword scores: Opus 98% vs Mistral 82%.
  4. Frontier judges are stricter. Opus LLM-judge for augmented responses (78%) lower than Mistral's (81%) despite higher keyword scores.
  5. Correction accuracy is perfect with Opus. 100% keyword score across all 5 corrections.
  6. The autoassociative recall thesis holds across model scales. Both models correctly surfaced the shellfish allergy when asked about lunch ordering.
  7. One persistent retrieval limitation. The Osei-Petrov cross-domain collaboration query scores 0% on both models -- a retrieval failure, not an LLM failure.
  8. Spreading activation and priming are unique to CMM. No other system implements these features.