Benchmark Results

Tier 0: MTEB/BEIR Embedding Sanity

Task	Our nDCG@10	Published Range	Status
SciFact	0.645	0.64-0.67	Matches
NFCorpus	0.316	0.30-0.33	Matches
ArguAna	0.502	0.49-0.52	Matches

Embedding layer is standard. These scores validate the foundation.

Tier 1: Synthetic Retrieval

Basic Recall (distances 10, 25, 50, 100 turns)

Condition	Recall@k	Avg Score
simple_retrieval	99.4%	0.94
with_decay	99.4%	0.94
with_importance	100.0%	1.24
full_system	100.0%	1.24

Correction Handling (52-turn gap)

Condition	Recall@k	Avg Score
simple_retrieval	100.0%	0.80
with_importance	100.0%	1.29
full_system	100.0%	1.29

Rehearsal (110-turn gap)

Condition	Recall@k	Avg Score
simple_retrieval	87.5%	0.68
with_decay	87.5%	0.63
full_system	93.8%	0.96

Strongest differentiator. Full system recovers a memory that simpler conditions miss. Importance scoring and priming boost rehearsed facts by 40% in score.

Tier 2: Multi-Session Dialogue

Condition	Recall@k	Avg Score
simple_retrieval	97.0%	0.75
with_decay	97.0%	0.75
full_system	100.0%	1.13

Tier 3: End-to-End LLM Evaluation

Mistral 7B (Local)

LLM Evaluation: Baseline vs CMM

Scenario	Baseline (kw)	Augmented (kw)	Delta (kw)	Baseline (judge)	Augmented (judge)	Delta (judge)
Novel fact recall	17%	72%	+55%	18%	90%	+72%
Correction accuracy	40%	80%	+40%	30%	70%	+40%
Multi-memory synthesis	29%	90%	+61%	20%	90%	+70%
User preferences	0%	88%	+88%	62%	75%	+12%
Overall	21%	82%	+61%	32%	81%	+49%

Claude Opus 4.6

Scenario	Baseline (kw)	Augmented (kw)	Delta (kw)	Baseline (judge)	Augmented (judge)	Delta (judge)
Novel fact recall	17%	98%	+81%	0%	85%	+85%
Correction accuracy	40%	100%	+60%	0%	88%	+88%
Multi-memory synthesis	36%	95%	+59%	0%	61%	+61%
User preferences	33%	100%	+67%	2%	80%	+78%
Overall	31%	98%	+67%	1%	78%	+78%

Cross-Model Comparison

Metric	Mistral Baseline	Mistral Augmented	Opus Baseline	Opus Augmented
Keyword avg	21%	82%	31%	98%
LLM judge avg	32%	81%	1%	78%
Correction kw	80%	80%	100%	100%
Synthesis kw	90%	90%	95%	95%

Tier 4: Feature Comparison with Existing Systems

Feature Comparison

Feature	CMM (ours)	Generative Agents	MemGPT	Standard RAG	LangChain	ChatGPT Memory	Mem0
Retrieval trigger	Automatic (passive)	Automatic	Explicit (function calls)	Automatic	Automatic	Automatic (all injected)	Explicit (API)
Temporal decay	Grace-period + frequency-adjusted	Exponential (0.995/hr)	No	No	No	Weak (recency heuristic)	Yes (undocumented)
Importance scoring	4-level rule-based	LLM-assigned 1-10	No	No	No	No	Yes (undocumented)
Spreading activation	Yes (dual-path: FAISS + entity links)	No	No	No	No	No	No
Entity linking	Yes (spaCy NER, cross-domain)	No	No	No	No	No	No
Priming	Yes (turn-decaying boost)	No	No	No	No	No	No
Working memory	Fixed-size TTL buffer	No	Core memory blocks	No	Buffer window	No	Session layer
Consolidation	Episodic-to-semantic clustering	Reflection (importance threshold)	No	No	Summary compression	No	Layer promotion
Metamemory	Confidence + tip-of-tongue	No	No	No	No	No	No
Gist compression	LLM-compressed summaries	No (verbatim)	No (verbatim)	No (verbatim chunks)	Optional summary	Conversation summaries	LLM-extracted facts
Scoring	`sim x decay x importance x priming`	`recency + importance + relevance`	Similarity only	Similarity only	Similarity only	All injected	Similarity + recency
Agent controls memory?	No (fully passive)	No (auto)	Yes (explicit)	N/A (no write)	No (auto per turn)	Hybrid	Hybrid

Key Observations

Memory augmentation benefits both small and frontier models. Mistral 7B: 21% -> 82% (keyword). Claude Opus: 31% -> 98%.
Frontier models produce cleaner baselines. Opus baseline LLM-judge score is 0-2% (honest "I don't know") vs Mistral's 18-32% (confident hallucination).
Frontier models utilize memory context better. Augmented keyword scores: Opus 98% vs Mistral 82%.
Frontier judges are stricter. Opus LLM-judge for augmented responses (78%) lower than Mistral's (81%) despite higher keyword scores.
Correction accuracy is perfect with Opus. 100% keyword score across all 5 corrections.
The autoassociative recall thesis holds across model scales. Both models correctly surfaced the shellfish allergy when asked about lunch ordering.
One persistent retrieval limitation. The Osei-Petrov cross-domain collaboration query scores 0% on both models -- a retrieval failure, not an LLM failure.
Spreading activation and priming are unique to CMM. No other system implements these features.