Benchmark Results
Tier 0: MTEB/BEIR Embedding Sanity
| Task |
Our nDCG@10 |
Published Range |
Status |
| SciFact |
0.645 |
0.64-0.67 |
Matches |
| NFCorpus |
0.316 |
0.30-0.33 |
Matches |
| ArguAna |
0.502 |
0.49-0.52 |
Matches |
Embedding layer is standard. These scores validate the foundation.
Tier 1: Synthetic Retrieval
Basic Recall (distances 10, 25, 50, 100 turns)
| Condition |
Recall@k |
Avg Score |
| simple_retrieval |
99.4% |
0.94 |
| with_decay |
99.4% |
0.94 |
| with_importance |
100.0% |
1.24 |
| full_system |
100.0% |
1.24 |
Correction Handling (52-turn gap)
| Condition |
Recall@k |
Avg Score |
| simple_retrieval |
100.0% |
0.80 |
| with_importance |
100.0% |
1.29 |
| full_system |
100.0% |
1.29 |
Rehearsal (110-turn gap)
| Condition |
Recall@k |
Avg Score |
| simple_retrieval |
87.5% |
0.68 |
| with_decay |
87.5% |
0.63 |
| full_system |
93.8% |
0.96 |
Strongest differentiator. Full system recovers a memory that simpler conditions miss. Importance scoring and priming boost rehearsed facts by 40% in score.
Tier 2: Multi-Session Dialogue
| Condition |
Recall@k |
Avg Score |
| simple_retrieval |
97.0% |
0.75 |
| with_decay |
97.0% |
0.75 |
| full_system |
100.0% |
1.13 |
Tier 3: End-to-End LLM Evaluation
Mistral 7B (Local)

| Scenario |
Baseline (kw) |
Augmented (kw) |
Delta (kw) |
Baseline (judge) |
Augmented (judge) |
Delta (judge) |
| Novel fact recall |
17% |
72% |
+55% |
18% |
90% |
+72% |
| Correction accuracy |
40% |
80% |
+40% |
30% |
70% |
+40% |
| Multi-memory synthesis |
29% |
90% |
+61% |
20% |
90% |
+70% |
| User preferences |
0% |
88% |
+88% |
62% |
75% |
+12% |
| Overall |
21% |
82% |
+61% |
32% |
81% |
+49% |
Claude Opus 4.6
| Scenario |
Baseline (kw) |
Augmented (kw) |
Delta (kw) |
Baseline (judge) |
Augmented (judge) |
Delta (judge) |
| Novel fact recall |
17% |
98% |
+81% |
0% |
85% |
+85% |
| Correction accuracy |
40% |
100% |
+60% |
0% |
88% |
+88% |
| Multi-memory synthesis |
36% |
95% |
+59% |
0% |
61% |
+61% |
| User preferences |
33% |
100% |
+67% |
2% |
80% |
+78% |
| Overall |
31% |
98% |
+67% |
1% |
78% |
+78% |
Cross-Model Comparison

| Metric |
Mistral Baseline |
Mistral Augmented |
Opus Baseline |
Opus Augmented |
| Keyword avg |
21% |
82% |
31% |
98% |
| LLM judge avg |
32% |
81% |
1% |
78% |
| Correction kw |
80% |
80% |
100% |
100% |
| Synthesis kw |
90% |
90% |
95% |
95% |
Tier 4: Feature Comparison with Existing Systems

| Feature |
CMM (ours) |
Generative Agents |
MemGPT |
Standard RAG |
LangChain |
ChatGPT Memory |
Mem0 |
| Retrieval trigger |
Automatic (passive) |
Automatic |
Explicit (function calls) |
Automatic |
Automatic |
Automatic (all injected) |
Explicit (API) |
| Temporal decay |
Grace-period + frequency-adjusted |
Exponential (0.995/hr) |
No |
No |
No |
Weak (recency heuristic) |
Yes (undocumented) |
| Importance scoring |
4-level rule-based |
LLM-assigned 1-10 |
No |
No |
No |
No |
Yes (undocumented) |
| Spreading activation |
Yes (dual-path: FAISS + entity links) |
No |
No |
No |
No |
No |
No |
| Entity linking |
Yes (spaCy NER, cross-domain) |
No |
No |
No |
No |
No |
No |
| Priming |
Yes (turn-decaying boost) |
No |
No |
No |
No |
No |
No |
| Working memory |
Fixed-size TTL buffer |
No |
Core memory blocks |
No |
Buffer window |
No |
Session layer |
| Consolidation |
Episodic-to-semantic clustering |
Reflection (importance threshold) |
No |
No |
Summary compression |
No |
Layer promotion |
| Metamemory |
Confidence + tip-of-tongue |
No |
No |
No |
No |
No |
No |
| Gist compression |
LLM-compressed summaries |
No (verbatim) |
No (verbatim) |
No (verbatim chunks) |
Optional summary |
Conversation summaries |
LLM-extracted facts |
| Scoring |
sim x decay x importance x priming |
recency + importance + relevance |
Similarity only |
Similarity only |
Similarity only |
All injected |
Similarity + recency |
| Agent controls memory? |
No (fully passive) |
No (auto) |
Yes (explicit) |
N/A (no write) |
No (auto per turn) |
Hybrid |
Hybrid |
Key Observations
- Memory augmentation benefits both small and frontier models. Mistral 7B: 21% -> 82% (keyword). Claude Opus: 31% -> 98%.
- Frontier models produce cleaner baselines. Opus baseline LLM-judge score is 0-2% (honest "I don't know") vs Mistral's 18-32% (confident hallucination).
- Frontier models utilize memory context better. Augmented keyword scores: Opus 98% vs Mistral 82%.
- Frontier judges are stricter. Opus LLM-judge for augmented responses (78%) lower than Mistral's (81%) despite higher keyword scores.
- Correction accuracy is perfect with Opus. 100% keyword score across all 5 corrections.
- The autoassociative recall thesis holds across model scales. Both models correctly surfaced the shellfish allergy when asked about lunch ordering.
- One persistent retrieval limitation. The Osei-Petrov cross-domain collaboration query scores 0% on both models -- a retrieval failure, not an LLM failure.
- Spreading activation and priming are unique to CMM. No other system implements these features.