```html AI Memory Latest Paper Review and MoltMem Comparison Report
Blog 2026.03.11
AI Memory System

AI Memory Latest Paper Review and MoltMem Comparison Report

March 11, 2026 · ~30 minutes read · Covered period: Late 2025 to Early 2026

Verification approach: Initial web search results were cast widely, but this report adopts only papers whose existence, dates, and abstracts could be directly confirmed via arXiv API.


0. Conclusions First

Four Truly Important Points for MoltMem

  1. Explicitly possess pre-storage gating (admission control)
    In the 2026 flow, "what to store" has become an independent design consideration.
  2. Acknowledge that facts alone are insufficient
    fact/summarized memory is cheap, but risks losing evidence needed for future queries. Need a path back to raw evidence.
  3. Evaluate retrieval separately from write
    Recent papers show that retrieval differences are more dominant than cleverness in writing methods.
  4. Run evaluation across multiple sessions, time-series, and with updates
    Simple recall benchmarks alone miss the real-world question: "Does the remembered information improve subsequent behavior?"

In short, MoltMem is already on quite a good track. But the next battleground is not extraction cleverness itself, but rather admission / evidence tier / retrieval eval.


1. What MoltMem Is Aiming For Now (Basis for Comparison)

1.1 Purpose as Stated in README

"MoltMem solves the following problems:"
"1. Conversational memory loss: Problem of conversation content being lost during compaction"
"2. Poor compression quality: Problem of important information not being properly retained"
"3. Difficulty tracking conversation history: Problem of being unable to track what happened in which turn"

From these three points, MoltMem's core lies not in mere "long-term memory" but in the combination of memory + auditability.

1.2 Current State of the Self-Learning Pipeline

"Goal: Repair all unimplemented and incomplete parts of the self-learning pipeline (fact_extractor occurrence coordination, procedural fact creation, NLI integration, source_type filter relaxation, Admin UI settings) so all learning types become promotion candidates."

In other words, MoltMem is already moving toward occurrence / recurrence, source_type, NLI gate, and Admin UI auditing and adjustment. This aligns quite well with recent paper trends.

1.3 Fact Ledger Design Philosophy

"Facts replace the legacy Memory/EpisodicMemory system with a unified, typed, auditable fact store."

Self-learning and provenance fields: source_type / fingerprint / recurrence_count / first_seen_at / source_session_id / source_timestamp / source_turn_ids

This is significant. The admission basis, recurrence, provenance, and traceability that have become important in recent papers already have a foundation in at least the data model.

1.4 Current Injection Method

"Tier 1: safety/pinned facts (full text, no item limit)"
"Tier 2: behavior-related facts (full text, max 10 items)"

This is simple and robust, but conversely, it is not query-adaptive retrieval. This is a candidate for future gaps.


2. Adopted Papers (Existence Confirmed)

All of the following had their titles, dates, and abstracts directly confirmed via arXiv API.

#Paper TitlearXiv
1Memory for Autonomous LLM Agents2603.07670
2Adaptive Memory Admission Control (A-MAC)2603.04549
3Memex(RL)2603.04257
4Diagnosing Retrieval vs. Utilization2603.02473
5Fact-Based Memory vs. Long-Context LLMs2603.04814
6TierMem (Provenance-Aware)2602.17913
7MemoryArena2602.16313
8Mnemis (Dual-Route Retrieval)2602.15313
9Neuromem2602.13967
10SimpleMem2601.02553
11Memory in the Age of AI Agents2512.13564
12Memoria2512.12686

3. Major Trends Visible in the Paper Collection

AI Memory Architecture Concept Diagram
AI Memory System Architecture — Write / Retrieval / Evaluation emerge as major focal points

3.1 Write Policy Becomes the Lead Role

A-MAC says this:

"memory admission remains a poorly specified and weakly controlled component"

"A-MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior."

In other words, "whether to store" is no longer a side effect of extraction. It must be designed independently and be explainable.

3.2 Summary-Only Is Risky; A Path Back to Raw Evidence Is Needed

TierMem says this:

"This creates a fundamental write-before-query barrier"

"summaries can cause unverifiable omissions"

This observation is quite fundamental. Since summaries must be created without knowing future queries, there's a risk of losing necessary conditions.

3.3 Retrieval Can Be More Dominant Than Write

Retrieval Concept
Search and Retrieval — Even clever summarization loses if retrieval is weak

The key point from Diagnosing Retrieval vs. Utilization is quite painful:

"On LoCoMo, retrieval method is the dominant factor"

"Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives"

Roughly speaking, even with clever summarization, you lose if retrieval is weak.

3.4 Evaluation Shifts from Recall Alone to Multi-Session Action

Evaluation and Benchmarking
From traditional recall benchmarks to multi-session action evaluation

MemoryArena cuts it this way:

"Existing evaluations of agents with memory typically assess memorization and action in isolation."

In other words, just "do you remember?" is insufficient. We need to see if remembered information led to improved subsequent behavior.

3.5 Breaking Down Streaming Memory Lifecycle for Evaluation

Neuromem:

"In practice, memory is streaming"

"accuracy and cost are governed by the full memory lifecycle, which encompasses the ingestion, maintenance, retrieval, and integration of information into generation"

This resonates strongly with MoltMem. Why? Because MoltMem is not just a search database but a pipeline of creation → promotion → integration → injection → audit.


4. Per-Paper Evaluation (Strengths / Weaknesses / MoltMem Comparison)

4.1 Adaptive Memory Admission Control (A-MAC) 2026-03-04

"agents either accumulate large volumes of conversational content, including hallucinated or obsolete facts, or depend on opaque, fully LLM-driven memory policies that are costly and difficult to audit."

"A-MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior."

Good Strengths
  • Makes admission an explicit design target
  • Interpretable factors make auditing easier
  • Good cost balance by using lightweight features rather than all-LLM approach
  • Directly addresses prevention of hallucinated / obsolete fact inflow
Bad Weaknesses
  • Benchmarks are LoCoMo-centric, unclear if generalizable to complex real-world conditions
  • Addresses admission but retrieval-side optimization is a separate issue
  • Factor design tends to be domain-dependent, risking tuning hell
MoltMem Comparison

Very compatible. MoltMem has source_type, fingerprint, recurrence_count, first_seen_at, making it easy to add A-MAC-style admission scores. Conversely, while MoltMem currently has promotion gates/NLI, the systematization of admission scoring is still weak. Top priority for near-term adoption.

Adopt If To Be Adopted
  • future_utility: type / pinned candidate / action relevance
  • factual_confidence: NLI result / source_type / human approval presence
  • semantic_novelty: fingerprint proximity / duplicate degree with similar facts
  • temporal_recency: first_seen / update timestamp
  • content_type_prior: prior for each of decision / preference / learning / procedural
4.2 Diagnosing Retrieval vs. Utilization Bottlenecks 2026-03-02

"On LoCoMo, retrieval method is the dominant factor: average accuracy spans 20 points across retrieval methods (57.1% to 77.2%) but only 3-8 points across write strategies."

"Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives"

Good Strengths
  • Separates write / retrieval / utilization for independent examination
  • Challenges the "clever extraction alone wins" narrative
  • BM25 / cosine / hybrid reranking comparisons directly inform implementation decisions
Bad Weaknesses
  • Diagnostic rather than proposing new architecture
  • Many results are LoCoMo-dependent
  • Raw chunk superiority may depend on immaturity of retrieval implementation
MoltMem Comparison

MoltMem invests heavily in fact-ification, NLI, and reconciliation, but the next battleground is strengthening retrieval baselines. fact_context_builder uses static Tier1/Tier2 injection, potentially weak on query-intent-aware retrieval. Should first conduct comparison experiments on BM25 / vector / hybrid rerank.

4.3 TierMem — Provenance-Aware Tiered Memory 2026-02-20

"This creates a fundamental write-before-query barrier"

"summaries can cause unverifiable omissions"

"TierMem uses a two-tier memory hierarchy to answer with the cheapest sufficient evidence"

Good Strengths
  • Rather than opposing summaries and raw evidence, stratifies them
  • Addresses provenance frontally with strong traceability philosophy
  • Sufficiency router escalates to raw only when necessary
Bad Weaknesses
  • Router design is difficult; misclassification causes accidents with summary answers
  • Maintaining raw stores makes implementation and operations heavier
  • Human-readable audit UI simultaneously required
MoltMem Comparison

Very close to MoltMem's core. Already has source_session_id, source_timestamp, source_turn_ids, making evolution toward provenance-awareness straightforward. However, runtime retrieval isn't yet systematized for evidence escalation. "Data model prepared but runtime system incomplete."

4.4 Memex(RL) 2026-03-04

"Existing solutions typically shorten context through truncation or running summaries, but these methods are fundamentally lossy"

"Memex maintains a compact working context consisting of concise structured summaries and stable indices, while storing full-fidelity underlying interactions in an external experience database under those indices."

Good Strengths
  • Avoids summary-only brittleness while dodging full context explosion
  • Index and dereference approach is clear
  • Strong structure for "return to raw records only when needed"
Bad Weaknesses
  • Assumes RL, high implementation cost
  • Too heavy to directly apply to personal memory / Fact Ledger
  • Heuristic router may be sufficient in practice
MoltMem Comparison

What MoltMem should immediately adopt is not RL but indexed evidence dereference. Going straight to RL controller is premature.

4.5 Fact-Based Memory vs. Long-Context LLMs 2026-03-05

"Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction."

"At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns"

Good Strengths
  • Production-oriented with practical cost comparisons
  • Distinguishes strengths and weaknesses of fact-based memory by domain
  • Avoids crude dichotomy
MoltMem Comparison

MoltMem's target conclusion: Stable attributes go to fact ledger, complex history to evidence tier, don't inject everything at once. Should define which queries should be handled fact-first.

4.6 Mnemis — Dual-Route Retrieval 2026-02-17

"existing methods (RAG and Graph-RAG) primarily retrieve memory through similarity-based mechanisms"

"Mnemis integrates System-1 similarity search with a complementary System-2 mechanism, termed Global Selection."

MoltMem Comparison

At MoltMem's current stage, retrieval baseline refinement comes before graphs. However, if preference / identity / relationship data grows, limited entity graph value is high. Full graphification is premature, but local graphs for relationship representation are worth considering.

4.7 MemoryArena 2026-02-18

"Existing evaluations of agents with memory typically assess memorization and action in isolation."

"agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting"

MoltMem Comparison

MoltMem needs this badly. If current success metrics center on fact creation count and retrieval accuracy, actual improvement in downstream conversation quality becomes invisible. Going forward, should focus on "downstream task success rate" rather than "memory success rate."

4.8 Neuromem 2026-02-15

"memory is streaming"

"the full memory lifecycle, which encompasses the ingestion, maintenance, retrieval, and integration of information into generation"

MoltMem Comparison

MoltMem precisely has ingestion / maintenance / retrieval / injection / audit, so Neuromem's decomposition axis is well-aligned. What MoltMem needs now is not just new feature addition but decomposed measurement of where performance/cost/accidents emerge at each stage.

4.9 SimpleMem 2026-01-29

"semantic lossless compression"

"Semantic Structured Compression" / "Online Semantic Synthesis" / "Intent-Aware Retrieval Planning"

MoltMem Comparison

References as ideas for addressing MoltMem's "poor compression quality." However, since auditability is prioritized, hybrid compression + provenance preservation is necessary.

4.10 Surveys / Framework Papers

5. Comprehensive Evaluation Compared to MoltMem

Data Flow
MoltMem Data Flow and Gap Analysis vs. Papers

5.1 Already Strong Alignments

A. Auditable Fact Store

Proclaims "unified, typed, auditable fact store" — among recent memory papers, quite the right direction.

B. Provenance Foundation

source_session_id, source_timestamp, source_turn_ids are strong. Foundation for TierMem-like evolution already exists.

C. Recurrence / Source_type

Candidates for A-MAC-style admission signals already in data model.

D. Admin UI / Review

Having HITL and audit flows — often overlooked by papers — is strong differentiation.

5.2 Still Weak Points and Gaps

1. Admission Control Not Systematized

Has gates but lacks philosophy of scoring storage value by multiple factors explicitly.

2. Relatively Weak Research on Retrieval

Clever write sometimes loses to retrieval strength. MoltMem should make this the main battlefield.

3. Evidence Escalation Incomplete as Runtime System

Provenance fields exist but runtime policy for escalating from fact to raw turn when fact insufficient is unclear.

4. Query-Adaptive Context Building Weak

Current fact_context_builder uses static Tier1/Tier2. Weak on search-intent optimization.

5. Evaluation Not Dimensionalized

Whether decomposed measurement across recall / overwrite / contradiction handling / downstream task success / token/latency is performed remains future work.


6. What Should Be Done Next in MoltMem (Priority Order)

P0 Do Immediately

P0-1. Add Admission Score Layer

Following A-MAC, numerically quantify before fact creation/promotion:

Example output:

{
  "admit": true,
  "score": 0.81,
  "factors": {
    "utility": 0.9,
    "confidence": 0.7,
    "novelty": 0.8,
    "recency": 0.6,
    "type_prior": 0.95
  },
  "reason": "high-utility procedural learning with recurrence"
}

P0-2. Create Retrieval Bake-Off

Compare on same set:

Metrics to observe: hit@k, exact answer accuracy, contradiction rate, latency, token cost

P0-3. Implement Evidence Escalation

Don't answer from facts alone. Escalate to original text when needed via source_turn_ids.

  1. fact hit
  2. sufficiency check
  3. if insufficient, retrieve source turns
  4. answer with citation
  5. feed verified result back into fact/synopsis

P1 Next Phase

P1-1. Query-Adaptive Context Builder

Keep current Tier1/Tier2, add intent-aware routes:

P1-2. Establish Evaluation Harness

Referencing MemoryArena / Neuromem, measure: multi-session consistency, stale fact correction success, provenance trace success, retrieval vs utilization failure split, insertion / retrieval latency

P1-3. Provenance-Aware UI

Admin UI shows "which source turn does this fact come from" at a glance. Strong weapon for paper comparison.

P2 Can Defer

P2-1. Full Graph Memory

Mnemis is interesting but full graph is premature. Higher value in solidifying retrieval baseline and provenance tier first.

P2-2. RL-Based Memory Controller

MemexRL's controller is romantic but what MoltMem currently needs is explicit policy, measurement, auditability — not a learning controller.


7. Summary in One Sentence

MoltMem's direction is quite sound. Especially typed fact ledger, recurrence / provenance, NLI gate, audit/admin UI align with recent paper interests.

But given 2026 papers, the next core to strengthen is not "extraction cleverness" but rather

Get these right and MoltMem becomes an auditable, operationable memory system rather than just "yet another memory DB."


8. References


The "paper basis" sections of this report are direct quotations from or summaries based on each paper's arXiv abstract. The "strengths / weaknesses / MoltMem comparison" sections are analyses conducted by cross-referencing the abstract with current MoltMem docs / code, and are not full implementation verification reviews of each paper.

```