AI Memory Latest Paper Review and MoltMem Comparison Report

March 11, 2026 · ~30 minutes read · Covered period: Late 2025 to Early 2026

Verification approach: Initial web search results were cast widely, but this report adopts only papers whose existence, dates, and abstracts could be directly confirmed via arXiv API.

0. Conclusions First

Four Truly Important Points for MoltMem

Explicitly possess pre-storage gating (admission control)
In the 2026 flow, "what to store" has become an independent design consideration.
Acknowledge that facts alone are insufficient
fact/summarized memory is cheap, but risks losing evidence needed for future queries. Need a path back to raw evidence.
Evaluate retrieval separately from write
Recent papers show that retrieval differences are more dominant than cleverness in writing methods.
Run evaluation across multiple sessions, time-series, and with updates
Simple recall benchmarks alone miss the real-world question: "Does the remembered information improve subsequent behavior?"

In short, MoltMem is already on quite a good track. But the next battleground is not extraction cleverness itself, but rather admission / evidence tier / retrieval eval.

1. What MoltMem Is Aiming For Now (Basis for Comparison)

1.1 Purpose as Stated in README

"MoltMem solves the following problems:"
"1. Conversational memory loss: Problem of conversation content being lost during compaction"
"2. Poor compression quality: Problem of important information not being properly retained"
"3. Difficulty tracking conversation history: Problem of being unable to track what happened in which turn"

From these three points, MoltMem's core lies not in mere "long-term memory" but in the combination of memory + auditability.

1.2 Current State of the Self-Learning Pipeline

"Goal: Repair all unimplemented and incomplete parts of the self-learning pipeline (fact_extractor occurrence coordination, procedural fact creation, NLI integration, source_type filter relaxation, Admin UI settings) so all learning types become promotion candidates."

In other words, MoltMem is already moving toward occurrence / recurrence, source_type, NLI gate, and Admin UI auditing and adjustment. This aligns quite well with recent paper trends.

1.3 Fact Ledger Design Philosophy

"Facts replace the legacy Memory/EpisodicMemory system with a unified, typed, auditable fact store."

Self-learning and provenance fields: source_type / fingerprint / recurrence_count / first_seen_at / source_session_id / source_timestamp / source_turn_ids

This is significant. The admission basis, recurrence, provenance, and traceability that have become important in recent papers already have a foundation in at least the data model.

1.4 Current Injection Method

"Tier 1: safety/pinned facts (full text, no item limit)"
"Tier 2: behavior-related facts (full text, max 10 items)"

This is simple and robust, but conversely, it is not query-adaptive retrieval. This is a candidate for future gaps.

2. Adopted Papers (Existence Confirmed)

All of the following had their titles, dates, and abstracts directly confirmed via arXiv API.

#	Paper Title	arXiv
1	Memory for Autonomous LLM Agents	2603.07670
2	Adaptive Memory Admission Control (A-MAC)	2603.04549
3	Memex(RL)	2603.04257
4	Diagnosing Retrieval vs. Utilization	2603.02473
5	Fact-Based Memory vs. Long-Context LLMs	2603.04814
6	TierMem (Provenance-Aware)	2602.17913
7	MemoryArena	2602.16313
8	Mnemis (Dual-Route Retrieval)	2602.15313
9	Neuromem	2602.13967
10	SimpleMem	2601.02553
11	Memory in the Age of AI Agents	2512.13564
12	Memoria	2512.12686

3. Major Trends Visible in the Paper Collection

AI Memory Architecture Concept Diagram — AI Memory System Architecture — Write / Retrieval / Evaluation emerge as major focal points

3.1 Write Policy Becomes the Lead Role

A-MAC says this:

"memory admission remains a poorly specified and weakly controlled component"

"A-MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior."

In other words, "whether to store" is no longer a side effect of extraction. It must be designed independently and be explainable.

3.2 Summary-Only Is Risky; A Path Back to Raw Evidence Is Needed

TierMem says this:

"This creates a fundamental write-before-query barrier"

"summaries can cause unverifiable omissions"

This observation is quite fundamental. Since summaries must be created without knowing future queries, there's a risk of losing necessary conditions.

3.3 Retrieval Can Be More Dominant Than Write

Retrieval Concept — Search and Retrieval — Even clever summarization loses if retrieval is weak

The key point from Diagnosing Retrieval vs. Utilization is quite painful:

"On LoCoMo, retrieval method is the dominant factor"

"Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives"

Roughly speaking, even with clever summarization, you lose if retrieval is weak.

3.4 Evaluation Shifts from Recall Alone to Multi-Session Action

Evaluation and Benchmarking — From traditional recall benchmarks to multi-session action evaluation

MemoryArena cuts it this way:

"Existing evaluations of agents with memory typically assess memorization and action in isolation."

In other words, just "do you remember?" is insufficient. We need to see if remembered information led to improved subsequent behavior.

3.5 Breaking Down Streaming Memory Lifecycle for Evaluation

Neuromem:

"In practice, memory is streaming"

"accuracy and cost are governed by the full memory lifecycle, which encompasses the ingestion, maintenance, retrieval, and integration of information into generation"

This resonates strongly with MoltMem. Why? Because MoltMem is not just a search database but a pipeline of creation → promotion → integration → injection → audit.

4. Per-Paper Evaluation (Strengths / Weaknesses / MoltMem Comparison)

4.1 Adaptive Memory Admission Control (A-MAC) 2026-03-04

arxiv.org/abs/2603.04549

"agents either accumulate large volumes of conversational content, including hallucinated or obsolete facts, or depend on opaque, fully LLM-driven memory policies that are costly and difficult to audit."

"A-MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior."

Good Strengths

Makes admission an explicit design target
Interpretable factors make auditing easier
Good cost balance by using lightweight features rather than all-LLM approach
Directly addresses prevention of hallucinated / obsolete fact inflow

Bad Weaknesses

Benchmarks are LoCoMo-centric, unclear if generalizable to complex real-world conditions
Addresses admission but retrieval-side optimization is a separate issue
Factor design tends to be domain-dependent, risking tuning hell

MoltMem Comparison

Very compatible. MoltMem has source_type, fingerprint, recurrence_count, first_seen_at, making it easy to add A-MAC-style admission scores. Conversely, while MoltMem currently has promotion gates/NLI, the systematization of admission scoring is still weak. Top priority for near-term adoption.

Adopt If To Be Adopted

future_utility: type / pinned candidate / action relevance
factual_confidence: NLI result / source_type / human approval presence
semantic_novelty: fingerprint proximity / duplicate degree with similar facts
temporal_recency: first_seen / update timestamp
content_type_prior: prior for each of decision / preference / learning / procedural

4.2 Diagnosing Retrieval vs. Utilization Bottlenecks 2026-03-02

arxiv.org/abs/2603.02473

"On LoCoMo, retrieval method is the dominant factor: average accuracy spans 20 points across retrieval methods (57.1% to 77.2%) but only 3-8 points across write strategies."

"Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives"

Good Strengths

Separates write / retrieval / utilization for independent examination
Challenges the "clever extraction alone wins" narrative
BM25 / cosine / hybrid reranking comparisons directly inform implementation decisions

Bad Weaknesses

Diagnostic rather than proposing new architecture
Many results are LoCoMo-dependent
Raw chunk superiority may depend on immaturity of retrieval implementation

MoltMem Comparison

MoltMem invests heavily in fact-ification, NLI, and reconciliation, but the next battleground is strengthening retrieval baselines. fact_context_builder uses static Tier1/Tier2 injection, potentially weak on query-intent-aware retrieval. Should first conduct comparison experiments on BM25 / vector / hybrid rerank.

4.3 TierMem — Provenance-Aware Tiered Memory 2026-02-20

arxiv.org/abs/2602.17913

"This creates a fundamental write-before-query barrier"

"summaries can cause unverifiable omissions"

"TierMem uses a two-tier memory hierarchy to answer with the cheapest sufficient evidence"

Good Strengths

Rather than opposing summaries and raw evidence, stratifies them
Addresses provenance frontally with strong traceability philosophy
Sufficiency router escalates to raw only when necessary

Bad Weaknesses

Router design is difficult; misclassification causes accidents with summary answers
Maintaining raw stores makes implementation and operations heavier
Human-readable audit UI simultaneously required

MoltMem Comparison

Very close to MoltMem's core. Already has source_session_id, source_timestamp, source_turn_ids, making evolution toward provenance-awareness straightforward. However, runtime retrieval isn't yet systematized for evidence escalation. "Data model prepared but runtime system incomplete."

4.4 Memex(RL) 2026-03-04

arxiv.org/abs/2603.04257

"Existing solutions typically shorten context through truncation or running summaries, but these methods are fundamentally lossy"

"Memex maintains a compact working context consisting of concise structured summaries and stable indices, while storing full-fidelity underlying interactions in an external experience database under those indices."

Good Strengths

Avoids summary-only brittleness while dodging full context explosion
Index and dereference approach is clear
Strong structure for "return to raw records only when needed"

Bad Weaknesses

Assumes RL, high implementation cost
Too heavy to directly apply to personal memory / Fact Ledger
Heuristic router may be sufficient in practice

MoltMem Comparison

What MoltMem should immediately adopt is not RL but indexed evidence dereference. Going straight to RL controller is premature.

4.5 Fact-Based Memory vs. Long-Context LLMs 2026-03-05

arxiv.org/abs/2603.04814

"Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction."

"At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns"

Good Strengths

Production-oriented with practical cost comparisons
Distinguishes strengths and weaknesses of fact-based memory by domain
Avoids crude dichotomy

MoltMem Comparison

MoltMem's target conclusion: Stable attributes go to fact ledger, complex history to evidence tier, don't inject everything at once. Should define which queries should be handled fact-first.

4.6 Mnemis — Dual-Route Retrieval 2026-02-17

arxiv.org/abs/2602.15313

"existing methods (RAG and Graph-RAG) primarily retrieve memory through similarity-based mechanisms"

"Mnemis integrates System-1 similarity search with a complementary System-2 mechanism, termed Global Selection."

MoltMem Comparison

At MoltMem's current stage, retrieval baseline refinement comes before graphs. However, if preference / identity / relationship data grows, limited entity graph value is high. Full graphification is premature, but local graphs for relationship representation are worth considering.

4.7 MemoryArena 2026-02-18

arxiv.org/abs/2602.16313

"Existing evaluations of agents with memory typically assess memorization and action in isolation."

"agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting"

MoltMem Comparison

MoltMem needs this badly. If current success metrics center on fact creation count and retrieval accuracy, actual improvement in downstream conversation quality becomes invisible. Going forward, should focus on "downstream task success rate" rather than "memory success rate."

4.8 Neuromem 2026-02-15

arxiv.org/abs/2602.13967

"memory is streaming"

"the full memory lifecycle, which encompasses the ingestion, maintenance, retrieval, and integration of information into generation"

MoltMem Comparison

MoltMem precisely has ingestion / maintenance / retrieval / injection / audit, so Neuromem's decomposition axis is well-aligned. What MoltMem needs now is not just new feature addition but decomposed measurement of where performance/cost/accidents emerge at each stage.

4.9 SimpleMem 2026-01-29

arxiv.org/abs/2601.02553

"semantic lossless compression"

"Semantic Structured Compression" / "Online Semantic Synthesis" / "Intent-Aware Retrieval Planning"

MoltMem Comparison

References as ideas for addressing MoltMem's "poor compression quality." However, since auditability is prioritized, hybrid compression + provenance preservation is necessary.

4.10 Surveys / Framework Papers

Memory in the Age of AI Agents (2512.13564) — Terminology organization, forms/functions/dynamics three-part breakdown. Useful for conceptual clarification in docs.
Memory for Autonomous LLM Agents (2603.07670) — Latest survey viewable through write-manage-read loop. Organization axes: ingestion / consolidation / retrieval / evaluation.
Memoria (2512.12686) — Hybrid of summarization + KG user model. Idea of thickening structure only for preference / identity categories.

5. Comprehensive Evaluation Compared to MoltMem

MoltMem Data Flow and Gap Analysis vs. Papers

5.1 Already Strong Alignments

A. Auditable Fact Store

Proclaims "unified, typed, auditable fact store" — among recent memory papers, quite the right direction.

B. Provenance Foundation

source_session_id, source_timestamp, source_turn_ids are strong. Foundation for TierMem-like evolution already exists.

C. Recurrence / Source_type

Candidates for A-MAC-style admission signals already in data model.

D. Admin UI / Review

Having HITL and audit flows — often overlooked by papers — is strong differentiation.

5.2 Still Weak Points and Gaps

1. Admission Control Not Systematized

Has gates but lacks philosophy of scoring storage value by multiple factors explicitly.

2. Relatively Weak Research on Retrieval

Clever write sometimes loses to retrieval strength. MoltMem should make this the main battlefield.

3. Evidence Escalation Incomplete as Runtime System

Provenance fields exist but runtime policy for escalating from fact to raw turn when fact insufficient is unclear.

4. Query-Adaptive Context Building Weak

Current fact_context_builder uses static Tier1/Tier2. Weak on search-intent optimization.

5. Evaluation Not Dimensionalized

Whether decomposed measurement across recall / overwrite / contradiction handling / downstream task success / token/latency is performed remains future work.

6. What Should Be Done Next in MoltMem (Priority Order)

P0 Do Immediately

P0-1. Add Admission Score Layer

Following A-MAC, numerically quantify before fact creation/promotion:

utility
confidence
novelty
recency
content-type prior

Example output:

{
  "admit": true,
  "score": 0.81,
  "factors": {
    "utility": 0.9,
    "confidence": 0.7,
    "novelty": 0.8,
    "recency": 0.6,
    "type_prior": 0.95
  },
  "reason": "high-utility procedural learning with recurrence"
}

P0-2. Create Retrieval Bake-Off

Compare on same set:

vector only
BM25 only
hybrid rerank
fact only
fact + source_turn evidence

Metrics to observe: hit@k, exact answer accuracy, contradiction rate, latency, token cost

P0-3. Implement Evidence Escalation

Don't answer from facts alone. Escalate to original text when needed via source_turn_ids.

fact hit
sufficiency check
if insufficient, retrieve source turns
answer with citation
feed verified result back into fact/synopsis

P1 Next Phase

P1-1. Query-Adaptive Context Builder

Keep current Tier1/Tier2, add intent-aware routes:

persona/preference query → fact-first
why/how/when query → source-turn/evidence-first
action/task query → procedural/decision priority

P1-2. Establish Evaluation Harness

Referencing MemoryArena / Neuromem, measure: multi-session consistency, stale fact correction success, provenance trace success, retrieval vs utilization failure split, insertion / retrieval latency

P1-3. Provenance-Aware UI

Admin UI shows "which source turn does this fact come from" at a glance. Strong weapon for paper comparison.

P2 Can Defer

P2-1. Full Graph Memory

Mnemis is interesting but full graph is premature. Higher value in solidifying retrieval baseline and provenance tier first.

P2-2. RL-Based Memory Controller

MemexRL's controller is romantic but what MoltMem currently needs is explicit policy, measurement, auditability — not a learning controller.

7. Summary in One Sentence

MoltMem's direction is quite sound. Especially typed fact ledger, recurrence / provenance, NLI gate, audit/admin UI align with recent paper interests.

But given 2026 papers, the next core to strengthen is not "extraction cleverness" but rather

What to store (admission control)
How to retrieve (retrieval policy)
Escalate to evidence when facts insufficient (evidence tier)
Measure by downstream behavior improvement (evaluation)

Get these right and MoltMem becomes an auditable, operationable memory system rather than just "yet another memory DB."

8. References

Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers — arxiv.org/abs/2603.07670
Adaptive Memory Admission Control for LLM Agents — arxiv.org/abs/2603.04549
Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory — arxiv.org/abs/2603.04257
Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory — arxiv.org/abs/2603.02473
Beyond the Context Window: Fact-Based Memory vs. Long-Context LLMs — arxiv.org/abs/2603.04814
From Lossy to Verified: A Provenance-Aware Tiered Memory for Agents — arxiv.org/abs/2602.17913
MemoryArena: Benchmarking Agent Memory in Multi-Session Agentic Tasks — arxiv.org/abs/2602.16313
Mnemis: Dual-Route Retrieval on Hierarchical Graphs — arxiv.org/abs/2602.15313
Neuromem: Granular Decomposition of the Streaming Lifecycle — arxiv.org/abs/2602.13967
SimpleMem: Efficient Lifelong Memory for LLM Agents — arxiv.org/abs/2601.02553
Memory in the Age of AI Agents — arxiv.org/abs/2512.13564
Memoria: A Scalable Agentic Memory Framework — arxiv.org/abs/2512.12686

The "paper basis" sections of this report are direct quotations from or summaries based on each paper's arXiv abstract. The "strengths / weaknesses / MoltMem comparison" sections are analyses conducted by cross-referencing the abstract with current MoltMem docs / code, and are not full implementation verification reviews of each paper.