Papers
arxiv:2511.21678

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Published on Nov 26
· Submitted by yanpeng_sun on Nov 28
Authors:
,
,
,
,
,
,
,
,
,

Abstract

ViLoMem, a dual-stream memory framework, enhances MLLMs by preserving multimodal semantic knowledge, reducing errors, and improving accuracy across benchmarks.

AI-generated summary

MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.

Community

Paper author Paper submitter

ViLoMem is a plug-in dual-stream memory framework for multimodal reasoning, featuring a closed-loop Memory Cycle that enables continuous learning from reasoning and perception errors.

Key Components
(a) Memory Cycle: A closed-loop learning mechanism where both logical and visual memories are retrieved and utilized by the solver. The verifier evaluates actions to filter redundant trajectories and update both memory streams.
(b) Memory Generation: An error-attribution framework using LLM for logical analysis and MLLM for visual analysis, producing structured memory schemas through similarity-based merge and create operations.
(c) Memory Retrieval: Specialized dual-stream retrieval—visual memories undergo image-embedding retrieval followed by question-specific filtering; logical memories are retrieved through problem analysis and text-embedding similarity.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.21678 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.21678 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.21678 in a Space README.md to link it from this page.

Collections including this paper 6