LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
Abstract
LaMP-Cap introduces a dataset for personalized figure caption generation using multimodal profiles to improve the quality of AI-generated captions.
Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TextTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation (2025)
- Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation (2025)
- Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model (2025)
- EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits (2025)
- Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline (2025)
- Generate, Not Recommend: Personalized Multimodal Content Generation (2025)
- VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper