arxiv:2507.04036

PresentAgent: Multimodal Agent for Presentation Video Generation

Published on Jul 5

· Submitted by

SteveZeyuZhang on Jul 8

Upvote

Authors:

Zeyu Zhang ,

Abstract

A multimodal agent transforms documents into detailed presentation videos with audio, evaluated using a comprehensive framework involving vision-language models.

AI-generated summary

We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.

View arXiv page View PDF GitHub 19 Add to collection

Community

SteveZeyuZhang

Paper author Paper submitter 1 day ago

•

edited 1 day ago

🚀🚀🚀 Want to create a presentation video for an oral talk but tired of recording?
With just one click, generate a presentation video using PresentAgent!

Our code is open-sourced, and an online demo is available here:
👉 https://github.com/AIGeeksGroup/PresentAgent

📝 Paper: https://arxiv.org/abs/2507.04036

Have fun exploring it!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.04036 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.04036 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.04036 in a Space README.md to link it from this page.