Papers
arxiv:2308.06685

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

Published on Aug 13, 2023
Authors:
,
,

Abstract

The application of video captioning models aims at translating the content of videos by using accurate natural language. Due to the complex nature inbetween object interaction in the video, the comprehensive understanding of spatio-temporal relations of objects remains a challenging task. Existing methods often fail in generating sufficient feature representations of video content. In this paper, we propose a video captioning model based on dual graphs and gated fusion: we adapt two types of graphs to generate feature representations of video content and utilize gated fusion to further understand these different levels of information. Using a dual-graphs model to generate appearance features and motion features respectively can utilize the content correlation in frames to generate various features from multiple perspectives. Among them, dual-graphs reasoning can enhance the content correlation in frame sequences to generate advanced semantic features; The gated fusion, on the other hand, aggregates the information in multiple feature representations for comprehensive video content understanding. The experiments conducted on worldly used datasets MSVD and MSR-VTT demonstrate state-of-the-art performance of our proposed approach.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2308.06685 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2308.06685 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2308.06685 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.