What libraries can I use for Text-to-Video?

The diffusers library is compatible with Text-to-Video.

What models can I use for Text-to-Video?

The tencent/HunyuanVideo, Lightricks/LTX-Video, nvidia/Cosmos-1.0-Diffusion-7B-Text2World, and Lightricks/LTX-Video-0.9.8-13B-distilled models can be used for Text-to-Video.

What datasets can I use for Text-to-Video?

The iejMac/CLIP-MSR-VTT, quchenyuan/UCF101-ZIP, nateraw/kinetics, HuggingFaceM4/something_something_v2, HuggingFaceM4/webvid, and iejMac/CLIP-DiDeMo datasets can be used for Text-to-Video.

What metrics can I use for Text-to-Video?

The is, fid, fvd, and clipsim metrics can be used for Text-to-Video.

Tasks

Text-to-Video

Text-to-video models can be used in any application that requires generating consistent sequence of images from text.

Inputs

Input

Darth Vader is surfing on the waves.

Text-to-Video Model

Output

About Text-to-Video

Use Cases

Script-based Video Generation

Text-to-video models can be used to create short-form video content from a provided text script. These models can be used to create engaging and informative marketing videos. For example, a company could use a text-to-video model to create a video that explains how their product works.

Content format conversion

Text-to-video models can be used to generate videos from long-form text, including blog posts, articles, and text files. Text-to-video models can be used to create educational videos that are more engaging and interactive. An example of this is creating a video that explains a complex concept from an article.

Voice-overs and Speech

Text-to-video models can be used to create an AI newscaster to deliver daily news, or for a film-maker to create a short film or a music video.

Task Variants

Text-to-video models have different variants based on inputs and outputs.

Text-to-video Editing

One text-to-video task is generating text-based video style and local attribute editing. Text-to-video editing models can make it easier to perform tasks like cropping, stabilization, color correction, resizing and audio editing consistently.

Text-to-video Search

Text-to-video search is the task of retrieving videos that are relevant to a given text query. This can be challenging, as videos are a complex medium that can contain a lot of information. By using semantic analysis to extract the meaning of the text query, visual analysis to extract features from the videos, such as the objects and actions that are present in the video, and temporal analysis to categorize relationships between the objects and actions in the video, we can determine which videos are most likely to be relevant to the text query.

Text-driven Video Prediction

Text-driven video prediction is the task of generating a video sequence from a text description. Text description can be anything from a simple sentence to a detailed story. The goal of this task is to generate a video that is both visually realistic and semantically consistent with the text description.

Video Translation

Text-to-video translation models can translate videos from one language to another or allow to query the multilingual text-video model with non-English sentences. This can be useful for people who want to watch videos in a language that they don't understand, especially when multi-lingual captions are available for training.

Inference

Contribute an inference snippet for text-to-video here!

Useful Resources

In this area, you can insert useful resources about how to train or use a model for this task.

Text-to-Video: The Task, Challenges and the Current State

Compatible libraries

Diffusers

using Wan-AI/Wan2.2-TI2V-5B

Models for Text-to-Video

Browse Models (2,142)

tencent/HunyuanVideo

Text-to-Video • Updated Mar 6, 2025 • 831 • • 2.2k

Note A strong model for consistent video generation.

Lightricks/LTX-Video

Image-to-Video • Updated Jul 16, 2025 • 449k • 2.22k

Note A text-to-video model with high fidelity motion and strong prompt adherence.

nvidia/Cosmos-1.0-Diffusion-7B-Text2World

Text-to-Video • Updated May 7, 2025 • 2.43k • 234

Note A text-to-video model focusing on physics-aware applications like robotics.

Lightricks/LTX-Video-0.9.8-13B-distilled

Image-to-Video • Updated Jul 17, 2025 • 2.27k • • 30

Note Very fast model for video generation.

Datasets for Text-to-Video

Browse Datasets (365)

iejMac/CLIP-MSR-VTT

Preview • Updated Oct 31, 2022 • 111

Note Microsoft Research Video to Text is a large-scale dataset for open domain video captioning

quchenyuan/UCF101-ZIP

Viewer • Updated Apr 26, 2023 • 13.3k • 596 • 6

Note UCF101 Human Actions dataset consists of 13,320 video clips from YouTube, with 101 classes.

nateraw/kinetics

Viewer • Updated Jun 16, 2022 • 4k • 163 • 1

Note A high-quality dataset for human action recognition in YouTube videos.

HuggingFaceM4/something_something_v2

Updated Oct 20, 2022 • 208 • 21

Note A dataset of video clips of humans performing pre-defined basic actions with everyday objects.

HuggingFaceM4/webvid

private

Updated May 13, 2022 • 13

Note This dataset consists of text-video pairs and contains noisy samples with irrelevant video descriptions

iejMac/CLIP-DiDeMo

Preview • Updated Oct 21, 2022 • 25 • 1

Note A dataset of short Flickr videos for the temporal localization of events with descriptions.

Spaces using Text-to-Video

🌍

VideoCrafter/VideoCrafter

Note An application that generates video from text.

💻

Wan-AI/Wan2.1

Note Consistent video generation application.

⚱️

Pyramid-Flow/pyramid-flow

Note A cutting edge video generation application.

Metrics for Text-to-Video

is: Inception Score uses an image classification model that predicts class labels and evaluates how distinct and diverse the images are. A higher score indicates better video generation.

fid: Frechet Inception Distance uses an image classification model to obtain image embeddings. The metric compares mean and standard deviation of the embeddings of real and generated images. A smaller score indicates better video generation.

fvd: Frechet Video Distance uses a model that captures coherence for changes in frames and the quality of each frame. A smaller score indicates better video generation.

clipsim: CLIPSIM measures similarity between video frames and text using an image-text similarity model. A higher score indicates better video generation.