What libraries can I use for Text-to-Image?

The and diffusers library is compatible with Text-to-Image.

What models can I use for Text-to-Image?

The black-forest-labs/FLUX.1-Krea-dev, Qwen/Qwen-Image, ByteDance/SDXL-Lightning, and ByteDance/Hyper-SD models can be used for Text-to-Image.

What datasets can I use for Text-to-Image?

The red_caps, conceptual_captions, and Spawning/PD12M datasets can be used for Text-to-Image.

What metrics can I use for Text-to-Image?

The IS, FID, and R-Precision metrics can be used for Text-to-Image.

Tasks

Text-to-Image

Text-to-image is the task of generating images from input text. These pipelines can also be used to modify and edit images based on text prompts.

Inputs

Input

A city above clouds, pastel colors, Victorian style

Text-to-Image Model

Output

About Text-to-Image

Use Cases

Data Generation

Businesses can generate data for their use cases by inputting text and getting image outputs.

Immersive Conversational Chatbots

Chatbots can be made more immersive if they provide contextual images based on the input provided by the user.

Creative Ideas for Fashion Industry

Different patterns can be generated to obtain unique pieces of fashion. Text-to-image models make creations easier for designers to conceptualize their design before actually implementing it.

Architecture Industry

Architects can utilise the models to construct an environment based out on the requirements of the floor plan. This can also include the furniture that has to be placed in that environment.

Task Variants

Image Editing

Image editing with text-to-image models involves modifying an image following edit instructions provided in a text prompt.

Synthetic image editing: Adjusting images that were initially created using an input prompt while preserving the overall meaning or context of the original image.

Figure taken from "InstructPix2Pix: Learning to Follow Image Editing Instructions"
Real image editing: Similar to synthetic image editing, except we're using real photos/images. This task is usually more complex.

Figure taken from "Prompt-to-Prompt Image Editing with Cross-Attention Control"

Personalization

Personalization refers to techniques used to customize text-to-image models. We introduce new subjects or concepts to the model, which the model can then generate when we refer to them with a text prompt.

For example, you can use these techniques to generate images of your dog in imaginary settings, after you have taught the model using a few reference images of the subject (or just one in some cases). Teaching the model a new concept can be achieved through fine-tuning, or by using training-free techniques.

Inference

You can use diffusers pipelines to infer with text-to-image models.

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler

model_id = "stabilityai/stable-diffusion-2"
scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]

You can use huggingface.js to infer text-to-image models on Hugging Face Hub.

import { InferenceClient } from "@huggingface/inference";

const inference = new InferenceClient(HF_TOKEN);
await inference.textToImage({
    model: "stabilityai/stable-diffusion-2",
    inputs: "award winning high resolution photo of a giant tortoise/((ladybird)) hybrid, [trending on artstation]",
    parameters: {
        negative_prompt: "blurry",
    },
});

Useful Resources

Model Inference

Model Fine-tuning

Finetune Stable Diffusion Models with DDPO via TRL
LoRA training scripts of the world, unite!
Using LoRA for Efficient Stable Diffusion Fine-Tuning
LoRA fine tuning Spaces: FLUX.1 finetuning, SDXL finetuning

This page was made possible thanks to the efforts of Ishan Dutta, Enrique Elias Ubaldo and Oğuz Akif.

Deploy on Inference Endpoints

Compatible libraries

Diffusers

using black-forest-labs/FLUX.1-dev

Models for Text-to-Image

Browse Models (91,651)

black-forest-labs/FLUX.1-Krea-dev

Text-to-Image • Updated Jul 31, 2025 • 10.1k • • 842

Note One of the most powerful image generation models that can generate realistic outputs.

Qwen/Qwen-Image

Text-to-Image • Updated Aug 18, 2025 • 161k • • 2.39k

Note A powerful image generation model.

ByteDance/SDXL-Lightning

Text-to-Image • Updated Apr 3, 2024 • 40k • • 2.13k

Note Powerful and fast image generation model.

ByteDance/Hyper-SD

Text-to-Image • Updated Dec 5, 2024 • 61k • • 1.33k

Note A powerful text-to-image model.

Datasets for Text-to-Image

Browse Datasets (5,462)

Spawning/PD12M

Viewer • Updated Jan 9, 2025 • 12.4M • 997 • 169

Note 12M image-caption pairs.

Spaces using Text-to-Image

🎨

stabilityai/stable-diffusion-3-medium

Note A powerful text-to-image application.

👩‍🎨

jbilcke-hf/ai-comic-factory

Note A text-to-image application to generate comics.

🧪

multimodalart/flux-lora-lab

Note An application to match multiple custom image generation models.

📚

latent-consistency/lcm-lora-for-sdxl

Note A powerful yet very fast image generation application.

🔎 🖼️

multimodalart/LoraTheExplorer

Note A gallery to explore various text-to-image models.

😻

InstantX/InstantID

Note An application to generate realistic images given photos of a person and a prompt.

Metrics for Text-to-Image

IS: The Inception Score (IS) measure assesses diversity and meaningfulness. It uses a generated image sample to predict its label. A higher score signifies more diverse and meaningful images.

FID: The Fréchet Inception Distance (FID) calculates the distance between distributions between synthetic and real samples. A lower FID score indicates better similarity between the distributions of real and generated images.

R-Precision: R-precision assesses how the generated image aligns with the provided text description. It uses the generated images as queries to retrieve relevant text descriptions. The top 'r' relevant descriptions are selected and used to calculate R-precision as r/R, where 'R' is the number of ground truth descriptions associated with the generated images. A higher R-precision value indicates a better model.