Model Card for Cityscapes ControlNet with Stable Diffusion v1.5
This model is a fine-tuned version of ControlNet built on top of Stable Diffusion v1.5, specifically conditioned on semantic segmentation maps from the Cityscapes dataset. It enables structure-aware image generation by combining natural language prompts with dense pixel-level guidance in the form of segmentation masks. The result is highly controllable generation of realistic urban street scenes that align with both spatial layouts and descriptive context.
Model Description
Base Model: stable-diffusion-v1-5/stable-diffusion-v1-5
Control Type: Semantic segmentation maps (Cityscapes-style RGB masks)
Architecture: U-Net + ControlNet adapter + Variational Autoencoder (VAE) + CLIP Text Encoder (ViT-L/14)
Training Epochs: 50 full passes over the training data
Training Dataset: 3475 annotated image-label pairs from the Cityscapes dataset (train + val)
Resolution: Trained at 256×256 resolution
Hardware: NVIDIA A100 40GB GPU — total training time was approximately 2 hours
Loss Function: Mean Squared Error (MSE) between predicted and true noise vectors (used in DDPM training)
The ControlNet branches were trained while freezing the base Stable Diffusion weights. This setup maintains prior knowledge from the original diffusion model while specializing its structure conditioning through segmentation.
Usage
This model is available via the diffusers
library. Here's how to load and use it:
from diffusers import StableDiffusionControlNetPipeline
import torch
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"doguilmak/cityscapes-controlnet-sd15",
torch_dtype=torch.float32,
safety_checker=None
)
pipe.to("cuda")
# Load your segmentation map (RGB format expected)
from PIL import Image
control = Image.open("segmentation_map.png").convert("RGB")
# Run generation
result = pipe(
prompt="a detailed urban street, cinematic lighting",
negative_prompt="blurry, distorted",
image=control,
control_image=control,
num_inference_steps=50,
guidance_scale=9,
output_type="pil"
).images[0]
result.save("result.png")
Example Outputs
Input Segmentation Map
Limitations
The model was trained on 256×256 resolution; higher-resolution inference may lead to artifacts unless resized inputs are used.
It performs best on scenes that resemble urban environments, such as city streets and buildings.
The input control image must closely resemble Cityscapes segmentation formats (classes and layout).
License
This stable diffusion base model is distributed under the CreativeML Open RAIL-M license, which allows commercial and non-commercial use with certain restrictions.
Our model is distributed under the MIT license.
References
ControlNet Segmentation Model: lllyasviel/sd-controlnet-seg @ Hugging Face
ControlNet Paper: Y. Zhao et al., “Adding Conditional Control to Text-to-Image Diffusion Models,” arXiv preprint arXiv:2302.05543, 2023.
- Downloads last month
- 0
Model tree for doguilmak/cityscapes-controlnet-sd15
Base model
runwayml/stable-diffusion-v1-5