Model Card for Facades ControlNet with Stable Diffusion v1.5
This model is a fine-tuned version of ControlNet built on top of Stable Diffusion v1.5, specifically conditioned on semantic segmentation maps from the Facades dataset. It enables structure-aware image generation by combining natural language prompts with pixel-level guidance in the form of building façade segmentation masks. The result is highly controllable generation of realistic architectural scenes that reflect both structural layout and textual context.
Model Description
Base Model: stable-diffusion-v1-5/stable-diffusion-v1-5
Control Type: Semantic segmentation maps (Facades-style RGB masks)
Architecture: U-Net + ControlNet adapter + Variational Autoencoder (VAE) + CLIP Text Encoder (ViT-L/14)
Training Epochs: 30 full passes over the training data
Training Dataset: Facades dataset
Resolution: Trained at 512×512 resolution
Hardware: NVIDIA A100 40GB GPU — total training time was approximately 1 hours
Loss Function: Mean Squared Error (MSE) between predicted and true noise vectors (used in DDPM training)
The ControlNet branches were trained while freezing the base Stable Diffusion weights. This retains the generative capabilities of the original model while specializing it to generate façade-aligned structures.
Usage
This model is available via the diffusers
library. Here's how to load and use it:
from diffusers import StableDiffusionControlNetPipeline
import torch
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"doguilmak/facade-controlnet-sd15",
torch_dtype=torch.float32,
safety_checker=None
)
pipe.to("cuda")
# Load your segmentation map (RGB format expected)
from PIL import Image
control = Image.open("facades_segmentation_map.png").convert("RGB")
# Run generation
result = pipe(
prompt="a modern building with large glass windows",
negative_prompt="blurry, distorted",
image=control,
control_image=control,
num_inference_steps=50,
guidance_scale=9,
output_type="pil"
).images[0]
result.save("facade_result.png")
Example Outputs
These example illustrate the model’s ability to generate photorealistic urban scenes guided by semantic segmentation maps. The output demonstrate strong spatial alignment between the input masks and the synthesized content.
Limitations
The model was trained on 512×512 resolution; using higher resolutions without resizing may cause artifacts.
It performs best on scenes resembling architectural façades.
The control image should resemble Facades-style segmentation formats for optimal results.
License
This stable diffusion base model is distributed under the CreativeML Open RAIL-M license.
Our model is distributed under the MIT license.
References
ControlNet Segmentation Model: lllyasviel/sd-controlnet-seg @ Hugging Face
ControlNet Paper: Y. Zhao et al., “Adding Conditional Control to Text-to-Image Diffusion Models,” arXiv preprint arXiv:2302.05543, 2023.
Facades Dataset: Kaggle: Facades Dataset
- Downloads last month
- 0
Model tree for doguilmak/facade-controlnet-sd15
Base model
runwayml/stable-diffusion-v1-5Evaluation results
- Mean Squared Error on CMP Facades DatasetCustom Evaluation0.018