Kalaido-qwen-image-lora — Reinforcement Learning Enhanced Qwen-Image

Kalaido-qwen-image-lora Graphic

🌟 Introduction

Kalaido-qwen-image-lora is a LoRA finetune of original Qwen-Image model, fine-tuned using state-of-the art RL techniques to improve to build upon the strong foundation of Qwen-Image.
The resulting model demonstrates:

Sharper and more readable text rendering.
Better aesthetic composition and lighting balance.
Improved semantic alignment between textual prompts and visual generations.

⚙️ Example Usage

Install the latest version of diffusers

pip install git+https://github.com/huggingface/diffusers

Note: The Lora works best when only partial denoising is done with it. Hence it is used for only 10 steps.

import torch
from diffusers import QwenImagePipeline, QwenImageTransformer2DModel

model_id = "Qwen/Qwen-Image"  # Replace with your HF model ID
lora_id = 'FractalAIResearch/Kalaido-qwen-image-lora'

# Load the base model
transformer = QwenImageTransformer2DModel.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, subfolder="transformer"
)

# Load the pipeline
pipe = QwenImagePipeline.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, transformer=transformer
)
pipe.load_lora_weights(lora_ckpt_path, weight_name = 'pytorch_lora_weights.safetensors',adapter_name='aes')

pipe.to("cuda")

pipe.enable_vae_tiling()
pipe.enable_vae_slicing()

prompt = "A blackboard that says 'AI research FRACTAL'"
negative_prompt = " " # using an empty string if you do not have specific concept to remove

def callback_on_step_end(self, i, t, callback_kwargs):
    if i == 10:
        self.disable_lora()      
    return {}

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1024,
    height=1024,
    num_inference_steps=50,
    true_cfg_scale=4.0,
    generator=torch.Generator(device='cuda').manual_seed(42),
    callback_on_step_end = callback_on_step_end
).images[0]

image.save("output.png")

🧪 Evaluation

The evaluation of Kalaido-qwen-image-lora was performed across multiple benchmarks to measure text rendering, visual aesthetics, and alignment to human preferences. For all evaluations the lora model is used for only 10 steps:

OneIg: It is a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including subject-element alignment, text rendering precision, reasoning-generated content, stylization, and diversity. Evaluation was conducted only on the English subset of the One-IG text benchmark.

Model	Alignment	Text
FLUX.1 [Dev]	0.786	0.523
HiDream-I1-Full	0.829	0.707
Seedream 3.0	0.818	0.865
GPT Image 1 [High]	0.851	0.857
Qwen-Image	0.882	0.891
Kalaido-qwen-image-lora	0.889	0.979

Long-Text bench: LongText-Bench, proposed in X-Omni, focuses on evaluating the performance on rendering longer texts in both English and Chinese. We evalute our model on only the English subsection of this benchmark.

Model	LongText-Bench-EN
HiDream-I1-Full (Cai et al., 2025)	0.543
FLUX.1 [Dev] (BlackForest, 2024)	0.607
Seedream 3.0 (Gao et al., 2025)	0.896
GPT Image 1 [High] (OpenAI, 2025)	0.956
Qwen-Image	0.935
Kalaido-qwen-image-lora	0.939

Aesthetic score: For aesthetic score, 1,000 prompts were randomly sampled from the Hpsv3 test set.

Model	Aesthetic Score
Qwen-Image	6.62
FLUX.1 [Dev]	6.71
HiDream-I1-Full	6.70
GPT-Image-1	6.79
Kalaido-qwen-image-lora	6.88

Qualitative comparisions.

The figure below compares image generations from the baseline Qwen-Image model (left) and our Kalaido-qwen-image-lora model (right). Kalaido-qwen-image-lora consistently produces outputs with improved aesthetics, and better semantic alignment to the given prompts.

Comparison Grid 1 Comparison Grid 2