DFloat11 Compressed Model: HiDream-ai/HiDream-I1-Full
This is a DFloat11 losslessly compressed version of the original HiDream-ai/HiDream-I1-Full
model. It reduces model size by 30% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference.
π₯π₯π₯ Thanks to DFloat11 compression, HiDream-I1-Full can now run smoothly on a single 32GB GPU without any quality loss. π₯π₯π₯
π Performance Comparison
Metric | HiDream-I1-Full (BFloat16) | HiDream-I1-Full (DFloat11) |
---|---|---|
Model Size | 34.21 GB | 24.19 GB |
Peak GPU Memory (1024Γ1024 image generation) |
35.61 GB | 26.42 GB |
Generation Time (A100 GPU) |
140 seconds | 161 seconds |
π§ How to Use
Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):
pip install -U dfloat11[cuda12] # or if you have CUDA version 11: # pip install -U dfloat11[cuda11]
Install or upgrade the diffusers library.
pip install -U diffusers
To use the DFloat11 model, run the following example code in Python:
import torch from transformers import AutoTokenizer from diffusers import HiDreamImagePipeline from dfloat11 import DFloat11Model tokenizer_4 = AutoTokenizer.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11") text_encoder_4 = DFloat11Model.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11", device="cpu") text_encoder_4.config.output_hidden_states = True text_encoder_4.config.output_attentions = True pipe = HiDreamImagePipeline.from_pretrained( "HiDream-ai/HiDream-I1-Full", tokenizer_4=tokenizer_4, text_encoder_4=text_encoder_4, torch_dtype=torch.bfloat16, ) DFloat11Model.from_pretrained( "DFloat11/HiDream-I1-Full-DF11", device="cpu", bfloat16_model=pipe.transformer, ) pipe.enable_model_cpu_offload() image = pipe( 'A cat wearing a vintage astronaut suit, floating inside a spaceship and gazing out the window at Earth.', height=1024, width=1024, guidance_scale=5.0, num_inference_steps=50, generator=torch.Generator("cuda").manual_seed(0), ).images[0] image.save("output.png")
π How It Works
We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.
The result is a model that is ~30% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model.
Learn more in our research paper.
π Learn More
- Downloads last month
- 292
Model tree for DFloat11/HiDream-I1-Full-DF11
Base model
HiDream-ai/HiDream-I1-Full