DFloat11 Compressed Model: `HiDream-ai/HiDream-I1-Full`

This is a DFloat11 losslessly compressed version of the original HiDream-ai/HiDream-I1-Full model. It reduces model size by 30% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference.

🔥🔥🔥 Thanks to DFloat11 compression, HiDream-I1-Full can now run smoothly on a single 32GB GPU without any quality loss. 🔥🔥🔥

📊 Performance Comparison

Metric	HiDream-I1-Full (BFloat16)	HiDream-I1-Full (DFloat11)
Model Size	34.21 GB	24.19 GB
Peak GPU Memory (1024×1024 image generation)	35.61 GB	26.42 GB
Generation Time (A100 GPU)	140 seconds	161 seconds

🔧 How to Use

Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):
```
pip install -U dfloat11[cuda12]
# or if you have CUDA version 11:
# pip install -U dfloat11[cuda11]
```
Install or upgrade the diffusers library.
```
pip install -U diffusers
```

To use the DFloat11 model, run the following example code in Python:

import torch
from transformers import AutoTokenizer
from diffusers import HiDreamImagePipeline
from dfloat11 import DFloat11Model

tokenizer_4 = AutoTokenizer.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11")
text_encoder_4 = DFloat11Model.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11", device="cpu")
text_encoder_4.config.output_hidden_states = True
text_encoder_4.config.output_attentions = True

pipe = HiDreamImagePipeline.from_pretrained(
    "HiDream-ai/HiDream-I1-Full",
    tokenizer_4=tokenizer_4,
    text_encoder_4=text_encoder_4,
    torch_dtype=torch.bfloat16,
)
DFloat11Model.from_pretrained(
    "DFloat11/HiDream-I1-Full-DF11",
    device="cpu",
    bfloat16_model=pipe.transformer,
)
pipe.enable_model_cpu_offload()

image = pipe(
    'A cat wearing a vintage astronaut suit, floating inside a spaceship and gazing out the window at Earth.',
    height=1024,
    width=1024,
    guidance_scale=5.0,
    num_inference_steps=50,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]
image.save("output.png")

🔍 How It Works

We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.

The result is a model that is ~30% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model.

Learn more in our research paper.

📄 Learn More

Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
GitHub: https://github.com/LeanModels/DFloat11
HuggingFace: https://huggingface.co/DFloat11

DFloat11
/

HiDream-I1-Full-DF11

DFloat11 Compressed Model: `HiDream-ai/HiDream-I1-Full`

📊 Performance Comparison

🔧 How to Use

🔍 How It Works

📄 Learn More

Model tree for DFloat11/HiDream-I1-Full-DF11

DFloat11 Compressed Model: HiDream-ai/HiDream-I1-Full

📊 Performance Comparison

🔧 How to Use

🔍 How It Works

📄 Learn More

Model tree for DFloat11/HiDream-I1-Full-DF11

DFloat11 Compressed Model: `HiDream-ai/HiDream-I1-Full`