DFloat11 Compressed Model: HiDream-ai/HiDream-I1-Full

This is a DFloat11 losslessly compressed version of the original HiDream-ai/HiDream-I1-Full model. It reduces model size by 30% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference.

πŸ”₯πŸ”₯πŸ”₯ Thanks to DFloat11 compression, HiDream-I1-Full can now run smoothly on a single 32GB GPU without any quality loss. πŸ”₯πŸ”₯πŸ”₯

πŸ“Š Performance Comparison

Metric HiDream-I1-Full (BFloat16) HiDream-I1-Full (DFloat11)
Model Size 34.21 GB 24.19 GB
Peak GPU Memory
(1024Γ—1024 image generation)
35.61 GB 26.42 GB
Generation Time
(A100 GPU)
140 seconds 161 seconds

πŸ”§ How to Use

  1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):

    pip install -U dfloat11[cuda12]
    # or if you have CUDA version 11:
    # pip install -U dfloat11[cuda11]
    
  2. Install or upgrade the diffusers library.

    pip install -U diffusers
    
  3. To use the DFloat11 model, run the following example code in Python:

    import torch
    from transformers import AutoTokenizer
    from diffusers import HiDreamImagePipeline
    from dfloat11 import DFloat11Model
    
    tokenizer_4 = AutoTokenizer.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11")
    text_encoder_4 = DFloat11Model.from_pretrained("DFloat11/Llama-3.1-8B-Instruct-DF11", device="cpu")
    text_encoder_4.config.output_hidden_states = True
    text_encoder_4.config.output_attentions = True
    
    pipe = HiDreamImagePipeline.from_pretrained(
        "HiDream-ai/HiDream-I1-Full",
        tokenizer_4=tokenizer_4,
        text_encoder_4=text_encoder_4,
        torch_dtype=torch.bfloat16,
    )
    DFloat11Model.from_pretrained(
        "DFloat11/HiDream-I1-Full-DF11",
        device="cpu",
        bfloat16_model=pipe.transformer,
    )
    pipe.enable_model_cpu_offload()
    
    image = pipe(
        'A cat wearing a vintage astronaut suit, floating inside a spaceship and gazing out the window at Earth.',
        height=1024,
        width=1024,
        guidance_scale=5.0,
        num_inference_steps=50,
        generator=torch.Generator("cuda").manual_seed(0),
    ).images[0]
    image.save("output.png")
    

πŸ” How It Works

We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.

The result is a model that is ~30% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model.

Learn more in our research paper.

πŸ“„ Learn More

Downloads last month
292
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DFloat11/HiDream-I1-Full-DF11

Quantized
(5)
this model