--- base_model: - black-forest-labs/FLUX.1-schnell base_model_relation: quantized pipeline_tag: text-to-image tags: - dfloat11 - df11 - lossless compression - 70% size, 100% accuracy --- ## DFloat11 Compressed Model: `black-forest-labs/FLUX.1-schnell` This is a **losslessly compressed** version of [`black-forest-labs/FLUX.1-schnell`](https://huggingface.co/black-forest-labs/FLUX.1-schnell) using our custom **DFloat11** format. The outputs of this compressed model are **bit-for-bit identical** to the original BFloat16 model, while reducing GPU memory consumption by approximately **30%**. ### ๐Ÿ” How It Works DFloat11 compresses model weights using **Huffman coding** of BFloat16 exponent bits, combined with **hardware-aware algorithmic designs** that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are **decompressed just before matrix multiplications**, then **immediately discarded after use** to minimize memory footprint. Key benefits: * **No CPU decompression or host-device data transfer** --- all operations are handled entirely on the GPU. * DFloat11 is **much faster than CPU-offloading approaches**, enabling practical deployment in memory-constrained environments. * The compression is **fully lossless**, guaranteeing that the modelโ€™s outputs are **bit-for-bit identical** to those of the original model. ### ๐Ÿ”ง How to Use 1. Install the DFloat11 pip package *(installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed)*: ```bash pip install dfloat11[cuda12] # or if you have CUDA version 11: # pip install dfloat11[cuda11] ``` 2. To use the DFloat11 model, run the following example code in Python: ```python import torch from diffusers import FluxPipeline from dfloat11 import DFloat11Model pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) pipe.enable_model_cpu_offload() DFloat11Model.from_pretrained('DFloat11/FLUX.1-schnell-DF11', device='cpu', bfloat16_model=pipe.transformer) prompt = "A futuristic cityscape at sunset, with flying cars, neon lights, and reflective water canals" image = pipe( prompt, guidance_scale=0.0, num_inference_steps=4, max_sequence_length=256, generator=torch.Generator("cpu").manual_seed(0) ).images[0] image.save("flux-schnell.png") ``` ### ๐Ÿ“„ Learn More * **Paper**: [70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float](https://arxiv.org/abs/2504.11651) * **GitHub**: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11) * **HuggingFace**: [https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)