mingyi456
/

t5-v1_1-xxl-fp16-DF11

@@ -6,4 +6,60 @@ license: apache-2.0
 language: en
 pipeline_tag: text-generation
 ---
-## **Important: This model is recommended for use with the stock Flux, Chroma and HiDream pipelines. For SD3.5 and Bria, it is recommended to use [this version made from BF16](https://huggingface.co/mingyi456/t5-v1_1-xxl-DF11) instead. <u>Also, currently the original T5XXL weights are required to initialize the model correctly. Random initialization will lead to unpredictable results, even with the same seed.</u>

 language: en
 pipeline_tag: text-generation
 ---
+## Important: This model is recommended for use with the stock SD3.5 and Bria pipelines. For Flux, Chroma and HiDream, it is recommended to use [this version made from BF16](https://huggingface.co/mingyi456/t5-v1_1-xxl-DF11) instead. <u>Also, currently the original T5XXL weights are required to initialize the model correctly. Random initialization will lead to unpredictable results, even with the same seed.</u>
+For more information (including how to compress models yourself), check out https://huggingface.co/DFloat11 and https://github.com/LeanModels/DFloat11
+After successfully compressing Cosmos-Predict2-14B-Text2Image and Chroma, I wanted to try compressing the text encoders used in diffusion pipelines. The benefits of doing so are as follows:
+1. It provides a further reduction in VRAM footprint if the entire pipeline is loaded onto the GPU, or in the case of `enable_model_cpu_offload()` it reduces the total system RAM footprint. Some diffusion models like, SD3.5 Medium, Cosmos-Predict2-2B and Bria 3.2 are actually smaller in footprint compared to the text encoder they use, so compressing the text encoder yields a larger benefit.
+2. The text encoder stage of the pipeline is very fast in my experience, so with `enable_model_cpu_offload()` the (almost insignificant) speed penalty of the text encoding is often more than outweighed by the significantly faster loading and unloading of the compressed text encoder, due to less data shuffling between VRAM and system RAM.
+Unfortunately, this was an absolute nightmare to get working. It took many failed attempts to get the compression code working, and then many more attempts to produce a compressed model that successfully loads, and produces identical outputs to the uncompressed model. For T5XXL I was unable to get it to save in a single file, due to some complaint about shared tensors (most likely due to my own incompetence and inexperience, so I welcome any advice in this area). Also, the compressed weights cannot be directly loaded via `text_encoder_df11 = DFloat11Model.from_pretrained()`, and require specifying the `bfloat16_model` to load the weights into.
+At least for now, using the DF11 compression of the T5XXL weights saves ~2.5GB of VRAM/RAM. This allows pipelines like Bria to run using `pipe.to("cuda")` instead of `pipe.enable_model_cpu_offload()` on 24GB VRAM setups, otherwise the uncompressed pipeline starts exceeding 24GB in the VAE decode stage. SD3.5 medium also exceeds 24GB when generating 1440x1440 images (which it seems somewhat capable of doing). As usual, do let me know if you run into any problems.
+### How to Use
+#### `diffusers`
+1. Install the DFloat11 pip package *(installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed)*:
+    ```bash
+    pip install dfloat11[cuda12]
+    # or if you have CUDA version 11:
+    # pip install dfloat11[cuda11]
+    ```
+2. To use the DFloat11 model, run the following example code in Python:
+    ```python
+    import torch
+    from diffusers import BriaPipeline, BriaTransformer2DModel
+    from dfloat11 import DFloat11Model
+    with no_init_weights(): # IMPORTANT! Only the transformer should be initialized this way! The text_encoder currently requires full bf16 weights to load correctly!
+      transformer = BriaTransformer2DModel.from_config(
+          BriaTransformer2DModel.load_config(
+              "briaai/BRIA-3.2",
+              subfolder="transformer"
+          ),
+          torch_dtype=torch.bfloat16
+      ).to(torch.bfloat16)
+    pipe = BriaPipeline.from_pretrained(
+        "briaai/BRIA-3.2",
+        transformer=transformer,
+        torch_dtype=torch.bfloat16
+    )
+    DFloat11Model.from_pretrained('mingyi456/t5-v1_1-xxl-fp16-DF11', device='cpu', bfloat16_model=pipe.text_encoder)
+    pipe.enable_model_cpu_offload()
+    prompt = "A futuristic cityscape at sunset, with flying cars, neon lights, and reflective water canals"
+    image = pipe(
+        prompt,
+        guidance_scale=3.5,
+        num_inference_steps=30,
+        max_sequence_length=256,
+        generator=torch.Generator("cpu").manual_seed(0)
+    ).images[0]
+    image.save("shuttle-jaguar.png")
+    ```
+#### ComfyUI
+Unfortunately, this is unlikely to be supported in the near future. Due to my limited experience in this field, I do not think I can make this work unless the original developer steps in.