ChrisGoringe
/

MixedQuantFlux

GGUF

Model card Files Files and versions

xet

Community

ChrisGoringe commited on Sep 16, 2024

Commit

f436e3f

verified ·

1 Parent(s): b24d330

Update README.md

Browse files

Files changed (1) hide show

README.md +23 -57

README.md CHANGED Viewed

@@ -29,17 +29,37 @@ where N_N is the average number of bits per parameter.
 -  5_9 is comfortable on 12 GB cards
 ```
 ## How is this optimised?
 The process for optimisation is as follows:
 - 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
 - For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
-- For each layer in turn, and for each of the Q8_0, Q5_1 and Q4_1 quantizations:
   - A single layer was quantized
-  - The initial hidden states were  processed by the modified layer stack
   - The error (MSE) in the final hidden state was calculated
-- This gives a 'cost' for each possible layer quantization
 - An optimised quantization is one that gives the desired reduction in size for the smallest total cost
   - A series of recipies for optimization have been created from the calculated costs
 - the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
@@ -50,57 +70,3 @@ The process for optimisation is as follows:
 - Different quantizations of different parts of a layer gave significantly worse results
 - Leaving bias in 16 bit made no relevant difference
 - Costs were evaluated for the original Flux.1-dev model. They are assumed to be essentially the same for finetunes
-## Details
-The optimisation recipes are as follows (layers 0-18 are the double_block_layers, 19-56 are the single_block_layers)
-```python
-CONFIGURATIONS = {
-    "9_6" : {
-        'casts': [
-            {'layers': '0-10',             'castto': 'BF16'},
-            {'layers': '11-14, 54',        'castto': 'Q8_0'},
-            {'layers': '15-36, 39-53, 55', 'castto': 'Q5_1'},
-            {'layers': '37-38, 56',        'castto': 'Q4_1'},
-        ]
-    },
-    "9_2" : {
-        'casts': [
-            {'layers': '0-8, 10, 12', 'castto': 'BF16'},
-            {'layers': '9, 11, 13-21, 49-54', 'castto': 'patch:flux1-dev-Q6_K.gguf'},
-            {'layers': '22-34, 41-48, 55', 'castto': 'patch:flux1-dev-Q5_K_S.gguf'},
-            {'layers': '35-40', 'castto': 'patch:flux1-dev-Q4_K_S.gguf'},
-            {'layers': '56', 'castto': 'Q4_1'},
-        ]
-    },
-    "8_4" : {
-        'casts': [
-            {'layers': '0-4, 10',      'castto': 'BF16'},
-            {'layers': '5-9, 11-14',   'castto': 'Q8_0'},
-            {'layers': '15-35, 41-55', 'castto': 'Q5_1'},
-            {'layers': '36-40, 56',    'castto': 'Q4_1'},
-        ]
-    },
-    "7_4" : {
-        'casts': [
-            {'layers': '0-2',                  'castto': 'BF16'},
-            {'layers': '5, 7-12',              'castto': 'Q8_0'},
-            {'layers': '3-4, 6, 13-33, 42-55', 'castto': 'Q5_1'},
-            {'layers': '34-41, 56',            'castto': 'Q4_1'},
-        ]
-    },
-    "5_9" : {
-        'casts': [
-            {'layers': '0-25, 27-28, 44-54', 'castto': 'Q5_1'},
-            {'layers': '26, 29-43, 55-56',   'castto': 'Q4_1'},
-        ]
-    },
-    "5_1" : {
-        'casts': [
-            {'layers': '0-56', 'castto': 'Q4_1'},
-        ]
-    },
-}
-```

 -  5_9 is comfortable on 12 GB cards
 ```
+## Speed?
+On an A40 (plenty of VRAM), everything except the model identical, the time taken to generate an image (30 steps, deis sampler) was:
+- 5_1 => 40.1s
+- 5_9 => 55.4s
+- 6_9 => 52.1s
+- 7_4 => 49.7s
+- 7_6 => 43.6s
+- 8_4 => 46.8s
+- 9_2 => 42.8s
+- 9_6 => 48.2s
+for comparison:
+- bfloat16 (default) =>
+- fp8_e4m3fn =>
+- fp8_e5m2 =>
 ## How is this optimised?
 The process for optimisation is as follows:
 - 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
 - For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
+- For each layer in turn, and for each quantization:
   - A single layer was quantized
+  - The initial hidden states were processed by the modified layer stack
   - The error (MSE) in the final hidden state was calculated
+- This gives a 'cost' for each possible layer quantization - how much different it is to the full model
 - An optimised quantization is one that gives the desired reduction in size for the smallest total cost
   - A series of recipies for optimization have been created from the calculated costs
 - the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
 - Different quantizations of different parts of a layer gave significantly worse results
 - Leaving bias in 16 bit made no relevant difference
 - Costs were evaluated for the original Flux.1-dev model. They are assumed to be essentially the same for finetunes