Update README.md
Browse files
README.md
CHANGED
|
@@ -29,17 +29,37 @@ where N_N is the average number of bits per parameter.
|
|
| 29 |
- 5_9 is comfortable on 12 GB cards
|
| 30 |
```
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
## How is this optimised?
|
| 33 |
|
| 34 |
The process for optimisation is as follows:
|
| 35 |
|
| 36 |
- 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
|
| 37 |
- For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
|
| 38 |
-
- For each layer in turn, and for each
|
| 39 |
- A single layer was quantized
|
| 40 |
-
- The initial hidden states were
|
| 41 |
- The error (MSE) in the final hidden state was calculated
|
| 42 |
-
- This gives a 'cost' for each possible layer quantization
|
| 43 |
- An optimised quantization is one that gives the desired reduction in size for the smallest total cost
|
| 44 |
- A series of recipies for optimization have been created from the calculated costs
|
| 45 |
- the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
|
|
@@ -50,57 +70,3 @@ The process for optimisation is as follows:
|
|
| 50 |
- Different quantizations of different parts of a layer gave significantly worse results
|
| 51 |
- Leaving bias in 16 bit made no relevant difference
|
| 52 |
- Costs were evaluated for the original Flux.1-dev model. They are assumed to be essentially the same for finetunes
|
| 53 |
-
|
| 54 |
-
## Details
|
| 55 |
-
|
| 56 |
-
The optimisation recipes are as follows (layers 0-18 are the double_block_layers, 19-56 are the single_block_layers)
|
| 57 |
-
|
| 58 |
-
```python
|
| 59 |
-
|
| 60 |
-
CONFIGURATIONS = {
|
| 61 |
-
"9_6" : {
|
| 62 |
-
'casts': [
|
| 63 |
-
{'layers': '0-10', 'castto': 'BF16'},
|
| 64 |
-
{'layers': '11-14, 54', 'castto': 'Q8_0'},
|
| 65 |
-
{'layers': '15-36, 39-53, 55', 'castto': 'Q5_1'},
|
| 66 |
-
{'layers': '37-38, 56', 'castto': 'Q4_1'},
|
| 67 |
-
]
|
| 68 |
-
},
|
| 69 |
-
"9_2" : {
|
| 70 |
-
'casts': [
|
| 71 |
-
{'layers': '0-8, 10, 12', 'castto': 'BF16'},
|
| 72 |
-
{'layers': '9, 11, 13-21, 49-54', 'castto': 'patch:flux1-dev-Q6_K.gguf'},
|
| 73 |
-
{'layers': '22-34, 41-48, 55', 'castto': 'patch:flux1-dev-Q5_K_S.gguf'},
|
| 74 |
-
{'layers': '35-40', 'castto': 'patch:flux1-dev-Q4_K_S.gguf'},
|
| 75 |
-
{'layers': '56', 'castto': 'Q4_1'},
|
| 76 |
-
]
|
| 77 |
-
},
|
| 78 |
-
"8_4" : {
|
| 79 |
-
'casts': [
|
| 80 |
-
{'layers': '0-4, 10', 'castto': 'BF16'},
|
| 81 |
-
{'layers': '5-9, 11-14', 'castto': 'Q8_0'},
|
| 82 |
-
{'layers': '15-35, 41-55', 'castto': 'Q5_1'},
|
| 83 |
-
{'layers': '36-40, 56', 'castto': 'Q4_1'},
|
| 84 |
-
]
|
| 85 |
-
},
|
| 86 |
-
"7_4" : {
|
| 87 |
-
'casts': [
|
| 88 |
-
{'layers': '0-2', 'castto': 'BF16'},
|
| 89 |
-
{'layers': '5, 7-12', 'castto': 'Q8_0'},
|
| 90 |
-
{'layers': '3-4, 6, 13-33, 42-55', 'castto': 'Q5_1'},
|
| 91 |
-
{'layers': '34-41, 56', 'castto': 'Q4_1'},
|
| 92 |
-
]
|
| 93 |
-
},
|
| 94 |
-
"5_9" : {
|
| 95 |
-
'casts': [
|
| 96 |
-
{'layers': '0-25, 27-28, 44-54', 'castto': 'Q5_1'},
|
| 97 |
-
{'layers': '26, 29-43, 55-56', 'castto': 'Q4_1'},
|
| 98 |
-
]
|
| 99 |
-
},
|
| 100 |
-
"5_1" : {
|
| 101 |
-
'casts': [
|
| 102 |
-
{'layers': '0-56', 'castto': 'Q4_1'},
|
| 103 |
-
]
|
| 104 |
-
},
|
| 105 |
-
}
|
| 106 |
-
```
|
|
|
|
| 29 |
- 5_9 is comfortable on 12 GB cards
|
| 30 |
```
|
| 31 |
|
| 32 |
+
## Speed?
|
| 33 |
+
|
| 34 |
+
On an A40 (plenty of VRAM), everything except the model identical, the time taken to generate an image (30 steps, deis sampler) was:
|
| 35 |
+
|
| 36 |
+
- 5_1 => 40.1s
|
| 37 |
+
- 5_9 => 55.4s
|
| 38 |
+
- 6_9 => 52.1s
|
| 39 |
+
- 7_4 => 49.7s
|
| 40 |
+
- 7_6 => 43.6s
|
| 41 |
+
- 8_4 => 46.8s
|
| 42 |
+
- 9_2 => 42.8s
|
| 43 |
+
- 9_6 => 48.2s
|
| 44 |
+
|
| 45 |
+
for comparison:
|
| 46 |
+
- bfloat16 (default) =>
|
| 47 |
+
- fp8_e4m3fn =>
|
| 48 |
+
- fp8_e5m2 =>
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
|
| 52 |
## How is this optimised?
|
| 53 |
|
| 54 |
The process for optimisation is as follows:
|
| 55 |
|
| 56 |
- 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
|
| 57 |
- For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
|
| 58 |
+
- For each layer in turn, and for each quantization:
|
| 59 |
- A single layer was quantized
|
| 60 |
+
- The initial hidden states were processed by the modified layer stack
|
| 61 |
- The error (MSE) in the final hidden state was calculated
|
| 62 |
+
- This gives a 'cost' for each possible layer quantization - how much different it is to the full model
|
| 63 |
- An optimised quantization is one that gives the desired reduction in size for the smallest total cost
|
| 64 |
- A series of recipies for optimization have been created from the calculated costs
|
| 65 |
- the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
|
|
|
|
| 70 |
- Different quantizations of different parts of a layer gave significantly worse results
|
| 71 |
- Leaving bias in 16 bit made no relevant difference
|
| 72 |
- Costs were evaluated for the original Flux.1-dev model. They are assumed to be essentially the same for finetunes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|