ChrisGoringe
/

MixedQuantFlux

GGUF

Model card Files Files and versions

xet

Community

ChrisGoringe commited on Sep 6, 2024

Commit

c2a6fbc

verified ·

1 Parent(s): 3c577fb

Create README.md

Browse files

Files changed (1) hide show

README.md +92 -0

README.md ADDED Viewed

	@@ -0,0 +1,92 @@

+---
+base_model: black-forest-labs/FLUX.1-dev
+---
+*Note that all these models are derivatives of black-forest-labs/FLUX.1-dev and therefore covered by the
+[FLUX.1 [dev] Non-Commercial License](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md) license.*
+*Some models are derivatives of finetunes, and are included with the permission of the finetuner*
+# Optimised Flux GGUF models
+A collection of GGUF models using mixed quantization (different layers quantized to different precision to optimise fidelity v. memory).
+They can be loaded in ComfyUI using the [ComfyUI GGUF Nodes](https://github.com/city96/ComfyUI-GGUF). Put the gguf files in your
+model/unet directory.
+## Naming convention (mx for 'mixed')
+[original_model_name]_mxNN_N.gguf
+where NN_N is the approximate reduction in VRAM usage compared the full 16 bit version.
+-  9_0 might just fit on a 16GB card
+- 10_6 is a good balance for 16GB cards,
+- 12_0 is roughly the size of an 8 bit model,
+- 14_1 should work for 12 GB cards
+- 15_2 is fully quantised to Q4_1
+## How is this optimised?
+The process for optimisation is as follows:
+- 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model
+- The hidden states before the start of the double_layer_blocks and after the end of the single_layer_blocks were captured
+- The layer stack was then modified by quantizing one layer to one of Q8_0, Q5_1 or Q4_1
+- The initial hidden states were then processed by the modified layer stack, and the error (MSE) in the final hidden state calculated
+- This gives a 'cost' of each possible layer quantization
+- An optimised quantization is one that gives the desired reduction in size for the smallest total cost
+  - A series of recipies for optimization have been created from the calculated costs
+- the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
+## Also note
+- Tests on using bitsandbytes quantizations showed they did not perform as well as the equivalent sized GGUF quants
+- Different quantizations of different parts of a layer gave significantly worse results
+- Leaving bias in 16 bit made no relevant difference
+- Costs were evaluated for the original Flux.1-dev model. They are assumed to be essentially the same for finetunes
+## Details
+The optimisation recipes are as follows (layers 0-18 are the double_block_layers, 19-56 are the single_block_layers)
+```python
+CONFIGURATIONS = {
+    "9_0" : {
+        'casts': [
+            {'layers': '0-10',             'castto': 'BF16'},
+            {'layers': '11-14, 54',        'castto': 'Q8_0'},
+            {'layers': '15-36, 39-53, 55', 'castto': 'Q5_1'},
+            {'layers': '37-38, 56',        'castto': 'Q4_1'},
+        ]
+    },
+    "10_6" : {
+        'casts': [
+            {'layers': '0-4, 10',      'castto': 'BF16'},
+            {'layers': '5-9, 11-14',   'castto': 'Q8_0'},
+            {'layers': '15-35, 41-55', 'castto': 'Q5_1'},
+            {'layers': '36-40, 56',    'castto': 'Q4_1'},
+        ]
+    },
+    "12_0" : {
+        'casts': [
+            {'layers': '0-2',                  'castto': 'BF16'},
+            {'layers': '5, 7-12',              'castto': 'Q8_0'},
+            {'layers': '3-4, 6, 13-33, 42-55', 'castto': 'Q5_1'},
+            {'layers': '34-41, 56',            'castto': 'Q4_1'},
+        ]
+    },
+    "14_1" : {
+        'casts': [
+            {'layers': '0-25, 27-28, 44-54', 'castto': 'Q5_1'},
+            {'layers': '26, 29-43, 55-56',   'castto': 'Q4_1'},
+        ]
+    },
+    "15_2" : {
+        'casts': [
+            {'layers': '0-56', 'castto': 'Q4_1'},
+        ]
+    },
+}
+```