gguf quantized version of kontext

  • drag kontext to > ./ComfyUI/models/diffusion_models
  • drag clip-l, t5xxl to > ./ComfyUI/models/text_encoders
  • drag pig to > ./ComfyUI/models/vae

screenshot

Prompt
the anime girl with massive fennec ears is wearing cargo pants while sitting on a log in the woods biting into a sandwitch beside a beautiful alpine lake
Prompt
the anime girl with massive fennec ears is wearing a maid outfit with a long black gold leaf pattern dress and a white apron mouth open holding a fancy black forest cake with candles on top in the kitchen of an old dark Victorian mansion lit by candlelight with a bright window to the foggy forest and very expensive stuff everywhere
Prompt
add a hat to the pig
  • don't need safetensors anymore; all gguf (model + encoder + vae)
  • full set gguf works on gguf-node (see the last item from reference at the very end)
  • get more t5xxl gguf encoder either here or here

screenshot

extra: scaled safetensors (alternative 1)

  • get all-in-one checkpoint here (model, clips and vae embedded) screenshot
  • another option: get multi matrix scaled fp8 from comfyui here or e4m3fn fp8 here with seperate scaled version l-clip, t5xxl and vae

run it with diffusers🧨 (alternative 2)

  • might need the most updated diffusers (git version) for FluxKontextPipeline to work; upgrade your diffusers with:
pip install git+https://github.com/huggingface/diffusers.git
  • see example inference below:
import torch
from transformers import T5EncoderModel
from diffusers import FluxKontextPipeline
from diffusers.utils import load_image

text_encoder = T5EncoderModel.from_pretrained(
    "calcuis/kontext-gguf",
    gguf_file="t5xxl_fp16-q4_0.gguf",
    torch_dtype=torch.bfloat16,
    )

pipe = FluxKontextPipeline.from_pretrained(
    "calcuis/kontext-gguf",
    text_encoder_2=text_encoder,
    torch_dtype=torch.bfloat16
    ).to("cuda")

input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")

image = pipe(
  image=input_image,
  prompt="Add a hat to the cat",
  guidance_scale=2.5
).images[0]
image.save("output.png")
  • tip: if your machine doesn't has enough vram, would suggest running it with gguf-node via comfyui (plan a), otherwise you might expect to wait very long while falling to a slow mode; this is always a winner takes all game

run it with gguf-connector (alternative 3)

  • simply execute the command below in console/terminal
ggc k2

screenshot

  • note: during the first time launch, it will pull the required model file(s) from this repo to local cache automatically; then opt to run it entirely offline; i.e., from local URL: http://127.0.0.1:7860 with lazy webui

screenshot

  • with bot lora embedded version
ggc k1

screenshot

  • new plushie style screenshot

additional chapter for lora conversion via gguf-connector

  • convert lora from base to unet format, i.e.,plushie, then it can be used in comfyui as well
ggc la

screenshot

  • able to swap the lora back (from unet to base; auto-detection logic applied), then it can be used for inference again
ggc la

screenshot

update

  • clip-l-v2: missing tensor text_projection.weight added
  • kontext-v2: s-quant and k-quant; except single and double blocks, all in f32 status
    • pros: load faster (as no dequant needed for those tensors); and
      1. avoid key breaking issue, since some inference engines only dequant blocks;
      2. compatible for non-cuda machines, as most of them cannot run bf16 tensors
    • cons: little bit larger in file size
  • kontext-v3: i-quant attempt (upgrade your node to the latest version for full quant support)
  • kontext-v4: t-quant; runnable (extramely fast); for speed test/experimental purposes
rank quant s/it loading speed
1 q2_k 6.40±.7 🐖💨💨💨💨💨💨
2 q4_0 8.58±.5 🐖🐖💨💨💨💨💨
3 q4_1 9.12±.5 🐖🐖🐖💨💨💨💨
4 q8_0 9.45±.3 🐖🐖🐖🐖💨💨💨
5 q3_k 9.50±.3 🐖🐖🐖🐖💨💨💨
6 q5_0 10.48±.5 🐖🐖🐖🐖🐖💨💨
7 iq4_nl 10.55±.5 🐖🐖🐖🐖🐖💨💨
8 q5_1 10.65±.5 🐖🐖🐖🐖🐖💨💨
9 iq4_xs 11.45±.7 🐖🐖🐖🐖🐖🐖💨
10 iq3_s 11.62±.9 🐢🐢🐢🐢🐢🐢💨
11 iq3_xxs 12.08±.9 🐢🐢🐢🐢🐢🐢🐢

not all included in the initial test (*tested with a beginner laptop gpu only, if you have highend model, might find q8_0 running surprisingly faster than others), the rest of them, test it yourself; btw, the interesting thing is: the loading time required was not aligning with file size, due to the complexity of each calculation (dequant), and might vary from models

reference

Downloads last month
7,573
GGUF
Model size
123M params
Architecture
pig
Hardware compatibility
Log In to view the estimation

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Examples
Examples
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for calcuis/kontext-gguf

Quantized
(11)
this model