Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
659.7
TFLOPS
30
2
58
Omer Karisman
PRO
okaris
Follow
btba2023's profile picture
Marcoflex's profile picture
Hemant202's profile picture
46 followers
·
6 following
https://okaris.ai
okarisman
okaris
AI & ML interests
GenAI
Recent Activity
replied
to
a-r-r-o-w
's
post
about 5 hours ago
Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache. The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with. Diffusers usage with CogView4: ```python import torch from diffusers import CogView4Pipeline from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16) pipe.to("cuda") apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2)) prompt = "A photo of an astronaut riding a horse on mars" image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0] image.save("output.png") ``` Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT. Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache PR: https://github.com/huggingface/diffusers/pull/11180 References: - First Block Cache: https://github.com/chengzeyi/ParaAttention - TeaCache: https://github.com/ali-vilab/TeaCache
reacted
to
a-r-r-o-w
's
post
with 🚀
about 5 hours ago
Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache. The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with. Diffusers usage with CogView4: ```python import torch from diffusers import CogView4Pipeline from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16) pipe.to("cuda") apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2)) prompt = "A photo of an astronaut riding a horse on mars" image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0] image.save("output.png") ``` Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT. Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache PR: https://github.com/huggingface/diffusers/pull/11180 References: - First Block Cache: https://github.com/chengzeyi/ParaAttention - TeaCache: https://github.com/ali-vilab/TeaCache
reacted
to
a-r-r-o-w
's
post
with 🧠
about 5 hours ago
Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache. The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with. Diffusers usage with CogView4: ```python import torch from diffusers import CogView4Pipeline from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16) pipe.to("cuda") apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2)) prompt = "A photo of an astronaut riding a horse on mars" image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0] image.save("output.png") ``` Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT. Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache PR: https://github.com/huggingface/diffusers/pull/11180 References: - First Block Cache: https://github.com/chengzeyi/ParaAttention - TeaCache: https://github.com/ali-vilab/TeaCache
View all activity
Organizations
spaces
2
Sort: Recently updated
Running
on
Zero
455
Omni-Zero
🧛
Restylize & repose person ID
Running
on
Zero
33
Omni-Zero-Couples
🧛
Create stylized portraits from images of couples
models
22
Sort: Recently updated
okaris/big-lama
Updated
Apr 7
okaris/flux-hires
Updated
Dec 23, 2024
okaris/sam2.1-onnx
Updated
Nov 15, 2024
okaris/rdjgxlv8ip
Image-to-Image
•
Updated
Nov 7, 2024
•
283
okaris/rdjgxlv8
Image-to-Image
•
Updated
Oct 21, 2024
•
2
okaris/inpaint-patch
Updated
Oct 21, 2024
okaris/head-segmentation
Updated
Oct 1, 2024
•
22
okaris/simple-lama
Updated
Sep 30, 2024
okaris/BiRefNet-legacy
Image Segmentation
•
0.2B
•
Updated
Aug 24, 2024
•
4
okaris/female-faces
Updated
May 29, 2024
•
1
View 22 models
datasets
2
Sort: Recently updated
okaris/man-woman-reg
Updated
Sep 27, 2023
•
3
okaris/ucberkeley-dlab-measuring-hate-speech
Viewer
•
Updated
Sep 17, 2023
•
136k
•
6