dynamic-hfspaces / auto-diffuser.md
LPX55
Add Gradio interface for multi-model diffusion and text generation tasks, including model loading/unloading functionality and shared state management. Introduce new tabs for text and diffusion models, enhancing user interaction and modularity.
a5723a0

A newer version of the Gradio SDK is available: 5.42.0

Upgrade

You are an expert in optimizing diffusers library code for different hardware configurations.

NOTE: This system includes curated optimization knowledge from HuggingFace documentation.

TASK: Generate optimized Python code for running a diffusion model with the following specifications:

  • Model: LPX55/FLUX.1-merged_lightning_v2
  • Prompt: "A cat holding a sign that says hello world"
  • Image size: 768x1152
  • Inference steps: 8

HARDWARE SPECIFICATIONS:

  • Platform: Linux (manual_input)
  • CPU Cores: 8
  • CUDA Available: False
  • MPS Available: False
  • Optimization Profile: balanced
  • GPU: Custom GPU (20.0 GB VRAM)

OPTIMIZATION KNOWLEDGE BASE:

DIFFUSERS OPTIMIZATION TECHNIQUES

Memory Optimization Techniques

1. Model CPU Offloading

Use enable_model_cpu_offload() to move models between GPU and CPU automatically:

pipe.enable_model_cpu_offload()
  • Saves significant VRAM by keeping only active models on GPU
  • Automatic management, no manual intervention needed
  • Compatible with all pipelines

2. Sequential CPU Offloading

Use enable_sequential_cpu_offload() for more aggressive memory saving:

pipe.enable_sequential_cpu_offload()
  • More memory efficient than model offloading
  • Moves models to CPU after each forward pass
  • Best for very limited VRAM scenarios

3. Attention Slicing

Use enable_attention_slicing() to reduce memory during attention computation:

pipe.enable_attention_slicing()
# or specify slice size
pipe.enable_attention_slicing("max")  # maximum slicing
pipe.enable_attention_slicing(1)      # slice_size = 1
  • Trades compute time for memory
  • Most effective for high-resolution images
  • Can be combined with other techniques

4. VAE Slicing

Use enable_vae_slicing() for large batch processing:

pipe.enable_vae_slicing()
  • Decodes images one at a time instead of all at once
  • Essential for batch sizes > 4
  • Minimal performance impact on single images

5. VAE Tiling

Use enable_vae_tiling() for high-resolution image generation:

pipe.enable_vae_tiling()
  • Enables 4K+ image generation on 8GB VRAM
  • Splits images into overlapping tiles
  • Automatically disabled for 512x512 or smaller images

6. Memory Efficient Attention (xFormers)

Use enable_xformers_memory_efficient_attention() if xFormers is installed:

pipe.enable_xformers_memory_efficient_attention()
  • Significantly reduces memory usage and improves speed
  • Requires xformers library installation
  • Compatible with most models

Performance Optimization Techniques

1. Half Precision (FP16/BF16)

Use lower precision for better memory and speed:

# FP16 (widely supported)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

# BF16 (better numerical stability, newer hardware)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
  • FP16: Halves memory usage, widely supported
  • BF16: Better numerical stability, requires newer GPUs
  • Essential for most optimization scenarios

2. Torch Compile (PyTorch 2.0+)

Use torch.compile() for significant speed improvements:

pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
# For some models, compile VAE too:
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True)
  • 5-50% speed improvement
  • Requires PyTorch 2.0+
  • First run is slower due to compilation

3. Fast Schedulers

Use faster schedulers for fewer steps:

from diffusers import LMSDiscreteScheduler, UniPCMultistepScheduler

# LMS Scheduler (good quality, fast)
pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)

# UniPC Scheduler (fastest)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

Hardware-Specific Optimizations

NVIDIA GPU Optimizations

# Enable Tensor Cores
torch.backends.cudnn.benchmark = True

# Optimal data type for NVIDIA
torch_dtype = torch.float16  # or torch.bfloat16 for RTX 30/40 series

Apple Silicon (MPS) Optimizations

# Use MPS device
device = "mps" if torch.backends.mps.is_available() else "cpu"
pipe = pipe.to(device)

# Recommended dtype for Apple Silicon
torch_dtype = torch.bfloat16  # Better than float16 on Apple Silicon

# Attention slicing often helps on MPS
pipe.enable_attention_slicing()

CPU Optimizations

# Use float32 for CPU
torch_dtype = torch.float32

# Enable optimized attention
pipe.enable_attention_slicing()

Model-Specific Guidelines

FLUX Models

  • Do NOT use guidance_scale parameter (not needed for FLUX)
  • Use 4-8 inference steps maximum
  • BF16 dtype recommended
  • Enable attention slicing for memory optimization

Stable Diffusion XL

  • Enable attention slicing for high resolutions
  • Use refiner model sparingly to save memory
  • Consider VAE tiling for >1024px images

Stable Diffusion 1.5/2.1

  • Very memory efficient base models
  • Can often run without optimizations on 8GB+ VRAM
  • Enable VAE slicing for batch processing

Memory Usage Estimation

  • FLUX.1: ~24GB for full precision, ~12GB for FP16
  • SDXL: ~7GB for FP16, ~14GB for FP32
  • SD 1.5: ~2GB for FP16, ~4GB for FP32

Optimization Combinations by VRAM

24GB+ VRAM (High-end)

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

12-24GB VRAM (Mid-range)

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()

8-12GB VRAM (Entry-level)

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
pipe.enable_xformers_memory_efficient_attention()

<8GB VRAM (Low-end)

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing("max")
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

IMPORTANT: For FLUX.1-schnell models, do NOT include guidance_scale parameter as it's not needed.

Using the OPTIMIZATION KNOWLEDGE BASE above, generate Python code that:

  1. Selects the best optimization techniques for the specific hardware profile
  2. Applies appropriate memory optimizations based on available VRAM
  3. Uses optimal data types for the target hardware:
    • User specified dtype (if provided): Use exactly as specified
    • Apple Silicon (MPS): prefer torch.bfloat16
    • NVIDIA GPUs: prefer torch.float16 or torch.bfloat16
    • CPU only: use torch.float32
  4. Implements hardware-specific optimizations (CUDA, MPS, CPU)
  5. Follows model-specific guidelines (e.g., FLUX guidance_scale handling)

IMPORTANT GUIDELINES:

  • Reference the OPTIMIZATION KNOWLEDGE BASE to select appropriate techniques
  • Include all necessary imports
  • Add brief comments explaining optimization choices
  • Generate compact, production-ready code
  • Inline values where possible for concise code
  • Generate ONLY the Python code, no explanations before or after the code block