Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.42.0
You are an expert in optimizing diffusers library code for different hardware configurations.
NOTE: This system includes curated optimization knowledge from HuggingFace documentation.
TASK: Generate optimized Python code for running a diffusion model with the following specifications:
- Model: LPX55/FLUX.1-merged_lightning_v2
- Prompt: "A cat holding a sign that says hello world"
- Image size: 768x1152
- Inference steps: 8
HARDWARE SPECIFICATIONS:
- Platform: Linux (manual_input)
- CPU Cores: 8
- CUDA Available: False
- MPS Available: False
- Optimization Profile: balanced
- GPU: Custom GPU (20.0 GB VRAM)
OPTIMIZATION KNOWLEDGE BASE:
DIFFUSERS OPTIMIZATION TECHNIQUES
Memory Optimization Techniques
1. Model CPU Offloading
Use enable_model_cpu_offload()
to move models between GPU and CPU automatically:
pipe.enable_model_cpu_offload()
- Saves significant VRAM by keeping only active models on GPU
- Automatic management, no manual intervention needed
- Compatible with all pipelines
2. Sequential CPU Offloading
Use enable_sequential_cpu_offload()
for more aggressive memory saving:
pipe.enable_sequential_cpu_offload()
- More memory efficient than model offloading
- Moves models to CPU after each forward pass
- Best for very limited VRAM scenarios
3. Attention Slicing
Use enable_attention_slicing()
to reduce memory during attention computation:
pipe.enable_attention_slicing()
# or specify slice size
pipe.enable_attention_slicing("max") # maximum slicing
pipe.enable_attention_slicing(1) # slice_size = 1
- Trades compute time for memory
- Most effective for high-resolution images
- Can be combined with other techniques
4. VAE Slicing
Use enable_vae_slicing()
for large batch processing:
pipe.enable_vae_slicing()
- Decodes images one at a time instead of all at once
- Essential for batch sizes > 4
- Minimal performance impact on single images
5. VAE Tiling
Use enable_vae_tiling()
for high-resolution image generation:
pipe.enable_vae_tiling()
- Enables 4K+ image generation on 8GB VRAM
- Splits images into overlapping tiles
- Automatically disabled for 512x512 or smaller images
6. Memory Efficient Attention (xFormers)
Use enable_xformers_memory_efficient_attention()
if xFormers is installed:
pipe.enable_xformers_memory_efficient_attention()
- Significantly reduces memory usage and improves speed
- Requires xformers library installation
- Compatible with most models
Performance Optimization Techniques
1. Half Precision (FP16/BF16)
Use lower precision for better memory and speed:
# FP16 (widely supported)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
# BF16 (better numerical stability, newer hardware)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
- FP16: Halves memory usage, widely supported
- BF16: Better numerical stability, requires newer GPUs
- Essential for most optimization scenarios
2. Torch Compile (PyTorch 2.0+)
Use torch.compile()
for significant speed improvements:
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
# For some models, compile VAE too:
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True)
- 5-50% speed improvement
- Requires PyTorch 2.0+
- First run is slower due to compilation
3. Fast Schedulers
Use faster schedulers for fewer steps:
from diffusers import LMSDiscreteScheduler, UniPCMultistepScheduler
# LMS Scheduler (good quality, fast)
pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
# UniPC Scheduler (fastest)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
Hardware-Specific Optimizations
NVIDIA GPU Optimizations
# Enable Tensor Cores
torch.backends.cudnn.benchmark = True
# Optimal data type for NVIDIA
torch_dtype = torch.float16 # or torch.bfloat16 for RTX 30/40 series
Apple Silicon (MPS) Optimizations
# Use MPS device
device = "mps" if torch.backends.mps.is_available() else "cpu"
pipe = pipe.to(device)
# Recommended dtype for Apple Silicon
torch_dtype = torch.bfloat16 # Better than float16 on Apple Silicon
# Attention slicing often helps on MPS
pipe.enable_attention_slicing()
CPU Optimizations
# Use float32 for CPU
torch_dtype = torch.float32
# Enable optimized attention
pipe.enable_attention_slicing()
Model-Specific Guidelines
FLUX Models
- Do NOT use guidance_scale parameter (not needed for FLUX)
- Use 4-8 inference steps maximum
- BF16 dtype recommended
- Enable attention slicing for memory optimization
Stable Diffusion XL
- Enable attention slicing for high resolutions
- Use refiner model sparingly to save memory
- Consider VAE tiling for >1024px images
Stable Diffusion 1.5/2.1
- Very memory efficient base models
- Can often run without optimizations on 8GB+ VRAM
- Enable VAE slicing for batch processing
Memory Usage Estimation
- FLUX.1: ~24GB for full precision, ~12GB for FP16
- SDXL: ~7GB for FP16, ~14GB for FP32
- SD 1.5: ~2GB for FP16, ~4GB for FP32
Optimization Combinations by VRAM
24GB+ VRAM (High-end)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
12-24GB VRAM (Mid-range)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()
8-12GB VRAM (Entry-level)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
pipe.enable_xformers_memory_efficient_attention()
<8GB VRAM (Low-end)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing("max")
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()
IMPORTANT: For FLUX.1-schnell models, do NOT include guidance_scale parameter as it's not needed.
Using the OPTIMIZATION KNOWLEDGE BASE above, generate Python code that:
- Selects the best optimization techniques for the specific hardware profile
- Applies appropriate memory optimizations based on available VRAM
- Uses optimal data types for the target hardware:
- User specified dtype (if provided): Use exactly as specified
- Apple Silicon (MPS): prefer torch.bfloat16
- NVIDIA GPUs: prefer torch.float16 or torch.bfloat16
- CPU only: use torch.float32
- Implements hardware-specific optimizations (CUDA, MPS, CPU)
- Follows model-specific guidelines (e.g., FLUX guidance_scale handling)
IMPORTANT GUIDELINES:
- Reference the OPTIMIZATION KNOWLEDGE BASE to select appropriate techniques
- Include all necessary imports
- Add brief comments explaining optimization choices
- Generate compact, production-ready code
- Inline values where possible for concise code
- Generate ONLY the Python code, no explanations before or after the code block