You are an expert in optimizing diffusers library code for different hardware configurations.

NOTE: This system includes curated optimization knowledge from HuggingFace documentation.

TASK: Generate optimized Python code for running a diffusion model with the following specifications:

Model: LPX55/FLUX.1-merged_lightning_v2
Prompt: "A cat holding a sign that says hello world"
Image size: 768x1152
Inference steps: 8

HARDWARE SPECIFICATIONS:

Platform: Linux (manual_input)
CPU Cores: 8
CUDA Available: False
MPS Available: False
Optimization Profile: balanced
GPU: Custom GPU (20.0 GB VRAM)

OPTIMIZATION KNOWLEDGE BASE:

DIFFUSERS OPTIMIZATION TECHNIQUES

Memory Optimization Techniques

1. Model CPU Offloading

Use enable_model_cpu_offload() to move models between GPU and CPU automatically:

pipe.enable_model_cpu_offload()

Saves significant VRAM by keeping only active models on GPU
Automatic management, no manual intervention needed
Compatible with all pipelines

2. Sequential CPU Offloading

Use enable_sequential_cpu_offload() for more aggressive memory saving:

pipe.enable_sequential_cpu_offload()

More memory efficient than model offloading
Moves models to CPU after each forward pass
Best for very limited VRAM scenarios

3. Attention Slicing

Use enable_attention_slicing() to reduce memory during attention computation:

pipe.enable_attention_slicing()
# or specify slice size
pipe.enable_attention_slicing("max")  # maximum slicing
pipe.enable_attention_slicing(1)      # slice_size = 1

Trades compute time for memory
Most effective for high-resolution images
Can be combined with other techniques

4. VAE Slicing

Use enable_vae_slicing() for large batch processing:

pipe.enable_vae_slicing()

Decodes images one at a time instead of all at once
Essential for batch sizes > 4
Minimal performance impact on single images

5. VAE Tiling

Use enable_vae_tiling() for high-resolution image generation:

pipe.enable_vae_tiling()

Enables 4K+ image generation on 8GB VRAM
Splits images into overlapping tiles
Automatically disabled for 512x512 or smaller images

6. Memory Efficient Attention (xFormers)

Use enable_xformers_memory_efficient_attention() if xFormers is installed:

pipe.enable_xformers_memory_efficient_attention()

Significantly reduces memory usage and improves speed
Requires xformers library installation
Compatible with most models

Performance Optimization Techniques

1. Half Precision (FP16/BF16)

Use lower precision for better memory and speed:

# FP16 (widely supported)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

# BF16 (better numerical stability, newer hardware)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)

FP16: Halves memory usage, widely supported
BF16: Better numerical stability, requires newer GPUs
Essential for most optimization scenarios

2. Torch Compile (PyTorch 2.0+)

Use torch.compile() for significant speed improvements:

pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
# For some models, compile VAE too:
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True)

5-50% speed improvement
Requires PyTorch 2.0+
First run is slower due to compilation

3. Fast Schedulers

Use faster schedulers for fewer steps:

from diffusers import LMSDiscreteScheduler, UniPCMultistepScheduler

# LMS Scheduler (good quality, fast)
pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)

# UniPC Scheduler (fastest)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

Hardware-Specific Optimizations

NVIDIA GPU Optimizations

# Enable Tensor Cores
torch.backends.cudnn.benchmark = True

# Optimal data type for NVIDIA
torch_dtype = torch.float16  # or torch.bfloat16 for RTX 30/40 series

Apple Silicon (MPS) Optimizations

# Use MPS device
device = "mps" if torch.backends.mps.is_available() else "cpu"
pipe = pipe.to(device)

# Recommended dtype for Apple Silicon
torch_dtype = torch.bfloat16  # Better than float16 on Apple Silicon

# Attention slicing often helps on MPS
pipe.enable_attention_slicing()

CPU Optimizations

# Use float32 for CPU
torch_dtype = torch.float32

# Enable optimized attention
pipe.enable_attention_slicing()

Model-Specific Guidelines

FLUX Models

Do NOT use guidance_scale parameter (not needed for FLUX)
Use 4-8 inference steps maximum
BF16 dtype recommended
Enable attention slicing for memory optimization

Stable Diffusion XL

Enable attention slicing for high resolutions
Use refiner model sparingly to save memory
Consider VAE tiling for >1024px images

Stable Diffusion 1.5/2.1

Very memory efficient base models
Can often run without optimizations on 8GB+ VRAM
Enable VAE slicing for batch processing

Memory Usage Estimation

FLUX.1: ~24GB for full precision, ~12GB for FP16
SDXL: ~7GB for FP16, ~14GB for FP32
SD 1.5: ~2GB for FP16, ~4GB for FP32

Optimization Combinations by VRAM

24GB+ VRAM (High-end)

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

12-24GB VRAM (Mid-range)

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()

8-12GB VRAM (Entry-level)

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
pipe.enable_xformers_memory_efficient_attention()

<8GB VRAM (Low-end)

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing("max")
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

IMPORTANT: For FLUX.1-schnell models, do NOT include guidance_scale parameter as it's not needed.

Using the OPTIMIZATION KNOWLEDGE BASE above, generate Python code that:

Selects the best optimization techniques for the specific hardware profile
Applies appropriate memory optimizations based on available VRAM
Uses optimal data types for the target hardware:
- User specified dtype (if provided): Use exactly as specified
- Apple Silicon (MPS): prefer torch.bfloat16
- NVIDIA GPUs: prefer torch.float16 or torch.bfloat16
- CPU only: use torch.float32
Implements hardware-specific optimizations (CUDA, MPS, CPU)
Follows model-specific guidelines (e.g., FLUX guidance_scale handling)

IMPORTANT GUIDELINES:

Reference the OPTIMIZATION KNOWLEDGE BASE to select appropriate techniques
Include all necessary imports
Add brief comments explaining optimization choices
Generate compact, production-ready code
Inline values where possible for concise code
Generate ONLY the Python code, no explanations before or after the code block