fast-rendering-node-for-clapper

Paused

App Files Files Community

jbilcke-hf HF Staff commited on Jul 3

Commit

b2c19b1

1 Parent(s): d1a3122

testing

Browse files

Files changed (3) hide show

README.md +10 -105
app.py +43 -15
handler.py +0 -545

README.md CHANGED Viewed

@@ -1,116 +1,21 @@
 ---
 emoji: 🎥
-title: 'Self Forcing Wan 2.1 '
-short_description: Real-time video generation
 sdk: gradio
-sdk_version: 5.34.2
 ---
-<p align="center">
-<h1 align="center">Self Forcing</h1>
-<h3 align="center">Bridging the Train-Test Gap in Autoregressive Video Diffusion</h3>
-</p>
-<p align="center">
-  <p align="center">
-    <a href="https://www.xunhuang.me/">Xun Huang</a><sup>1</sup>
-    ·
-    <a href="https://zhengqili.github.io/">Zhengqi Li</a><sup>1</sup>
-    ·
-    <a href="https://guandehe.github.io/">Guande He</a><sup>2</sup>
-    ·
-    <a href="https://mingyuanzhou.github.io/">Mingyuan Zhou</a><sup>2</sup>
-    ·
-    <a href="https://research.adobe.com/person/eli-shechtman/">Eli Shechtman</a><sup>1</sup><br>
-    <sup>1</sup>Adobe Research <sup>2</sup>UT Austin
-  </p>
-  <h3 align="center"><a href="https://arxiv.org/abs/2506.08009">Paper</a> | <a href="https://self-forcing.github.io">Website</a> | <a href="https://huggingface.co/gdhe17/Self-Forcing/tree/main">Models (HuggingFace)</a></h3>
-</p>
----
-Self Forcing trains autoregressive video diffusion models by **simulating the inference process during training**, performing autoregressive rollout with KV caching. It resolves the train-test distribution mismatch and enables **real-time, streaming video generation on a single RTX 4090** while matching the quality of state-of-the-art diffusion models.
----
-https://github.com/user-attachments/assets/7548c2db-fe03-4ba8-8dd3-52d2c6160739
-## Requirements
-We tested this repo on the following setup:
-* Nvidia GPU with at least 24 GB memory (RTX 4090, A100, and H100 are tested).
-* Linux operating system.
-* 64 GB RAM.
-Other hardware setup could also work but hasn't been tested.
-## Installation
-Create a conda environment and install dependencies:
-```
-conda create -n self_forcing python=3.10 -y
-conda activate self_forcing
-pip install -r requirements.txt
-pip install flash-attn --no-build-isolation
-python setup.py develop
-```
-## Quick Start
-### Download checkpoints
-```
-huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B
-huggingface-cli download gdhe17/Self-Forcing checkpoints/self_forcing_dmd.pt --local-dir .
-```
-### GUI demo
-```
-python demo.py
-```
-Note:
-* **Our model works better with long, detailed prompts** since it's trained with such prompts. We will integrate prompt extension into the codebase (similar to [Wan2.1](https://github.com/Wan-Video/Wan2.1/tree/main?tab=readme-ov-file#2-using-prompt-extention)) in the future. For now, it is recommended to use third-party LLMs (such as GPT-4o) to extend your prompt before providing to the model.
-* You may want to adjust FPS so it plays smoothly on your device.
-* The speed can be improved by enabling `torch.compile`, [TAEHV-VAE](https://github.com/madebyollin/taehv/), or using FP8 Linear layers, although the latter two options may sacrifice quality. It is recommended to use `torch.compile` if possible and enable TAEHV-VAE if further speedup is needed.
-### CLI Inference
-Example inference script using the chunk-wise autoregressive checkpoint trained with DMD:
-```
-python inference.py \
-    --config_path configs/self_forcing_dmd.yaml \
-    --output_folder videos/self_forcing_dmd \
-    --checkpoint_path checkpoints/self_forcing_dmd.pt \
-    --data_path prompts/MovieGenVideoBench_extended.txt \
-    --use_ema
-```
-Other config files and corresponding checkpoints can be found in [configs](configs) folder and our [huggingface repo](https://huggingface.co/gdhe17/Self-Forcing/tree/main/checkpoints).
-## Training
-### Download text prompts and ODE initialized checkpoint
-```
-huggingface-cli download gdhe17/Self-Forcing checkpoints/ode_init.pt --local-dir .
-huggingface-cli download gdhe17/Self-Forcing vidprom_filtered_extended.txt --local-dir prompts
-```
-Note: Our training algorithm (except for the GAN version) is data-free (**no video data is needed**). For now, we directly provide the ODE initialization checkpoint and will add more instructions on how to perform ODE initialization in the future (which is identical to the process described in the [CausVid](https://github.com/tianweiy/CausVid) repo).
-### Self Forcing Training with DMD
-```
-torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
-  --rdzv_backend=c10d \
-  --rdzv_endpoint $MASTER_ADDR \
-  train.py \
-  --config_path configs/self_forcing_dmd.yaml \
-  --logdir logs/self_forcing_dmd \
-  --disable-wandb
-```
-Our training run uses 600 iterations and completes in under 2 hours using 64 H100 GPUs. By implementing gradient accumulation, it should be possible to reproduce the results in less than 16 hours using 8 H100 GPUs.
-## Acknowledgements
 This codebase is built on top of the open-source implementation of [CausVid](https://github.com/tianweiy/CausVid) by [Tianwei Yin](https://tianweiy.github.io/) and the [Wan2.1](https://github.com/Wan-Video/Wan2.1) repo.
-## Citation
-If you find this codebase useful for your research, please kindly cite our paper:
-```
-@article{huang2025selfforcing,
-  title={Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion},
-  author={Huang, Xun and Li, Zhengqi and He, Guande and Zhou, Mingyuan and Shechtman, Eli},
-  journal={arXiv preprint arXiv:2506.08009},
-  year={2025}
-}
-```

 ---
 emoji: 🎥
+title: 'Self-Forcing Wan2.1-1.3B'
+short_description: MCP server for real-time video generation
 sdk: gradio
+sdk_version: 5.35.0
 ---
+This server is meant to be used as an API/MCP server by generative AI video apps.
+It is best to run it on a big GPU such as a Nvidia A100, H100 or H200.
+It can run on a Nvidia L40S but not in real-time.
+## Acknowledgements
+ If you want to use this codebase for your project, I invite you to use the [original repository](https://huggingface.co/spaces/multimodalart/self-forcing) instead, it will better fit your needs (it contains things I removed because I didn't need them).
+A big thank you to [Multimodalart](https://huggingface.co/multimodalart/) who created the original repo!
 This codebase is built on top of the open-source implementation of [CausVid](https://github.com/tianweiy/CausVid) by [Tianwei Yin](https://tianweiy.github.io/) and the [Wan2.1](https://github.com/Wan-Video/Wan2.1) repo.

app.py CHANGED Viewed

@@ -91,8 +91,11 @@ APP_STATE = {
     "current_vae_decoder": None,
 }
 # Apply torch.compile for maximum performance
-if not APP_STATE["torch_compile_applied"]:
     print("🚀 Applying torch.compile for speed optimization...")
     transformer.compile(mode="max-autotune-no-cudagraphs")
     APP_STATE["torch_compile_applied"] = True
@@ -213,7 +216,7 @@ pipeline = CausalInferencePipeline(
 pipeline.to(dtype=torch.float16).to(gpu)
 @torch.no_grad()
-def video_generation_handler_streaming(prompt, seed=42, fps=15, width=400, height=224):
     """
     Generator function that yields .ts video chunks using PyAV for streaming.
     Now optimized for block-based processing.
@@ -221,7 +224,18 @@ def video_generation_handler_streaming(prompt, seed=42, fps=15, width=400, heigh
     if seed == -1:
         seed = random.randint(0, 2**32 - 1)
-    print(f"🎬 Starting PyAV streaming: '{prompt}', seed: {seed}")
     # Setup
     conditional_dict = text_encoder(text_prompts=[prompt])
@@ -237,7 +251,13 @@ def video_generation_handler_streaming(prompt, seed=42, fps=15, width=400, heigh
     if not APP_STATE["current_use_taehv"] and not args.trt:
         vae_cache = [c.to(device=gpu, dtype=torch.float16) for c in ZERO_VAE_CACHE]
-    num_blocks = 7
     current_start_frame = 0
     all_num_frames = [pipeline.num_frame_per_block] * num_blocks
@@ -394,16 +414,6 @@ with gr.Blocks(title="Self-Forcing Streaming Demo") as demo:
             start_btn = gr.Button("🎬 Start Streaming", variant="primary", size="lg")
-            gr.Markdown("### 🎯 Examples")
-            gr.Examples(
-                examples=[
-                    "A close-up shot of a ceramic teacup slowly pouring water into a glass mug.",
-                    "A playful cat is seen playing an electronic guitar, strumming the strings with its front paws. The cat has distinctive black facial markings and a bushy tail. It sits comfortably on a small stool, its body slightly tilted as it focuses intently on the instrument. The setting is a cozy, dimly lit room with vintage posters on the walls, adding a retro vibe. The cat's expressive eyes convey a sense of joy and concentration. Medium close-up shot, focusing on the cat's face and hands interacting with the guitar.",
-                    "A dynamic over-the-shoulder perspective of a chef meticulously plating a dish in a bustling kitchen. The chef, a middle-aged woman, deftly arranges ingredients on a pristine white plate. Her hands move with precision, each gesture deliberate and practiced. The background shows a crowded kitchen with steaming pots, whirring blenders, and the clatter of utensils. Bright lights highlight the scene, casting shadows across the busy workspace. The camera angle captures the chef's detailed work from behind, emphasizing his skill and dedication.",
-                ],
-                inputs=[prompt],
-            )
             gr.Markdown("### ⚙️ Settings")
             with gr.Row():
                 seed = gr.Number(
@@ -422,6 +432,24 @@ with gr.Blocks(title="Self-Forcing Streaming Demo") as demo:
                     info="Frames per second for playback"
                 )
             with gr.Row():
                 width = gr.Slider(
                     label="Width",
@@ -465,7 +493,7 @@ with gr.Blocks(title="Self-Forcing Streaming Demo") as demo:
     # Connect the generator to the streaming video
     start_btn.click(
         fn=video_generation_handler_streaming,
-        inputs=[prompt, seed, fps, width, height],
         outputs=[streaming_video, status_display]
     )

     "current_vae_decoder": None,
 }
+# I've tried to enable it, but I didn't notice a significant performance improvement..
+ENABLE_TORCH_COMPILATION = False
 # Apply torch.compile for maximum performance
+if not APP_STATE["torch_compile_applied"] and ENABLE_TORCH_COMPILATION:
     print("🚀 Applying torch.compile for speed optimization...")
     transformer.compile(mode="max-autotune-no-cudagraphs")
     APP_STATE["torch_compile_applied"] = True
 pipeline.to(dtype=torch.float16).to(gpu)
 @torch.no_grad()
+def video_generation_handler_streaming(prompt, seed=42, fps=15, width=400, height=224, duration=5, buffering=2):
     """
     Generator function that yields .ts video chunks using PyAV for streaming.
     Now optimized for block-based processing.
     if seed == -1:
         seed = random.randint(0, 2**32 - 1)
+    print(f"🎬 Starting PyAV streaming: '{prompt}', seed: {seed}, duration: {duration}s, buffering: {buffering}s")
+    # Buffering delay
+    if buffering > 0:
+        buffering_status_html = (
+            f"<div style='padding: 10px; border: 1px solid #ffc107; background: #fff3cd; border-radius: 8px; font-family: sans-serif;'>"
+            f"  <p style='margin: 0 0 8px 0; font-size: 16px; font-weight: bold;'>⏳ Buffering...</p>"
+            f"  <p style='margin: 0; color: #856404; font-size: 14px;'>Waiting {buffering} seconds before starting stream</p>"
+            f"</div>"
+        )
+        yield None, buffering_status_html
+        time.sleep(buffering)
     # Setup
     conditional_dict = text_encoder(text_prompts=[prompt])
     if not APP_STATE["current_use_taehv"] and not args.trt:
         vae_cache = [c.to(device=gpu, dtype=torch.float16) for c in ZERO_VAE_CACHE]
+    # Calculate number of blocks based on duration
+    # Current setup generates approximately 5 seconds with 7 blocks
+    # So we scale proportionally
+    base_duration = 5.0  # seconds
+    base_blocks = 7
+    num_blocks = max(1, int(base_blocks * duration / base_duration))
     current_start_frame = 0
     all_num_frames = [pipeline.num_frame_per_block] * num_blocks
             start_btn = gr.Button("🎬 Start Streaming", variant="primary", size="lg")
             gr.Markdown("### ⚙️ Settings")
             with gr.Row():
                 seed = gr.Number(
                     info="Frames per second for playback"
                 )
+            with gr.Row():
+                duration = gr.Slider(
+                    label="Duration (seconds)",
+                    minimum=1,
+                    maximum=10,
+                    value=5,
+                    step=1,
+                    info="Video duration in seconds"
+                )
+                buffering = gr.Slider(
+                    label="Buffering (seconds)",
+                    minimum=0,
+                    maximum=5,
+                    value=2,
+                    step=0.5,
+                    info="Wait time before starting stream"
+                )
             with gr.Row():
                 width = gr.Slider(
                     label="Width",
     # Connect the generator to the streaming video
     start_btn.click(
         fn=video_generation_handler_streaming,
+        inputs=[prompt, seed, fps, width, height, duration, buffering],
         outputs=[streaming_video, status_display]
     )

handler.py DELETED Viewed

@@ -1,545 +0,0 @@
-from dataclasses import dataclass
-from pathlib import Path
-import logging
-import base64
-import random
-import gc
-import os
-import numpy as np
-import torch
-from typing import Dict, Any, Optional, List, Union, Tuple
-import json
-from omegaconf import OmegaConf
-from PIL import Image
-import io
-from pipeline import CausalInferencePipeline
-from demo_utils.constant import ZERO_VAE_CACHE
-from demo_utils.vae_block3 import VAEDecoderWrapper
-from utils.wan_wrapper import WanDiffusionWrapper, WanTextEncoder
-# Configure logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-# Get token from environment
-hf_token = os.getenv("HF_API_TOKEN")
-# Constraints
-MAX_LARGE_SIDE = 1280
-MAX_SMALL_SIDE = 768
-MAX_FRAMES = 169  # Based on Wan model capabilities
-@dataclass
-class GenerationConfig:
-    """Configuration for video generation using Wan model"""
-    # general content settings
-    prompt: str = ""
-    negative_prompt: str = "worst quality, lowres, blurry, distorted, cropped, watermarked, watermark, logo, subtitle, subtitles"
-    # video model settings
-    width: int = 960  # Wan model default width
-    height: int = 576  # Wan model default height
-    # number of frames (based on Wan model block structure)
-    num_frames: int = 105  # 7 blocks * 15 frames per block
-    # guidance and sampling settings
-    guidance_scale: float = 7.5
-    num_inference_steps: int = 4  # Distilled model uses fewer steps
-    # reproducible generation settings
-    seed: int = -1  # -1 means random seed
-    # output settings
-    fps: int = 15  # FPS of the final video
-    quality: int = 18  # Video quality (CRF)
-    # advanced settings
-    mixed_precision: bool = True
-    use_taehv: bool = False  # Whether to use TAEHV decoder
-    use_trt: bool = False  # Whether to use TensorRT optimized decoder
-    def validate_and_adjust(self) -> 'GenerationConfig':
-        """Validate and adjust parameters to meet constraints"""
-        # Ensure dimensions are multiples of 32 and within limits
-        self.width = max(128, min(MAX_LARGE_SIDE, round(self.width / 32) * 32))
-        self.height = max(128, min(MAX_LARGE_SIDE, round(self.height / 32) * 32))
-        # Ensure frame count is reasonable
-        self.num_frames = min(self.num_frames, MAX_FRAMES)
-        # Set random seed if not specified
-        if self.seed == -1:
-            self.seed = random.randint(0, 2**32 - 1)
-        return self
-def load_image_to_tensor_with_resize_and_crop(
-    image_input: Union[str, bytes],
-    target_height: int = 576,
-    target_width: int = 960,
-    quality: int = 100
-) -> torch.Tensor:
-    """Load and process an image into a tensor for Wan model.
-    Args:
-        image_input: Either a file path (str) or image data (bytes)
-        target_height: Desired height of output tensor
-        target_width: Desired width of output tensor
-        quality: JPEG quality to use when re-encoding
-    """
-    # Handle base64 data URI
-    if isinstance(image_input, str) and image_input.startswith('data:'):
-        header, encoded = image_input.split(",", 1)
-        image_data = base64.b64decode(encoded)
-        image = Image.open(io.BytesIO(image_data)).convert("RGB")
-    # Handle raw bytes
-    elif isinstance(image_input, bytes):
-        image = Image.open(io.BytesIO(image_input)).convert("RGB")
-    # Handle file path
-    elif isinstance(image_input, str):
-        image = Image.open(image_input).convert("RGB")
-    else:
-        raise ValueError("image_input must be either a file path, bytes, or base64 data URI")
-    # Apply JPEG compression if quality < 100
-    if quality < 100:
-        buffer = io.BytesIO()
-        image.save(buffer, format="JPEG", quality=quality)
-        buffer.seek(0)
-        image = Image.open(buffer).convert("RGB")
-    # Resize and crop to target dimensions
-    input_width, input_height = image.size
-    aspect_ratio_target = target_width / target_height
-    aspect_ratio_frame = input_width / input_height
-    if aspect_ratio_frame > aspect_ratio_target:
-        new_width = int(input_height * aspect_ratio_target)
-        new_height = input_height
-        x_start = (input_width - new_width) // 2
-        y_start = 0
-    else:
-        new_width = input_width
-        new_height = int(input_width / aspect_ratio_target)
-        x_start = 0
-        y_start = (input_height - new_height) // 2
-    image = image.crop((x_start, y_start, x_start + new_width, y_start + new_height))
-    image = image.resize((target_width, target_height))
-    # Convert to tensor format expected by Wan model
-    frame_tensor = torch.tensor(np.array(image)).permute(2, 0, 1).float()
-    frame_tensor = (frame_tensor / 127.5) - 1.0
-    return frame_tensor.unsqueeze(0)
-def initialize_vae_decoder(use_taehv=False, use_trt=False, device="cuda"):
-    """Initialize VAE decoder based on configuration"""
-    if use_trt:
-        from demo_utils.vae import VAETRTWrapper
-        print("Initializing TensorRT VAE Decoder...")
-        vae_decoder = VAETRTWrapper()
-    elif use_taehv:
-        print("Initializing TAEHV VAE Decoder...")
-        from demo_utils.taehv import TAEHV
-        taehv_checkpoint_path = "/repository/taehv/taew2_1.pth"
-        if not os.path.exists(taehv_checkpoint_path):
-            print(f"Downloading TAEHV checkpoint to {taehv_checkpoint_path}...")
-            os.makedirs("checkpoints", exist_ok=True)
-            import urllib.request
-            download_url = "https://github.com/madebyollin/taehv/raw/main/taew2_1.pth"
-            try:
-                urllib.request.urlretrieve(download_url, taehv_checkpoint_path)
-            except Exception as e:
-                raise RuntimeError(f"Failed to download taew2_1.pth: {e}")
-        class DotDict(dict):
-            __getattr__ = dict.get
-        class TAEHVDiffusersWrapper(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-                self.dtype = torch.float16
-                self.taehv = TAEHV(checkpoint_path=taehv_checkpoint_path).to(self.dtype)
-                self.config = DotDict(scaling_factor=1.0)
-            def decode(self, latents, return_dict=None):
-                return self.taehv.decode_video(latents, parallel=True).mul_(2).sub_(1)
-        vae_decoder = TAEHVDiffusersWrapper()
-    else:
-        print("Initializing Default VAE Decoder...")
-        vae_decoder = VAEDecoderWrapper()
-        try:
-            # I should have called the folder "Wan2.1-T2V-1.3B" instead of "wan2.1"
-            #vae_state_dict = torch.load('/repository/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth', map_location="cpu")
-            vae_state_dict = torch.load('/repository/wan2.1/Wan2.1_VAE.pth', map_location="cpu")
-            decoder_state_dict = {k: v for k, v in vae_state_dict.items() if 'decoder.' in k or 'conv2' in k}
-            vae_decoder.load_state_dict(decoder_state_dict)
-        except FileNotFoundError:
-            print("Warning: Default VAE weights not found.")
-    vae_decoder.eval().to(dtype=torch.float16).requires_grad_(False).to(device)
-    return vae_decoder
-def create_wan_pipeline(
-    config: GenerationConfig,
-    device: str = "cuda"
-) -> CausalInferencePipeline:
-    """Create and configure the Wan video pipeline"""
-    # Load configuration
-    try:
-        wan_config = OmegaConf.load("/repository/configs/self_forcing_dmd.yaml")
-        default_config = OmegaConf.load("/repository/configs/default_config.yaml")
-        wan_config = OmegaConf.merge(default_config, wan_config)
-    except FileNotFoundError as e:
-        logger.error(f"Error loading config file: {e}")
-        raise RuntimeError(f"Config files not found: {e}")
-    # Initialize model components
-    text_encoder = WanTextEncoder()
-    transformer = WanDiffusionWrapper(is_causal=True)
-    # Load checkpoint
-    checkpoint_path = "/repository/self-forcing/checkpoints/self_forcing_dmd.pt"
-    try:
-        state_dict = torch.load(checkpoint_path, map_location="cpu")
-        transformer.load_state_dict(state_dict.get('generator_ema', state_dict.get('generator')))
-    except FileNotFoundError as e:
-        logger.error(f"Error loading checkpoint: {e}")
-        raise RuntimeError(f"Checkpoint not found: {checkpoint_path}")
-    # Move to device and set precision
-    text_encoder.eval().to(dtype=torch.float16).requires_grad_(False).to(device)
-    transformer.eval().to(dtype=torch.float16).requires_grad_(False).to(device)
-    # Initialize VAE decoder
-    vae_decoder = initialize_vae_decoder(
-        use_taehv=config.use_taehv,
-        use_trt=config.use_trt,
-        device=device
-    )
-    # Create pipeline
-    pipeline = CausalInferencePipeline(
-        wan_config,
-        device=device,
-        generator=transformer,
-        text_encoder=text_encoder,
-        vae=vae_decoder
-    )
-    pipeline.to(dtype=torch.float16).to(device)
-    return pipeline
-def frames_to_video_bytes(frames: List[np.ndarray], fps: int = 15, quality: int = 18) -> bytes:
-    """Convert frames to MP4 video bytes"""
-    import tempfile
-    import subprocess
-    with tempfile.TemporaryDirectory() as temp_dir:
-        # Save frames as images
-        frame_paths = []
-        for i, frame in enumerate(frames):
-            frame_path = os.path.join(temp_dir, f"frame_{i:06d}.png")
-            Image.fromarray(frame).save(frame_path)
-            frame_paths.append(frame_path)
-        # Create video using ffmpeg
-        output_path = os.path.join(temp_dir, "output.mp4")
-        cmd = [
-            "ffmpeg", "-y", "-framerate", str(fps),
-            "-i", os.path.join(temp_dir, "frame_%06d.png"),
-            "-c:v", "libx264", "-crf", str(quality),
-            "-pix_fmt", "yuv420p", "-movflags", "faststart",
-            output_path
-        ]
-        try:
-            subprocess.run(cmd, check=True, capture_output=True)
-            with open(output_path, "rb") as f:
-                return f.read()
-        except subprocess.CalledProcessError as e:
-            logger.error(f"FFmpeg error: {e}")
-            raise RuntimeError(f"Video encoding failed: {e}")
-class EndpointHandler:
-    """Handler for the Wan Video endpoint"""
-    def __init__(self, model_path: str = "./"):
-        """Initialize the endpoint handler
-        Args:
-            model_path: Path to model weights
-        """
-        # Enable TF32 for potential speedup on Ampere GPUs
-        torch.backends.cuda.matmul.allow_tf32 = True
-        # The pipeline will be loaded during inference to save memory
-        self.pipeline = None
-        self.device = "cuda" if torch.cuda.is_available() else "cpu"
-        # Perform warm-up inference if GPU is available
-        if self.device == "cuda":
-            logger.info("Performing warm-up inference...")
-            self._warmup()
-            logger.info("Warm-up completed!")
-        else:
-            logger.info("CPU device detected, skipping warm-up")
-    def _warmup(self):
-        """Perform a warm-up inference to prepare the model for future requests"""
-        try:
-            # Create a simple test configuration
-            test_config = GenerationConfig(
-                prompt="a cat walking",
-                negative_prompt="worst quality, lowres",
-                width=480,  # Smaller resolution for faster warm-up
-                height=320,
-                num_frames=33,  # Fewer frames for faster warm-up
-                guidance_scale=7.5,
-                num_inference_steps=2,  # Fewer steps for faster warm-up
-                seed=42,  # Fixed seed for consistent warm-up
-                fps=15,
-                mixed_precision=True,
-            ).validate_and_adjust()
-            # Create the pipeline if it doesn't exist
-            if self.pipeline is None:
-                self.pipeline = create_wan_pipeline(test_config, self.device)
-            # Run a quick inference
-            with torch.no_grad():
-                # Set seeds for reproducibility
-                random.seed(test_config.seed)
-                np.random.seed(test_config.seed)
-                torch.manual_seed(test_config.seed)
-                # Generate video frames (simplified version)
-                conditional_dict = self.pipeline.text_encoder(text_prompts=[test_config.prompt])
-                for key, value in conditional_dict.items():
-                    conditional_dict[key] = value.to(dtype=torch.float16)
-                rnd = torch.Generator(self.device).manual_seed(int(test_config.seed))
-                self.pipeline._initialize_kv_cache(1, torch.float16, device=self.device)
-                self.pipeline._initialize_crossattn_cache(1, torch.float16, device=self.device)
-                # Generate a small noise tensor for testing
-                noise = torch.randn([1, 3, 8, 20, 32], device=self.device, dtype=torch.float16, generator=rnd)
-                # Clean up
-                del noise, conditional_dict
-                torch.cuda.empty_cache()
-                gc.collect()
-                logger.info("Warm-up successful!")
-        except Exception as e:
-            # Log the error but don't fail initialization
-            import traceback
-            error_message = f"Warm-up failed (but this is non-critical): {str(e)}\n{traceback.format_exc()}"
-            logger.warning(error_message)
-    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
-        """Process inference requests
-        Args:
-            data: Request data containing inputs and parameters
-        Returns:
-            Dictionary with generated video and metadata
-        """
-        # Extract inputs and parameters
-        inputs = data.get("inputs", {})
-        # Support both formats:
-        # 1. {"inputs": {"prompt": "...", "image": "..."}}
-        # 2. {"inputs": "..."} (prompt only)
-        if isinstance(inputs, str):
-            input_prompt = inputs
-            input_image = None
-        else:
-            input_prompt = inputs.get("prompt", "")
-            input_image = inputs.get("image")
-        params = data.get("parameters", {})
-        if not input_prompt:
-            raise ValueError("Prompt must be provided")
-        # Create and validate configuration
-        config = GenerationConfig(
-            # general content settings
-            prompt=input_prompt,
-            negative_prompt=params.get("negative_prompt", GenerationConfig.negative_prompt),
-            # video model settings
-            width=params.get("width", GenerationConfig.width),
-            height=params.get("height", GenerationConfig.height),
-            num_frames=params.get("num_frames", GenerationConfig.num_frames),
-            guidance_scale=params.get("guidance_scale", GenerationConfig.guidance_scale),
-            num_inference_steps=params.get("num_inference_steps", GenerationConfig.num_inference_steps),
-            # reproducible generation settings
-            seed=params.get("seed", GenerationConfig.seed),
-            # output settings
-            fps=params.get("fps", GenerationConfig.fps),
-            quality=params.get("quality", GenerationConfig.quality),
-            # advanced settings
-            mixed_precision=params.get("mixed_precision", GenerationConfig.mixed_precision),
-            use_taehv=params.get("use_taehv", GenerationConfig.use_taehv),
-            use_trt=params.get("use_trt", GenerationConfig.use_trt),
-        ).validate_and_adjust()
-        try:
-            with torch.no_grad():
-                # Set random seeds for reproducibility
-                random.seed(config.seed)
-                np.random.seed(config.seed)
-                torch.manual_seed(config.seed)
-                # Create pipeline if not already created
-                if self.pipeline is None:
-                    self.pipeline = create_wan_pipeline(config, self.device)
-                # Prepare text conditioning
-                conditional_dict = self.pipeline.text_encoder(text_prompts=[config.prompt])
-                for key, value in conditional_dict.items():
-                    conditional_dict[key] = value.to(dtype=torch.float16)
-                # Initialize caches
-                rnd = torch.Generator(self.device).manual_seed(int(config.seed))
-                self.pipeline._initialize_kv_cache(1, torch.float16, device=self.device)
-                self.pipeline._initialize_crossattn_cache(1, torch.float16, device=self.device)
-                # Generate noise tensor
-                noise = torch.randn(
-                    [1, 21, 16, config.height // 16, config.width // 16],
-                    device=self.device,
-                    dtype=torch.float16,
-                    generator=rnd
-                )
-                # Initialize VAE cache
-                vae_cache = None
-                latents_cache = None
-                if not config.use_taehv and not config.use_trt:
-                    vae_cache = [c.to(device=self.device, dtype=torch.float16) for c in ZERO_VAE_CACHE]
-                # Generation parameters
-                num_blocks = 7
-                current_start_frame = 0
-                all_num_frames = [self.pipeline.num_frame_per_block] * num_blocks
-                all_frames = []
-                # Generate video blocks
-                for idx, current_num_frames in enumerate(all_num_frames):
-                    logger.info(f"Processing block {idx+1}/{num_blocks}")
-                    noisy_input = noise[:, current_start_frame : current_start_frame + current_num_frames]
-                    # Denoising steps
-                    for step_idx, current_timestep in enumerate(self.pipeline.denoising_step_list):
-                        timestep = torch.ones([1, current_num_frames], device=noise.device, dtype=torch.int64) * current_timestep
-                        _, denoised_pred = self.pipeline.generator(
-                            noisy_image_or_video=noisy_input,
-                            conditional_dict=conditional_dict,
-                            timestep=timestep,
-                            kv_cache=self.pipeline.kv_cache1,
-                            crossattn_cache=self.pipeline.crossattn_cache,
-                            current_start=current_start_frame * self.pipeline.frame_seq_length
-                        )
-                        if step_idx < len(self.pipeline.denoising_step_list) - 1:
-                            next_timestep = self.pipeline.denoising_step_list[step_idx + 1]
-                            noisy_input = self.pipeline.scheduler.add_noise(
-                                denoised_pred.flatten(0, 1),
-                                torch.randn_like(denoised_pred.flatten(0, 1)),
-                                next_timestep * torch.ones([1 * current_num_frames], device=noise.device, dtype=torch.long)
-                            ).unflatten(0, denoised_pred.shape[:2])
-                    # Update cache for next block
-                    if idx < len(all_num_frames) - 1:
-                        self.pipeline.generator(
-                            noisy_image_or_video=denoised_pred,
-                            conditional_dict=conditional_dict,
-                            timestep=torch.zeros_like(timestep),
-                            kv_cache=self.pipeline.kv_cache1,
-                            crossattn_cache=self.pipeline.crossattn_cache,
-                            current_start=current_start_frame * self.pipeline.frame_seq_length,
-                        )
-                    # Decode to pixels
-                    if config.use_trt:
-                        pixels, vae_cache = self.pipeline.vae.forward(denoised_pred.half(), *vae_cache)
-                    elif config.use_taehv:
-                        if latents_cache is None:
-                            latents_cache = denoised_pred
-                        else:
-                            denoised_pred = torch.cat([latents_cache, denoised_pred], dim=1)
-                            latents_cache = denoised_pred[:, -3:]
-                        pixels = self.pipeline.vae.decode(denoised_pred)
-                    else:
-                        pixels, vae_cache = self.pipeline.vae(denoised_pred.half(), *vae_cache)
-                    # Handle frame skipping
-                    if idx == 0 and not config.use_trt:
-                        pixels = pixels[:, 3:]
-                    elif config.use_taehv and idx > 0:
-                        pixels = pixels[:, 12:]
-                    # Convert frames to numpy
-                    for frame_idx in range(pixels.shape[1]):
-                        frame_tensor = pixels[0, frame_idx]
-                        frame_np = torch.clamp(frame_tensor.float(), -1., 1.) * 127.5 + 127.5
-                        frame_np = frame_np.to(torch.uint8).cpu().numpy()
-                        frame_np = np.transpose(frame_np, (1, 2, 0))  # CHW -> HWC
-                        all_frames.append(frame_np)
-                    current_start_frame += current_num_frames
-                # Convert frames to video
-                video_bytes = frames_to_video_bytes(all_frames, fps=config.fps, quality=config.quality)
-                # Convert to base64 data URI
-                video_b64 = base64.b64encode(video_bytes).decode('utf-8')
-                video_uri = f"data:video/mp4;base64,{video_b64}"
-                # Prepare metadata
-                metadata = {
-                    "width": config.width,
-                    "height": config.height,
-                    "num_frames": len(all_frames),
-                    "fps": config.fps,
-                    "duration": len(all_frames) / config.fps,
-                    "seed": config.seed,
-                    "prompt": config.prompt,
-                }
-                # Clean up to prevent CUDA OOM errors
-                del noise, conditional_dict, pixels
-                if self.device == "cuda":
-                    torch.cuda.empty_cache()
-                gc.collect()
-                return {
-                    "video": video_uri,
-                    "content-type": "video/mp4",
-                    "metadata": metadata
-                }
-        except Exception as e:
-            # Log the error and reraise
-            import traceback
-            error_message = f"Error generating video: {str(e)}\n{traceback.format_exc()}"
-            logger.error(error_message)
-            raise RuntimeError(error_message)