Self-Forcing2.1-T2V-1.3B-GGUF

📄 Self-Forcing ｜ 🧬 Wan2.1 ｜ 🤖 GGUF

This repository contains the quantized versions in GGUF format of the Self-Forcing video generation model.

The Self-Forcing model is an evolution of Wan2.1-T2V-1.3B, optimized with an innovative "self-forcing" technique that allows it to correct its own generation errors in real-time. This results in more coherent and higher-quality videos.

These GGUF files allow the model to be run efficiently on GPU/CPU, drastically reducing VRAM consumption and making video generation accessible without the need for high-end GPUs.

✨ Key Features

⚡️ GPU/CPU Inference: Thanks to the GGUF format, the model can run on a wide range of hardware with optimized performance.
🧠 Self-Forcing Technique: The model learns from its own predictions during generation to improve temporal consistency and visual quality of the video.
🖼️ Image-guided Generation: Ability to generate smooth video transitions between a start and an end image, guided by a text prompt.
📉 Low Memory Consumption: Quantization significantly reduces the RAM/VRAM memory footprint compared to the original models (FP16/FP32).
🧬 Based on a Solid Architecture: It inherits the powerful base of the Wan2.1-T2V-1.3B model, known for its efficiency and quality.

Usage

The model files can be used in ComfyUI with the ComfyUI-GGUF custom node.

🧐 What is GGUF?

GGUF is a file format designed to store large language models (and other architectures) for fast inference on CPUs. The key advantages are:

Fast Loading: Does not require complex deserialization.
Quantization: Allows model weights to be stored with reduced precision (e.g., 4 or 8 bits instead of 16 or 32), which reduces file size and RAM usage.
GPU/CPU Execution: It is optimized to run on general-purpose processors through libraries like llama.cpp.

Note: Running this video model in GGUF format requires compatible software that can interpret the video diffusion transformer architecture.

📚 Model Details and Attribution

This work would not be possible without the open-source projects that precede it.

Base Model: Wan2.1

This model is based on Wan2.1-T2V-1.3B, a powerful 1.3 billion parameter text-to-video model. It uses a Diffusion Transformer (DiT) architecture and a 3D VAE (Wan-VAE) optimized to preserve temporal information, making it ideal for video generation.

Original Repository: Wan-AI/Wan2.1-T2V-1.3B
Architecture: Diffusion Transformer (DiT) with a T5 text encoder.

Optimization Technique: Self-Forcing

The Wan2.1 model was enhanced with the Self-Forcing method, which trains the model to recognize and correct its own diffusion errors in a single forward pass. This improves fidelity and coherence without the need for costly additional training.

Project Page: self-forcing.github.io

🙏 Acknowledgements

We thank the teams behind Wan2.1, Self-Forcing, Stable Diffusion, diffusers, and the entire Hugging Face community for their contribution to the open-source ecosystem.

✍️ Citation

If you find our work useful, please cite the original projects:

@article{wan2.1,
    title   = {Wan: Open and Advanced Large-Scale Video Generative Models},
    author  = {Wan Team},
    journal = {},
    year    = {2025}
}

@misc{bar2024self,
      title={Self-Forcing for Real-Time Video Generation},
      author={Tal Bar and Roy Vovers and Yael Vinker and Eliahu Horwitz and Mark B. Zkharya and Yedid Hoshen},
      year={2024},
      eprint={2405.03358},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Nichonauta
/

Self-Forcing2.1-T2V-1.3B-GGUF