Self-Forcing2.1-T2V-1.3B-GGUF

πŸ“„ Self-Forcing    |    🧬 Wan2.1    |    πŸ€– GGUF


Developed by Nichonauta.

This repository contains the quantized versions in GGUF format of the Self-Forcing video generation model.

The Self-Forcing model is an evolution of Wan2.1-T2V-1.3B, optimized with an innovative "self-forcing" technique that allows it to correct its own generation errors in real-time. This results in more coherent and higher-quality videos.

These GGUF files allow the model to be run efficiently on GPU/CPU, drastically reducing VRAM consumption and making video generation accessible without the need for high-end GPUs.

✨ Key Features

  • ⚑️ GPU/CPU Inference: Thanks to the GGUF format, the model can run on a wide range of hardware with optimized performance.
  • 🧠 Self-Forcing Technique: The model learns from its own predictions during generation to improve temporal consistency and visual quality of the video.
  • πŸ–ΌοΈ Image-guided Generation: Ability to generate smooth video transitions between a start and an end image, guided by a text prompt.
  • πŸ“‰ Low Memory Consumption: Quantization significantly reduces the RAM/VRAM memory footprint compared to the original models (FP16/FP32).
  • 🧬 Based on a Solid Architecture: It inherits the powerful base of the Wan2.1-T2V-1.3B model, known for its efficiency and quality.

Usage

The model files can be used in ComfyUI with the ComfyUI-GGUF custom node.


🧐 What is GGUF?

GGUF is a file format designed to store large language models (and other architectures) for fast inference on CPUs. The key advantages are:

  • Fast Loading: Does not require complex deserialization.
  • Quantization: Allows model weights to be stored with reduced precision (e.g., 4 or 8 bits instead of 16 or 32), which reduces file size and RAM usage.
  • GPU/CPU Execution: It is optimized to run on general-purpose processors through libraries like llama.cpp.

Note: Running this video model in GGUF format requires compatible software that can interpret the video diffusion transformer architecture.


πŸ“š Model Details and Attribution

This work would not be possible without the open-source projects that precede it.

Base Model: Wan2.1

This model is based on Wan2.1-T2V-1.3B, a powerful 1.3 billion parameter text-to-video model. It uses a Diffusion Transformer (DiT) architecture and a 3D VAE (Wan-VAE) optimized to preserve temporal information, making it ideal for video generation.

  • Original Repository: Wan-AI/Wan2.1-T2V-1.3B
  • Architecture: Diffusion Transformer (DiT) with a T5 text encoder.

Optimization Technique: Self-Forcing

The Wan2.1 model was enhanced with the Self-Forcing method, which trains the model to recognize and correct its own diffusion errors in a single forward pass. This improves fidelity and coherence without the need for costly additional training.


πŸ™ Acknowledgements

We thank the teams behind Wan2.1, Self-Forcing, Stable Diffusion, diffusers, and the entire Hugging Face community for their contribution to the open-source ecosystem.

✍️ Citation

If you find our work useful, please cite the original projects:

@article{wan2.1,
    title   = {Wan: Open and Advanced Large-Scale Video Generative Models},
    author  = {Wan Team},
    journal = {},
    year    = {2025}
}

@misc{bar2024self,
      title={Self-Forcing for Real-Time Video Generation},
      author={Tal Bar and Roy Vovers and Yael Vinker and Eliahu Horwitz and Mark B. Zkharya and Yedid Hoshen},
      year={2024},
      eprint={2405.03358},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Downloads last month
90
GGUF
Model size
1.42B params
Architecture
wan
Hardware compatibility
Log In to view the estimation

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Nichonauta/Self-Forcing2.1-T2V-1.3B-GGUF

Quantized
(2)
this model