Self-Forcing2.1-T2V-1.3B-GGUF
π Self-Forcing ο½ π§¬ Wan2.1 ο½ π€ GGUF
Developed by Nichonauta.
This repository contains the quantized versions in GGUF format of the Self-Forcing video generation model.
The Self-Forcing model is an evolution of Wan2.1-T2V-1.3B
, optimized with an innovative "self-forcing" technique that allows it to correct its own generation errors in real-time. This results in more coherent and higher-quality videos.
These GGUF files allow the model to be run efficiently on GPU/CPU, drastically reducing VRAM consumption and making video generation accessible without the need for high-end GPUs.
β¨ Key Features
- β‘οΈ GPU/CPU Inference: Thanks to the GGUF format, the model can run on a wide range of hardware with optimized performance.
- π§ Self-Forcing Technique: The model learns from its own predictions during generation to improve temporal consistency and visual quality of the video.
- πΌοΈ Image-guided Generation: Ability to generate smooth video transitions between a start and an end image, guided by a text prompt.
- π Low Memory Consumption: Quantization significantly reduces the RAM/VRAM memory footprint compared to the original models (
FP16
/FP32
). - 𧬠Based on a Solid Architecture: It inherits the powerful base of the
Wan2.1-T2V-1.3B
model, known for its efficiency and quality.
Usage
The model files can be used in ComfyUI with the ComfyUI-GGUF custom node.
π§ What is GGUF?
GGUF is a file format designed to store large language models (and other architectures) for fast inference on CPUs. The key advantages are:
- Fast Loading: Does not require complex deserialization.
- Quantization: Allows model weights to be stored with reduced precision (e.g., 4 or 8 bits instead of 16 or 32), which reduces file size and RAM usage.
- GPU/CPU Execution: It is optimized to run on general-purpose processors through libraries like
llama.cpp
.
Note: Running this video model in GGUF format requires compatible software that can interpret the video diffusion transformer architecture.
π Model Details and Attribution
This work would not be possible without the open-source projects that precede it.
Base Model: Wan2.1
This model is based on Wan2.1-T2V-1.3B
, a powerful 1.3 billion parameter text-to-video model. It uses a Diffusion Transformer (DiT) architecture and a 3D VAE (Wan-VAE) optimized to preserve temporal information, making it ideal for video generation.
- Original Repository: Wan-AI/Wan2.1-T2V-1.3B
- Architecture: Diffusion Transformer (DiT) with a T5 text encoder.
Optimization Technique: Self-Forcing
The Wan2.1
model was enhanced with the Self-Forcing method, which trains the model to recognize and correct its own diffusion errors in a single forward pass. This improves fidelity and coherence without the need for costly additional training.
- Project Page: self-forcing.github.io
π Acknowledgements
We thank the teams behind Wan2.1, Self-Forcing, Stable Diffusion, diffusers, and the entire Hugging Face community for their contribution to the open-source ecosystem.
βοΈ Citation
If you find our work useful, please cite the original projects:
@article{wan2.1,
title = {Wan: Open and Advanced Large-Scale Video Generative Models},
author = {Wan Team},
journal = {},
year = {2025}
}
@misc{bar2024self,
title={Self-Forcing for Real-Time Video Generation},
author={Tal Bar and Roy Vovers and Yael Vinker and Eliahu Horwitz and Mark B. Zkharya and Yedid Hoshen},
year={2024},
eprint={2405.03358},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- Downloads last month
- 90
4-bit
8-bit
16-bit
Model tree for Nichonauta/Self-Forcing2.1-T2V-1.3B-GGUF
Base model
gdhe17/Self-Forcing