OOMs
I tried to run this on A6000 (48GB VRAM), but it always OOMs
But I read on Reddit that people are able to run this on 4090 which has less VRAM
Is there anything am I missing?
I'm not familiar with the lightx2v software specifically, but I did find this page https://github.com/ModelTC/lightx2v/blob/main/docs/en_US/03.quantization.md which may help. People on Reddit may be using other software, for instance I run it in 16GB VRAM on my 4070 Ti Super myself using my own https://github.com/Sarania/blissful-tuner/ using features like SageAttention ( https://github.com/thu-ml/SageAttention ), fp8 scaled quantization, torch.compile optimization, transformer block swap, and more. Any software capable of loading a Wan architecture model and providing an LCM sampler (e.g. ComfyUI, WanVideoWrapper) should also work and many are optimized for machines with consumer hardware and thus lower VRAM.
The resolution and length of the requested output also play a large role in how much VRAM is consumed during inference. Using an unquantized model and trying to create high res, long videos can eat VRAM exceptionally quickly. With these considerations in mind, I hope you find success!
I'm not familiar with the lightx2v software specifically, but I did find this page https://github.com/ModelTC/lightx2v/blob/main/docs/en_US/03.quantization.md which may help. People on Reddit may be using other software, for instance I run it in 16GB VRAM on my 4070 Ti Super myself using my own https://github.com/Sarania/blissful-tuner/ using features like SageAttention ( https://github.com/thu-ml/SageAttention ), fp8 scaled quantization, torch.compile optimization, transformer block swap, and more. Any software capable of loading a Wan architecture model and providing an LCM sampler (e.g. ComfyUI, WanVideoWrapper) should also work and many are optimized for machines with consumer hardware and thus lower VRAM.
The resolution and length of the requested output also play a large role in how much VRAM is consumed during inference. Using an unquantized model and trying to create high res, long videos can eat VRAM exceptionally quickly. With these considerations in mind, I hope you find success!
Thanks! I tried to do cpu offloading, vae tiling, quantization, etc, with them I was able to get around OOMs, but now it takes 6 min to generate a 5s 480p video, and the quality is subpar.
is that type of speed and quality expected? Feel like I probably missed something important
Unfortunately that's fairly typical video diffusion is HEAVY, though I think you could do a little better with optimized settings. Looking at the specs of your GPU, the raw compute is about 90% of my GPU but you have much more VRAM. With fully optimized settings I can make a 1280x720 video of 5 seconds at 16 fps (81 frames) in about 5 minutes(6 steps) so you should be able to achieve similar I think. I am very much a power user though, optimizing every last bit of code to maximize what I can do with my hardware - that's what gave birth to Blissful Tuner that I mentioned above. I'll go over some things more in depth here as far as getting the max out of these models. If it's too much please feel free to skip to the last paragraph to skip the theory and go for the TL;DR!
The big speed wins in video diffusion come from a few different places. The number one boon is SageAttention - this is a highly optimized, quantized attention kernel that nearly doubles inference speed for Wan on the right hardware (Ada) in exchange for a minor difference in output compared to SDPA/Flash. You have an Ampere card so the boon won't be quite as much, but Ampere IS supported so you definitely will want that. The next big boost is fp16_accumulation - this is simply enabling a flag in PyTorch that allows doing the GEMM accumulations in fp16 and it provides another ~+25% speed to Wan with no noticeable quality hit. Then utilizing torch.compile which is a PyTorch function to optimize a model and workload specifically to your hardware will net you another +20% speed and -20% VRAM.
Now getting into REALLY advanced territory, the default rotary position embed function(RoPE) for Wan uses complex numbers and this makes it not play nice with torch.compile. Some software like WanVideoWrapper or my own implement an alternative RoPE function that does not use complex numbers and thus compiles much better. This saves significant VRAM which means you need to do less cpu offloading and so increases speed. It comes with no penalty.
All this said, if you JUST want to make the highest quality videos you can get from your hardware at the best speed possible, you might be better served by using a software with support for more features for that kind of thing(no shade on lightx2v). If you are comfortable with CLI interface, my own Blissful Tuner is a possibility. I even provide a --optimized flag to enable all the optimizations I just spoke about in one shot(requires SageAttention, Triton, PyTorch 2.7.0 or higher in the venv). Blissful also supports advanced fp8 scaled quantization to save VRAM while maximizing quality(also enabled by --optimized). If you prefer a GUI interface, ComfyUI either natively or with the likes of https://github.com/kijai/ComfyUI-WanVideoWrapper is what most people use to inference these models. WanVideoWrapper has a LOT of options for optimizing speed vs quality, even more than Blissful, but it can be a bit overwhelming for beginners not used to Comfy's visual programming style. I honestly think it's super impressive the ways that have been created to allow us to run these huge models on consumer hardware! I hope you find what you are looking for!