Enhance model card for TLB-VFI: add metadata, links, and usage (#1)
Browse files- Enhance model card for TLB-VFI: add metadata, links, and usage (b947b19f5f61d8cba601a9b28fb8a78a6f43b695)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
|
@@ -1 +1,79 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: image-to-video
|
| 3 |
+
library_name: diffusers
|
| 4 |
+
license: mit
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation
|
| 8 |
+
|
| 9 |
+
This repository contains the official model weights for **TLB-VFI**, an efficient video-based diffusion model presented in the paper [TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation](https://huggingface.co/papers/2507.04984).
|
| 10 |
+
|
| 11 |
+
- 🌐 **Project Page**: [https://zonglinl.github.io/tlbvfi_page](https://zonglinl.github.io/tlbvfi_page)
|
| 12 |
+
- 💻 **Code**: [https://github.com/ZonglinL/TLB-VFI](https://github.com/ZonglinL/TLB-VFI)
|
| 13 |
+
|
| 14 |
+
<div align="center">
|
| 15 |
+
<img src="https://github.com/ZonglinL/TLB-VFI/raw/main/images/visual1.png" width=95%>
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
## Abstract
|
| 19 |
+
|
| 20 |
+
Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models.
|
| 21 |
+
|
| 22 |
+
## Overview
|
| 23 |
+
|
| 24 |
+
TLB-VFI leverages temporal information extraction in the pixel space (3D wavelet) and latent space (3D convolution and attention) to improve the temporal consistency of the model.
|
| 25 |
+
|
| 26 |
+
<div align="center">
|
| 27 |
+
<img src="https://github.com/ZonglinL/TLB-VFI/raw/main/images/overview.jpg" width=95%>
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
+
## Quantitative Results
|
| 31 |
+
|
| 32 |
+
Our method achieves state-of-the-art performance in LPIPS/FloLPIPS/FID among all recent SOTAs.
|
| 33 |
+
|
| 34 |
+
<div align="center">
|
| 35 |
+
<img src="https://github.com/ZonglinL/TLB-VFI/raw/main/images/quant.png" width=95%>
|
| 36 |
+
</div>
|
| 37 |
+
|
| 38 |
+
## Qualitative Results
|
| 39 |
+
|
| 40 |
+
Our method achieves the best visual quality among all recent SOTAs. For more visualizations, please refer to our [project page](https://zonglinl.github.io/tlbvfi_page).
|
| 41 |
+
|
| 42 |
+
<div align="center">
|
| 43 |
+
<img src="https://github.com/ZonglinL/TLB-VFI/raw/main/images/visual3.png" width=95%>
|
| 44 |
+
</div>
|
| 45 |
+
|
| 46 |
+
## Usage
|
| 47 |
+
|
| 48 |
+
For detailed instructions on setup, training, and evaluation, please refer to the [official GitHub repository](https://github.com/ZonglinL/TLB-VFI).
|
| 49 |
+
|
| 50 |
+
### Inference Example
|
| 51 |
+
|
| 52 |
+
You can perform inference using the provided scripts on the GitHub repository. Please ensure you have downloaded the trained model weights.
|
| 53 |
+
|
| 54 |
+
To interpolate 7 frames in between `frame0` and `frame1`:
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
python interpolate.py --resume_model path_to_model_weights --frame0 path_to_the_previous_frame --frame1 path_to_the_next_frame
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
To interpolate 1 frame in between:
|
| 61 |
+
|
| 62 |
+
```bash
|
| 63 |
+
python interpolate_one.py --resume_model path_to_model_weights --frame0 path_to_the_previous_frame --frame1 path_to_the_next_frame
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
## Citation
|
| 67 |
+
|
| 68 |
+
If you find this repository helpful for your research, please cite the paper:
|
| 69 |
+
|
| 70 |
+
```bibtex
|
| 71 |
+
@article{lyu2025tlbvfitemporalawarelatentbrownian,
|
| 72 |
+
title={TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation},
|
| 73 |
+
author={Zonglin Lyu and Chen Chen},
|
| 74 |
+
year={2025},
|
| 75 |
+
eprint={2507.04984},
|
| 76 |
+
archivePrefix={arXiv},
|
| 77 |
+
primaryClass={cs.CV},
|
| 78 |
+
}
|
| 79 |
+
```
|