ByteDance
/

LatentSync-1.5

Model card Files Files and versions

LatentSync-1.5 / README.md

chunyu-li's picture

Change `library_name` to enable download counting

32a20d2 4 months ago

|

history blame contribute delete

3.01 kB

	---
	license: openrail++
	library_name: torchgeo
	tags:
	- lipsync
	- video-editing
	---

	Paper: https://arxiv.org/abs/2412.09262

	Code: https://github.com/bytedance/LatentSync

	# What's new in LatentSync 1.5?

	1. Add temporal layer: Our previous claim that the [temporal layer](https://arxiv.org/abs/2307.04725) severely impairs lip-sync accuracy was incorrect; the issue was actually caused by a bug in the code implementation. We have corrected our [paper](https://arxiv.org/abs/2412.09262) and updated the code. After incorporating the temporal layer, LatentSync 1.5 demonstrates significantly improved temporal consistency compared to version 1.0.

	2. Improves performance on Chinese videos: many issues reported poor performance on Chinese videos, so we added Chinese data to the training of the new model version.

	3. Reduce the VRAM requirement of the stage2 training to 20 GB through the following optimizations:

	1. Implement gradient checkpointing in U-Net, VAE, SyncNet and VideoMAE
	2. Replace xFormers with PyTorch's native implementation of FlashAttention-2.
	3. Clear the CUDA cache after loading checkpoints.
	4. The stage2 training only requires training the temporal layer and audio cross-attention layer, which significantly reduces VRAM requirement compared to the previous full-parameter fine-tuning.

	Now you can train LatentSync on a single RTX 3090! Start the stage2 training with `configs/unet/stage2_efficient.yaml`.

	4. Other code optimizations:

	1. Remove the dependency on xFormers and Triton.
	2. Upgrade the diffusers version to `0.32.2`.

	## LatentSync 1.5 Demo

	<table class="center">
	<tr style="font-weight: bolder;text-align:center;">
	<td width="50%"><b>Original video</b></td>
	<td width="50%"><b>Lip-synced video</b></td>
	</tr>
	<tr>
	<td>
	<video src=https://github.com/user-attachments/assets/b0c8d1da-3fdc-4946-9800-1b2fd0ef9c7f controls preload></video>
	</td>
	<td>
	<video src=https://github.com/user-attachments/assets/25dd1733-44c7-42fe-805a-d612d4bc30e0 controls preload></video>
	</td>
	</tr>
	<tr>
	<td>
	<video src=https://github.com/user-attachments/assets/4e48e501-64b4-4b4f-a69c-ed18dd987b1f controls preload></video>
	</td>
	<td>
	<video src=https://github.com/user-attachments/assets/e690d91b-9fe5-4323-a60e-2b7f546f01bc controls preload></video>
	</td>
	</tr>
	<tr>
	<td>
	<video src=https://github.com/user-attachments/assets/e84e2c13-1deb-41f7-8382-048ba1922b71 controls preload></video>
	</td>
	<td>
	<video src=https://github.com/user-attachments/assets/5a5ba09f-590b-4eb3-8dfb-a199d8d1e276 controls preload></video>
	</td>
	</tr>
	<tr>
	<td>
	<video src=https://github.com/user-attachments/assets/11e4b2b6-64f4-4617-b005-059209fcaea5 controls preload></video>
	</td>
	<td>
	<video src=https://github.com/user-attachments/assets/38437475-3c90-4d08-b540-c8e819e93e0d controls preload></video>
	</td>
	</tr>
	</table>