English

Add pipeline tag and library name + link to Space

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +107 -32
README.md CHANGED
@@ -1,8 +1,11 @@
1
  ---
2
- license: creativeml-openrail-m
3
  language:
4
  - en
 
 
 
5
  ---
 
6
  # MuseTalk
7
 
8
  MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting
@@ -11,31 +14,42 @@ Yue Zhang <sup>\*</sup>,
11
  Minhao Liu<sup>\*</sup>,
12
  Zhaokang Chen,
13
  Bin Wu<sup>†</sup>,
14
- Yingjie He,
15
  Chao Zhan,
 
 
16
  Wenjiang Zhou
17
  (<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, [email protected])
18
 
19
- **[github](https://github.com/TMElyralab/MuseTalk)** **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)** **Project(comming soon)** **Technical report (comming soon)**
 
 
20
 
21
  We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https://github.com/TMElyralab/MuseV), as a complete virtual human solution.
22
 
 
 
23
  # Overview
24
  `MuseTalk` is a real-time high quality audio-driven lip-syncing model trained in the latent space of `ft-mse-vae`, which
25
 
26
  1. modifies an unseen face according to the input audio, with a size of face region of `256 x 256`.
27
- 1. supports audio in various languages, such as Chinese, English, and Japanese.
28
- 1. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
29
- 1. supports modification of the center point of the face region proposes, which **SIGNIFICANTLY** affects generation results.
30
- 1. checkpoint available trained on the HDTF dataset.
31
- 1. training codes (comming soon).
32
 
33
  # News
34
- - [04/02/2024] Released MuseTalk project and pretrained models.
 
 
 
35
 
36
  ## Model
37
  ![Model Structure](assets/figs/musetalk_arc.jpg)
38
- MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention.
 
 
39
 
40
  ## Cases
41
  ### MuseV + MuseTalk make human photos alive!
@@ -50,10 +64,10 @@ MuseTalk was trained in latent spaces, where the images were encoded by a freeze
50
  <img src=assets/demo/musk/musk.png width="95%">
51
  </td>
52
  <td >
53
- <video src=assets/demo/yongen/yongen_musev.mp4 controls preload></video>
54
  </td>
55
  <td >
56
- <video src=assets/demo/yongen/yongen_musetalk.mp4 controls preload></video>
57
  </td>
58
  </tr>
59
  <tr>
@@ -67,6 +81,28 @@ MuseTalk was trained in latent spaces, where the images were encoded by a freeze
67
  <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/94d8dcba-1bcd-4b54-9d1d-8b6fc53228f0 controls preload></video>
68
  </td>
69
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  <tr>
71
  <td>
72
  <img src=assets/demo/monalisa/monalisa.png width="95%">
@@ -121,19 +157,42 @@ MuseTalk was trained in latent spaces, where the images were encoded by a freeze
121
  </tr>
122
  </table>
123
 
124
- * For video dubbing, we applied a self-developed tool which can detect the talking person.
125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
  # TODO:
128
  - [x] trained models and inference codes.
 
 
129
  - [ ] technical report.
130
  - [ ] training codes.
131
- - [ ] online UI.
132
  - [ ] a better model (may take longer).
133
 
134
 
135
  # Getting Started
136
  We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:
 
 
 
 
 
 
 
137
  ## Installation
138
  To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
139
  ### Build environment
@@ -143,11 +202,6 @@ We recommend a python version >=3.10 and cuda version =11.7. Then build environm
143
  ```shell
144
  pip install -r requirements.txt
145
  ```
146
- ### whisper
147
- install whisper to extract audio feature (only encoder)
148
- ```
149
- pip install --editable ./musetalk/whisper
150
- ```
151
 
152
  ### mmlab packages
153
  ```bash
@@ -205,10 +259,12 @@ Here, we provide the inference script.
205
  python -m scripts.inference --inference_config configs/inference/test.yaml
206
  ```
207
  configs/inference/test.yaml is the path to the inference configuration file, including video_path and audio_path.
208
- The video_path should be either a video file or a directory of images.
 
 
209
 
210
  #### Use of bbox_shift to have adjustable results
211
- :mag_right: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the `bbox_shift` parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the upper half) decrease mouth openness.
212
 
213
  You can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range.
214
 
@@ -220,17 +276,36 @@ python -m scripts.inference --inference_config configs/inference/test.yaml --bbo
220
 
221
  #### Combining MuseV and MuseTalk
222
 
223
- As a complete solution to virtual human generation, you are suggested to first apply [MuseV](https://github.com/TMElyralab/MuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https://github.com/TMElyralab/MuseV?tab=readme-ov-file#text2video). Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#inference).
224
 
225
- # Note
226
 
227
- If you want to launch online video chats, you are suggested to generate videos using MuseV and apply necessary pre-processing such as face detection in advance. During online chatting, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
228
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
229
 
230
  # Acknowledgement
231
- 1. We thank open-source components like [whisper](https://github.com/isaacOnline/whisper/tree/extract-embeddings), [dwpose](https://github.com/IDEA-Research/DWPose), [face-alignment](https://github.com/1adrianb/face-alignment), [face-parsing](https://github.com/zllrunning/face-parsing.PyTorch), [S3FD](https://github.com/yxlijun/S3FD.pytorch).
232
- 1. MuseTalk has referred much to [diffusers](https://github.com/huggingface/diffusers).
233
- 1. MuseTalk has been built on `HDTF` datasets.
234
 
235
  Thanks for open-sourcing!
236
 
@@ -246,14 +321,14 @@ If you need higher resolution, you could apply super resolution models such as [
246
  ```bib
247
  @article{musetalk,
248
  title={MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting},
249
- author={Zhang, Yue and Liu, Minhao and Chen, Zhaokang and Wu, Bin and He, Yingjie and Zhan, Chao and Zhou, Wenjiang},
250
  journal={arxiv},
251
  year={2024}
252
  }
253
  ```
254
  # Disclaimer/License
255
  1. `code`: The code of MuseTalk is released under the MIT License. There is no limitation for both academic and commercial usage.
256
- 1. `model`: The trained model are available for any purpose, even commercially.
257
- 1. `other opensource model`: Other open-source models used must comply with their license, such as `whisper`, `ft-mse-vae`, `dwpose`, `S3FD`, etc..
258
- 1. The testdata are collected from internet, which are available for non-commercial research purposes only.
259
- 1. `AIGC`: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.
 
1
  ---
 
2
  language:
3
  - en
4
+ license: creativeml-openrail-m
5
+ pipeline_tag: image-to-video
6
+ library_name: diffusers
7
  ---
8
+
9
  # MuseTalk
10
 
11
  MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting
 
14
  Minhao Liu<sup>\*</sup>,
15
  Zhaokang Chen,
16
  Bin Wu<sup>†</sup>,
17
+ Yubin Zeng,
18
  Chao Zhan,
19
+ Yingjie He,
20
+ Junxin Huang,
21
  Wenjiang Zhou
22
  (<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, [email protected])
23
 
24
+ Lyra Lab, Tencent Music Entertainment
25
+
26
+ **[github](https://github.com/TMElyralab/MuseTalk)** **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)** **[space](https://huggingface.co/spaces/TMElyralab/MuseTalk)** **[Technical report](https://arxiv.org/abs/2410.10122)**
27
 
28
  We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https://github.com/TMElyralab/MuseV), as a complete virtual human solution.
29
 
30
+ :new: Update: We are thrilled to announce that [MusePose](https://github.com/TMElyralab/MusePose/) has been released. MusePose is an image-to-video generation framework for virtual human under control signal like pose. Together with MuseV and MuseTalk, we hope the community can join us and march towards the vision where a virtual human can be generated end2end with native ability of full body movement and interaction.
31
+
32
  # Overview
33
  `MuseTalk` is a real-time high quality audio-driven lip-syncing model trained in the latent space of `ft-mse-vae`, which
34
 
35
  1. modifies an unseen face according to the input audio, with a size of face region of `256 x 256`.
36
+ 2. supports audio in various languages, such as Chinese, English, and Japanese.
37
+ 3. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
38
+ 4. supports modification of the center point of the face region proposes, which **SIGNIFICANTLY** affects generation results.
39
+ 5. checkpoint available trained on the HDTF dataset.
40
+ 6. training codes (comming soon).
41
 
42
  # News
43
+ - [04/02/2024] Release MuseTalk project and pretrained models.
44
+ - [04/16/2024] Release Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk) on HuggingFace Spaces (thanks to HF team for their community grant)
45
+ - [04/17/2024] : We release a pipeline that utilizes MuseTalk for real-time inference.
46
+ - [10/18/2024] :mega: We release the [technical report](https://arxiv.org/abs/2410.10122). Our report details a superior model to the open-source L1 loss version. It includes GAN and perceptual losses for improved clarity, and sync loss for enhanced performance.
47
 
48
  ## Model
49
  ![Model Structure](assets/figs/musetalk_arc.jpg)
50
+ MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention.
51
+
52
+ Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is **NOT** a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.
53
 
54
  ## Cases
55
  ### MuseV + MuseTalk make human photos alive!
 
64
  <img src=assets/demo/musk/musk.png width="95%">
65
  </td>
66
  <td >
67
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/4a4bb2d1-9d14-4ca9-85c8-7f19c39f712e controls preload></video>
68
  </td>
69
  <td >
70
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/b2a879c2-e23a-4d39-911d-51f0343218e4 controls preload></video>
71
  </td>
72
  </tr>
73
  <tr>
 
81
  <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/94d8dcba-1bcd-4b54-9d1d-8b6fc53228f0 controls preload></video>
82
  </td>
83
  </tr>
84
+ <tr>
85
+ <td>
86
+ <img src=assets/demo/sit/sit.jpeg width="95%">
87
+ </td>
88
+ <td >
89
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/5fbab81b-d3f2-4c75-abb5-14c76e51769e controls preload></video>
90
+ </td>
91
+ <td >
92
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/f8100f4a-3df8-4151-8de2-291b09269f66 controls preload></video>
93
+ </td>
94
+ </tr>
95
+ <tr>
96
+ <td>
97
+ <img src=assets/demo/man/man.png width="95%">
98
+ </td>
99
+ <td >
100
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/a6e7d431-5643-4745-9868-8b423a454153 controls preload></video>
101
+ </td>
102
+ <td >
103
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/6ccf7bc7-cb48-42de-85bd-076d5ee8a623 controls preload></video>
104
+ </td>
105
+ </tr>
106
  <tr>
107
  <td>
108
  <img src=assets/demo/monalisa/monalisa.png width="95%">
 
157
  </tr>
158
  </table>
159
 
160
+ * For video dubbing, we applied a self-developed tool which can identify the talking person.
161
 
162
+ ## Some interesting videos!
163
+ <table class="center">
164
+ <tr style="font-weight: bolder;text-align:center;">
165
+ <td width="50%">Image</td>
166
+ <td width="50%">MuseV + MuseTalk</td>
167
+ </tr>
168
+ <tr>
169
+ <td>
170
+ <img src=assets/demo/video1/video1.png width="95%">
171
+ </td>
172
+ <td>
173
+ <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/1f02f9c6-8b98-475e-86b8-82ebee82fe0d controls preload></video>
174
+ </td>
175
+ </tr>
176
+ </table>
177
 
178
  # TODO:
179
  - [x] trained models and inference codes.
180
+ - [x] Huggingface Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk).
181
+ - [x] codes for real-time inference.
182
  - [ ] technical report.
183
  - [ ] training codes.
 
184
  - [ ] a better model (may take longer).
185
 
186
 
187
  # Getting Started
188
  We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:
189
+
190
+ ## Third party integration
191
+ Thanks for the third-party integration, which makes installation and use more convenient for everyone.
192
+ We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.
193
+
194
+ ### [ComfyUI](https://github.com/chaojie/ComfyUI-MuseTalk)
195
+
196
  ## Installation
197
  To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
198
  ### Build environment
 
202
  ```shell
203
  pip install -r requirements.txt
204
  ```
 
 
 
 
 
205
 
206
  ### mmlab packages
207
  ```bash
 
259
  python -m scripts.inference --inference_config configs/inference/test.yaml
260
  ```
261
  configs/inference/test.yaml is the path to the inference configuration file, including video_path and audio_path.
262
+ The video_path should be either a video file, an image file or a directory of images.
263
+
264
+ You are recommended to input video with `25fps`, the same fps used when training the model. If your video is far less than 25fps, you are recommended to apply frame interpolation or directly convert the video to 25fps using ffmpeg.
265
 
266
  #### Use of bbox_shift to have adjustable results
267
+ :mag_right: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the `bbox_shift` parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the lower half) decrease mouth openness.
268
 
269
  You can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range.
270
 
 
276
 
277
  #### Combining MuseV and MuseTalk
278
 
279
+ As a complete solution to virtual human generation, you are suggested to first apply [MuseV](https://github.com/TMElyralab/MuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https://github.com/TMElyralab/MuseV?tab=readme-ov-file#text2video). Frame interpolation is suggested to increase frame rate. Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#inference).
280
 
281
+ #### :new: Real-time inference
282
 
283
+ Here, we provide the inference script. This script first applies necessary pre-processing such as face detection, face parsing and VAE encode in advance. During inference, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
284
 
285
+ ```
286
+ python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml --batch_size 4
287
+ ```
288
+ configs/inference/realtime.yaml is the path to the real-time inference configuration file, including `preparation`, `video_path` , `bbox_shift` and `audio_clips`.
289
+
290
+ 1. Set `preparation` to `True` in `realtime.yaml` to prepare the materials for a new `avatar`. (If the `bbox_shift` has changed, you also need to re-prepare the materials.)
291
+ 2. After that, the `avatar` will use an audio clip selected from `audio_clips` to generate video.
292
+ ```
293
+ Inferring using: data/audio/yongen.wav
294
+ ```
295
+ 3. While MuseTalk is inferring, sub-threads can simultaneously stream the results to the users. The generation process can achieve 30fps+ on an NVIDIA Tesla V100.
296
+ 4. Set `preparation` to `False` and run this script if you want to genrate more videos using the same avatar.
297
+
298
+ ##### Note for Real-time inference
299
+ 1. If you want to generate multiple videos using the same avatar/video, you can also use this script to **SIGNIFICANTLY** expedite the generation process.
300
+ 2. In the previous script, the generation time is also limited by I/O (e.g. saving images). If you just want to test the generation speed without saving the images, you can run
301
+ ```
302
+ python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml --skip_save_images
303
+ ```
304
 
305
  # Acknowledgement
306
+ 1. We thank open-source components like [whisper](https://github.com/openai/whisper), [dwpose](https://github.com/IDEA-Research/DWPose), [face-alignment](https://github.com/1adrianb/face-alignment), [face-parsing](https://github.com/zllrunning/face-parsing.PyTorch), [S3FD](https://github.com/yxlijun/S3FD.pytorch).
307
+ 2. MuseTalk has referred much to [diffusers](https://github.com/huggingface/diffusers) and [isaacOnline/whisper](https://github.com/isaacOnline/whisper/tree/extract-embeddings).
308
+ 3. MuseTalk has been built on [HDTF](https://github.com/MRzzm/HDTF) datasets.
309
 
310
  Thanks for open-sourcing!
311
 
 
321
  ```bib
322
  @article{musetalk,
323
  title={MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting},
324
+ author={Zhang, Yue and Liu, Minhao and Chen, Zhaokang and Wu, Bin and Zeng, Yubin and Zhan, Chao and He, Yingjie and Huang, Junxin and Zhou, Wenjiang},
325
  journal={arxiv},
326
  year={2024}
327
  }
328
  ```
329
  # Disclaimer/License
330
  1. `code`: The code of MuseTalk is released under the MIT License. There is no limitation for both academic and commercial usage.
331
+ 2. `model`: The trained model are available for any purpose, even commercially.
332
+ 3. `other opensource model`: Other open-source models used must comply with their license, such as `whisper`, `ft-mse-vae`, `dwpose`, `S3FD`, etc..
333
+ 4. The testdata are collected from internet, which are available for non-commercial research purposes only.
334
+ 5. `AIGC`: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.