ameerazam08 commited on Nov 30, 2023

Commit

9a973f2

1 Parent(s): 8f7f7c3

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +2 -0
README.md +70 -0
checkpoints/renderer_checkpoint.pt +3 -0
checkpoints/styletalk_checkpoint.pth +3 -0
configs/__pycache__/default.cpython-37.pyc +0 -0
configs/default.py +65 -0
configs/renderer_conf.yaml +17 -0
core/__pycache__/utils.cpython-37.pyc +0 -0
core/networks/__init__.py +9 -0
core/networks/__pycache__/__init__.cpython-37.pyc +0 -0
core/networks/__pycache__/disentangle_decoder.cpython-37.pyc +0 -0
core/networks/__pycache__/dynamic_conv.cpython-37.pyc +0 -0
core/networks/__pycache__/dynamic_fc_decoder.cpython-37.pyc +0 -0
core/networks/__pycache__/dynamic_linear.cpython-37.pyc +0 -0
core/networks/__pycache__/generator.cpython-37.pyc +0 -0
core/networks/__pycache__/mish.cpython-37.pyc +0 -0
core/networks/__pycache__/self_attention_pooling.cpython-37.pyc +0 -0
core/networks/__pycache__/styletalk.cpython-37.pyc +0 -0
core/networks/__pycache__/transformer.cpython-37.pyc +0 -0
core/networks/building_blocks.py +112 -0
core/networks/disentangle_decoder.py +184 -0
core/networks/dynamic_conv.py +149 -0
core/networks/dynamic_fc_decoder.py +140 -0
core/networks/dynamic_linear.py +42 -0
core/networks/generator.py +213 -0
core/networks/mish.py +51 -0
core/networks/self_attention_pooling.py +43 -0
core/networks/styletalk.py +24 -0
core/networks/transformer.py +300 -0
core/utils.py +228 -0
demo.mp4 +0 -0
demo.npy +3 -0
demo_download.mp4 +0 -0
demo_download.npy +3 -0
env.yaml +0 -0
environment.yml +91 -0
generators/__pycache__/base_function.cpython-37.pyc +0 -0
generators/__pycache__/face_model.cpython-37.pyc +0 -0
generators/__pycache__/flow_util.cpython-37.pyc +0 -0
generators/base_function.py +368 -0
generators/face_model.py +127 -0
generators/flow_util.py +56 -0
inference_for_demo.py +187 -0
media/first_page.png +3 -0
phindex.json +1 -0
requirements.txt +11 -0
samples/source_video/3DMM/KristiNoem.mat +0 -0
samples/source_video/3DMM/Obama_clip1.mat +0 -0
samples/source_video/3DMM/Obama_clip2.mat +0 -0
samples/source_video/3DMM/Obama_clip3.mat +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+media/first_page.png filter=lfs diff=lfs merge=lfs -text
+samples/source_video/wav/intro.wav filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,70 @@

+# StyleTalk
+The official repository of the AAAI2023 paper [StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles](https://arxiv.org/abs/2301.01081)
+<p align='center'>
+  <b>
+    <a href="https://arxiv.org/abs/2301.01081">Paper</a>
+    |
+    <a href="https://drive.google.com/file/d/19WRhBHYVWRIH8_zo332l00fLXfUE96-k/view?usp=share_link">Supp. Materials</a>
+    |
+    <a href="https://youtu.be/mO2Tjcwr4u8">Video</a>
+  </b>
+</p>
+  <p align='center'>
+    <img src='media/first_page.png' width='700'/>
+  </p>
+  The proposed **StyleTalk** can generate talking head videos with speaking styles specified by arbitrary style reference videos.
+# News
+* April 14th, 2023. The code is available.
+# Get Started
+## Installation
+Clone this repo, install conda and run:
+```bash
+conda create -n styletalk python=3.7.0
+conda activate styletalk
+pip install -r requirements.txt
+conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
+conda update ffmpeg
+```
+The code has been test on CUDA 11.1, GPU RTX 3090.
+## Data Preprocessing
+Our methods takes 3DMM parameters(\*.mat) and phoneme labels(\*_seq.json) as input. Follow [PIRenderer](https://github.com/RenYurui/PIRender) to extract 3DMM parameters. Follow [AVCT](https://github.com/FuxiVirtualHuman/AAAI22-one-shot-talking-face) to extract phoneme labels. Some preprocessed data can be found in folder `samples`.
+## Inference
+Download checkpoints for [StyleTalk](https://drive.google.com/file/d/1z54FymEiyPQ0mPGrVePt8GMtDe-E2RmN/view?usp=share_link)  and [Renderer](https://drive.google.com/file/d/1wFAtFQjybKI3hwRWvtcBDl4tpZzlDkja/view?usp=share_link) and put them into `./checkpoints`.
+Run the demo:
+```bash
+python inference_for_demo.py \
+--audio_path samples/source_video/phoneme/reagan_clip1_seq.json \
+--style_clip_path samples/style_clips/3DMM/happyenglish_clip1.mat \
+--pose_path samples/source_video/3DMM/reagan_clip1.mat \
+--src_img_path samples/source_video/image/andrew_clip_1.png \
+--wav_path samples/source_video/wav/reagan_clip1.wav \
+--output_path demo.mp4
+```
+Change `audio_path`, `style_clip_path`, `pose_path`, `src_img_path`, `wav_path`, `output_path` to generate more results.
+# Acknowledgement
+Some code are borrowed from following projects:
+* [AVCT](https://github.com/FuxiVirtualHuman/AAAI22-one-shot-talking-face)
+* [PIRenderer](https://github.com/RenYurui/PIRender)
+* [Deep3DFaceRecon_pytorch](https://github.com/sicxu/Deep3DFaceRecon_pytorch)
+* [Speech Drives Templates](https://github.com/ShenhanQian/SpeechDrivesTemplates)
+* [FOMM video preprocessing](https://github.com/AliaksandrSiarohin/video-preprocessing)
+Thanks for their contributions!

checkpoints/renderer_checkpoint.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a67014839d42d592255c9fc3b3ceecbcd62c27ce0c0a89ed6628292447404242
+size 335281551

checkpoints/styletalk_checkpoint.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c3bd52d0080d52440e25de44930b539d21c7f1431102c2811fafb30838e9812e
+size 275485145

configs/__pycache__/default.cpython-37.pyc ADDED Viewed

Binary file (1.92 kB). View file

configs/default.py ADDED Viewed

	@@ -0,0 +1,65 @@

+from yacs.config import CfgNode as CN
+_C = CN()
+_C.TAG = "style_id_emotion"
+_C.DECODER_TYPE = "DisentangleDecoder"
+_C.CONTENT_ENCODER_TYPE = "ContentEncoder"
+_C.STYLE_ENCODER_TYPE = "StyleEncoder"
+_C.DISCRIMINATOR_TYPE = "Discriminator"
+_C.WIN_SIZE = 5
+_C.D_MODEL = 256
+_C.DATASET = CN()
+_C.DATASET.FACE3D_DIM = 64
+_C.CONTENT_ENCODER = CN()
+_C.CONTENT_ENCODER.d_model = _C.D_MODEL
+_C.CONTENT_ENCODER.nhead = 8
+_C.CONTENT_ENCODER.num_encoder_layers = 3
+_C.CONTENT_ENCODER.dim_feedforward = 4 * _C.D_MODEL
+_C.CONTENT_ENCODER.dropout = 0.1
+_C.CONTENT_ENCODER.activation = "relu"
+_C.CONTENT_ENCODER.normalize_before = False
+_C.CONTENT_ENCODER.pos_embed_len = 2 * _C.WIN_SIZE + 1
+_C.CONTENT_ENCODER.ph_embed_dim = 128
+_C.STYLE_ENCODER = CN()
+_C.STYLE_ENCODER.d_model = _C.D_MODEL
+_C.STYLE_ENCODER.nhead = 8
+_C.STYLE_ENCODER.num_encoder_layers = 3
+_C.STYLE_ENCODER.dim_feedforward = 4 * _C.D_MODEL
+_C.STYLE_ENCODER.dropout = 0.1
+_C.STYLE_ENCODER.activation = "relu"
+_C.STYLE_ENCODER.normalize_before = False
+_C.STYLE_ENCODER.pos_embed_len = 256
+_C.STYLE_ENCODER.aggregate_method = "self_attention_pooling"  # average | self_attention_pooling
+# _C.STYLE_ENCODER.input_dim = _C.DATASET.FACE3D_DIM
+_C.DECODER = CN()
+_C.DECODER.d_model = _C.D_MODEL
+_C.DECODER.nhead = 8
+_C.DECODER.num_decoder_layers = 3
+_C.DECODER.dim_feedforward = 4 * _C.D_MODEL
+_C.DECODER.dropout = 0.1
+_C.DECODER.activation = "relu"
+_C.DECODER.normalize_before = False
+_C.DECODER.return_intermediate_dec = False
+_C.DECODER.pos_embed_len = 2 * _C.WIN_SIZE + 1
+_C.DECODER.network_type = "DynamicFCDecoder"
+_C.DECODER.dynamic_K = 8
+_C.DECODER.dynamic_ratio = 4
+# fmt: off
+_C.DECODER.upper_face3d_indices = [6, 8, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]
+# fmt: on
+_C.DECODER.lower_face3d_indices = [0, 1, 2, 3, 4, 5, 7, 9, 10, 11, 12, 13, 14]
+_C.INFERENCE = CN()
+_C.INFERENCE.CHECKPOINT = ""
+def get_cfg_defaults():
+    """Get a yacs CfgNode object with default values for my_project."""
+    return _C.clone()

configs/renderer_conf.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+common:
+  descriptor_nc: 256
+  image_nc: 3
+  max_nc: 256
+  use_spect: false
+editing_net:
+  base_nc: 64
+  layer: 3
+  num_res_blocks: 2
+mapping_net:
+  coeff_nc: 73
+  descriptor_nc: 256
+  layer: 3
+warpping_net:
+  base_nc: 32
+  decoder_layer: 3
+  encoder_layer: 5

core/__pycache__/utils.cpython-37.pyc ADDED Viewed

Binary file (6.26 kB). View file

core/networks/__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+from core.networks.generator import ContentEncoder, StyleEncoder, Decoder
+from core.networks.disentangle_decoder import DisentangleDecoder
+def get_network(name: str):
+    obj = globals().get(name)
+    if obj is None:
+        raise KeyError("Unknown Network: %s" % name)
+    else:
+        return obj

core/networks/__pycache__/__init__.cpython-37.pyc ADDED Viewed

Binary file (533 Bytes). View file

core/networks/__pycache__/disentangle_decoder.cpython-37.pyc ADDED Viewed

Binary file (3.17 kB). View file

core/networks/__pycache__/dynamic_conv.cpython-37.pyc ADDED Viewed

Binary file (3.78 kB). View file

core/networks/__pycache__/dynamic_fc_decoder.cpython-37.pyc ADDED Viewed

Binary file (3.16 kB). View file

core/networks/__pycache__/dynamic_linear.cpython-37.pyc ADDED Viewed

Binary file (1.3 kB). View file

core/networks/__pycache__/generator.cpython-37.pyc ADDED Viewed

Binary file (4.85 kB). View file

core/networks/__pycache__/mish.cpython-37.pyc ADDED Viewed

Binary file (1.7 kB). View file

core/networks/__pycache__/self_attention_pooling.cpython-37.pyc ADDED Viewed

Binary file (1.61 kB). View file

core/networks/__pycache__/styletalk.cpython-37.pyc ADDED Viewed

Binary file (998 Bytes). View file

core/networks/__pycache__/transformer.cpython-37.pyc ADDED Viewed

Binary file (9.3 kB). View file

core/networks/building_blocks.py ADDED Viewed

	@@ -0,0 +1,112 @@

+from torch import nn
+class ADAIN(nn.Module):
+    def __init__(self, content_nc, condition_nc, hidden_nc):
+        super().__init__()
+        self.param_free_norm = nn.InstanceNorm1d(content_nc, affine=False)
+        use_bias = True
+        self.mlp_shared = nn.Sequential(
+            nn.Linear(condition_nc, hidden_nc, bias=use_bias),
+            nn.ReLU(),
+        )
+        self.mlp_gamma = nn.Linear(hidden_nc, content_nc, bias=use_bias)
+        self.mlp_beta = nn.Linear(hidden_nc, content_nc, bias=use_bias)
+    def forward(self, content, condition):
+        # Part 1. generate parameter-free normalized activations
+        normalized = self.param_free_norm(content)
+        # Part 2. produce scaling and bias conditioned on feature
+        actv = self.mlp_shared(condition)
+        gamma = self.mlp_gamma(actv)
+        beta = self.mlp_beta(actv)
+        # apply scale and bias
+        gamma = gamma.unsqueeze(-1)
+        beta = beta.unsqueeze(-1)
+        out = normalized * (1 + gamma) + beta
+        return out
+class ConvNormRelu(nn.Module):
+    def __init__(
+        self,
+        conv_type="1d",
+        in_channels=3,
+        out_channels=64,
+        downsample=False,
+        kernel_size=None,
+        stride=None,
+        padding=None,
+        norm="IN",
+        leaky=False,
+        adain_condition_nc=None,
+        adain_hidden_nc=None,
+    ):
+        super().__init__()
+        if kernel_size is None:
+            if downsample:
+                kernel_size, stride, padding = 4, 2, 1
+            else:
+                kernel_size, stride, padding = 3, 1, 1
+        if conv_type == "1d":
+            self.conv = nn.Conv1d(
+                in_channels,
+                out_channels,
+                kernel_size,
+                stride,
+                padding,
+                bias=False,
+            )
+            if norm == "IN":
+                self.norm = nn.InstanceNorm1d(out_channels, affine=True)
+            elif norm == "ADAIN":
+                self.norm = ADAIN(out_channels, adain_condition_nc, adain_hidden_nc)
+            elif norm == "NONE":
+                self.norm = nn.Identity()
+            else:
+                raise NotImplementedError
+        nn.init.kaiming_normal_(self.conv.weight)
+        self.act = nn.LeakyReLU(negative_slope=0.2, inplace=True) if leaky else nn.ReLU(inplace=True)
+    def forward(self, x, condition=None):
+        """
+        Args:
+            x (_type_): (B, C, L)
+            condition (_type_, optional): (B, C)
+        Returns:
+            _type_: _description_
+        """
+        x = self.conv(x)
+        if isinstance(self.norm, ADAIN):
+            x = self.norm(x, condition)
+        else:
+            x = self.norm(x)
+        x = self.act(x)
+        return x
+class MyConv1d(nn.Module):
+    def __init__(self, cin, cout, kernel_size, stride, padding, residual=False, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.conv_block = nn.Sequential(
+            nn.Conv1d(cin, cout, kernel_size, stride, padding),
+            nn.BatchNorm1d(cout),
+        )
+        self.act = nn.ReLU()
+        self.residual = residual
+    def forward(self, x):
+        out = self.conv_block(x)
+        if self.residual:
+            out += x
+        return self.act(out)

core/networks/disentangle_decoder.py ADDED Viewed

	@@ -0,0 +1,184 @@

+import torch
+from torch import nn
+from .transformer import (
+    PositionalEncoding,
+    TransformerDecoderLayer,
+    TransformerDecoder,
+)
+from core.networks.dynamic_fc_decoder import DynamicFCDecoderLayer, DynamicFCDecoder
+from core.utils import _reset_parameters
+def get_decoder_network(
+    network_type,
+    d_model,
+    nhead,
+    dim_feedforward,
+    dropout,
+    activation,
+    normalize_before,
+    num_decoder_layers,
+    return_intermediate_dec,
+    dynamic_K,
+    dynamic_ratio,
+):
+    decoder = None
+    if network_type == "TransformerDecoder":
+        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, activation, normalize_before)
+        norm = nn.LayerNorm(d_model)
+        decoder = TransformerDecoder(
+            decoder_layer,
+            num_decoder_layers,
+            norm,
+            return_intermediate_dec,
+        )
+    elif network_type == "DynamicFCDecoder":
+        d_style = d_model
+        decoder_layer = DynamicFCDecoderLayer(
+            d_model,
+            nhead,
+            d_style,
+            dynamic_K,
+            dynamic_ratio,
+            dim_feedforward,
+            dropout,
+            activation,
+            normalize_before,
+        )
+        norm = nn.LayerNorm(d_model)
+        decoder = DynamicFCDecoder(decoder_layer, num_decoder_layers, norm, return_intermediate_dec)
+    else:
+        raise ValueError(f"Invalid network_type {network_type}")
+    return decoder
+class DisentangleDecoder(nn.Module):
+    def __init__(
+        self,
+        d_model=512,
+        nhead=8,
+        num_decoder_layers=3,
+        dim_feedforward=2048,
+        dropout=0.1,
+        activation="relu",
+        normalize_before=False,
+        return_intermediate_dec=False,
+        pos_embed_len=80,
+        upper_face3d_indices=tuple(list(range(19)) + list(range(46, 51))),
+        lower_face3d_indices=tuple(range(19, 46)),
+        network_type="None",
+        dynamic_K=None,
+        dynamic_ratio=None,
+        **_,
+    ) -> None:
+        super().__init__()
+        self.upper_face3d_indices = upper_face3d_indices
+        self.lower_face3d_indices = lower_face3d_indices
+        # upper_decoder_layer = TransformerDecoderLayer(
+        #     d_model, nhead, dim_feedforward, dropout, activation, normalize_before
+        # )
+        # upper_decoder_norm = nn.LayerNorm(d_model)
+        # self.upper_decoder = TransformerDecoder(
+        #     upper_decoder_layer,
+        #     num_decoder_layers,
+        #     upper_decoder_norm,
+        #     return_intermediate=return_intermediate_dec,
+        # )
+        self.upper_decoder = get_decoder_network(
+            network_type,
+            d_model,
+            nhead,
+            dim_feedforward,
+            dropout,
+            activation,
+            normalize_before,
+            num_decoder_layers,
+            return_intermediate_dec,
+            dynamic_K,
+            dynamic_ratio,
+        )
+        _reset_parameters(self.upper_decoder)
+        # lower_decoder_layer = TransformerDecoderLayer(
+        #     d_model, nhead, dim_feedforward, dropout, activation, normalize_before
+        # )
+        # lower_decoder_norm = nn.LayerNorm(d_model)
+        # self.lower_decoder = TransformerDecoder(
+        #     lower_decoder_layer,
+        #     num_decoder_layers,
+        #     lower_decoder_norm,
+        #     return_intermediate=return_intermediate_dec,
+        # )
+        self.lower_decoder = get_decoder_network(
+            network_type,
+            d_model,
+            nhead,
+            dim_feedforward,
+            dropout,
+            activation,
+            normalize_before,
+            num_decoder_layers,
+            return_intermediate_dec,
+            dynamic_K,
+            dynamic_ratio,
+        )
+        _reset_parameters(self.lower_decoder)
+        self.pos_embed = PositionalEncoding(d_model, pos_embed_len)
+        tail_hidden_dim = d_model // 2
+        self.upper_tail_fc = nn.Sequential(
+            nn.Linear(d_model, tail_hidden_dim),
+            nn.ReLU(),
+            nn.Linear(tail_hidden_dim, tail_hidden_dim),
+            nn.ReLU(),
+            nn.Linear(tail_hidden_dim, len(upper_face3d_indices)),
+        )
+        self.lower_tail_fc = nn.Sequential(
+            nn.Linear(d_model, tail_hidden_dim),
+            nn.ReLU(),
+            nn.Linear(tail_hidden_dim, tail_hidden_dim),
+            nn.ReLU(),
+            nn.Linear(tail_hidden_dim, len(lower_face3d_indices)),
+        )
+    def forward(self, content, style_code):
+        """
+        Args:
+            content (_type_): (B, num_frames, window, C_dmodel)
+            style_code (_type_): (B, C_dmodel)
+        Returns:
+            face3d: (B, L_clip, C_3dmm)
+        """
+        B, N, W, C = content.shape
+        style = style_code.reshape(B, 1, 1, C).expand(B, N, W, C)
+        style = style.permute(2, 0, 1, 3).reshape(W, B * N, C)
+        # (W, B*N, C)
+        content = content.permute(2, 0, 1, 3).reshape(W, B * N, C)
+        # (W, B*N, C)
+        tgt = torch.zeros_like(style)
+        pos_embed = self.pos_embed(W)
+        pos_embed = pos_embed.permute(1, 0, 2)
+        upper_face3d_feat = self.upper_decoder(tgt, content, pos=pos_embed, query_pos=style)[0]
+        # (W, B*N, C)
+        upper_face3d_feat = upper_face3d_feat.permute(1, 0, 2).reshape(B, N, W, C)[:, :, W // 2, :]
+        # (B, N, C)
+        upper_face3d = self.upper_tail_fc(upper_face3d_feat)
+        # (B, N, C_exp)
+        lower_face3d_feat = self.lower_decoder(tgt, content, pos=pos_embed, query_pos=style)[0]
+        lower_face3d_feat = lower_face3d_feat.permute(1, 0, 2).reshape(B, N, W, C)[:, :, W // 2, :]
+        lower_face3d = self.lower_tail_fc(lower_face3d_feat)
+        C_exp = len(self.upper_face3d_indices) + len(self.lower_face3d_indices)
+        face3d = torch.zeros(B, N, C_exp).to(upper_face3d)
+        face3d[:, :, self.upper_face3d_indices] = upper_face3d
+        face3d[:, :, self.lower_face3d_indices] = lower_face3d
+        return face3d

core/networks/dynamic_conv.py ADDED Viewed

	@@ -0,0 +1,149 @@

+import math
+import torch
+from torch import nn
+from torch.nn import functional as F
+class Attention(nn.Module):
+    def __init__(self, cond_planes, ratio, K, temperature=30, init_weight=True):
+        super().__init__()
+        # self.avgpool = nn.AdaptiveAvgPool2d(1)
+        self.temprature = temperature
+        assert cond_planes > ratio
+        hidden_planes = cond_planes // ratio
+        self.net = nn.Sequential(
+            nn.Conv2d(cond_planes, hidden_planes, kernel_size=1, bias=False),
+            nn.ReLU(),
+            nn.Conv2d(hidden_planes, K, kernel_size=1, bias=False),
+        )
+        if init_weight:
+            self._initialize_weights()
+    def update_temprature(self):
+        if self.temprature > 1:
+            self.temprature -= 1
+    def _initialize_weights(self):
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
+                if m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+            if isinstance(m, nn.BatchNorm2d):
+                nn.init.constant_(m.weight, 1)
+                nn.init.constant_(m.bias, 0)
+    def forward(self, cond):
+        """
+        Args:
+            cond (_type_): (B, C_style)
+        Returns:
+            _type_: (B, K)
+        """
+        # att = self.avgpool(cond)  # bs,dim,1,1
+        att = cond.view(cond.shape[0], cond.shape[1], 1, 1)
+        att = self.net(att).view(cond.shape[0], -1)  # bs,K
+        return F.softmax(att / self.temprature, -1)
+class DynamicConv(nn.Module):
+    def __init__(
+        self,
+        in_planes,
+        out_planes,
+        cond_planes,
+        kernel_size,
+        stride,
+        padding=0,
+        dilation=1,
+        groups=1,
+        bias=True,
+        K=4,
+        temperature=30,
+        ratio=4,
+        init_weight=True,
+    ):
+        super().__init__()
+        self.in_planes = in_planes
+        self.out_planes = out_planes
+        self.cond_planes = cond_planes
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.padding = padding
+        self.dilation = dilation
+        self.groups = groups
+        self.bias = bias
+        self.K = K
+        self.init_weight = init_weight
+        self.attention = Attention(
+            cond_planes=cond_planes, ratio=ratio, K=K, temperature=temperature, init_weight=init_weight
+        )
+        self.weight = nn.Parameter(
+            torch.randn(K, out_planes, in_planes // groups, kernel_size, kernel_size), requires_grad=True
+        )
+        if bias:
+            self.bias = nn.Parameter(torch.randn(K, out_planes), requires_grad=True)
+        else:
+            self.bias = None
+        if self.init_weight:
+            self._initialize_weights()
+    def _initialize_weights(self):
+        for i in range(self.K):
+            nn.init.kaiming_uniform_(self.weight[i], a=math.sqrt(5))
+            if self.bias is not None:
+                fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight[i])
+                if fan_in != 0:
+                    bound = 1 / math.sqrt(fan_in)
+                    nn.init.uniform_(self.bias, -bound, bound)
+    def forward(self, x, cond):
+        """
+        Args:
+            x (_type_): (B, C_in, L, 1)
+            cond (_type_): (B, C_style)
+        Returns:
+            _type_: (B, C_out, L, 1)
+        """
+        bs, in_planels, h, w = x.shape
+        softmax_att = self.attention(cond)  # bs,K
+        x = x.view(1, -1, h, w)
+        weight = self.weight.view(self.K, -1)  # K,-1
+        aggregate_weight = torch.mm(softmax_att, weight).view(
+            bs * self.out_planes, self.in_planes // self.groups, self.kernel_size, self.kernel_size
+        )  # bs*out_p,in_p,k,k
+        if self.bias is not None:
+            bias = self.bias.view(self.K, -1)  # K,out_p
+            aggregate_bias = torch.mm(softmax_att, bias).view(-1)  # bs*out_p
+            output = F.conv2d(
+                x,  # 1, bs*in_p, L, 1
+                weight=aggregate_weight,
+                bias=aggregate_bias,
+                stride=self.stride,
+                padding=self.padding,
+                groups=self.groups * bs,
+                dilation=self.dilation,
+            )
+        else:
+            output = F.conv2d(
+                x,
+                weight=aggregate_weight,
+                bias=None,
+                stride=self.stride,
+                padding=self.padding,
+                groups=self.groups * bs,
+                dilation=self.dilation,
+            )
+        output = output.view(bs, self.out_planes, h, w)
+        return output

core/networks/dynamic_fc_decoder.py ADDED Viewed

	@@ -0,0 +1,140 @@

+import torch.nn as nn
+import torch
+from core.networks.transformer import _get_activation_fn, _get_clones
+from core.networks.dynamic_linear import DynamicLinear
+class DynamicFCDecoderLayer(nn.Module):
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        d_style,
+        dynamic_K,
+        dynamic_ratio,
+        dim_feedforward=2048,
+        dropout=0.1,
+        activation="relu",
+        normalize_before=False,
+    ):
+        super().__init__()
+        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
+        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
+        # Implementation of Feedforward model
+        # self.linear1 = nn.Linear(d_model, dim_feedforward)
+        self.linear1 = DynamicLinear(d_model, dim_feedforward, d_style, K=dynamic_K, ratio=dynamic_ratio)
+        self.dropout = nn.Dropout(dropout)
+        self.linear2 = nn.Linear(dim_feedforward, d_model)
+        # self.linear2 = DynamicLinear(dim_feedforward, d_model, d_style, K=dynamic_K, ratio=dynamic_ratio)
+        self.norm1 = nn.LayerNorm(d_model)
+        self.norm2 = nn.LayerNorm(d_model)
+        self.norm3 = nn.LayerNorm(d_model)
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+        self.dropout3 = nn.Dropout(dropout)
+        self.activation = _get_activation_fn(activation)
+        self.normalize_before = normalize_before
+    def with_pos_embed(self, tensor, pos):
+        return tensor if pos is None else tensor + pos
+    def forward_post(
+        self,
+        tgt,
+        memory,
+        style,
+        tgt_mask=None,
+        memory_mask=None,
+        tgt_key_padding_mask=None,
+        memory_key_padding_mask=None,
+        pos=None,
+        query_pos=None,
+    ):
+        # q = k = self.with_pos_embed(tgt, query_pos)
+        tgt2 = self.self_attn(tgt, tgt, value=tgt, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask)[0]
+        tgt = tgt + self.dropout1(tgt2)
+        tgt = self.norm1(tgt)
+        tgt2 = self.multihead_attn(
+            query=tgt, key=memory, value=memory, attn_mask=memory_mask, key_padding_mask=memory_key_padding_mask
+        )[0]
+        tgt = tgt + self.dropout2(tgt2)
+        tgt = self.norm2(tgt)
+        # tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt, style))), style)
+        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt, style))))
+        tgt = tgt + self.dropout3(tgt2)
+        tgt = self.norm3(tgt)
+        return tgt
+    def forward(
+        self,
+        tgt,
+        memory,
+        style,
+        tgt_mask=None,
+        memory_mask=None,
+        tgt_key_padding_mask=None,
+        memory_key_padding_mask=None,
+        pos=None,
+        query_pos=None,
+    ):
+        if self.normalize_before:
+            raise NotImplementedError
+        return self.forward_post(
+            tgt, memory, style, tgt_mask, memory_mask, tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos
+        )
+class DynamicFCDecoder(nn.Module):
+    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
+        super().__init__()
+        self.layers = _get_clones(decoder_layer, num_layers)
+        self.num_layers = num_layers
+        self.norm = norm
+        self.return_intermediate = return_intermediate
+    def forward(
+        self,
+        tgt,
+        memory,
+        tgt_mask=None,
+        memory_mask=None,
+        tgt_key_padding_mask=None,
+        memory_key_padding_mask=None,
+        pos=None,
+        query_pos=None,
+    ):
+        style = query_pos[0]
+        # (B*N, C)
+        output = tgt + pos + query_pos
+        intermediate = []
+        for layer in self.layers:
+            output = layer(
+                output,
+                memory,
+                style,
+                tgt_mask=tgt_mask,
+                memory_mask=memory_mask,
+                tgt_key_padding_mask=tgt_key_padding_mask,
+                memory_key_padding_mask=memory_key_padding_mask,
+                pos=pos,
+                query_pos=query_pos,
+            )
+            if self.return_intermediate:
+                intermediate.append(self.norm(output))
+        if self.norm is not None:
+            output = self.norm(output)
+            if self.return_intermediate:
+                intermediate.pop()
+                intermediate.append(output)
+        if self.return_intermediate:
+            return torch.stack(intermediate)
+        return output.unsqueeze(0)

core/networks/dynamic_linear.py ADDED Viewed

	@@ -0,0 +1,42 @@

+import math
+import torch
+from torch import nn
+from torch.nn import functional as F
+from core.networks.dynamic_conv import DynamicConv
+class DynamicLinear(nn.Module):
+    def __init__(self, in_planes, out_planes, cond_planes, bias=True, K=4, temperature=30, ratio=4, init_weight=True):
+        super().__init__()
+        self.dynamic_conv = DynamicConv(
+            in_planes,
+            out_planes,
+            cond_planes,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=bias,
+            K=K,
+            ratio=ratio,
+            temperature=temperature,
+            init_weight=init_weight,
+        )
+    def forward(self, x, cond):
+        """
+        Args:
+            x (_type_): (L, B, C_in)
+            cond (_type_): (B, C_style)
+        Returns:
+            _type_: (L, B, C_out)
+        """
+        x = x.permute(1, 2, 0).unsqueeze(-1)
+        out = self.dynamic_conv(x, cond)
+        # (B, C_out, L, 1)
+        out = out.squeeze().permute(2, 0, 1)
+        return out

core/networks/generator.py ADDED Viewed

	@@ -0,0 +1,213 @@

+import torch
+from torch import nn
+from .transformer import (
+    TransformerEncoder,
+    TransformerEncoderLayer,
+    PositionalEncoding,
+    TransformerDecoderLayer,
+    TransformerDecoder,
+)
+from core.utils import _reset_parameters
+from core.networks.self_attention_pooling import SelfAttentionPooling
+class ContentEncoder(nn.Module):
+    def __init__(
+        self,
+        d_model=512,
+        nhead=8,
+        num_encoder_layers=6,
+        dim_feedforward=2048,
+        dropout=0.1,
+        activation="relu",
+        normalize_before=False,
+        pos_embed_len=80,
+        ph_embed_dim=128,
+    ):
+        super().__init__()
+        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation, normalize_before)
+        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
+        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
+        _reset_parameters(self.encoder)
+        self.pos_embed = PositionalEncoding(d_model, pos_embed_len)
+        self.ph_embedding = nn.Embedding(41, ph_embed_dim)
+        self.increase_embed_dim = nn.Linear(ph_embed_dim, d_model)
+    def forward(self, x):
+        """
+        Args:
+            x (_type_): (B, num_frames, window)
+        Returns:
+            content: (B, num_frames, window, C_dmodel)
+        """
+        x_embedding = self.ph_embedding(x)
+        x_embedding = self.increase_embed_dim(x_embedding)
+        # (B, N, W, C)
+        B, N, W, C = x_embedding.shape
+        x_embedding = x_embedding.reshape(B * N, W, C)
+        x_embedding = x_embedding.permute(1, 0, 2)
+        # (W, B*N, C)
+        pos = self.pos_embed(W)
+        pos = pos.permute(1, 0, 2)
+        # (W, 1, C)
+        content = self.encoder(x_embedding, pos=pos)
+        # (W, B*N, C)
+        content = content.permute(1, 0, 2).reshape(B, N, W, C)
+        # (B, N, W, C)
+        return content
+class StyleEncoder(nn.Module):
+    def __init__(
+        self,
+        d_model=512,
+        nhead=8,
+        num_encoder_layers=6,
+        dim_feedforward=2048,
+        dropout=0.1,
+        activation="relu",
+        normalize_before=False,
+        pos_embed_len=80,
+        input_dim=128,
+        aggregate_method="average",
+    ):
+        super().__init__()
+        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation, normalize_before)
+        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
+        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
+        _reset_parameters(self.encoder)
+        self.pos_embed = PositionalEncoding(d_model, pos_embed_len)
+        self.increase_embed_dim = nn.Linear(input_dim, d_model)
+        self.aggregate_method = None
+        if aggregate_method == "self_attention_pooling":
+            self.aggregate_method = SelfAttentionPooling(d_model)
+        elif aggregate_method == "average":
+            pass
+        else:
+            raise ValueError(f"Invalid aggregate method {aggregate_method}")
+    def forward(self, x, pad_mask=None):
+        """
+        Args:
+            x (_type_): (B, num_frames(L), C_exp)
+            pad_mask: (B, num_frames)
+        Returns:
+            style_code: (B, C_model)
+        """
+        x = self.increase_embed_dim(x)
+        # (B, L, C)
+        x = x.permute(1, 0, 2)
+        # (L, B, C)
+        pos = self.pos_embed(x.shape[0])
+        pos = pos.permute(1, 0, 2)
+        # (L, 1, C)
+        style = self.encoder(x, pos=pos, src_key_padding_mask=pad_mask)
+        # (L, B, C)
+        if self.aggregate_method is not None:
+            permute_style = style.permute(1, 0, 2)
+            # (B, L, C)
+            style_code = self.aggregate_method(permute_style, pad_mask)
+            return style_code
+        if pad_mask is None:
+            style = style.permute(1, 2, 0)
+            # (B, C, L)
+            style_code = style.mean(2)
+            # (B, C)
+        else:
+            permute_style = style.permute(1, 0, 2)
+            # (B, L, C)
+            permute_style[pad_mask] = 0
+            sum_style_code = permute_style.sum(dim=1)
+            # (B, C)
+            valid_token_num = (~pad_mask).sum(dim=1).unsqueeze(-1)
+            # (B, 1)
+            style_code = sum_style_code / valid_token_num
+            # (B, C)
+        return style_code
+class Decoder(nn.Module):
+    def __init__(
+        self,
+        d_model=512,
+        nhead=8,
+        num_decoder_layers=3,
+        dim_feedforward=2048,
+        dropout=0.1,
+        activation="relu",
+        normalize_before=False,
+        return_intermediate_dec=False,
+        pos_embed_len=80,
+        output_dim=64,
+        **_,
+    ) -> None:
+        super().__init__()
+        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, activation, normalize_before)
+        decoder_norm = nn.LayerNorm(d_model)
+        self.decoder = TransformerDecoder(
+            decoder_layer,
+            num_decoder_layers,
+            decoder_norm,
+            return_intermediate=return_intermediate_dec,
+        )
+        _reset_parameters(self.decoder)
+        self.pos_embed = PositionalEncoding(d_model, pos_embed_len)
+        tail_hidden_dim = d_model // 2
+        self.tail_fc = nn.Sequential(
+            nn.Linear(d_model, tail_hidden_dim),
+            nn.ReLU(),
+            nn.Linear(tail_hidden_dim, tail_hidden_dim),
+            nn.ReLU(),
+            nn.Linear(tail_hidden_dim, output_dim),
+        )
+    def forward(self, content, style_code):
+        """
+        Args:
+            content (_type_): (B, num_frames, window, C_dmodel)
+            style_code (_type_): (B, C_dmodel)
+        Returns:
+            face3d: (B, num_frames, C_3dmm)
+        """
+        B, N, W, C = content.shape
+        style = style_code.reshape(B, 1, 1, C).expand(B, N, W, C)
+        style = style.permute(2, 0, 1, 3).reshape(W, B * N, C)
+        # (W, B*N, C)
+        content = content.permute(2, 0, 1, 3).reshape(W, B * N, C)
+        # (W, B*N, C)
+        tgt = torch.zeros_like(style)
+        pos_embed = self.pos_embed(W)
+        pos_embed = pos_embed.permute(1, 0, 2)
+        face3d_feat = self.decoder(tgt, content, pos=pos_embed, query_pos=style)[0]
+        # (W, B*N, C)
+        face3d_feat = face3d_feat.permute(1, 0, 2).reshape(B, N, W, C)[:, :, W // 2, :]
+        # (B, N, C)
+        face3d = self.tail_fc(face3d_feat)
+        # (B, N, C_exp)
+        return face3d

core/networks/mish.py ADDED Viewed

	@@ -0,0 +1,51 @@

+"""
+Applies the mish function element-wise:
+mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + exp(x)))
+"""
+# import pytorch
+import torch
+import torch.nn.functional as F
+from torch import nn
+@torch.jit.script
+def mish(input):
+    """
+    Applies the mish function element-wise:
+    mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + exp(x)))
+    See additional documentation for mish class.
+    """
+    return input * torch.tanh(F.softplus(input))
+class Mish(nn.Module):
+    """
+    Applies the mish function element-wise:
+    mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + exp(x)))
+    Shape:
+        - Input: (N, *) where * means, any number of additional
+          dimensions
+        - Output: (N, *), same shape as the input
+    Examples:
+        >>> m = Mish()
+        >>> input = torch.randn(2)
+        >>> output = m(input)
+    Reference: https://pytorch.org/docs/stable/generated/torch.nn.Mish.html
+    """
+    def __init__(self):
+        """
+        Init method.
+        """
+        super().__init__()
+    def forward(self, input):
+        """
+        Forward pass of the function.
+        """
+        if torch.__version__ >= "1.9":
+            return F.mish(input)
+        else:
+            return mish(input)

core/networks/self_attention_pooling.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import torch
+import torch.nn as nn
+from core.networks.mish import Mish
+class SelfAttentionPooling(nn.Module):
+    """
+    Implementation of SelfAttentionPooling
+    Original Paper: Self-Attention Encoding and Pooling for Speaker Recognition
+    https://arxiv.org/pdf/2008.01077v1.pdf
+    """
+    def __init__(self, input_dim):
+        super(SelfAttentionPooling, self).__init__()
+        self.W = nn.Sequential(nn.Linear(input_dim, input_dim), Mish(), nn.Linear(input_dim, 1))
+        self.softmax = nn.functional.softmax
+    def forward(self, batch_rep, att_mask=None):
+        """
+        N: batch size, T: sequence length, H: Hidden dimension
+        input:
+            batch_rep : size (N, T, H)
+        attention_weight:
+            att_w : size (N, T, 1)
+        att_mask:
+            att_mask: size (N, T): if True, mask this item.
+        return:
+            utter_rep: size (N, H)
+        """
+        att_logits = self.W(batch_rep).squeeze(-1)
+        # (N, T)
+        if att_mask is not None:
+            att_mask_logits = att_mask.to(dtype=batch_rep.dtype) * -100000.0
+            # (N, T)
+            att_logits = att_mask_logits + att_logits
+        att_w = self.softmax(att_logits, dim=-1).unsqueeze(-1)
+        utter_rep = torch.sum(batch_rep * att_w, dim=1)
+        return utter_rep

core/networks/styletalk.py ADDED Viewed

	@@ -0,0 +1,24 @@

+import torch.nn as nn
+from core.networks import get_network
+class StyleTalk(nn.Module):
+    def __init__(self, cfg) -> None:
+        super().__init__()
+        self.cfg = cfg
+        content_encoder_class = get_network(cfg.CONTENT_ENCODER_TYPE)
+        self.content_encoder = content_encoder_class(**cfg.CONTENT_ENCODER)
+        style_encoder_class = get_network(cfg.STYLE_ENCODER_TYPE)
+        cfg.defrost()
+        cfg.STYLE_ENCODER.input_dim = cfg.DATASET.FACE3D_DIM
+        cfg.freeze()
+        self.style_encoder = style_encoder_class(**cfg.STYLE_ENCODER)
+        decoder_class = get_network(cfg.DECODER_TYPE)
+        cfg.defrost()
+        cfg.DECODER.output_dim = cfg.DATASET.FACE3D_DIM
+        cfg.freeze()
+        self.decoder = decoder_class(**cfg.DECODER)

core/networks/transformer.py ADDED Viewed

	@@ -0,0 +1,300 @@

+import torch.nn as nn
+import torch
+import numpy as np
+import torch.nn.functional as F
+import copy
+class PositionalEncoding(nn.Module):
+    def __init__(self, d_hid, n_position=200):
+        super(PositionalEncoding, self).__init__()
+        # Not a parameter
+        self.register_buffer("pos_table", self._get_sinusoid_encoding_table(n_position, d_hid))
+    def _get_sinusoid_encoding_table(self, n_position, d_hid):
+        """Sinusoid position encoding table"""
+        # TODO: make it with torch instead of numpy
+        def get_position_angle_vec(position):
+            return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]
+        sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
+        sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
+        sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
+        return torch.FloatTensor(sinusoid_table).unsqueeze(0)
+    def forward(self, winsize):
+        return self.pos_table[:, :winsize].clone().detach()
+def _get_activation_fn(activation):
+    """Return an activation function given a string"""
+    if activation == "relu":
+        return F.relu
+    if activation == "gelu":
+        return F.gelu
+    if activation == "glu":
+        return F.glu
+    raise RuntimeError(f"activation should be relu/gelu, not {activation}.")
+def _get_clones(module, N):
+    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])
+class Transformer(nn.Module):
+    def __init__(
+        self,
+        d_model=512,
+        nhead=8,
+        num_encoder_layers=6,
+        num_decoder_layers=6,
+        dim_feedforward=2048,
+        dropout=0.1,
+        activation="relu",
+        normalize_before=False,
+        return_intermediate_dec=True,
+    ):
+        super().__init__()
+        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation, normalize_before)
+        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
+        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
+        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, activation, normalize_before)
+        decoder_norm = nn.LayerNorm(d_model)
+        self.decoder = TransformerDecoder(
+            decoder_layer, num_decoder_layers, decoder_norm, return_intermediate=return_intermediate_dec
+        )
+        self._reset_parameters()
+        self.d_model = d_model
+        self.nhead = nhead
+    def _reset_parameters(self):
+        for p in self.parameters():
+            if p.dim() > 1:
+                nn.init.xavier_uniform_(p)
+    def forward(self, opt, src, query_embed, pos_embed):
+        # flatten NxCxHxW to HWxNxC
+        src = src.permute(1, 0, 2)
+        pos_embed = pos_embed.permute(1, 0, 2)
+        query_embed = query_embed.permute(1, 0, 2)
+        tgt = torch.zeros_like(query_embed)
+        memory = self.encoder(src, pos=pos_embed)
+        hs = self.decoder(tgt, memory, pos=pos_embed, query_pos=query_embed)
+        return hs
+class TransformerEncoder(nn.Module):
+    def __init__(self, encoder_layer, num_layers, norm=None):
+        super().__init__()
+        self.layers = _get_clones(encoder_layer, num_layers)
+        self.num_layers = num_layers
+        self.norm = norm
+    def forward(self, src, mask=None, src_key_padding_mask=None, pos=None):
+        output = src + pos
+        for layer in self.layers:
+            output = layer(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask, pos=pos)
+        if self.norm is not None:
+            output = self.norm(output)
+        return output
+class TransformerDecoder(nn.Module):
+    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
+        super().__init__()
+        self.layers = _get_clones(decoder_layer, num_layers)
+        self.num_layers = num_layers
+        self.norm = norm
+        self.return_intermediate = return_intermediate
+    def forward(
+        self,
+        tgt,
+        memory,
+        tgt_mask=None,
+        memory_mask=None,
+        tgt_key_padding_mask=None,
+        memory_key_padding_mask=None,
+        pos=None,
+        query_pos=None,
+    ):
+        output = tgt + pos + query_pos
+        intermediate = []
+        for layer in self.layers:
+            output = layer(
+                output,
+                memory,
+                tgt_mask=tgt_mask,
+                memory_mask=memory_mask,
+                tgt_key_padding_mask=tgt_key_padding_mask,
+                memory_key_padding_mask=memory_key_padding_mask,
+                pos=pos,
+                query_pos=query_pos,
+            )
+            if self.return_intermediate:
+                intermediate.append(self.norm(output))
+        if self.norm is not None:
+            output = self.norm(output)
+            if self.return_intermediate:
+                intermediate.pop()
+                intermediate.append(output)
+        if self.return_intermediate:
+            return torch.stack(intermediate)
+        return output.unsqueeze(0)
+class TransformerEncoderLayer(nn.Module):
+    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu", normalize_before=False):
+        super().__init__()
+        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
+        # Implementation of Feedforward model
+        self.linear1 = nn.Linear(d_model, dim_feedforward)
+        self.dropout = nn.Dropout(dropout)
+        self.linear2 = nn.Linear(dim_feedforward, d_model)
+        self.norm1 = nn.LayerNorm(d_model)
+        self.norm2 = nn.LayerNorm(d_model)
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+        self.activation = _get_activation_fn(activation)
+        self.normalize_before = normalize_before
+    def with_pos_embed(self, tensor, pos):
+        return tensor if pos is None else tensor + pos
+    def forward_post(self, src, src_mask=None, src_key_padding_mask=None, pos=None):
+        # q = k = self.with_pos_embed(src, pos)
+        src2 = self.self_attn(src, src, value=src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask)[0]
+        src = src + self.dropout1(src2)
+        src = self.norm1(src)
+        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
+        src = src + self.dropout2(src2)
+        src = self.norm2(src)
+        return src
+    def forward_pre(self, src, src_mask=None, src_key_padding_mask=None, pos=None):
+        src2 = self.norm1(src)
+        # q = k = self.with_pos_embed(src2, pos)
+        src2 = self.self_attn(src2, src2, value=src2, attn_mask=src_mask, key_padding_mask=src_key_padding_mask)[0]
+        src = src + self.dropout1(src2)
+        src2 = self.norm2(src)
+        src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))
+        src = src + self.dropout2(src2)
+        return src
+    def forward(self, src, src_mask=None, src_key_padding_mask=None, pos=None):
+        if self.normalize_before:
+            return self.forward_pre(src, src_mask, src_key_padding_mask, pos)
+        return self.forward_post(src, src_mask, src_key_padding_mask, pos)
+class TransformerDecoderLayer(nn.Module):
+    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu", normalize_before=False):
+        super().__init__()
+        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
+        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
+        # Implementation of Feedforward model
+        self.linear1 = nn.Linear(d_model, dim_feedforward)
+        self.dropout = nn.Dropout(dropout)
+        self.linear2 = nn.Linear(dim_feedforward, d_model)
+        self.norm1 = nn.LayerNorm(d_model)
+        self.norm2 = nn.LayerNorm(d_model)
+        self.norm3 = nn.LayerNorm(d_model)
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+        self.dropout3 = nn.Dropout(dropout)
+        self.activation = _get_activation_fn(activation)
+        self.normalize_before = normalize_before
+    def with_pos_embed(self, tensor, pos):
+        return tensor if pos is None else tensor + pos
+    def forward_post(
+        self,
+        tgt,
+        memory,
+        tgt_mask=None,
+        memory_mask=None,
+        tgt_key_padding_mask=None,
+        memory_key_padding_mask=None,
+        pos=None,
+        query_pos=None,
+    ):
+        # q = k = self.with_pos_embed(tgt, query_pos)
+        tgt2 = self.self_attn(tgt, tgt, value=tgt, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask)[0]
+        tgt = tgt + self.dropout1(tgt2)
+        tgt = self.norm1(tgt)
+        tgt2 = self.multihead_attn(
+            query=tgt, key=memory, value=memory, attn_mask=memory_mask, key_padding_mask=memory_key_padding_mask
+        )[0]
+        tgt = tgt + self.dropout2(tgt2)
+        tgt = self.norm2(tgt)
+        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        tgt = tgt + self.dropout3(tgt2)
+        tgt = self.norm3(tgt)
+        return tgt
+    def forward_pre(
+        self,
+        tgt,
+        memory,
+        tgt_mask=None,
+        memory_mask=None,
+        tgt_key_padding_mask=None,
+        memory_key_padding_mask=None,
+        pos=None,
+        query_pos=None,
+    ):
+        tgt2 = self.norm1(tgt)
+        # q = k = self.with_pos_embed(tgt2, query_pos)
+        tgt2 = self.self_attn(tgt2, tgt2, value=tgt2, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask)[0]
+        tgt = tgt + self.dropout1(tgt2)
+        tgt2 = self.norm2(tgt)
+        tgt2 = self.multihead_attn(
+            query=tgt2, key=memory, value=memory, attn_mask=memory_mask, key_padding_mask=memory_key_padding_mask
+        )[0]
+        tgt = tgt + self.dropout2(tgt2)
+        tgt2 = self.norm3(tgt)
+        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
+        tgt = tgt + self.dropout3(tgt2)
+        return tgt
+    def forward(
+        self,
+        tgt,
+        memory,
+        tgt_mask=None,
+        memory_mask=None,
+        tgt_key_padding_mask=None,
+        memory_key_padding_mask=None,
+        pos=None,
+        query_pos=None,
+    ):
+        if self.normalize_before:
+            return self.forward_pre(
+                tgt, memory, tgt_mask, memory_mask, tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos
+            )
+        return self.forward_post(
+            tgt, memory, tgt_mask, memory_mask, tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos
+        )

core/utils.py ADDED Viewed

	@@ -0,0 +1,228 @@

+import os
+import argparse
+from collections import defaultdict
+import logging
+import numpy as np
+import torch
+from torch import nn
+from scipy.io import loadmat
+from configs.default import get_cfg_defaults
+def _reset_parameters(model):
+    for p in model.parameters():
+        if p.dim() > 1:
+            nn.init.xavier_uniform_(p)
+def get_video_style(video_name, style_type):
+    person_id, direction, emotion, level, *_ = video_name.split("_")
+    if style_type == "id_dir_emo_level":
+        style = "_".join([person_id, direction, emotion, level])
+    elif style_type == "emotion":
+        style = emotion
+    else:
+        raise ValueError("Unknown style type")
+    return style
+def get_style_video_lists(video_list, style_type):
+    style2video_list = defaultdict(list)
+    for video in video_list:
+        style = get_video_style(video, style_type)
+        style2video_list[style].append(video)
+    return style2video_list
+def get_face3d_clip(video_name, video_root_dir, num_frames, start_idx, dtype=torch.float32):
+    """_summary_
+    Args:
+        video_name (_type_): _description_
+        video_root_dir (_type_): _description_
+        num_frames (_type_): _description_
+        start_idx (_type_): "random" , middle, int
+        dtype (_type_, optional): _description_. Defaults to torch.float32.
+    Raises:
+        ValueError: _description_
+        ValueError: _description_
+    Returns:
+        _type_: _description_
+    """
+    video_path = os.path.join(video_root_dir, video_name)
+    if video_path[-3:] == "mat":
+        face3d_all = loadmat(video_path)["coeff"]
+        face3d_exp = face3d_all[:, 80:144]  # expression 3DMM range
+    elif video_path[-3:] == "txt":
+        face3d_exp = np.loadtxt(video_path)
+    else:
+        raise ValueError("Invalid 3DMM file extension")
+    length = face3d_exp.shape[0]
+    clip_num_frames = num_frames
+    if start_idx == "random":
+        clip_start_idx = np.random.randint(low=0, high=length - clip_num_frames + 1)
+    elif start_idx == "middle":
+        clip_start_idx = (length - clip_num_frames + 1) // 2
+    elif isinstance(start_idx, int):
+        clip_start_idx = start_idx
+    else:
+        raise ValueError(f"Invalid start_idx {start_idx}")
+    face3d_clip = face3d_exp[clip_start_idx : clip_start_idx + clip_num_frames]
+    face3d_clip = torch.tensor(face3d_clip, dtype=dtype)
+    return face3d_clip
+def get_video_style_clip(video_path, style_max_len, start_idx="random", dtype=torch.float32):
+    if video_path[-3:] == "mat":
+        face3d_all = loadmat(video_path)["coeff"]
+        face3d_exp = face3d_all[:, 80:144]  # expression 3DMM range
+    elif video_path[-3:] == "txt":
+        face3d_exp = np.loadtxt(video_path)
+    else:
+        raise ValueError("Invalid 3DMM file extension")
+    face3d_exp = torch.tensor(face3d_exp, dtype=dtype)
+    length = face3d_exp.shape[0]
+    if length >= style_max_len:
+        clip_num_frames = style_max_len
+        if start_idx == "random":
+            clip_start_idx = np.random.randint(low=0, high=length - clip_num_frames + 1)
+        elif start_idx == "middle":
+            clip_start_idx = (length - clip_num_frames + 1) // 2
+        elif isinstance(start_idx, int):
+            clip_start_idx = start_idx
+        else:
+            raise ValueError(f"Invalid start_idx {start_idx}")
+        face3d_clip = face3d_exp[clip_start_idx : clip_start_idx + clip_num_frames]
+        pad_mask = torch.tensor([False] * style_max_len)
+    else:
+        padding = torch.zeros(style_max_len - length, face3d_exp.shape[1])
+        face3d_clip = torch.cat((face3d_exp, padding), dim=0)
+        pad_mask = torch.tensor([False] * length + [True] * (style_max_len - length))
+    return face3d_clip, pad_mask
+def get_audio_name_from_video(video_name):
+    audio_name = video_name[:-4] + "_seq.json"
+    return audio_name
+def get_audio_window(audio, win_size):
+    """
+    Args:
+        audio (numpy.ndarray): (N,)
+    Returns:
+        audio_wins (numpy.ndarray): (N, W)
+    """
+    num_frames = len(audio)
+    ph_frames = []
+    for rid in range(0, num_frames):
+        ph = []
+        for i in range(rid - win_size, rid + win_size + 1):
+            if i < 0:
+                ph.append(31)
+            elif i >= num_frames:
+                ph.append(31)
+            else:
+                ph.append(audio[i])
+        ph_frames.append(ph)
+    audio_wins = np.array(ph_frames)
+    return audio_wins
+def setup_config():
+    parser = argparse.ArgumentParser(description="voice2pose main program")
+    parser.add_argument("--config_file", default="", metavar="FILE", help="path to config file")
+    parser.add_argument("--resume_from", type=str, default=None, help="the checkpoint to resume from")
+    parser.add_argument("--test_only", action="store_true", help="perform testing and evaluation only")
+    parser.add_argument("--demo_input", type=str, default=None, help="path to input for demo")
+    parser.add_argument("--checkpoint", type=str, default=None, help="the checkpoint to test with")
+    parser.add_argument("--tag", type=str, default="", help="tag for the experiment")
+    parser.add_argument(
+        "opts",
+        help="Modify config options using the command-line",
+        default=None,
+        nargs=argparse.REMAINDER,
+    )
+    parser.add_argument(
+        "--local_rank",
+        type=int,
+        help="local rank for DistributedDataParallel",
+    )
+    parser.add_argument(
+        "--master_port",
+        type=str,
+        default="12345",
+    )
+    args = parser.parse_args()
+    cfg = get_cfg_defaults()
+    cfg.merge_from_file(args.config_file)
+    cfg.merge_from_list(args.opts)
+    cfg.freeze()
+    return args, cfg
+def setup_logger(base_path, exp_name):
+    rootLogger = logging.getLogger()
+    rootLogger.setLevel(logging.INFO)
+    logFormatter = logging.Formatter("%(asctime)s [%(levelname)-0.5s] %(message)s")
+    log_path = "{0}/{1}.log".format(base_path, exp_name)
+    fileHandler = logging.FileHandler(log_path)
+    fileHandler.setFormatter(logFormatter)
+    rootLogger.addHandler(fileHandler)
+    consoleHandler = logging.StreamHandler()
+    consoleHandler.setFormatter(logFormatter)
+    rootLogger.addHandler(consoleHandler)
+    rootLogger.handlers[0].setLevel(logging.ERROR)
+    logging.info("log path: %s" % log_path)
+def get_pose_params(mat_path):
+    """Get pose parameters from mat file
+    Args:
+        mat_path (str): path of mat file
+    Returns:
+        pose_params (numpy.ndarray): shape (L_video, 9), angle, translation, crop paramters
+    """
+    mat_dict = loadmat(mat_path)
+    np_3dmm = mat_dict["coeff"]
+    angles = np_3dmm[:, 224:227]
+    translations = np_3dmm[:, 254:257]
+    np_trans_params = mat_dict["transform_params"]
+    crop = np_trans_params[:, -3:]
+    pose_params = np.concatenate((angles, translations, crop), axis=1)
+    return pose_params
+def obtain_seq_index(index, num_frames, radius):
+    seq = list(range(index - radius, index + radius + 1))
+    seq = [min(max(item, 0), num_frames - 1) for item in seq]
+    return seq

demo.mp4 ADDED Viewed

Binary file (547 kB). View file

demo.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7d29bc2d048a0be69193daea065734a3e76abbbe37e5ae4c8903d82f14ad92cb
+size 227888

demo_download.mp4 ADDED Viewed

Binary file (457 kB). View file

demo_download.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7d29bc2d048a0be69193daea065734a3e76abbbe37e5ae4c8903d82f14ad92cb
+size 227888

env.yaml ADDED Viewed

File without changes

environment.yml ADDED Viewed

	@@ -0,0 +1,91 @@

+name: styletalk
+channels:
+  - pytorch
+  - conda-forge
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _openmp_mutex=5.1=1_gnu
+  - blas=1.0=mkl
+  - bzip2=1.0.8=h7b6447c_0
+  - ca-certificates=2023.08.22=h06a4308_0
+  - certifi=2020.6.20=pyhd3eb1b0_3
+  - cudatoolkit=11.1.1=ha002fc5_10
+  - ffmpeg=4.2.2=h20bf706_0
+  - freetype=2.12.1=h4a9f257_0
+  - gmp=6.2.1=h295c915_3
+  - gnutls=3.6.15=he1e5248_0
+  - intel-openmp=2021.4.0=h06a4308_3561
+  - jpeg=9b=h024ee3a_2
+  - lame=3.100=h7b6447c_0
+  - libedit=3.1.20221030=h5eee18b_0
+  - libffi=3.2.1=hf484d3e_1007
+  - libgcc-ng=11.2.0=h1234567_1
+  - libgomp=11.2.0=h1234567_1
+  - libidn2=2.3.4=h5eee18b_0
+  - libopus=1.3.1=h7b6447c_0
+  - libpng=1.6.39=h5eee18b_0
+  - libstdcxx-ng=11.2.0=h1234567_1
+  - libtasn1=4.19.0=h5eee18b_0
+  - libtiff=4.0.9=he6b73bb_1
+  - libunistring=0.9.10=h27cfd23_0
+  - libuv=1.44.2=h5eee18b_0
+  - libvpx=1.7.0=h439df22_0
+  - mkl=2021.4.0=h06a4308_640
+  - mkl-service=2.4.0=py37h7f8727e_0
+  - mkl_fft=1.3.1=py37h3e078e5_1
+  - mkl_random=1.2.2=py37h51133e4_0
+  - ncurses=6.4=h6a678d5_0
+  - nettle=3.7.3=hbbd107a_1
+  - ninja=1.10.2=h06a4308_5
+  - ninja-base=1.10.2=hd09550d_5
+  - numpy=1.21.5=py37h6c91a56_3
+  - numpy-base=1.21.5=py37ha15fc14_3
+  - olefile=0.46=py37_0
+  - openh264=2.1.1=h4ff587b_0
+  - openssl=1.0.2u=h7b6447c_0
+  - pip=22.3.1=py37h06a4308_0
+  - python=3.7.0=h6e4f718_3
+  - python_abi=3.7=2_cp37m
+  - pytorch=1.8.0=py3.7_cuda11.1_cudnn8.0.5_0
+  - readline=7.0=h7b6447c_5
+  - setuptools=65.6.3=py37h06a4308_0
+  - six=1.16.0=pyhd3eb1b0_1
+  - sqlite=3.33.0=h62c20be_0
+  - tk=8.6.12=h1ccaba5_0
+  - torchaudio=0.8.0=py37
+  - torchvision=0.9.0=py37_cu111
+  - typing_extensions=4.1.1=pyh06a4308_0
+  - wheel=0.38.4=py37h06a4308_0
+  - x264=1!157.20191217=h7b6447c_0
+  - xz=5.4.2=h5eee18b_0
+  - zlib=1.2.13=h5eee18b_0
+  - pip:
+      - av==10.0.0
+      - beautifulsoup4==4.12.2
+      - charset-normalizer==3.3.2
+      - ffmpeg-python==0.2.0
+      - filelock==3.12.2
+      - future==0.18.3
+      - gdown==4.7.1
+      - idna==3.6
+      - imageio==2.18.0
+      - joblib==1.3.2
+      - networkx==2.6.3
+      - opencv-python==4.4.0.46
+      - packaging==23.2
+      - pillow==9.1.0
+      - pysocks==1.7.1
+      - pywavelets==1.3.0
+      - pyyaml==6.0
+      - requests==2.31.0
+      - scikit-image==0.19.3
+      - scikit-learn==1.0.2
+      - scipy==1.7.3
+      - soupsieve==2.4.1
+      - threadpoolctl==3.1.0
+      - tifffile==2021.11.2
+      - tqdm==4.66.1
+      - urllib3==2.0.7
+      - yacs==0.1.8
+prefix: /home/pixis/miniconda3/envs/styletalk

generators/__pycache__/base_function.cpython-37.pyc ADDED Viewed

Binary file (13.9 kB). View file

generators/__pycache__/face_model.cpython-37.pyc ADDED Viewed

Binary file (3.97 kB). View file

generators/__pycache__/flow_util.cpython-37.pyc ADDED Viewed

Binary file (1.95 kB). View file

generators/base_function.py ADDED Viewed

	@@ -0,0 +1,368 @@

+import sys
+import math
+import torch
+from torch import nn
+from torch.nn import functional as F
+from torch.autograd import Function
+from torch.nn.utils.spectral_norm import spectral_norm as SpectralNorm
+class LayerNorm2d(nn.Module):
+    def __init__(self, n_out, affine=True):
+        super(LayerNorm2d, self).__init__()
+        self.n_out = n_out
+        self.affine = affine
+        if self.affine:
+          self.weight = nn.Parameter(torch.ones(n_out, 1, 1))
+          self.bias = nn.Parameter(torch.zeros(n_out, 1, 1))
+    def forward(self, x):
+        normalized_shape = x.size()[1:]
+        if self.affine:
+          return F.layer_norm(x, normalized_shape, \
+              self.weight.expand(normalized_shape),
+              self.bias.expand(normalized_shape))
+        else:
+          return F.layer_norm(x, normalized_shape)
+class ADAINHourglass(nn.Module):
+    def __init__(self, image_nc, pose_nc, ngf, img_f, encoder_layers, decoder_layers, nonlinearity, use_spect):
+        super(ADAINHourglass, self).__init__()
+        self.encoder = ADAINEncoder(image_nc, pose_nc, ngf, img_f, encoder_layers, nonlinearity, use_spect)
+        self.decoder = ADAINDecoder(pose_nc, ngf, img_f, encoder_layers, decoder_layers, True, nonlinearity, use_spect)
+        self.output_nc = self.decoder.output_nc
+    def forward(self, x, z):
+        return self.decoder(self.encoder(x, z), z)
+class ADAINEncoder(nn.Module):
+    def __init__(self, image_nc, pose_nc, ngf, img_f, layers, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(ADAINEncoder, self).__init__()
+        self.layers = layers
+        self.input_layer = nn.Conv2d(image_nc, ngf, kernel_size=7, stride=1, padding=3)
+        for i in range(layers):
+            in_channels = min(ngf * (2**i), img_f)
+            out_channels = min(ngf *(2**(i+1)), img_f)
+            model = ADAINEncoderBlock(in_channels, out_channels, pose_nc, nonlinearity, use_spect)
+            setattr(self, 'encoder' + str(i), model)
+        self.output_nc = out_channels
+    def forward(self, x, z):
+        out = self.input_layer(x)
+        out_list = [out]
+        for i in range(self.layers):
+            model = getattr(self, 'encoder' + str(i))
+            out = model(out, z)
+            out_list.append(out)
+        return out_list
+class ADAINDecoder(nn.Module):
+    """docstring for ADAINDecoder"""
+    def __init__(self, pose_nc, ngf, img_f, encoder_layers, decoder_layers, skip_connect=True,
+                 nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(ADAINDecoder, self).__init__()
+        self.encoder_layers = encoder_layers
+        self.decoder_layers = decoder_layers
+        self.skip_connect = skip_connect
+        use_transpose = True
+        for i in range(encoder_layers-decoder_layers, encoder_layers)[::-1]:
+            in_channels = min(ngf * (2**(i+1)), img_f)
+            in_channels = in_channels*2 if i != (encoder_layers-1) and self.skip_connect else in_channels
+            out_channels = min(ngf * (2**i), img_f)
+            model = ADAINDecoderBlock(in_channels, out_channels, out_channels, pose_nc, use_transpose, nonlinearity, use_spect)
+            setattr(self, 'decoder' + str(i), model)
+        self.output_nc = out_channels*2 if self.skip_connect else out_channels
+    def forward(self, x, z):
+        out = x.pop() if self.skip_connect else x
+        for i in range(self.encoder_layers-self.decoder_layers, self.encoder_layers)[::-1]:
+            model = getattr(self, 'decoder' + str(i))
+            out = model(out, z)
+            out = torch.cat([out, x.pop()], 1) if self.skip_connect else out
+        return out
+class ADAINEncoderBlock(nn.Module):
+    def __init__(self, input_nc, output_nc, feature_nc, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(ADAINEncoderBlock, self).__init__()
+        kwargs_down = {'kernel_size': 4, 'stride': 2, 'padding': 1}
+        kwargs_fine = {'kernel_size': 3, 'stride': 1, 'padding': 1}
+        self.conv_0 = spectral_norm(nn.Conv2d(input_nc,  output_nc, **kwargs_down), use_spect)
+        self.conv_1 = spectral_norm(nn.Conv2d(output_nc, output_nc, **kwargs_fine), use_spect)
+        self.norm_0 = ADAIN(input_nc, feature_nc)
+        self.norm_1 = ADAIN(output_nc, feature_nc)
+        self.actvn = nonlinearity
+    def forward(self, x, z):
+        x = self.conv_0(self.actvn(self.norm_0(x, z)))
+        x = self.conv_1(self.actvn(self.norm_1(x, z)))
+        return x
+class ADAINDecoderBlock(nn.Module):
+    def __init__(self, input_nc, output_nc, hidden_nc, feature_nc, use_transpose=True, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(ADAINDecoderBlock, self).__init__()
+        # Attributes
+        self.actvn = nonlinearity
+        hidden_nc = min(input_nc, output_nc) if hidden_nc is None else hidden_nc
+        kwargs_fine = {'kernel_size':3, 'stride':1, 'padding':1}
+        if use_transpose:
+            kwargs_up = {'kernel_size':3, 'stride':2, 'padding':1, 'output_padding':1}
+        else:
+            kwargs_up = {'kernel_size':3, 'stride':1, 'padding':1}
+        # create conv layers
+        self.conv_0 = spectral_norm(nn.Conv2d(input_nc, hidden_nc, **kwargs_fine), use_spect)
+        if use_transpose:
+            self.conv_1 = spectral_norm(nn.ConvTranspose2d(hidden_nc, output_nc, **kwargs_up), use_spect)
+            self.conv_s = spectral_norm(nn.ConvTranspose2d(input_nc, output_nc, **kwargs_up), use_spect)
+        else:
+            self.conv_1 = nn.Sequential(spectral_norm(nn.Conv2d(hidden_nc, output_nc, **kwargs_up), use_spect),
+                                        nn.Upsample(scale_factor=2))
+            self.conv_s = nn.Sequential(spectral_norm(nn.Conv2d(input_nc, output_nc, **kwargs_up), use_spect),
+                                        nn.Upsample(scale_factor=2))
+        # define normalization layers
+        self.norm_0 = ADAIN(input_nc, feature_nc)
+        self.norm_1 = ADAIN(hidden_nc, feature_nc)
+        self.norm_s = ADAIN(input_nc, feature_nc)
+    def forward(self, x, z):
+        x_s = self.shortcut(x, z)
+        dx = self.conv_0(self.actvn(self.norm_0(x, z)))
+        dx = self.conv_1(self.actvn(self.norm_1(dx, z)))
+        out = x_s + dx
+        return out
+    def shortcut(self, x, z):
+        x_s = self.conv_s(self.actvn(self.norm_s(x, z)))
+        return x_s
+def spectral_norm(module, use_spect=True):
+    """use spectral normal layer to stable the training process"""
+    if use_spect:
+        return SpectralNorm(module)
+    else:
+        return module
+class ADAIN(nn.Module):
+    def __init__(self, norm_nc, feature_nc):
+        super().__init__()
+        self.param_free_norm = nn.InstanceNorm2d(norm_nc, affine=False)
+        nhidden = 128
+        use_bias=True
+        self.mlp_shared = nn.Sequential(
+            nn.Linear(feature_nc, nhidden, bias=use_bias),
+            nn.ReLU()
+        )
+        self.mlp_gamma = nn.Linear(nhidden, norm_nc, bias=use_bias)
+        self.mlp_beta = nn.Linear(nhidden, norm_nc, bias=use_bias)
+    def forward(self, x, feature):
+        # Part 1. generate parameter-free normalized activations
+        normalized = self.param_free_norm(x)
+        # Part 2. produce scaling and bias conditioned on feature
+        feature = feature.view(feature.size(0), -1)
+        actv = self.mlp_shared(feature)
+        gamma = self.mlp_gamma(actv)
+        beta = self.mlp_beta(actv)
+        # apply scale and bias
+        gamma = gamma.view(*gamma.size()[:2], 1,1)
+        beta = beta.view(*beta.size()[:2], 1,1)
+        out = normalized * (1 + gamma) + beta
+        return out
+class FineEncoder(nn.Module):
+    """docstring for Encoder"""
+    def __init__(self, image_nc, ngf, img_f, layers, norm_layer=nn.BatchNorm2d, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(FineEncoder, self).__init__()
+        self.layers = layers
+        self.first = FirstBlock2d(image_nc, ngf, norm_layer, nonlinearity, use_spect)
+        for i in range(layers):
+            in_channels = min(ngf*(2**i), img_f)
+            out_channels = min(ngf*(2**(i+1)), img_f)
+            model = DownBlock2d(in_channels, out_channels, norm_layer, nonlinearity, use_spect)
+            setattr(self, 'down' + str(i), model)
+        self.output_nc = out_channels
+    def forward(self, x):
+        x = self.first(x)
+        out=[x]
+        for i in range(self.layers):
+            model = getattr(self, 'down'+str(i))
+            x = model(x)
+            out.append(x)
+        return out
+class FineDecoder(nn.Module):
+    """docstring for FineDecoder"""
+    def __init__(self, image_nc, feature_nc, ngf, img_f, layers, num_block, norm_layer=nn.BatchNorm2d, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(FineDecoder, self).__init__()
+        self.layers = layers
+        for i in range(layers)[::-1]:
+            in_channels = min(ngf*(2**(i+1)), img_f)
+            out_channels = min(ngf*(2**i), img_f)
+            up = UpBlock2d(in_channels, out_channels, norm_layer, nonlinearity, use_spect)
+            res = FineADAINResBlocks(num_block, in_channels, feature_nc, norm_layer, nonlinearity, use_spect)
+            jump = Jump(out_channels, norm_layer, nonlinearity, use_spect)
+            setattr(self, 'up' + str(i), up)
+            setattr(self, 'res' + str(i), res)
+            setattr(self, 'jump' + str(i), jump)
+        self.final = FinalBlock2d(out_channels, image_nc, use_spect, 'tanh')
+        self.output_nc = out_channels
+    def forward(self, x, z):
+        out = x.pop()
+        for i in range(self.layers)[::-1]:
+            res_model = getattr(self, 'res' + str(i))
+            up_model = getattr(self, 'up' + str(i))
+            jump_model = getattr(self, 'jump' + str(i))
+            out = res_model(out, z)
+            out = up_model(out)
+            out = jump_model(x.pop()) + out
+        out_image = self.final(out)
+        return out_image
+class FirstBlock2d(nn.Module):
+    """
+    Downsampling block for use in encoder.
+    """
+    def __init__(self, input_nc, output_nc, norm_layer=nn.BatchNorm2d, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(FirstBlock2d, self).__init__()
+        kwargs = {'kernel_size': 7, 'stride': 1, 'padding': 3}
+        conv = spectral_norm(nn.Conv2d(input_nc, output_nc, **kwargs), use_spect)
+        if type(norm_layer) == type(None):
+            self.model = nn.Sequential(conv, nonlinearity)
+        else:
+            self.model = nn.Sequential(conv, norm_layer(output_nc), nonlinearity)
+    def forward(self, x):
+        out = self.model(x)
+        return out
+class DownBlock2d(nn.Module):
+    def __init__(self, input_nc, output_nc, norm_layer=nn.BatchNorm2d, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(DownBlock2d, self).__init__()
+        kwargs = {'kernel_size': 3, 'stride': 1, 'padding': 1}
+        conv = spectral_norm(nn.Conv2d(input_nc, output_nc, **kwargs), use_spect)
+        pool = nn.AvgPool2d(kernel_size=(2, 2))
+        if type(norm_layer) == type(None):
+            self.model = nn.Sequential(conv, nonlinearity, pool)
+        else:
+            self.model = nn.Sequential(conv, norm_layer(output_nc), nonlinearity, pool)
+    def forward(self, x):
+        out = self.model(x)
+        return out
+class UpBlock2d(nn.Module):
+    def __init__(self, input_nc, output_nc, norm_layer=nn.BatchNorm2d, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(UpBlock2d, self).__init__()
+        kwargs = {'kernel_size': 3, 'stride': 1, 'padding': 1}
+        conv = spectral_norm(nn.Conv2d(input_nc, output_nc, **kwargs), use_spect)
+        if type(norm_layer) == type(None):
+            self.model = nn.Sequential(conv, nonlinearity)
+        else:
+            self.model = nn.Sequential(conv, norm_layer(output_nc), nonlinearity)
+    def forward(self, x):
+        out = self.model(F.interpolate(x, scale_factor=2))
+        return out
+class FineADAINResBlocks(nn.Module):
+    def __init__(self, num_block, input_nc, feature_nc, norm_layer=nn.BatchNorm2d, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(FineADAINResBlocks, self).__init__()
+        self.num_block = num_block
+        for i in range(num_block):
+            model = FineADAINResBlock2d(input_nc, feature_nc, norm_layer, nonlinearity, use_spect)
+            setattr(self, 'res'+str(i), model)
+    def forward(self, x, z):
+        for i in range(self.num_block):
+            model = getattr(self, 'res'+str(i))
+            x = model(x, z)
+        return x
+class Jump(nn.Module):
+    def __init__(self, input_nc, norm_layer=nn.BatchNorm2d, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(Jump, self).__init__()
+        kwargs = {'kernel_size': 3, 'stride': 1, 'padding': 1}
+        conv = spectral_norm(nn.Conv2d(input_nc, input_nc, **kwargs), use_spect)
+        if type(norm_layer) == type(None):
+            self.model = nn.Sequential(conv, nonlinearity)
+        else:
+            self.model = nn.Sequential(conv, norm_layer(input_nc), nonlinearity)
+    def forward(self, x):
+        out = self.model(x)
+        return out
+class FineADAINResBlock2d(nn.Module):
+    """
+    Define an Residual block for different types
+    """
+    def __init__(self, input_nc, feature_nc, norm_layer=nn.BatchNorm2d, nonlinearity=nn.LeakyReLU(), use_spect=False):
+        super(FineADAINResBlock2d, self).__init__()
+        kwargs = {'kernel_size': 3, 'stride': 1, 'padding': 1}
+        self.conv1 = spectral_norm(nn.Conv2d(input_nc, input_nc, **kwargs), use_spect)
+        self.conv2 = spectral_norm(nn.Conv2d(input_nc, input_nc, **kwargs), use_spect)
+        self.norm1 = ADAIN(input_nc, feature_nc)
+        self.norm2 = ADAIN(input_nc, feature_nc)
+        self.actvn = nonlinearity
+    def forward(self, x, z):
+        dx = self.actvn(self.norm1(self.conv1(x), z))
+        dx = self.norm2(self.conv2(x), z)
+        out = dx + x
+        return out
+class FinalBlock2d(nn.Module):
+    """
+    Define the output layer
+    """
+    def __init__(self, input_nc, output_nc, use_spect=False, tanh_or_sigmoid='tanh'):
+        super(FinalBlock2d, self).__init__()
+        kwargs = {'kernel_size': 7, 'stride': 1, 'padding':3}
+        conv = spectral_norm(nn.Conv2d(input_nc, output_nc, **kwargs), use_spect)
+        if tanh_or_sigmoid == 'sigmoid':
+            out_nonlinearity = nn.Sigmoid()
+        else:
+            out_nonlinearity = nn.Tanh()
+        self.model = nn.Sequential(conv, out_nonlinearity)
+    def forward(self, x):
+        out = self.model(x)
+        return out

generators/face_model.py ADDED Viewed

	@@ -0,0 +1,127 @@

+import functools
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import generators.flow_util as flow_util
+from generators.base_function import LayerNorm2d, ADAINHourglass, FineEncoder, FineDecoder
+class FaceGenerator(nn.Module):
+    def __init__(
+        self,
+        mapping_net,
+        warpping_net,
+        editing_net,
+        common
+        ):
+        super(FaceGenerator, self).__init__()
+        self.mapping_net = MappingNet(**mapping_net)
+        self.warpping_net = WarpingNet(**warpping_net, **common)
+        self.editing_net = EditingNet(**editing_net, **common)
+    def forward(
+        self,
+        input_image,
+        driving_source,
+        stage=None
+        ):
+        if stage == 'warp':
+            descriptor = self.mapping_net(driving_source)
+            output = self.warpping_net(input_image, descriptor)
+        else:
+            descriptor = self.mapping_net(driving_source)
+            output = self.warpping_net(input_image, descriptor)
+            output['fake_image'] = self.editing_net(input_image, output['warp_image'], descriptor)
+        return output
+class MappingNet(nn.Module):
+    def __init__(self, coeff_nc, descriptor_nc, layer):
+        super( MappingNet, self).__init__()
+        self.layer = layer
+        nonlinearity = nn.LeakyReLU(0.1)
+        self.first = nn.Sequential(
+            torch.nn.Conv1d(coeff_nc, descriptor_nc, kernel_size=7, padding=0, bias=True))
+        for i in range(layer):
+            net = nn.Sequential(nonlinearity,
+                torch.nn.Conv1d(descriptor_nc, descriptor_nc, kernel_size=3, padding=0, dilation=3))
+            setattr(self, 'encoder' + str(i), net)
+        self.pooling = nn.AdaptiveAvgPool1d(1)
+        self.output_nc = descriptor_nc
+    def forward(self, input_3dmm):
+        out = self.first(input_3dmm)
+        for i in range(self.layer):
+            model = getattr(self, 'encoder' + str(i))
+            out = model(out) + out[:,:,3:-3]
+        out = self.pooling(out)
+        return out
+class WarpingNet(nn.Module):
+    def __init__(
+        self,
+        image_nc,
+        descriptor_nc,
+        base_nc,
+        max_nc,
+        encoder_layer,
+        decoder_layer,
+        use_spect
+        ):
+        super( WarpingNet, self).__init__()
+        nonlinearity = nn.LeakyReLU(0.1)
+        norm_layer = functools.partial(LayerNorm2d, affine=True)
+        kwargs = {'nonlinearity':nonlinearity, 'use_spect':use_spect}
+        self.descriptor_nc = descriptor_nc
+        self.hourglass = ADAINHourglass(image_nc, self.descriptor_nc, base_nc,
+                                       max_nc, encoder_layer, decoder_layer, **kwargs)
+        self.flow_out = nn.Sequential(norm_layer(self.hourglass.output_nc),
+                                      nonlinearity,
+                                      nn.Conv2d(self.hourglass.output_nc, 2, kernel_size=7, stride=1, padding=3))
+        self.pool = nn.AdaptiveAvgPool2d(1)
+    def forward(self, input_image, descriptor):
+        final_output={}
+        output = self.hourglass(input_image, descriptor)
+        final_output['flow_field'] = self.flow_out(output)
+        deformation = flow_util.convert_flow_to_deformation(final_output['flow_field'])
+        final_output['warp_image'] = flow_util.warp_image(input_image, deformation)
+        return final_output
+class EditingNet(nn.Module):
+    def __init__(
+        self,
+        image_nc,
+        descriptor_nc,
+        layer,
+        base_nc,
+        max_nc,
+        num_res_blocks,
+        use_spect):
+        super(EditingNet, self).__init__()
+        nonlinearity = nn.LeakyReLU(0.1)
+        norm_layer = functools.partial(LayerNorm2d, affine=True)
+        kwargs = {'norm_layer':norm_layer, 'nonlinearity':nonlinearity, 'use_spect':use_spect}
+        self.descriptor_nc = descriptor_nc
+        # encoder part
+        self.encoder = FineEncoder(image_nc*2, base_nc, max_nc, layer, **kwargs)
+        self.decoder = FineDecoder(image_nc, self.descriptor_nc, base_nc, max_nc, layer, num_res_blocks, **kwargs)
+    def forward(self, input_image, warp_image, descriptor):
+        x = torch.cat([input_image, warp_image], 1)
+        x = self.encoder(x)
+        gen_image = self.decoder(x, descriptor)
+        return gen_image

generators/flow_util.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import torch
+def convert_flow_to_deformation(flow):
+    r"""convert flow fields to deformations.
+    Args:
+        flow (tensor): Flow field obtained by the model
+    Returns:
+        deformation (tensor): The deformation used for warpping
+    """
+    b,c,h,w = flow.shape
+    flow_norm = 2 * torch.cat([flow[:,:1,...]/(w-1),flow[:,1:,...]/(h-1)], 1)
+    grid = make_coordinate_grid(flow)
+    deformation = grid + flow_norm.permute(0,2,3,1)
+    return deformation
+def make_coordinate_grid(flow):
+    r"""obtain coordinate grid with the same size as the flow filed.
+    Args:
+        flow (tensor): Flow field obtained by the model
+    Returns:
+        grid (tensor): The grid with the same size as the input flow
+    """
+    b,c,h,w = flow.shape
+    x = torch.arange(w).to(flow)
+    y = torch.arange(h).to(flow)
+    x = (2 * (x / (w - 1)) - 1)
+    y = (2 * (y / (h - 1)) - 1)
+    yy = y.view(-1, 1).repeat(1, w)
+    xx = x.view(1, -1).repeat(h, 1)
+    meshed = torch.cat([xx.unsqueeze_(2), yy.unsqueeze_(2)], 2)
+    meshed = meshed.expand(b, -1, -1, -1)
+    return meshed
+def warp_image(source_image, deformation):
+    r"""warp the input image according to the deformation
+    Args:
+        source_image (tensor): source images to be warpped
+        deformation (tensor): deformations used to warp the images; value in range (-1, 1)
+    Returns:
+        output (tensor): the warpped images
+    """
+    _, h_old, w_old, _ = deformation.shape
+    _, _, h, w = source_image.shape
+    if h_old != h or w_old != w:
+        deformation = deformation.permute(0, 3, 1, 2)
+        deformation = torch.nn.functional.interpolate(deformation, size=(h, w), mode='bilinear')
+        deformation = deformation.permute(0, 2, 3, 1)
+    return torch.nn.functional.grid_sample(source_image, deformation)

inference_for_demo.py ADDED Viewed

	@@ -0,0 +1,187 @@

+import argparse
+import cv2
+import json
+import os
+import numpy as np
+import torch
+import torchvision
+import torchvision.transforms as transforms
+from PIL import Image
+from core.networks.styletalk import StyleTalk
+from core.utils import get_audio_window, get_pose_params, get_video_style_clip, obtain_seq_index
+from configs.default import get_cfg_defaults
+@torch.no_grad()
+def get_eval_model(cfg):
+    model = StyleTalk(cfg).cuda()
+    content_encoder = model.content_encoder
+    style_encoder = model.style_encoder
+    decoder = model.decoder
+    checkpoint = torch.load(cfg.INFERENCE.CHECKPOINT)
+    model_state_dict = checkpoint["model_state_dict"]
+    content_encoder_dict = {k[16:]: v for k, v in model_state_dict.items() if k[:16] == "content_encoder."}
+    content_encoder.load_state_dict(content_encoder_dict, strict=True)
+    style_encoder_dict = {k[14:]: v for k, v in model_state_dict.items() if k[:14] == "style_encoder."}
+    style_encoder.load_state_dict(style_encoder_dict, strict=True)
+    decoder_dict = {k[8:]: v for k, v in model_state_dict.items() if k[:8] == "decoder."}
+    decoder.load_state_dict(decoder_dict, strict=True)
+    model.eval()
+    return content_encoder, style_encoder, decoder
+@torch.no_grad()
+def render_video(
+    net_G, src_img_path, exp_path, wav_path, output_path, silent=False, semantic_radius=13, fps=30, split_size=64
+):
+    target_exp_seq = np.load(exp_path)
+    frame = cv2.imread(src_img_path)
+    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+    src_img_raw = Image.fromarray(frame)
+    image_transform = transforms.Compose(
+        [
+            transforms.ToTensor(),
+            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5), inplace=True),
+        ]
+    )
+    src_img = image_transform(src_img_raw)
+    target_win_exps = []
+    for frame_idx in range(len(target_exp_seq)):
+        win_indices = obtain_seq_index(frame_idx, target_exp_seq.shape[0], semantic_radius)
+        win_exp = torch.tensor(target_exp_seq[win_indices]).permute(1, 0)
+        # (73, 27)
+        target_win_exps.append(win_exp)
+    target_exp_concat = torch.stack(target_win_exps, dim=0)
+    target_splited_exps = torch.split(target_exp_concat, split_size, dim=0)
+    output_imgs = []
+    for win_exp in target_splited_exps:
+        win_exp = win_exp.cuda()
+        cur_src_img = src_img.expand(win_exp.shape[0], -1, -1, -1).cuda()
+        output_dict = net_G(cur_src_img, win_exp)
+        output_imgs.append(output_dict["fake_image"].cpu().clamp_(-1, 1))
+    output_imgs = torch.cat(output_imgs, 0)
+    transformed_imgs = ((output_imgs + 1) / 2 * 255).to(torch.uint8).permute(0, 2, 3, 1)
+    if silent:
+        torchvision.io.write_video(output_path, transformed_imgs.cpu(), fps)
+    else:
+        silent_video_path = "silent.mp4"
+        torchvision.io.write_video(silent_video_path, transformed_imgs.cpu(), fps)
+        os.system(f"ffmpeg -loglevel quiet -y -i {silent_video_path} -i {wav_path} -shortest {output_path}")
+        os.remove(silent_video_path)
+@torch.no_grad()
+def get_netG(checkpoint_path):
+    from generators.face_model import FaceGenerator
+    import yaml
+    with open("configs/renderer_conf.yaml", "r") as f:
+        renderer_config = yaml.load(f, Loader=yaml.FullLoader)
+    renderer = FaceGenerator(**renderer_config).to(torch.cuda.current_device())
+    checkpoint = torch.load(checkpoint_path, map_location=lambda storage, loc: storage)
+    renderer.load_state_dict(checkpoint["net_G_ema"], strict=False)
+    renderer.eval()
+    return renderer
+@torch.no_grad()
+def generate_expression_params(
+    cfg, audio_path, style_clip_path, pose_path, output_path, content_encoder, style_encoder, decoder
+):
+    with open(audio_path, "r") as f:
+        audio = json.load(f)
+    audio_win = get_audio_window(audio, cfg.WIN_SIZE)
+    audio_win = torch.tensor(audio_win).cuda()
+    content = content_encoder(audio_win.unsqueeze(0))
+    style_clip, pad_mask = get_video_style_clip(style_clip_path, style_max_len=256, start_idx=0)
+    style_code = style_encoder(
+        style_clip.unsqueeze(0).cuda(), pad_mask.unsqueeze(0).cuda() if pad_mask is not None else None
+    )
+    gen_exp_stack = decoder(content, style_code)
+    gen_exp = gen_exp_stack[0].cpu().numpy()
+    pose_ext = pose_path[-3:]
+    pose = None
+    if pose_ext == "npy":
+        pose = np.load(pose_path)
+    elif pose_ext == "mat":
+        pose = get_pose_params(pose_path)
+    # (L, 9)
+    selected_pose = None
+    if len(pose) >= len(gen_exp):
+        selected_pose = pose[: len(gen_exp)]
+    else:
+        selected_pose = pose[-1].unsqueeze(0).repeat(len(gen_exp), 1)
+        selected_pose[: len(pose)] = pose
+    gen_exp_pose = np.concatenate((gen_exp, selected_pose), axis=1)
+    np.save(output_path, gen_exp_pose)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="inference for demo")
+    parser.add_argument(
+        "--styletalk_checkpoint",
+        type=str,
+        default="checkpoints/styletalk_checkpoint.pth",
+        help="the checkpoint to test with",
+    )
+    parser.add_argument(
+        "--renderer_checkpoint",
+        type=str,
+        default="checkpoints/renderer_checkpoint.pt",
+        help="renderer checkpoint",
+    )
+    parser.add_argument("--audio_path", type=str, default="", help="path for phoneme")
+    parser.add_argument("--style_clip_path", type=str, default="", help="path for style_clip_mat")
+    parser.add_argument("--pose_path", type=str, default="", help="path for pose")
+    parser.add_argument("--src_img_path", type=str, default="test_images/KristiNoem1_0.jpg")
+    parser.add_argument("--wav_path", type=str, default="demo/data/KristiNoem_front_neutral_level1_002.wav")
+    parser.add_argument("--output_path", type=str, default="demo_output.npy", help="path for output")
+    args = parser.parse_args()
+    cfg = get_cfg_defaults()
+    cfg.INFERENCE.CHECKPOINT = args.styletalk_checkpoint
+    cfg.freeze()
+    print(f"checkpoint: {cfg.INFERENCE.CHECKPOINT}")
+    # load checkpoint
+    with torch.no_grad():
+        content_encoder, style_encoder, decoder = get_eval_model(cfg)
+        exp_param_path = f"{args.output_path[:-4]}.npy"
+        generate_expression_params(
+            cfg,
+            args.audio_path,
+            args.style_clip_path,
+            args.pose_path,
+            exp_param_path,
+            content_encoder,
+            style_encoder,
+            decoder,
+        )
+        image_renderer = get_netG(args.renderer_checkpoint)
+        render_video(
+            image_renderer,
+            args.src_img_path,
+            exp_param_path,
+            args.wav_path,
+            args.output_path,
+            split_size=4,
+        )

media/first_page.png ADDED Viewed

Git LFS Details

SHA256: f9aa8175737d6b7bcd8b2520f62fb21969287f0646f954ee973a655e049d626f
Pointer size: 132 Bytes
Size of remote file: 1.88 MB

phindex.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"AA": 0, "AE": 1, "AH": 2, "AO": 3, "AW": 4, "AY": 5, "B": 6, "CH": 7, "D": 8, "DH": 9, "EH": 10, "ER": 11, "EY": 12, "F": 13, "G": 14, "HH": 15, "IH": 16, "IY": 17, "JH": 18, "K": 19, "L": 20, "M": 21, "N": 22, "NG": 23, "NSN": 24, "OW": 25, "OY": 26, "P": 27, "R": 28, "S": 29, "SH": 30, "SIL": 31, "T": 32, "TH": 33, "UH": 34, "UW": 35, "V": 36, "W": 37, "Y": 38, "Z": 39, "ZH": 40}

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+yacs==0.1.8
+scipy==1.7.3
+scikit-image==0.19.3
+scikit-learn==1.0.2
+PyYAML==6.0
+Pillow==9.1.0
+numpy==1.21.5
+opencv-python==4.4.0.46
+imageio==2.18.0
+ffmpeg-python==0.2.0
+av==10.0.0

samples/source_video/3DMM/KristiNoem.mat ADDED Viewed

Binary file (566 kB). View file

samples/source_video/3DMM/Obama_clip1.mat ADDED Viewed

Binary file (629 kB). View file

samples/source_video/3DMM/Obama_clip2.mat ADDED Viewed

Binary file (943 kB). View file

samples/source_video/3DMM/Obama_clip3.mat ADDED Viewed

Binary file (629 kB). View file