add init finetune vi model
Browse files- README.md +112 -0
 - audio-test/t1_0001-00010.wav +0 -0
 - audio-test/t1_utt000000042.wav +0 -0
 - audio-test/t2_0000006682.wav +0 -0
 - config.json +78 -0
 - preprocessor_config.json +9 -0
 - pytorch_model.bin +3 -0
 - special_tokens_map.json +1 -0
 - tokenizer_config.json +1 -0
 - vocab.json +1 -0
 
    	
        README.md
    ADDED
    
    | 
         @@ -0,0 +1,112 @@ 
     | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            ---
         
     | 
| 2 | 
         
            +
            language: en
         
     | 
| 3 | 
         
            +
            datasets:
         
     | 
| 4 | 
         
            +
            - librispeech_asr
         
     | 
| 5 | 
         
            +
            tags:
         
     | 
| 6 | 
         
            +
            - audio
         
     | 
| 7 | 
         
            +
            - automatic-speech-recognition
         
     | 
| 8 | 
         
            +
            license: apache-2.0
         
     | 
| 9 | 
         
            +
            widget:
         
     | 
| 10 | 
         
            +
            - label: Librispeech sample 1
         
     | 
| 11 | 
         
            +
              src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
         
     | 
| 12 | 
         
            +
            - label: Librispeech sample 2
         
     | 
| 13 | 
         
            +
              src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
         
     | 
| 14 | 
         
            +
            ---
         
     | 
| 15 | 
         
            +
             
     | 
| 16 | 
         
            +
            # Wav2Vec2-Base-960h
         
     | 
| 17 | 
         
            +
             
     | 
| 18 | 
         
            +
            [Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
         
     | 
| 19 | 
         
            +
             
     | 
| 20 | 
         
            +
            The base model pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model
         
     | 
| 21 | 
         
            +
            make sure that your speech input is also sampled at 16Khz.
         
     | 
| 22 | 
         
            +
             
     | 
| 23 | 
         
            +
            [Paper](https://arxiv.org/abs/2006.11477)
         
     | 
| 24 | 
         
            +
             
     | 
| 25 | 
         
            +
            Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
         
     | 
| 26 | 
         
            +
             
     | 
| 27 | 
         
            +
            **Abstract**
         
     | 
| 28 | 
         
            +
             
     | 
| 29 | 
         
            +
            We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
         
     | 
| 30 | 
         
            +
             
     | 
| 31 | 
         
            +
            The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.
         
     | 
| 32 | 
         
            +
             
     | 
| 33 | 
         
            +
             
     | 
| 34 | 
         
            +
            # Usage
         
     | 
| 35 | 
         
            +
             
     | 
| 36 | 
         
            +
            To transcribe audio files the model can be used as a standalone acoustic model as follows:
         
     | 
| 37 | 
         
            +
             
     | 
| 38 | 
         
            +
            ```python
         
     | 
| 39 | 
         
            +
             from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
         
     | 
| 40 | 
         
            +
             from datasets import load_dataset
         
     | 
| 41 | 
         
            +
             import soundfile as sf
         
     | 
| 42 | 
         
            +
             import torch
         
     | 
| 43 | 
         
            +
             
         
     | 
| 44 | 
         
            +
             # load model and tokenizer
         
     | 
| 45 | 
         
            +
             processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
         
     | 
| 46 | 
         
            +
             model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
         
     | 
| 47 | 
         
            +
             
         
     | 
| 48 | 
         
            +
             # define function to read in sound file
         
     | 
| 49 | 
         
            +
             def map_to_array(batch):
         
     | 
| 50 | 
         
            +
                 speech, _ = sf.read(batch["file"])
         
     | 
| 51 | 
         
            +
                 batch["speech"] = speech
         
     | 
| 52 | 
         
            +
                 return batch
         
     | 
| 53 | 
         
            +
                 
         
     | 
| 54 | 
         
            +
             # load dummy dataset and read soundfiles
         
     | 
| 55 | 
         
            +
             ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
         
     | 
| 56 | 
         
            +
             ds = ds.map(map_to_array)
         
     | 
| 57 | 
         
            +
             
         
     | 
| 58 | 
         
            +
             # tokenize
         
     | 
| 59 | 
         
            +
             input_values = processor(ds["speech"][:2], return_tensors="pt", padding="longest").input_values  # Batch size 1
         
     | 
| 60 | 
         
            +
             
         
     | 
| 61 | 
         
            +
             # retrieve logits
         
     | 
| 62 | 
         
            +
             logits = model(input_values).logits
         
     | 
| 63 | 
         
            +
             
         
     | 
| 64 | 
         
            +
             # take argmax and decode
         
     | 
| 65 | 
         
            +
             predicted_ids = torch.argmax(logits, dim=-1)
         
     | 
| 66 | 
         
            +
             transcription = processor.batch_decode(predicted_ids)
         
     | 
| 67 | 
         
            +
             ```
         
     | 
| 68 | 
         
            +
             
         
     | 
| 69 | 
         
            +
             ## Evaluation
         
     | 
| 70 | 
         
            +
             
         
     | 
| 71 | 
         
            +
             This code snippet shows how to evaluate **facebook/wav2vec2-base-960h** on LibriSpeech's "clean" and "other" test data.
         
     | 
| 72 | 
         
            +
             
         
     | 
| 73 | 
         
            +
            ```python
         
     | 
| 74 | 
         
            +
            from datasets import load_dataset
         
     | 
| 75 | 
         
            +
            from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
         
     | 
| 76 | 
         
            +
            import soundfile as sf
         
     | 
| 77 | 
         
            +
            import torch
         
     | 
| 78 | 
         
            +
            from jiwer import wer
         
     | 
| 79 | 
         
            +
             
     | 
| 80 | 
         
            +
             
     | 
| 81 | 
         
            +
            librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
         
     | 
| 82 | 
         
            +
             
     | 
| 83 | 
         
            +
            model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
         
     | 
| 84 | 
         
            +
            processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
         
     | 
| 85 | 
         
            +
             
     | 
| 86 | 
         
            +
            def map_to_array(batch):
         
     | 
| 87 | 
         
            +
                speech, _ = sf.read(batch["file"])
         
     | 
| 88 | 
         
            +
                batch["speech"] = speech
         
     | 
| 89 | 
         
            +
                return batch
         
     | 
| 90 | 
         
            +
             
     | 
| 91 | 
         
            +
            librispeech_eval = librispeech_eval.map(map_to_array)
         
     | 
| 92 | 
         
            +
             
     | 
| 93 | 
         
            +
            def map_to_pred(batch):
         
     | 
| 94 | 
         
            +
                input_values = processor(batch["speech"], return_tensors="pt", padding="longest").input_values
         
     | 
| 95 | 
         
            +
                with torch.no_grad():
         
     | 
| 96 | 
         
            +
                    logits = model(input_values.to("cuda")).logits
         
     | 
| 97 | 
         
            +
             
     | 
| 98 | 
         
            +
                predicted_ids = torch.argmax(logits, dim=-1)
         
     | 
| 99 | 
         
            +
                transcription = processor.batch_decode(predicted_ids)
         
     | 
| 100 | 
         
            +
                batch["transcription"] = transcription
         
     | 
| 101 | 
         
            +
                return batch
         
     | 
| 102 | 
         
            +
             
     | 
| 103 | 
         
            +
            result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
         
     | 
| 104 | 
         
            +
             
     | 
| 105 | 
         
            +
            print("WER:", wer(result["text"], result["transcription"]))
         
     | 
| 106 | 
         
            +
            ```
         
     | 
| 107 | 
         
            +
             
     | 
| 108 | 
         
            +
            *Result (WER)*:
         
     | 
| 109 | 
         
            +
             
     | 
| 110 | 
         
            +
            | "clean" | "other" |
         
     | 
| 111 | 
         
            +
            |---|---|
         
     | 
| 112 | 
         
            +
            | 3.4 | 8.6 |
         
     | 
    	
        audio-test/t1_0001-00010.wav
    ADDED
    
    | 
         Binary file (120 kB). View file 
     | 
| 
         | 
    	
        audio-test/t1_utt000000042.wav
    ADDED
    
    | 
         Binary file (76.8 kB). View file 
     | 
| 
         | 
    	
        audio-test/t2_0000006682.wav
    ADDED
    
    | 
         Binary file (49.6 kB). View file 
     | 
| 
         | 
    	
        config.json
    ADDED
    
    | 
         @@ -0,0 +1,78 @@ 
     | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            {
         
     | 
| 2 | 
         
            +
              "_name_or_path": "/content/vaw2tmp/model-bin/finetune/base/checkpoint-146596",
         
     | 
| 3 | 
         
            +
              "activation_dropout": 0.1,
         
     | 
| 4 | 
         
            +
              "apply_spec_augment": true,
         
     | 
| 5 | 
         
            +
              "architectures": [
         
     | 
| 6 | 
         
            +
                "Wav2Vec2ForCTC"
         
     | 
| 7 | 
         
            +
              ],
         
     | 
| 8 | 
         
            +
              "attention_dropout": 0.1,
         
     | 
| 9 | 
         
            +
              "bos_token_id": 1,
         
     | 
| 10 | 
         
            +
              "codevector_dim": 256,
         
     | 
| 11 | 
         
            +
              "contrastive_logits_temperature": 0.1,
         
     | 
| 12 | 
         
            +
              "conv_bias": false,
         
     | 
| 13 | 
         
            +
              "conv_dim": [
         
     | 
| 14 | 
         
            +
                512,
         
     | 
| 15 | 
         
            +
                512,
         
     | 
| 16 | 
         
            +
                512,
         
     | 
| 17 | 
         
            +
                512,
         
     | 
| 18 | 
         
            +
                512,
         
     | 
| 19 | 
         
            +
                512,
         
     | 
| 20 | 
         
            +
                512
         
     | 
| 21 | 
         
            +
              ],
         
     | 
| 22 | 
         
            +
              "conv_kernel": [
         
     | 
| 23 | 
         
            +
                10,
         
     | 
| 24 | 
         
            +
                3,
         
     | 
| 25 | 
         
            +
                3,
         
     | 
| 26 | 
         
            +
                3,
         
     | 
| 27 | 
         
            +
                3,
         
     | 
| 28 | 
         
            +
                2,
         
     | 
| 29 | 
         
            +
                2
         
     | 
| 30 | 
         
            +
              ],
         
     | 
| 31 | 
         
            +
              "conv_stride": [
         
     | 
| 32 | 
         
            +
                5,
         
     | 
| 33 | 
         
            +
                2,
         
     | 
| 34 | 
         
            +
                2,
         
     | 
| 35 | 
         
            +
                2,
         
     | 
| 36 | 
         
            +
                2,
         
     | 
| 37 | 
         
            +
                2,
         
     | 
| 38 | 
         
            +
                2
         
     | 
| 39 | 
         
            +
              ],
         
     | 
| 40 | 
         
            +
              "ctc_loss_reduction": "mean",
         
     | 
| 41 | 
         
            +
              "ctc_zero_infinity": false,
         
     | 
| 42 | 
         
            +
              "diversity_loss_weight": 0.1,
         
     | 
| 43 | 
         
            +
              "do_stable_layer_norm": false,
         
     | 
| 44 | 
         
            +
              "eos_token_id": 2,
         
     | 
| 45 | 
         
            +
              "feat_extract_activation": "gelu",
         
     | 
| 46 | 
         
            +
              "feat_extract_dropout": 0.0,
         
     | 
| 47 | 
         
            +
              "feat_extract_norm": "group",
         
     | 
| 48 | 
         
            +
              "feat_proj_dropout": 0.1,
         
     | 
| 49 | 
         
            +
              "feat_quantizer_dropout": 0.0,
         
     | 
| 50 | 
         
            +
              "final_dropout": 0.1,
         
     | 
| 51 | 
         
            +
              "gradient_checkpointing": true,
         
     | 
| 52 | 
         
            +
              "hidden_act": "gelu",
         
     | 
| 53 | 
         
            +
              "hidden_dropout": 0.1,
         
     | 
| 54 | 
         
            +
              "hidden_dropout_prob": 0.1,
         
     | 
| 55 | 
         
            +
              "hidden_size": 768,
         
     | 
| 56 | 
         
            +
              "initializer_range": 0.02,
         
     | 
| 57 | 
         
            +
              "intermediate_size": 3072,
         
     | 
| 58 | 
         
            +
              "layer_norm_eps": 1e-05,
         
     | 
| 59 | 
         
            +
              "layerdrop": 0.1,
         
     | 
| 60 | 
         
            +
              "mask_feature_length": 10,
         
     | 
| 61 | 
         
            +
              "mask_feature_prob": 0.0,
         
     | 
| 62 | 
         
            +
              "mask_time_length": 10,
         
     | 
| 63 | 
         
            +
              "mask_time_prob": 0.05,
         
     | 
| 64 | 
         
            +
              "model_type": "wav2vec2",
         
     | 
| 65 | 
         
            +
              "num_attention_heads": 12,
         
     | 
| 66 | 
         
            +
              "num_codevector_groups": 2,
         
     | 
| 67 | 
         
            +
              "num_codevectors_per_group": 320,
         
     | 
| 68 | 
         
            +
              "num_conv_pos_embedding_groups": 16,
         
     | 
| 69 | 
         
            +
              "num_conv_pos_embeddings": 128,
         
     | 
| 70 | 
         
            +
              "num_feat_extract_layers": 7,
         
     | 
| 71 | 
         
            +
              "num_hidden_layers": 12,
         
     | 
| 72 | 
         
            +
              "num_negatives": 100,
         
     | 
| 73 | 
         
            +
              "pad_token_id": 109,
         
     | 
| 74 | 
         
            +
              "proj_codevector_dim": 256,
         
     | 
| 75 | 
         
            +
              "torch_dtype": "float32",
         
     | 
| 76 | 
         
            +
              "transformers_version": "4.9.2",
         
     | 
| 77 | 
         
            +
              "vocab_size": 110
         
     | 
| 78 | 
         
            +
            }
         
     | 
    	
        preprocessor_config.json
    ADDED
    
    | 
         @@ -0,0 +1,9 @@ 
     | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            {
         
     | 
| 2 | 
         
            +
              "do_normalize": true,
         
     | 
| 3 | 
         
            +
              "feature_extractor_type": "Wav2Vec2FeatureExtractor",
         
     | 
| 4 | 
         
            +
              "feature_size": 1,
         
     | 
| 5 | 
         
            +
              "padding_side": "right",
         
     | 
| 6 | 
         
            +
              "padding_value": 0.0,
         
     | 
| 7 | 
         
            +
              "return_attention_mask": false,
         
     | 
| 8 | 
         
            +
              "sampling_rate": 16000
         
     | 
| 9 | 
         
            +
            }
         
     | 
    	
        pytorch_model.bin
    ADDED
    
    | 
         @@ -0,0 +1,3 @@ 
     | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            version https://git-lfs.github.com/spec/v1
         
     | 
| 2 | 
         
            +
            oid sha256:2529ffc057c90b3d576ea109999ea49f3bab9691f7a93297a91d69ce8c981f84
         
     | 
| 3 | 
         
            +
            size 377906903
         
     | 
    	
        special_tokens_map.json
    ADDED
    
    | 
         @@ -0,0 +1 @@ 
     | 
|
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
         
     | 
    	
        tokenizer_config.json
    ADDED
    
    | 
         @@ -0,0 +1 @@ 
     | 
|
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "<pad>", "do_lower_case": false, "word_delimiter_token": "|", "tokenizer_class": "Wav2Vec2CTCTokenizer"}
         
     | 
    	
        vocab.json
    ADDED
    
    | 
         @@ -0,0 +1 @@ 
     | 
|
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            {"ẻ": 0, "6": 1, "ụ": 2, "í": 3, "3": 4, "ỹ": 5, "ý": 6, "ẩ": 7, "ở": 8, "ề": 9, "õ": 10, "7": 11, "ê": 12, "ứ": 13, "ỏ": 14, "v": 15, "ỷ": 16, "a": 17, "l": 18, "ự": 19, "q": 20, "ờ": 21, "j": 22, "ố": 23, "à": 24, "ỗ": 25, "n": 26, "é": 27, "ủ": 28, "у": 29, "ô": 30, "u": 31, "y": 32, "ằ": 33, "4": 34, "w": 35, "b": 36, "ệ": 37, "ễ": 38, "s": 39, "ì": 40, "ầ": 41, "ỵ": 42, "8": 43, "d": 44, "ể": 45, "r": 47, "ũ": 48, "c": 49, "ạ": 50, "9": 51, "ế": 52, "ù": 53, "ỡ": 54, "2": 55, "t": 56, "i": 57, "g": 58, "́": 59, "ử": 60, "̀": 61, "á": 62, "0": 63, "ậ": 64, "e": 65, "ộ": 66, "m": 67, "ẳ": 68, "ợ": 69, "ĩ": 70, "h": 71, "â": 72, "ú": 73, "ọ": 74, "ồ": 75, "ặ": 76, "f": 77, "ữ": 78, "ắ": 79, "ỳ": 80, "x": 81, "ó": 82, "ã": 83, "ổ": 84, "ị": 85, "̣": 86, "z": 87, "ả": 88, "đ": 89, "è": 90, "ừ": 91, "ò": 92, "ẵ": 93, "1": 94, "ơ": 95, "k": 96, "ẫ": 97, "p": 98, "ấ": 99, "ẽ": 100, "ỉ": 101, "ớ": 102, "ẹ": 103, "ă": 104, "o": 105, "ư": 106, "5": 107, "|": 46, "<unk>": 108, "<pad>": 109}
         
     |