--- license: apache-2.0 language: - es - eu library_name: nemo metrics: - wer pipeline_tag: automatic-speech-recognition tags: - automatic-speech-recognition - speech - audio - Transducer - Conformer - NeMo - pytorch - Transformer model-index: - name: stt_eu_conformer_transducer_large results: - task: type: Automatic Speech Recognition name: speech-recognition dataset: name: Mozilla Common Voice 18.1 EU type: mozilla-foundation/common_voice_18_1 config: eu split: test args: language: eu metrics: - name: Test WER type: wer value: 7.22 - task: type: Automatic Speech Recognition name: speech-recognition dataset: name: Mozilla Common Voice 18.1 ES type: mozilla-foundation/common_voice_18_1 config: es split: test args: language: es metrics: - name: Test WER type: wer value: 14.52 - task: type: Automatic Speech Recognition name: speech-recognition dataset: name: Basque Parliament EU type: gttsehu/basque_parliament_1 config: eu split: test args: language: eu metrics: - name: Test WER type: wer value: 3.8 - task: type: Automatic Speech Recognition name: speech-recognition dataset: name: Basque Parliament ES type: gttsehu/basque_parliament_1 config: es split: test args: language: es metrics: - name: Test WER type: wer value: 2.18 - task: type: Automatic Speech Recognition name: speech-recognition dataset: name: Basque Parliament BI type: gttsehu/basque_parliament_1 config: bi split: test args: language: bi metrics: - name: Test WER type: wer value: 4.51 - task: type: Automatic Speech Recognition name: speech-recognition dataset: name: Multi Lingual Librispeech ES type: mls config: es split: test args: language: es metrics: - name: Test WER type: wer value: 7.84 - task: type: Automatic Speech Recognition name: speech-recognition dataset: name: Facebook Voxpopuli ES type: facebook/voxpopuli config: es split: test args: language: es metrics: - name: Test WER type: wer value: 10.29 --- # HiTZ/Aholab's Bilingual Basque Spanish Speech-to-Text model Conformer-Transducer for IBERSPEECH 2024's BBS-S2TC ## Model Description | [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--Transducer-lightgrey#model-badge)](#model-architecture) | [![Model size](https://img.shields.io/badge/Params-119M-lightgrey#model-badge)](#model-architecture) | [![Language](https://img.shields.io/badge/Language-eu-lightgrey#model-badge)](#datasets) | [![Language](https://img.shields.io/badge/Language-es-lightgrey#model-badge)](#datasets) This model was specifically designed for a submission in the [BBS-S2TC (Bilingual Basque Spanish Speech to Text Challenge)](https://www.isca-archive.org/iberspeech_2024/herranz24_iberspeech.html) from the IBERSPEECH 2024 Albayzin evalutaions chalenges section. The train was fitted for a good performance on the challenge's evaluation splits, therefore, the performance in other splits is worse. This model transcribes speech in lowercase Spanish alphabet including spaces, and was trained on a composite dataset comprising of 1462 hours of Spanish and Basque speech. The model was fine-tuned from a pre-trained Basque [stt_eu_conformer_transducer_large](https://huggingface.co/HiTZ/stt_eu_conformer_transducer_large) model using the [Nvidia NeMo](https://github.com/NVIDIA/NeMo) toolkit. It is an autoregressive "large" variant of Conformer, with around 119 million parameters. See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details. ## Usage To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version. ```bash pip install nemo_toolkit['all'] ``` ### Transcribing using Python Clone repository to download the model: ```bash git clone https://huggingface.co/HiTZ/BBS-S2TC_conformer_transducer_large ``` Given `NEMO_MODEL_FILEPATH` is the path that points to the downloaded `BBS-S2TC_conformer_transducer_large.nemo` file. ```python import nemo.collections.asr as nemo_asr # Load the model asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(NEMO_MODEL_FILEPATH) # Create a list pointing to the audio files audio = ["audio_1.wav","audio_2.wav", ..., "audio_n.wav"] # Fix the batch_size to whatever number suits your purpouse batch_size = 8 # Transcribe the audio files transcriptions = asr_model.transcribe(audio=audio, batch_size=batch_size) # Visualize the transcriptions print(transcriptions) ``` ## Input This model accepts 16000 kHz Mono-channel Audio (wav files) as input. ## Output This model provides transcribed speech as a string for a given audio sample. ## Model Architecture Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding instead of CTC loss. You may find more info on the detail of this model here: [Conformer-Transducer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer). ## Training ### Data preparation This model has been trained the bilingual dataset [basque_parliament](https://huggingface.co/datasets/gttsehu/basque_parliament_1) comprising 1462 hours of Spanish and Basque speech from the basque parliament's sessions. ### Training procedure This model was trained starting from the pre-trained Basque model [stt_eu_conformer_transducer_large](https://huggingface.co/HiTZ/stt_eu_conformer_transducer_large) over several hundred of epochs in a GPU device, using the NeMo toolkit [3] The tokenizer for these model was built using the text transcripts of the train dataset with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py), with a total of 128 spanish and basque language tokens. ## Performance Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding in the following table. | Tokenizer | Vocabulary Size | MCV 18.0 Test ES | MCV 18.1 Test EU | Basque Parliament Test ES | Basque Parliament Test EU | Basque Parliament Test BI | MLS Test ES | VoxPopuli ES | Train Dataset | |-----------------------|-----------------|------------------|------------------|---------------------------|---------------------------|---------------------------|-------------|--------------|----------------------------| | SentencePiece Unigram | 128 | 14.52 | 7.22 | 2.18 | 3.8 | 4.51 | 7.84 | 10.29 | Basque Palriament (1462 h) | ## Limitations Since this model was trained on almost publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. # Aditional Information ## Author HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU. ## Copyright Copyright (c) 2024 HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU. ## Licensing Information [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) ## Funding This project with reference 2022/TL22/00215335 has been parcially funded by the Ministerio de Transformación Digital and by the Plan de Recuperación, Transformación y Resiliencia – Funded by the European Union – NextGenerationEU [ILENIA](https://proyectoilenia.es/) and by the project [IkerGaitu](https://www.hitz.eus/iker-gaitu/) funded by the Basque Government. This model was trained at [Hyperion](https://scc.dipc.org/docs/systems/hyperion/overview/), one of the high-performance computing (HPC) systems hosted by the DIPC Supercomputing Center. ## References - [1] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100) - [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) - [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) ## Disclaimer

Click to expand

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. In no event shall the owner and creator of the models (HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.) be liable for any results arising from the use made by third parties of these models.