Whisper_bsc_large_v3_cat

Click to expand

Model Description
Intended Uses and Limitations
How to Get Started with the Model
Training Details
Citation
Additional Information

Model Description

The "whisper-bsc-large-v3-cat" is an acoustic model suitable for Automatic Speech Recognition in Catalan. It is the result of finetuning the model "openai/whisper-large-v3" with 4700 hours of Catalan data released by the Projecte AINA from Barcelona, Spain.

Intended Uses and Limitations

This model can be used for Automatic Speech Recognition (ASR) in Catalan. The model intends to transcribe Catalan audio files to plain text without punctuation.

Installation

To use this model, you may install datasets and transformers:

Create a virtual environment:

python -m venv /path/to/venv

Activate the environment:

source /path/to/venv/bin/activate

Install the modules:

pip install datasets transformers

For Inference

To transcribe audio in Catalan using this model, you can follow this example:

#Install Prerequisites
pip install torch
pip install datasets
pip install 'transformers[torch]'
pip install evaluate
pip install jiwer

#This code works with GPU

#Notice that: load_metric is no longer part of datasets.
#You have to remove it and use evaluate's load instead.
#(Note from November 2024)

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

#Load the processor and model.
MODEL_NAME="BSC-LT/whisper-bsc-large-v3-cat"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")

#Load the dataset
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("projecte-aina/parlament_parla",split='test')

#Downsample to 16 kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

#Process the dataset
def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    
    return batch
    
#Do the evaluation
result = ds.map(map_to_pred)

#Compute the overall WER now.
from evaluate import load

wer = load("wer")
WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)

Training Details

Training data

The specific datasets used to create the model are:

3CatParla. (Soon to be published)
commonvoice_benchmark_catalan_accents
corts_valencianes (Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version.)
parlament_parla_v3 (Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version.)
IB3 (Soon to be published)

Training procedure

This model is the result of fine-tuning the model "openai/whisper-large-v3" by following this tutorial provided by Language Technologies Laboratory. (Soon to be published)

Training Hyperparameters

language: Catalan
hours of training audio: 4700
learning rate: 1e-04
sample rate: 16000
train batch size: 16 (x4 GPUs)
eval batch size: 16
num_train_epochs: 10
weight_decay: 1e-4

Citation

If this model contributes to your research, please cite the work:

@misc{takanori2025whisperbsclarge3cat,
      title={Acoustic Model in Catalan: Whisper_bsc_large_v3_cat.}, 
      author={Sanchez Shiromizu, Lucas Takanori; Hernandez Mena, Carlos Daniel; Messaoudi, Abir; España i Bonet, Cristina; Cortada Garcia, Marti},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/langtech-veu/whisper-bsc-large-v3-cat},
      year={2025}
}

Additional Information

Author

The fine-tuning process was performed during Spring (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Lucas Takanori Sanchez Shiromizu.

Contact

For further information, please send an email to [email protected].

Copyright

License

Apache-2.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337.

The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.

Downloads last month: 8

Safetensors

Model size

2B params

Tensor type

F16

Model tree for BSC-LT/whisper-bsc-large-v3-cat

Base model

openai/whisper-large-v3

Finetuned

(660)

this model

Datasets used to train BSC-LT/whisper-bsc-large-v3-cat

Collection including BSC-LT/whisper-bsc-large-v3-cat

Speech models

Collection

Models developed by the speech team of the Language Technologies unit • 17 items • Updated 20 days ago

Evaluation results

WER on 3CatParla (Test)
test set self-reported

4.801
WER on CV Benchmark Catalan Accents (Balearic fem)
self-reported

5.314
WER on CV Benchmark Catalan Accents (Balearic male)
self-reported

4.310
WER on CV Benchmark Catalan Accents (Central fem)
self-reported

3.294
WER on CV Benchmark Catalan Accents (Central male)
self-reported

3.602
WER on CV Benchmark Catalan Accents (Northern fem)
self-reported

3.189
WER on CV Benchmark Catalan Accents (Northern male)
self-reported

3.378
WER on CV Benchmark Catalan Accents (Northwestern fem)
self-reported

3.217
WER on CV Benchmark Catalan Accents (Northwestern male)
self-reported

3.949
WER on CV Benchmark Catalan Accents (Valencian fem)
self-reported

3.581

View on Papers With Code

BSC-LT
/

whisper-bsc-large-v3-cat