🎙️ Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)

| | | |

This model transcribes speech in the Arabic language with punctuation mark support. It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC. See the section Model Architecture and NeMo documentation for complete architecture details. The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.

This model is ready for commercial and non-commercial use. ✅

🏗️ Model Architecture

FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss. You may find more information on the details of FastConformer here: Fast-Conformer Model.

Model utilizes a Google Sentencepiece Tokenizer [2] tokenizer with a vocabulary size of 1024.

📥 Input

Input Type: Audio
Input Format(s): .wav files
Other Properties Related to Input: 16000 Hz Mono-channel Audio, Pre-Processing Not Needed

📤 Output

This model provides transcribed speech as a string for a given audio sample.

Output Type: Text
Output Format: String
Output Parameters: One Dimensional (1D)
Other Properties Related to Output: May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks

⚠️ Limitations

The model is non-streaming and outputs the speech as a string without diacritical marks.
Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).

🚀 How to download and use the model

🔧 Installations

$ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging 
$ pip -q install nemo_toolkit['asr']

📥 Download the model

$ curl -L -o path/to/tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo

🐍 Imports and usage

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.restore_from(
    "path/to/tawasul_egy_stt.nemo",
)

🎯 Transcribing using Python

Simply do:

prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])
print(prediction.text)

You also can pass more then one audio as a patch inference

asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])

📊 Training, and Testing Datasets

🏋️ Training Datasets

The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:

Massive Arabic Speech Corpus (MASC) [690h]
- Data Collection Method: Automated
- Labeling Method: Automated
Mozilla Common Voice 17.0 Arabic [65h]
- Data Collection Method: by Human
- Labeling Method: by Human
Google Fleurs Arabic [5h]
- Data Collection Method: by Human
- Labeling Method: by Human

And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech

The second stage training Egyptian data is Private, there is no intention to open-source the data

🧪 Test Benchmark datasets

Test Set	Num Dialects	Test (h)
SADA	10	10.7
Common Voice 18.0	25	12.6
MASC (Clean-Test)	7	10.5
MASC (Noisy-Test)	8	14.9
MGB-2	Unspecified	9.6
Casablanca	8	7.7

📈 Test Benchmark results

CommonVoice
- WER:
- CER:
MASC
- Clean
  - WER:
  - CER:
- Noisy
  - WER:
  - CER:
MGB-2
- WER:
- CER:
Casablanca
- WER:
- CER:
SADA
- WER:
- CER:

🔧 Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

⚙️ Runtime Engine

Nemo 2.0.0

🖥️ Preferred Operating System

Linux

🔍 Explainability

High-Level Application and Domain: Automatic Speech Recognition
- Describe how this model works: The model transcribes audio input into text for the Arabic language
Verified to have met prescribed quality standards: Yes
Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).

⚖️ Bias

Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
Have any special measures been taken to mitigate unwanted bias? No

🔒 Safety & Security

Use Case Restrictions:

Non-streaming ASR model
Model outputs text in Arabic without diacritical marks
Output text requires Inverse Text Normalization
The model is noise-sensitive
The model is Egyptian Dialect further finetuned

📄 License

License to use this model is covered by the CC-BY-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.

📚 References

[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[2] Google Sentencepiece Tokenizer

[3] NVIDIA NeMo Toolkit

[4] Open Universal Arabic ASR Leaderboard