πŸŽ™οΈ Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)

| Model architecture | Model size | Language|

This model transcribes speech in the Arabic language with punctuation mark support. It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC. See the section Model Architecture and NeMo documentation for complete architecture details. The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.

This model is ready for commercial and non-commercial use. βœ…

πŸ—οΈ Model Architecture

FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss. You may find more information on the details of FastConformer here: Fast-Conformer Model.

Model utilizes a Google Sentencepiece Tokenizer [2] tokenizer with a vocabulary size of 1024.

πŸ“₯ Input

  • Input Type: Audio
  • Input Format(s): .wav files
  • Other Properties Related to Input: 16000 Hz Mono-channel Audio, Pre-Processing Not Needed

πŸ“€ Output

This model provides transcribed speech as a string for a given audio sample.

  • Output Type: Text
  • Output Format: String
  • Output Parameters: One Dimensional (1D)
  • Other Properties Related to Output: May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks

⚠️ Limitations

  • The model is non-streaming and outputs the speech as a string without diacritical marks.
  • Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).

πŸš€ How to download and use the model

πŸ”§ Installations

$ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging 
$ pip -q install nemo_toolkit['asr']

πŸ“₯ Download the model

$ curl -L -o path/to/tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo

🐍 Imports and usage

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.restore_from(
    "path/to/tawasul_egy_stt.nemo",
)

🎯 Transcribing using Python

Simply do:

prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])
print(prediction.text)

You also can pass more then one audio as a patch inference

asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])

πŸ“Š Training, and Testing Datasets

πŸ‹οΈ Training Datasets

The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:

And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech

  • The second stage training Egyptian data is Private, there is no intention to open-source the data

πŸ§ͺ Test Benchmark datasets

Test Set Num Dialects Test (h)
SADA 10 10.7
Common Voice 18.0 25 12.6
MASC (Clean-Test) 7 10.5
MASC (Noisy-Test) 8 14.9
MGB-2 Unspecified 9.6
Casablanca 8 7.7

πŸ“ˆ Test Benchmark results

  • CommonVoice
    • WER:
    • CER:
  • MASC
    • Clean
      • WER:
      • CER:
    • Noisy
      • WER:
      • CER:
  • MGB-2
    • WER:
    • CER:
  • Casablanca
    • WER:
    • CER:
  • SADA
    • WER:
    • CER:

πŸ”§ Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Pascal
  • NVIDIA Turing
  • NVIDIA Volta

βš™οΈ Runtime Engine

  • Nemo 2.0.0

πŸ–₯️ Preferred Operating System

  • Linux

πŸ” Explainability

  • High-Level Application and Domain: Automatic Speech Recognition
    • Describe how this model works: The model transcribes audio input into text for the Arabic language
  • Verified to have met prescribed quality standards: Yes
  • Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
  • Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).

βš–οΈ Bias

  • Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
  • Have any special measures been taken to mitigate unwanted bias? No

πŸ”’ Safety & Security

Use Case Restrictions:

  • Non-streaming ASR model
  • Model outputs text in Arabic without diacritical marks
  • Output text requires Inverse Text Normalization
  • The model is noise-sensitive
  • The model is Egyptian Dialect further finetuned

πŸ“„ License

License to use this model is covered by the CC-BY-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.

πŸ“š References

[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[2] Google Sentencepiece Tokenizer

[3] NVIDIA NeMo Toolkit

[4] Open Universal Arabic ASR Leaderboard

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support