recitation-segmenter-v2: Quranic Recitation Segmenter
This model is a fine-tuned version of facebook/w2v-bert-2.0 for segmenting Holy Quran recitations based on pause points (waqf). It was presented in the paper Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning.
Project Page: https://obadx.github.io/prepare-quran-dataset/ GitHub Repository: https://github.com/obadx/recitations-segmenter
It achieves the following results on the evaluation set:
- Accuracy: 0.9958
- F1: 0.9964
- Loss: 0.0132
- Precision: 0.9976
- Recall: 0.9951
Model description
The recitation-segmenter-v2
model is an enhanced AI model capable of segmenting Holy Quran recitations based on pause points (waqf
) with high accuracy. It is built upon a fine-tuned Wav2Vec2Bert model, performing Sequence Frame Level Classification with a 20-millisecond resolution. This model and its accompanying Python library are designed for high-performance processing of any number and length of Quranic recitations, from a few seconds to several hours, without performance degradation.
Key Features:
- Segments Quranic recitations according to
waqf
(pause rules). - Specifically trained for Quranic recitations.
- High accuracy, up to 20 milliseconds precision.
- Requires only ~3 GB of GPU memory.
- Capable of processing recitations of any duration without performance loss.
The model is part of a larger effort described in the associated paper, aiming to bridge gaps in assessing spoken language for the Holy Quran. This includes an automated pipeline to produce high-quality Quranic datasets and a novel ASR-based approach for pronunciation error detection using a custom Quran Phonetic Script (QPS).
Intended uses & limitations
This model is primarily intended for:
- Automatic segmentation of Holy Quran recitations for educational purposes or content analysis.
- Building high-quality Quranic audio databases.
- As a foundational component for larger systems focused on pronunciation error detection and correction for Quran learners.
Limitations:
- The segmenter currently considers
sakt
(a very short pause without breath) as a fullwaqf
(stop), which might be a nuance for advanced Tajweed analysis. - The model is specifically trained and optimized for Quranic recitations and might not generalize well to other forms of spoken Arabic.
Training and evaluation data
The model was fine-tuned on a meticulously collected dataset of Quranic recitations. The data collection process, described in the associated paper, involved a 98% automated pipeline including collection from expert reciters, segmentation at pause points (waqf
) using a fine-tuned wav2vec2-BERT
model, transcription of segments, and transcript verification via a novel Tasmeea algorithm. The dataset comprises over 850 hours of audio (~300K annotated utterances).
The data preparation involved:
- Downloading Quranic recitations and converting them to Hugging Face Audio Dataset format at 16000 Hz sample rate.
- Pre-segmenting verses based on pauses using
sliero-vad-v4
from everyayah.com. - Applying post-processing (e.g.,
min_silence_duration_ms
,min_speech_duration_ms
,pad_duration_ms
) to refine segments and manual verification for high-quality divisions. - Applying data augmentation techniques, including time stretching (speeding up/slowing down 40% of recitations) and various audio effects (Aliasing, AddGaussianNoise, BandPassFilter, PitchShift, RoomSimulator, etc.) using the
audiomentations
library. - Normalizing audio segments to 16000 Hz and chunking them, with a maximum length of 20 seconds, using a sliding window approach for longer segments.
The training dataset and its augmented version are available on Hugging Face:
Usage
You can use this model with its accompanying Python library, recitations-segmenter
, which integrates with Hugging Face transformers
.
First, ensure ffmpeg
and libsoundfile
are installed system-wide.
Requirements
Install ffmpeg
and libsoundfile
system-wide.
Linux
sudo apt-get update
sudo apt-get install -y ffmpeg libsndfile1 portaudio19-dev
Windows & Mac
You can create an anaconda
environment and then install these libraries:
conda create -n segment python=3.12
conda activate segment
conda install -c conda-forge ffmpeg libsndfile
Via pip
pip install recitations-segmenter
Sample usage (Python API)
Here's a complete example for using the library in Python. A Google Colab example is also available: Open in Colab
from pathlib import Path
from recitations_segmenter import segment_recitations, read_audio, clean_speech_intervals
from transformers import AutoFeatureExtractor, AutoModelForAudioFrameClassification
import torch
if __name__ == '__main__':
device = torch.device('cuda')
dtype = torch.bfloat16
processor = AutoFeatureExtractor.from_pretrained(
"obadx/recitation-segmenter-v2")
model = AutoModelForAudioFrameClassification.from_pretrained(
"obadx/recitation-segmenter-v2",
)
model.to(device, dtype=dtype)
# Change this to the file pathes of Holy Quran recitations
# File pathes with the Holy Quran Recitations
file_pathes = [
'./assets/dussary_002282.mp3',
'./assets/hussary_053001.mp3',
]
waves = [read_audio(p) for p in file_pathes]
# Extracting speech inervals in samples according to 16000 Sample rate
sampled_outputs = segment_recitations(
waves,
model,
processor,
device=device,
dtype=dtype,
batch_size=8,
)
for out, path in zip(sampled_outputs, file_pathes):
# Clean The speech intervals by:
# * merging small silence durations
# * remove small speech durations
# * add padding to each speech duration
# Raises:
# * NoSpeechIntervals: if the wav is complete silence
# * TooHighMinSpeechDruation: if `min_speech_duration` is too high which
# resuls for deleting all speech intervals
clean_out = clean_speech_intervals(
out.speech_intervals,
out.is_complete,
min_silence_duration_ms=30,
min_speech_duration_ms=30,
pad_duration_ms=30,
return_seconds=True,
)
print(f'Speech Intervals of: {Path(path).name}: ')
print(clean_out.clean_speech_intervals)
print(f'Is Recitation Complete: {clean_out.is_complete}')
print('-' * 40)
Training procedure
The model was trained on Wav2Vec2BertForAudioFrameClassification
using the transformers
library. More detailed motivations, methodology, and setup can be found in the GitHub repository's "تفاصيل التدريب" section.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 50
- eval_batch_size: 64
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: constant
- lr_scheduler_warmup_ratio: 0.2
- num_epochs: 1
Training results
Training Loss | Epoch | Step | Accuracy | F1 | Validation Loss | Precision | Recall |
---|---|---|---|---|---|---|---|
0.0701 | 0.2507 | 275 | 0.9953 | 0.9959 | 0.0249 | 0.9947 | 0.9971 |
0.0234 | 0.5014 | 550 | 0.9953 | 0.9959 | 0.0185 | 0.9940 | 0.9977 |
0.0186 | 0.7521 | 825 | 0.9958 | 0.9964 | 0.0132 | 0.9976 | 0.9951 |
Framework versions
- Transformers 4.51.3
- Pytorch 2.2.1+cu121
- Datasets 3.5.0
- Tokenizers 0.21.1
Citation
If you find our work helpful or inspiring, please feel free to cite it.
@article{ibrahim2025automatic,
title={Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning},
author={Abdullah Abdelfattah, Mahmoud I.Khalil, Hazem M.Abbas},
journal={arXiv preprint arXiv:2509.00094},
year={2025}
}
- Downloads last month
- 279
Model tree for obadx/recitation-segmenter-v2
Base model
facebook/w2v-bert-2.0