|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
tags: |
|
|
- whisper |
|
|
- automatic-speech-recognition |
|
|
- speech |
|
|
- audio |
|
|
- transcription |
|
|
- phone-calls |
|
|
- conversational |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://olib.ai/logo.png" alt="Olib AI Logo" width="200"/> |
|
|
|
|
|
# Whisper to Oliver |
|
|
|
|
|
**Fine-tuned Whisper for Real-World Conversational Audio** |
|
|
|
|
|
[](https://huggingface.co/olib-ai/whisper-to-oliver) |
|
|
[](https://opensource.org/licenses/MIT) |
|
|
[](https://www.olib.ai) |
|
|
</div> |
|
|
|
|
|
## π― Model Description |
|
|
|
|
|
**Whisper to Oliver** is a specialized fine-tuned version of OpenAI's `whisper-large-v3-turbo` model, optimized for real-world conversational audio with challenging acoustic conditions. This model is specifically designed to excel at transcribing phone calls and conversations where audio quality may be compromised. |
|
|
|
|
|
### β¨ Key Features |
|
|
|
|
|
- ποΈ **Enhanced Performance on Poor Quality Audio**: Fine-tuned on 170K conversational datasets with minor to poor audio quality |
|
|
- π **Phone Call Optimized**: Specifically trained on short conversational segments typical of phone calls |
|
|
- π **Turbo Performance**: Inherits the speed advantages of whisper-large-v3-turbo |
|
|
- πΌ **Enterprise Ready**: Developed by [Olib AI](https://www.olib.ai) for business applications |
|
|
- π§ **FP32 Precision**: Full precision model for maximum accuracy |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
- **Base Model**: [openai/whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) |
|
|
- **Training Dataset**: 170,000 conversational audio samples |
|
|
- **Audio Characteristics**: Minor to poor quality recordings |
|
|
- **Focus**: Short conversational segments typical of phone interactions |
|
|
- **Developer**: [Olib AI](https://www.olib.ai) - Building AI Services for Businesses |
|
|
|
|
|
## π Usage |
|
|
|
|
|
### Using the Transformers Library |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline |
|
|
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
|
|
|
|
model_id = "olib-ai/whisper-to-oliver" |
|
|
|
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained( |
|
|
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True |
|
|
) |
|
|
model.to(device) |
|
|
|
|
|
# Note: This model is in FP32 format |
|
|
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
|
|
pipe = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model=model, |
|
|
tokenizer=processor.tokenizer, |
|
|
feature_extractor=processor.feature_extractor, |
|
|
torch_dtype=torch_dtype, |
|
|
device=device, |
|
|
) |
|
|
|
|
|
# Transcribe audio |
|
|
result = pipe("audio.mp3") |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
### Advanced Usage with Parameters |
|
|
|
|
|
```python |
|
|
# For better results with phone calls or poor quality audio |
|
|
result = pipe( |
|
|
"phone_call.mp3", |
|
|
chunk_length_s=30, |
|
|
batch_size=16, |
|
|
return_timestamps=True, |
|
|
) |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
## π Performance |
|
|
|
|
|
Whisper to Oliver shows significant improvements over the base model when dealing with: |
|
|
- π Phone call recordings |
|
|
- ποΈ Low-quality microphone inputs |
|
|
- π Conversational speech with background noise |
|
|
- π¬ Short dialogue segments |
|
|
|
|
|
## π― Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- Customer service call transcription |
|
|
- Meeting transcription with variable audio quality |
|
|
- Voice assistant applications |
|
|
- Real-time conversation analysis |
|
|
- Accessibility applications for hearing-impaired users |
|
|
|
|
|
## β οΈ Limitations and Ethical Considerations |
|
|
|
|
|
Following the ethical guidelines of the base Whisper model: |
|
|
- Should not be used to transcribe recordings without consent |
|
|
- Not recommended for "subjective classification" tasks |
|
|
- Should undergo robust evaluation before deployment in high-risk contexts |
|
|
- May show performance variations across different languages and demographics |
|
|
|
|
|
## π License |
|
|
|
|
|
This model is released under the **MIT License**, allowing for commercial and non-commercial use with proper attribution. |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this model in your research or applications, please cite both our work and the original Whisper paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{whisper-to-oliver, |
|
|
author = {{Olib AI}}, |
|
|
title = {Whisper to Oliver: Fine-tuned Whisper for Real-World Conversational Audio}, |
|
|
year = {2024}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/olib-ai/whisper-to-oliver}}, |
|
|
} |
|
|
|
|
|
@misc{radford2022whisper, |
|
|
doi = {10.48550/ARXIV.2212.04356}, |
|
|
url = {https://arxiv.org/abs/2212.04356}, |
|
|
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, |
|
|
title = {Robust Speech Recognition via Large-Scale Weak Supervision}, |
|
|
publisher = {arXiv}, |
|
|
year = {2022}, |
|
|
copyright = {arXiv.org perpetual, non-exclusive license} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π₯ About Olib AI |
|
|
|
|
|
[Olib AI](https://www.olib.ai) specializes in building AI services for businesses. Our team focuses on creating practical AI solutions that solve real-world problems. |
|
|
|
|
|
**Contact Us:** |
|
|
- π Website: [www.olib.ai](https://www.olib.ai) |
|
|
- π§ Akram H. Sharkar: [[email protected]](mailto:[email protected]) |
|
|
- π§ Maya M. Sharkar: [[email protected]](mailto:[email protected]) |
|
|
- π» GitHub: [https://github.com/Olib-AI](https://github.com/Olib-AI) |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<strong>Built with β€οΈ by Olib AI</strong> |
|
|
</div> |