π― Model Description
Whisper to Oliver FP16 is the half-precision version of our specialized fine-tuned Whisper model, optimized for real-world conversational audio with challenging acoustic conditions. This FP16 version offers faster inference and reduced memory usage while maintaining excellent transcription quality.
β¨ Key Features
- ποΈ Enhanced Performance on Poor Quality Audio: Fine-tuned on 170K conversational datasets with minor to poor audio quality
- π Phone Call Optimized: Specifically trained on short conversational segments typical of phone calls
- π Turbo Performance: Inherits the speed advantages of whisper-large-v3-turbo
- πΌ Enterprise Ready: Developed by Olib AI for business applications
- β‘ FP16 Optimized: Half-precision format for 2x faster inference and 50% memory reduction
- π― Production Ready: Ideal balance between speed and accuracy for deployment
π Training Details
- Base Model: openai/whisper-large-v3-turbo
- Training Dataset: 170,000 conversational audio samples
- Audio Characteristics: Minor to poor quality recordings
- Focus: Short conversational segments typical of phone interactions
- Precision: FP16 (converted from FP32 original)
- Developer: Olib AI - Building AI Services for Businesses
π Usage
Using the Transformers Library
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "olib-ai/whisper-to-oliver-fp16"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
# Transcribe audio
result = pipe("audio.mp3")
print(result["text"])
Advanced Usage with Parameters
# For better results with phone calls or poor quality audio
result = pipe(
"phone_call.mp3",
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
)
print(result["text"])
π― FP16 vs FP32 Comparison
Metric | FP32 Version | FP16 Version |
---|---|---|
Model Size | ~1.5GB | ~760MB |
Inference Speed | 1x | ~2x faster |
Memory Usage | 1x | ~50% less |
Accuracy | Baseline | ~99.9% retained |
For maximum accuracy, consider using our FP32 version.
π Performance
Whisper to Oliver shows significant improvements over the base model when dealing with:
- π Phone call recordings
- ποΈ Low-quality microphone inputs
- π Conversational speech with background noise
- π¬ Short dialogue segments
π― Intended Use
This model is designed for:
- Customer service call transcription
- Meeting transcription with variable audio quality
- Voice assistant applications
- Real-time conversation analysis
- Accessibility applications for hearing-impaired users
- Edge deployment where memory and speed are critical
β οΈ Limitations and Ethical Considerations
Following the ethical guidelines of the base Whisper model:
- Should not be used to transcribe recordings without consent
- Not recommended for "subjective classification" tasks
- Should undergo robust evaluation before deployment in high-risk contexts
- May show performance variations across different languages and demographics
- FP16 precision may have minimal impact on edge cases compared to FP32
π License
This model is released under the MIT License, allowing for commercial and non-commercial use with proper attribution.
π Citation
If you use this model in your research or applications, please cite both our work and the original Whisper paper:
@misc{whisper-to-oliver,
author = {{Olib AI}},
title = {Whisper to Oliver: Fine-tuned Whisper for Real-World Conversational Audio},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/olib-ai/whisper-to-oliver-fp16}},
}
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
π₯ About Olib AI
Olib AI specializes in building AI services for businesses. Our team focuses on creating practical AI solutions that solve real-world problems.
Contact Us:
- π Website: www.olib.ai
- π§ Akram H. Sharkar: [email protected]
- π§ Maya M. Sharkar: [email protected]
- π» GitHub: https://github.com/Olib-AI
- Downloads last month
- 21