Speech Recognition AI: Fine-Tuned Whisper and Wav2Vec2 for Real-Time Audio
This project fine-tunes OpenAI's Whisper (whisper-small) and Facebook's Wav2Vec2 (wav2vec2-base-960h) models for real-time speech recognition using live audio recordings. Itβs designed for dynamic environments where low-latency transcription is key, such as live conversations or streaming audio.
Model Description
Fine-tuned Whisper and Wav2Vec2 models for real-time speech recognition on live audio.
Features
- Real-time audio recording: Captures live 16kHz mono audio via microphone input.
- Continuous fine-tuning: Updates model weights incrementally during live sessions.
- Speech-to-text transcription: Converts audio to text with high accuracy.
- Model saving/loading: Automatically saves fine-tuned models with timestamps.
- Dual model support: Choose between Whisper and Wav2Vec2 architectures.
Usage
Start Fine-Tuning
Fine-tune the model on live audio:
# For Whisper model
python main.py --model_type whisper
# For Wav2Vec2 model
python main.py --model_type wav2vec2
Records audio in real-time and updates the model continuously. Press Ctrl+C to stop training and save the model automatically.
Transcription
Test the fine-tuned model:
# For Whisper model
python test_transcription.py --model_type whisper
# For Wav2Vec2 model
python test_transcription.py --model_type wav2vec2
Records 5 seconds of audio (configurable in code) and generates a transcription.
Model Storage
Models are saved by default to:
models/speech_recognition_ai_fine_tune_[model_type]_[timestamp]
Example: models/speech_recognition_ai_fine_tune_whisper_20250225
To customize the save path:
export MODEL_SAVE_PATH="/your/custom/path"
python main.py --model_type [whisper|wav2vec2]
Requirements
- Python 3.8+
- PyTorch (torch==2.0.1 recommended)
- Transformers (transformers==4.35.0 recommended)
- Sounddevice (sounddevice==0.4.6)
- Torchaudio (torchaudio==2.0.1)
A GPU is recommended for faster fine-tuning. See requirements.txt for the full list.
Model Details
- Task: Automatic Speech Recognition (ASR)
- Base Models:
- Whisper: openai/whisper-small
- Wav2Vec2: facebook/wav2vec2-base-960h
- Fine-tuning: Trained on live 16kHz mono audio recordings with a batch size of 8, using the Adam optimizer (learning rate 1e-5).
- Input: 16kHz mono audio
- Output: Text transcription
- Language: English
Loading the Model (Hugging Face)
To load the models from Hugging Face:
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("harpertoken/harpertokenASR")
processor = WhisperProcessor.from_pretrained("harpertoken/harpertokenASR")
Repository Structure
speech-model/
βββ dataset.py # Audio recording and preprocessing
βββ train.py # Training pipeline
βββ test_transcription.py # Transcription testing
βββ main.py # Main script for fine-tuning
βββ README.md # This file
βββ requirements.txt # Dependencies
Training Data
The models are fine-tuned on live audio recordings collected during runtime. No pre-existing dataset is requiredβusers generate their own data via microphone input.
Evaluation Results
Future updates will include WER (Word Error Rate) metrics compared to base models.
License
Licensed under the MIT License.
- Downloads last month
- 13
Model tree for harpertoken/harpertokenASR
Base model
facebook/wav2vec2-base-960h