Speech Recognition AI: Fine-Tuned Whisper and Wav2Vec2 for Real-Time Audio

This project fine-tunes OpenAI's Whisper (whisper-small) and Facebook's Wav2Vec2 (wav2vec2-base-960h) models for real-time speech recognition using live audio recordings. It’s designed for dynamic environments where low-latency transcription is key, such as live conversations or streaming audio.

Model Description

Fine-tuned Whisper and Wav2Vec2 models for real-time speech recognition on live audio.

Features

Real-time audio recording: Captures live 16kHz mono audio via microphone input.
Continuous fine-tuning: Updates model weights incrementally during live sessions.
Speech-to-text transcription: Converts audio to text with high accuracy.
Model saving/loading: Automatically saves fine-tuned models with timestamps.
Dual model support: Choose between Whisper and Wav2Vec2 architectures.

Usage

Start Fine-Tuning

Fine-tune the model on live audio:

# For Whisper model
python main.py --model_type whisper

# For Wav2Vec2 model
python main.py --model_type wav2vec2

Records audio in real-time and updates the model continuously. Press Ctrl+C to stop training and save the model automatically.

Transcription

Test the fine-tuned model:

# For Whisper model
python test_transcription.py --model_type whisper

# For Wav2Vec2 model
python test_transcription.py --model_type wav2vec2

Records 5 seconds of audio (configurable in code) and generates a transcription.

Model Storage

Models are saved by default to:

models/speech_recognition_ai_fine_tune_[model_type]_[timestamp]

Example: models/speech_recognition_ai_fine_tune_whisper_20250225

To customize the save path:

export MODEL_SAVE_PATH="/your/custom/path"
python main.py --model_type [whisper|wav2vec2]

Requirements

Python 3.8+
PyTorch (torch==2.0.1 recommended)
Transformers (transformers==4.35.0 recommended)
Sounddevice (sounddevice==0.4.6)
Torchaudio (torchaudio==2.0.1)

A GPU is recommended for faster fine-tuning. See requirements.txt for the full list.

Model Details

Task: Automatic Speech Recognition (ASR)
Base Models:
- Whisper: openai/whisper-small
- Wav2Vec2: facebook/wav2vec2-base-960h
Fine-tuning: Trained on live 16kHz mono audio recordings with a batch size of 8, using the Adam optimizer (learning rate 1e-5).
Input: 16kHz mono audio
Output: Text transcription
Language: English

Loading the Model (Hugging Face)

To load the models from Hugging Face:

from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("harpertoken/harpertokenASR")
processor = WhisperProcessor.from_pretrained("harpertoken/harpertokenASR")

Repository Structure

speech-model/
├── dataset.py              # Audio recording and preprocessing
├── train.py                # Training pipeline
├── test_transcription.py   # Transcription testing
├── main.py                 # Main script for fine-tuning
├── README.md               # This file
└── requirements.txt        # Dependencies

Training Data

The models are fine-tuned on live audio recordings collected during runtime. No pre-existing dataset is required—users generate their own data via microphone input.

Evaluation Results

Future updates will include WER (Word Error Rate) metrics compared to base models.

License

Licensed under the MIT License.

Downloads last month: 13

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for harpertoken/harpertokenASR

Base model

facebook/wav2vec2-base-960h

Finetuned

(169)

this model

harpertoken
/

harpertokenASR