Arthemis TTS: A small Voice model

A lightweight transformer-based text-to-speech model that generates high-quality speech with a natural American women's voice. Despite having only 21-22 million parameters, this model delivers surprisingly clear synthesis

Model Description

This is a compact implementation of the Transformer TTS architecture, specifically trained to produce speech that sounds like a American woman speaker. The model strikes an excellent balance between medium quality and efficiency, making it perfect for applications where you need good speech synthesis without the computational overhead of larger models.

Key Features:

  • 🎯 Compact Size: Only 21-22M parameters (355MB model file)
  • ⚑ Fast Inference: Optimized for quick speech generation
  • 🎡 High Quality: Clear, natural-sounding speech output
  • πŸ’» CPU Friendly: Works well even without GPU acceleration

Quick Start

pip install arthemis-tts
import arthemis_tts

# Generate speech with American women's voice
audio = arthemis_tts.text_to_speech(
    "Hello,world!",
    model_path="arthemis_final.pt",
    output_path="test_speech.wav"
)

Model Architecture

This model uses a simplified Transformer TTS architecture with:

  • Text Encoder: Converts text to embeddings using convolutional pre-processing and transformer blocks
  • Mel Decoder: Generates mel spectrograms using cross-attention with the text representation
  • PostNet: Refines the mel spectrograms for better audio quality
  • Griffin-Lim Vocoder: Converts mel spectrograms to waveforms

The architecture is intentionally kept lightweight while maintaining the core transformer mechanisms that make modern TTS systems effective.

Training Details

  • Dataset: Trained on LJ Speech dataset (single speaker, American English)
  • Training Duration: Approximately 12 hours on A100 GPU
  • Optimization: Focused on maintaining quality while reducing model size
  • Voice Characteristics: Optimized to produce a warm, American women's voice

Limitations

  • Accent: Primarily optimized for American English pronunciation
  • Voice: Single speaker (American women) - not suitable for multi-speaker applications
  • Vocoder: Uses Griffin-Lim reconstruction, which may introduce some artifacts compared to neural vocoders
  • Voice-Quality-Limiation: This model might struggle to pronounce some words

Use Cases

This model is particularly well-suited for:

  • Accessibility Applications: Screen readers, audio books
  • Educational Content: E-learning platforms, language learning apps
  • Voice Assistants: Lightweight voice synthesis for IoT devices
  • Content Creation: Podcast intros, video narration
  • Prototyping: Quick TTS integration for demos and MVPs

Technical Specifications

  • Sample Rate: 22,050 Hz
  • Mel Frequency Bins: 128
  • Max Sequence Length: 1024 time steps
  • Input: Raw text (English)
  • Output: 16-bit WAV audio files

Installation and Usage

For detailed usage examples and advanced configuration options, please refer to the Arthemis TTS library documentation

License

This model is released under the MIT License. Feel free to use it in both commercial and non-commercial projects.


Acknowledgments

  • Based on the Neural Speech Synthesis with Transformer Network paper
  • Inspired by the original SimpleTransformerTTS implementation
  • Uses PyTorch and torchaudio for audio processing

Note: This model represents a good starting point for English TTS applications. While it may not match the quality of much larger models (100M+ parameters), it offers an excellent trade-off between size, speed, and quality for most practical applications.

Happy synthesizing! 🎀

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including rootxhacker/Arthemis-TTS-22M