A Swedish Text-To-Speech model fine-tuned from F5-TTS using approximately 200 hours of speech from the Common Voice dataset and parliamentary recordings from the RixVox dataset. Training was conducted locally using an RTX 4080.
Dataset preparation scripts can be found at https://github.com/ChiliOlavi/F5-TTS/tree/swedish-tts
Training Configuration
- --exp_name
- F5TTS_v1_Base
- --learning_rate
- "0.0001"
- --batch_size_per_gpu
- "2000"
- --batch_size_type
- frame
- --max_samples
- "96"
- --grad_accumulation_steps
- "16"
- --max_grad_norm
- "0.3"
- --epochs
- "100"
- --num_warmup_updates
- "3000"
- --save_per_updates
- "10000"
- --keep_last_n_checkpoints
- "-1"
- --last_per_updates
- "5000"
- --tokenizer
- pinyin
Inference Parameters
{dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4}
Thanks
Special thanks to Amos Wallgren for quality assurance.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for EkhoCollective/f5-tts-swedish
Base model
SWivid/F5-TTS