--- license: apache-2.0 title: audio2kineticvid sdk: gradio emoji: ๐Ÿš€ colorFrom: red colorTo: yellow --- # Audio2KineticVid Audio2KineticVid is a comprehensive tool that converts an audio track (e.g., a song) into a dynamic music video with AI-generated scenes and synchronized kinetic typography (animated subtitles). Everything runs locally using open-source models โ€“ no external APIs or paid services required. ## โœจ Features - **๐ŸŽค Whisper Transcription:** Choose from multiple Whisper models (tiny to large) for audio transcription with word-level timestamps. - **๐Ÿง  Adaptive Lyric Segmentation:** Splits lyrics into segments at natural pause points to align scene changes with the song. - **๐ŸŽจ Customizable Scene Generation:** Use various LLM models to generate scene descriptions for each lyric segment, with customizable system prompts and word limits. - **๐Ÿค– Multiple AI Models:** Select from a variety of text-to-image models (SDXL, SD 1.5, etc.) and video generation models. - **๐ŸŽฌ Style Consistency Options:** Choose between independent scene generation or img2img-based style consistency for a more cohesive visual experience. - **๐Ÿ” Preview & Inspection:** Preview scenes before full generation and inspect all generated images in a gallery view. - **๐Ÿ”„ Seamless Transitions:** Configurable crossfade transitions between scene clips. - **๐ŸŽช Kinetic Subtitles:** PyCaps renders styled animated subtitles that appear in sync with the original audio. - **๐Ÿ”’ Fully Local & Open-Source:** All models are open-license and run on local GPU. ## ๐Ÿ’ป System Requirements ### Hardware Requirements - **GPU**: NVIDIA GPU with 8GB+ VRAM (recommended: RTX 3080/4070 or better) - **RAM**: 16GB+ system RAM - **Storage**: SSD recommended for faster model loading and video processing - **CPU**: Modern multi-core processor ### Software Requirements - **Operating System**: Linux, Windows, or macOS - **Python**: 3.8 or higher - **CUDA**: NVIDIA CUDA toolkit (for GPU acceleration) - **FFmpeg**: For audio/video processing ## ๐Ÿš€ Quick Start (Gradio Web UI) ### 1. Install Dependencies Ensure you have a suitable GPU (NVIDIA T4/A10 or better) with CUDA installed. Then install the required Python packages: ```bash pip install -r requirements.txt ``` ### 2. Launch the Web Interface ```bash python app.py ``` This will start a Gradio web interface accessible at `http://localhost:7860`. ### 3. Using the Interface 1. **Upload Audio**: Choose an audio file (MP3, WAV, M4A, etc.) 2. **Select Quality Preset**: Choose from Fast, Balanced, or High Quality 3. **Configure Models**: Optionally adjust AI models in the "AI Models" tab 4. **Customize Style**: Modify scene prompts and visual style in other tabs 5. **Preview**: Click "Preview First Scene" to test settings quickly 6. **Generate**: Click "Generate Complete Music Video" to create the full video ## ๐Ÿ“ Usage Tips ### Audio Selection - **Format**: MP3, WAV, M4A, FLAC, OGG supported - **Quality**: Clear vocals work best for transcription - **Length**: 30 seconds to 3 minutes recommended for testing - **Content**: Songs with distinct lyrics produce better results ### Performance Optimization - **Fast Generation**: Use 512x288 resolution with "tiny" Whisper model - **Best Quality**: Use 1280x720 with "large" Whisper model (requires more VRAM) - **Memory Issues**: Lower resolution, use smaller models, or reduce max segments ### Style Customization - **Visual Style Keywords**: Add style terms like "cinematic, vibrant, neon" to influence all scenes - **Prompt Template**: Customize how the AI interprets lyrics into visual scenes - **Consistency Mode**: Use "Consistent (Img2Img)" for coherent visual style across scenes ## ๐Ÿ› ๏ธ Advanced Usage ### Command Line Interface For batch processing or automation, you can use the smoke test script: ```bash bash scripts/smoke_test.sh your_audio.mp3 ``` ### Custom Templates Create custom subtitle styles by adding new templates in the `templates/` directory: 1. Create a new folder: `templates/your_style/` 2. Add `pycaps.template.json` with animation definitions 3. Add `styles.css` with visual styling 4. The template will appear in the interface dropdown ### Model Configuration Supported models are defined in the utility modules: - **Whisper**: `utils/transcribe.py` - Add new Whisper model names - **LLM**: `utils/prompt_gen.py` - Add new language models - **Image**: `utils/video_gen.py` - Add new Stable Diffusion variants - **Video**: `utils/video_gen.py` - Add new video diffusion models ## ๐Ÿงช Testing Run the basic functionality test: ```bash python test_basic.py ``` For a complete end-to-end test with a sample audio file: ```bash python test.py ``` ## ๐Ÿ“ Project Structure ``` Audio2KineticVid/ โ”œโ”€โ”€ app.py # Main Gradio web interface โ”œโ”€โ”€ requirements.txt # Python dependencies โ”œโ”€โ”€ utils/ # Core processing modules โ”‚ โ”œโ”€โ”€ transcribe.py # Whisper audio transcription โ”‚ โ”œโ”€โ”€ segment.py # Intelligent lyric segmentation โ”‚ โ”œโ”€โ”€ prompt_gen.py # LLM scene description generation โ”‚ โ”œโ”€โ”€ video_gen.py # Image and video generation โ”‚ โ””โ”€โ”€ glue.py # Video stitching and subtitle overlay โ”œโ”€โ”€ templates/ # Subtitle animation templates โ”‚ โ”œโ”€โ”€ minimalist/ # Clean, simple subtitle style โ”‚ โ””โ”€โ”€ dynamic/ # Dynamic animations โ”œโ”€โ”€ scripts/ # Utility scripts โ”‚ โ””โ”€โ”€ smoke_test.sh # End-to-end testing script โ””โ”€โ”€ test_basic.py # Component testing ``` ## ๐ŸŽฌ Output The application generates: - **Final Video**: MP4 file with synchronized audio, visuals, and animated subtitles - **Scene Images**: Individual AI-generated images for each lyric segment - **Scene Descriptions**: Text prompts used for image generation - **Segmentation Data**: Analyzed lyric segments with timing information ## ๐Ÿ”ง Troubleshooting ### Common Issues **GPU Memory Errors** - Reduce video resolution (use 512x288 instead of 1280x720) - Use smaller models (tiny/base Whisper, SD 1.5 instead of SDXL) - Close other GPU-intensive applications **Audio Processing Fails** - Ensure FFmpeg is installed and accessible - Try converting audio to WAV format first - Check that audio file is not corrupted **Model Loading Issues** - Check internet connection (models download on first use) - Verify sufficient disk space for model files - Clear HuggingFace cache if models are corrupted **Slow Generation** - Use "Fast" quality preset for testing - Reduce crossfade duration to 0 for hard cuts - Use dynamic FPS instead of fixed high FPS ### Performance Monitoring Monitor system resources during generation: - **GPU Usage**: Should be near 100% during image/video generation - **RAM Usage**: Peak during model loading and video processing - **Disk I/O**: High during model downloads and video encoding ## ๐Ÿค Contributing Contributions are welcome! Areas for improvement: - Additional subtitle animation templates - Support for more AI models - Performance optimizations - Additional audio/video formats - Batch processing capabilities ## ๐Ÿ“„ License This project uses open-source models and libraries. Please check individual model licenses for usage rights. ## ๐Ÿ™ Acknowledgments - **OpenAI Whisper** for speech recognition - **Stability AI** for Stable Diffusion models - **Hugging Face** for model hosting and transformers - **PyCaps** for kinetic subtitle rendering - **Gradio** for the web interface