Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
5.38.2
license: apache-2.0
title: audio2kineticvid
sdk: gradio
emoji: π
colorFrom: red
colorTo: yellow
Audio2KineticVid
Audio2KineticVid is a comprehensive tool that converts an audio track (e.g., a song) into a dynamic music video with AI-generated scenes and synchronized kinetic typography (animated subtitles). Everything runs locally using open-source models β no external APIs or paid services required.
β¨ Features
- π€ Whisper Transcription: Choose from multiple Whisper models (tiny to large) for audio transcription with word-level timestamps.
- π§ Adaptive Lyric Segmentation: Splits lyrics into segments at natural pause points to align scene changes with the song.
- π¨ Customizable Scene Generation: Use various LLM models to generate scene descriptions for each lyric segment, with customizable system prompts and word limits.
- π€ Multiple AI Models: Select from a variety of text-to-image models (SDXL, SD 1.5, etc.) and video generation models.
- π¬ Style Consistency Options: Choose between independent scene generation or img2img-based style consistency for a more cohesive visual experience.
- π Preview & Inspection: Preview scenes before full generation and inspect all generated images in a gallery view.
- π Seamless Transitions: Configurable crossfade transitions between scene clips.
- πͺ Kinetic Subtitles: PyCaps renders styled animated subtitles that appear in sync with the original audio.
- π Fully Local & Open-Source: All models are open-license and run on local GPU.
π» System Requirements
Hardware Requirements
- GPU: NVIDIA GPU with 8GB+ VRAM (recommended: RTX 3080/4070 or better)
- RAM: 16GB+ system RAM
- Storage: SSD recommended for faster model loading and video processing
- CPU: Modern multi-core processor
Software Requirements
- Operating System: Linux, Windows, or macOS
- Python: 3.8 or higher
- CUDA: NVIDIA CUDA toolkit (for GPU acceleration)
- FFmpeg: For audio/video processing
π Quick Start (Gradio Web UI)
1. Install Dependencies
Ensure you have a suitable GPU (NVIDIA T4/A10 or better) with CUDA installed. Then install the required Python packages:
pip install -r requirements.txt
2. Launch the Web Interface
python app.py
This will start a Gradio web interface accessible at http://localhost:7860
.
3. Using the Interface
- Upload Audio: Choose an audio file (MP3, WAV, M4A, etc.)
- Select Quality Preset: Choose from Fast, Balanced, or High Quality
- Configure Models: Optionally adjust AI models in the "AI Models" tab
- Customize Style: Modify scene prompts and visual style in other tabs
- Preview: Click "Preview First Scene" to test settings quickly
- Generate: Click "Generate Complete Music Video" to create the full video
π Usage Tips
Audio Selection
- Format: MP3, WAV, M4A, FLAC, OGG supported
- Quality: Clear vocals work best for transcription
- Length: 30 seconds to 3 minutes recommended for testing
- Content: Songs with distinct lyrics produce better results
Performance Optimization
- Fast Generation: Use 512x288 resolution with "tiny" Whisper model
- Best Quality: Use 1280x720 with "large" Whisper model (requires more VRAM)
- Memory Issues: Lower resolution, use smaller models, or reduce max segments
Style Customization
- Visual Style Keywords: Add style terms like "cinematic, vibrant, neon" to influence all scenes
- Prompt Template: Customize how the AI interprets lyrics into visual scenes
- Consistency Mode: Use "Consistent (Img2Img)" for coherent visual style across scenes
π οΈ Advanced Usage
Command Line Interface
For batch processing or automation, you can use the smoke test script:
bash scripts/smoke_test.sh your_audio.mp3
Custom Templates
Create custom subtitle styles by adding new templates in the templates/
directory:
- Create a new folder:
templates/your_style/
- Add
pycaps.template.json
with animation definitions - Add
styles.css
with visual styling - The template will appear in the interface dropdown
Model Configuration
Supported models are defined in the utility modules:
- Whisper:
utils/transcribe.py
- Add new Whisper model names - LLM:
utils/prompt_gen.py
- Add new language models - Image:
utils/video_gen.py
- Add new Stable Diffusion variants - Video:
utils/video_gen.py
- Add new video diffusion models
π§ͺ Testing
Run the basic functionality test:
python test_basic.py
For a complete end-to-end test with a sample audio file:
python test.py
π Project Structure
Audio2KineticVid/
βββ app.py # Main Gradio web interface
βββ requirements.txt # Python dependencies
βββ utils/ # Core processing modules
β βββ transcribe.py # Whisper audio transcription
β βββ segment.py # Intelligent lyric segmentation
β βββ prompt_gen.py # LLM scene description generation
β βββ video_gen.py # Image and video generation
β βββ glue.py # Video stitching and subtitle overlay
βββ templates/ # Subtitle animation templates
β βββ minimalist/ # Clean, simple subtitle style
β βββ dynamic/ # Dynamic animations
βββ scripts/ # Utility scripts
β βββ smoke_test.sh # End-to-end testing script
βββ test_basic.py # Component testing
π¬ Output
The application generates:
- Final Video: MP4 file with synchronized audio, visuals, and animated subtitles
- Scene Images: Individual AI-generated images for each lyric segment
- Scene Descriptions: Text prompts used for image generation
- Segmentation Data: Analyzed lyric segments with timing information
π§ Troubleshooting
Common Issues
GPU Memory Errors
- Reduce video resolution (use 512x288 instead of 1280x720)
- Use smaller models (tiny/base Whisper, SD 1.5 instead of SDXL)
- Close other GPU-intensive applications
Audio Processing Fails
- Ensure FFmpeg is installed and accessible
- Try converting audio to WAV format first
- Check that audio file is not corrupted
Model Loading Issues
- Check internet connection (models download on first use)
- Verify sufficient disk space for model files
- Clear HuggingFace cache if models are corrupted
Slow Generation
- Use "Fast" quality preset for testing
- Reduce crossfade duration to 0 for hard cuts
- Use dynamic FPS instead of fixed high FPS
Performance Monitoring
Monitor system resources during generation:
- GPU Usage: Should be near 100% during image/video generation
- RAM Usage: Peak during model loading and video processing
- Disk I/O: High during model downloads and video encoding
π€ Contributing
Contributions are welcome! Areas for improvement:
- Additional subtitle animation templates
- Support for more AI models
- Performance optimizations
- Additional audio/video formats
- Batch processing capabilities
π License
This project uses open-source models and libraries. Please check individual model licenses for usage rights.
π Acknowledgments
- OpenAI Whisper for speech recognition
- Stability AI for Stable Diffusion models
- Hugging Face for model hosting and transformers
- PyCaps for kinetic subtitle rendering
- Gradio for the web interface