Spaces:
Running
on
Zero
Running
on
Zero
File size: 7,603 Bytes
21c9c57 9fa4d05 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
---
license: apache-2.0
title: audio2kineticvid
sdk: gradio
emoji: π
colorFrom: red
colorTo: yellow
---
# Audio2KineticVid
Audio2KineticVid is a comprehensive tool that converts an audio track (e.g., a song) into a dynamic music video with AI-generated scenes and synchronized kinetic typography (animated subtitles). Everything runs locally using open-source models β no external APIs or paid services required.
## β¨ Features
- **π€ Whisper Transcription:** Choose from multiple Whisper models (tiny to large) for audio transcription with word-level timestamps.
- **π§ Adaptive Lyric Segmentation:** Splits lyrics into segments at natural pause points to align scene changes with the song.
- **π¨ Customizable Scene Generation:** Use various LLM models to generate scene descriptions for each lyric segment, with customizable system prompts and word limits.
- **π€ Multiple AI Models:** Select from a variety of text-to-image models (SDXL, SD 1.5, etc.) and video generation models.
- **π¬ Style Consistency Options:** Choose between independent scene generation or img2img-based style consistency for a more cohesive visual experience.
- **π Preview & Inspection:** Preview scenes before full generation and inspect all generated images in a gallery view.
- **π Seamless Transitions:** Configurable crossfade transitions between scene clips.
- **πͺ Kinetic Subtitles:** PyCaps renders styled animated subtitles that appear in sync with the original audio.
- **π Fully Local & Open-Source:** All models are open-license and run on local GPU.
## π» System Requirements
### Hardware Requirements
- **GPU**: NVIDIA GPU with 8GB+ VRAM (recommended: RTX 3080/4070 or better)
- **RAM**: 16GB+ system RAM
- **Storage**: SSD recommended for faster model loading and video processing
- **CPU**: Modern multi-core processor
### Software Requirements
- **Operating System**: Linux, Windows, or macOS
- **Python**: 3.8 or higher
- **CUDA**: NVIDIA CUDA toolkit (for GPU acceleration)
- **FFmpeg**: For audio/video processing
## π Quick Start (Gradio Web UI)
### 1. Install Dependencies
Ensure you have a suitable GPU (NVIDIA T4/A10 or better) with CUDA installed. Then install the required Python packages:
```bash
pip install -r requirements.txt
```
### 2. Launch the Web Interface
```bash
python app.py
```
This will start a Gradio web interface accessible at `http://localhost:7860`.
### 3. Using the Interface
1. **Upload Audio**: Choose an audio file (MP3, WAV, M4A, etc.)
2. **Select Quality Preset**: Choose from Fast, Balanced, or High Quality
3. **Configure Models**: Optionally adjust AI models in the "AI Models" tab
4. **Customize Style**: Modify scene prompts and visual style in other tabs
5. **Preview**: Click "Preview First Scene" to test settings quickly
6. **Generate**: Click "Generate Complete Music Video" to create the full video
## π Usage Tips
### Audio Selection
- **Format**: MP3, WAV, M4A, FLAC, OGG supported
- **Quality**: Clear vocals work best for transcription
- **Length**: 30 seconds to 3 minutes recommended for testing
- **Content**: Songs with distinct lyrics produce better results
### Performance Optimization
- **Fast Generation**: Use 512x288 resolution with "tiny" Whisper model
- **Best Quality**: Use 1280x720 with "large" Whisper model (requires more VRAM)
- **Memory Issues**: Lower resolution, use smaller models, or reduce max segments
### Style Customization
- **Visual Style Keywords**: Add style terms like "cinematic, vibrant, neon" to influence all scenes
- **Prompt Template**: Customize how the AI interprets lyrics into visual scenes
- **Consistency Mode**: Use "Consistent (Img2Img)" for coherent visual style across scenes
## π οΈ Advanced Usage
### Command Line Interface
For batch processing or automation, you can use the smoke test script:
```bash
bash scripts/smoke_test.sh your_audio.mp3
```
### Custom Templates
Create custom subtitle styles by adding new templates in the `templates/` directory:
1. Create a new folder: `templates/your_style/`
2. Add `pycaps.template.json` with animation definitions
3. Add `styles.css` with visual styling
4. The template will appear in the interface dropdown
### Model Configuration
Supported models are defined in the utility modules:
- **Whisper**: `utils/transcribe.py` - Add new Whisper model names
- **LLM**: `utils/prompt_gen.py` - Add new language models
- **Image**: `utils/video_gen.py` - Add new Stable Diffusion variants
- **Video**: `utils/video_gen.py` - Add new video diffusion models
## π§ͺ Testing
Run the basic functionality test:
```bash
python test_basic.py
```
For a complete end-to-end test with a sample audio file:
```bash
python test.py
```
## π Project Structure
```
Audio2KineticVid/
βββ app.py # Main Gradio web interface
βββ requirements.txt # Python dependencies
βββ utils/ # Core processing modules
β βββ transcribe.py # Whisper audio transcription
β βββ segment.py # Intelligent lyric segmentation
β βββ prompt_gen.py # LLM scene description generation
β βββ video_gen.py # Image and video generation
β βββ glue.py # Video stitching and subtitle overlay
βββ templates/ # Subtitle animation templates
β βββ minimalist/ # Clean, simple subtitle style
β βββ dynamic/ # Dynamic animations
βββ scripts/ # Utility scripts
β βββ smoke_test.sh # End-to-end testing script
βββ test_basic.py # Component testing
```
## π¬ Output
The application generates:
- **Final Video**: MP4 file with synchronized audio, visuals, and animated subtitles
- **Scene Images**: Individual AI-generated images for each lyric segment
- **Scene Descriptions**: Text prompts used for image generation
- **Segmentation Data**: Analyzed lyric segments with timing information
## π§ Troubleshooting
### Common Issues
**GPU Memory Errors**
- Reduce video resolution (use 512x288 instead of 1280x720)
- Use smaller models (tiny/base Whisper, SD 1.5 instead of SDXL)
- Close other GPU-intensive applications
**Audio Processing Fails**
- Ensure FFmpeg is installed and accessible
- Try converting audio to WAV format first
- Check that audio file is not corrupted
**Model Loading Issues**
- Check internet connection (models download on first use)
- Verify sufficient disk space for model files
- Clear HuggingFace cache if models are corrupted
**Slow Generation**
- Use "Fast" quality preset for testing
- Reduce crossfade duration to 0 for hard cuts
- Use dynamic FPS instead of fixed high FPS
### Performance Monitoring
Monitor system resources during generation:
- **GPU Usage**: Should be near 100% during image/video generation
- **RAM Usage**: Peak during model loading and video processing
- **Disk I/O**: High during model downloads and video encoding
## π€ Contributing
Contributions are welcome! Areas for improvement:
- Additional subtitle animation templates
- Support for more AI models
- Performance optimizations
- Additional audio/video formats
- Batch processing capabilities
## π License
This project uses open-source models and libraries. Please check individual model licenses for usage rights.
## π Acknowledgments
- **OpenAI Whisper** for speech recognition
- **Stability AI** for Stable Diffusion models
- **Hugging Face** for model hosting and transformers
- **PyCaps** for kinetic subtitle rendering
- **Gradio** for the web interface |