File size: 7,603 Bytes
21c9c57
 
 
 
 
 
 
 
9fa4d05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
license: apache-2.0
title: audio2kineticvid
sdk: gradio
emoji: πŸš€
colorFrom: red
colorTo: yellow
---
# Audio2KineticVid

Audio2KineticVid is a comprehensive tool that converts an audio track (e.g., a song) into a dynamic music video with AI-generated scenes and synchronized kinetic typography (animated subtitles). Everything runs locally using open-source models – no external APIs or paid services required.

## ✨ Features

- **🎀 Whisper Transcription:** Choose from multiple Whisper models (tiny to large) for audio transcription with word-level timestamps.
- **🧠 Adaptive Lyric Segmentation:** Splits lyrics into segments at natural pause points to align scene changes with the song.
- **🎨 Customizable Scene Generation:** Use various LLM models to generate scene descriptions for each lyric segment, with customizable system prompts and word limits.
- **πŸ€– Multiple AI Models:** Select from a variety of text-to-image models (SDXL, SD 1.5, etc.) and video generation models.
- **🎬 Style Consistency Options:** Choose between independent scene generation or img2img-based style consistency for a more cohesive visual experience.
- **πŸ” Preview & Inspection:** Preview scenes before full generation and inspect all generated images in a gallery view.
- **πŸ”„ Seamless Transitions:** Configurable crossfade transitions between scene clips.
- **πŸŽͺ Kinetic Subtitles:** PyCaps renders styled animated subtitles that appear in sync with the original audio.
- **πŸ”’ Fully Local & Open-Source:** All models are open-license and run on local GPU.

## πŸ’» System Requirements

### Hardware Requirements
- **GPU**: NVIDIA GPU with 8GB+ VRAM (recommended: RTX 3080/4070 or better)
- **RAM**: 16GB+ system RAM 
- **Storage**: SSD recommended for faster model loading and video processing
- **CPU**: Modern multi-core processor

### Software Requirements
- **Operating System**: Linux, Windows, or macOS
- **Python**: 3.8 or higher
- **CUDA**: NVIDIA CUDA toolkit (for GPU acceleration)
- **FFmpeg**: For audio/video processing

## πŸš€ Quick Start (Gradio Web UI)

### 1. Install Dependencies

Ensure you have a suitable GPU (NVIDIA T4/A10 or better) with CUDA installed. Then install the required Python packages:

```bash
pip install -r requirements.txt
```

### 2. Launch the Web Interface

```bash
python app.py
```

This will start a Gradio web interface accessible at `http://localhost:7860`.

### 3. Using the Interface

1. **Upload Audio**: Choose an audio file (MP3, WAV, M4A, etc.)
2. **Select Quality Preset**: Choose from Fast, Balanced, or High Quality
3. **Configure Models**: Optionally adjust AI models in the "AI Models" tab
4. **Customize Style**: Modify scene prompts and visual style in other tabs  
5. **Preview**: Click "Preview First Scene" to test settings quickly
6. **Generate**: Click "Generate Complete Music Video" to create the full video

## πŸ“ Usage Tips

### Audio Selection
- **Format**: MP3, WAV, M4A, FLAC, OGG supported
- **Quality**: Clear vocals work best for transcription
- **Length**: 30 seconds to 3 minutes recommended for testing
- **Content**: Songs with distinct lyrics produce better results

### Performance Optimization
- **Fast Generation**: Use 512x288 resolution with "tiny" Whisper model
- **Best Quality**: Use 1280x720 with "large" Whisper model (requires more VRAM)
- **Memory Issues**: Lower resolution, use smaller models, or reduce max segments

### Style Customization
- **Visual Style Keywords**: Add style terms like "cinematic, vibrant, neon" to influence all scenes
- **Prompt Template**: Customize how the AI interprets lyrics into visual scenes
- **Consistency Mode**: Use "Consistent (Img2Img)" for coherent visual style across scenes

## πŸ› οΈ Advanced Usage

### Command Line Interface

For batch processing or automation, you can use the smoke test script:

```bash
bash scripts/smoke_test.sh your_audio.mp3
```

### Custom Templates

Create custom subtitle styles by adding new templates in the `templates/` directory:

1. Create a new folder: `templates/your_style/`
2. Add `pycaps.template.json` with animation definitions
3. Add `styles.css` with visual styling
4. The template will appear in the interface dropdown

### Model Configuration

Supported models are defined in the utility modules:
- **Whisper**: `utils/transcribe.py` - Add new Whisper model names
- **LLM**: `utils/prompt_gen.py` - Add new language models  
- **Image**: `utils/video_gen.py` - Add new Stable Diffusion variants
- **Video**: `utils/video_gen.py` - Add new video diffusion models

## πŸ§ͺ Testing

Run the basic functionality test:

```bash
python test_basic.py
```

For a complete end-to-end test with a sample audio file:

```bash
python test.py
```

## πŸ“ Project Structure

```
Audio2KineticVid/
β”œβ”€β”€ app.py                  # Main Gradio web interface
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ utils/                  # Core processing modules
β”‚   β”œβ”€β”€ transcribe.py      # Whisper audio transcription
β”‚   β”œβ”€β”€ segment.py         # Intelligent lyric segmentation  
β”‚   β”œβ”€β”€ prompt_gen.py      # LLM scene description generation
β”‚   β”œβ”€β”€ video_gen.py       # Image and video generation
β”‚   └── glue.py           # Video stitching and subtitle overlay
β”œβ”€β”€ templates/             # Subtitle animation templates
β”‚   β”œβ”€β”€ minimalist/       # Clean, simple subtitle style
β”‚   └── dynamic/          # Dynamic animations
β”œβ”€β”€ scripts/              # Utility scripts
β”‚   └── smoke_test.sh     # End-to-end testing script
└── test_basic.py         # Component testing
```

## 🎬 Output

The application generates:
- **Final Video**: MP4 file with synchronized audio, visuals, and animated subtitles
- **Scene Images**: Individual AI-generated images for each lyric segment
- **Scene Descriptions**: Text prompts used for image generation
- **Segmentation Data**: Analyzed lyric segments with timing information

## πŸ”§ Troubleshooting

### Common Issues

**GPU Memory Errors**
- Reduce video resolution (use 512x288 instead of 1280x720)
- Use smaller models (tiny/base Whisper, SD 1.5 instead of SDXL)
- Close other GPU-intensive applications

**Audio Processing Fails**
- Ensure FFmpeg is installed and accessible
- Try converting audio to WAV format first
- Check that audio file is not corrupted

**Model Loading Issues**
- Check internet connection (models download on first use)
- Verify sufficient disk space for model files
- Clear HuggingFace cache if models are corrupted

**Slow Generation**
- Use "Fast" quality preset for testing
- Reduce crossfade duration to 0 for hard cuts
- Use dynamic FPS instead of fixed high FPS

### Performance Monitoring

Monitor system resources during generation:
- **GPU Usage**: Should be near 100% during image/video generation
- **RAM Usage**: Peak during model loading and video processing
- **Disk I/O**: High during model downloads and video encoding

## 🀝 Contributing

Contributions are welcome! Areas for improvement:
- Additional subtitle animation templates
- Support for more AI models
- Performance optimizations
- Additional audio/video formats
- Batch processing capabilities

## πŸ“„ License

This project uses open-source models and libraries. Please check individual model licenses for usage rights.

## πŸ™ Acknowledgments

- **OpenAI Whisper** for speech recognition
- **Stability AI** for Stable Diffusion models  
- **Hugging Face** for model hosting and transformers
- **PyCaps** for kinetic subtitle rendering
- **Gradio** for the web interface