Audio2KineticVid / README.md
doodle-med's picture
Update README.md
21c9c57 verified

A newer version of the Gradio SDK is available: 5.38.2

Upgrade
metadata
license: apache-2.0
title: audio2kineticvid
sdk: gradio
emoji: πŸš€
colorFrom: red
colorTo: yellow

Audio2KineticVid

Audio2KineticVid is a comprehensive tool that converts an audio track (e.g., a song) into a dynamic music video with AI-generated scenes and synchronized kinetic typography (animated subtitles). Everything runs locally using open-source models – no external APIs or paid services required.

✨ Features

  • 🎀 Whisper Transcription: Choose from multiple Whisper models (tiny to large) for audio transcription with word-level timestamps.
  • 🧠 Adaptive Lyric Segmentation: Splits lyrics into segments at natural pause points to align scene changes with the song.
  • 🎨 Customizable Scene Generation: Use various LLM models to generate scene descriptions for each lyric segment, with customizable system prompts and word limits.
  • πŸ€– Multiple AI Models: Select from a variety of text-to-image models (SDXL, SD 1.5, etc.) and video generation models.
  • 🎬 Style Consistency Options: Choose between independent scene generation or img2img-based style consistency for a more cohesive visual experience.
  • πŸ” Preview & Inspection: Preview scenes before full generation and inspect all generated images in a gallery view.
  • πŸ”„ Seamless Transitions: Configurable crossfade transitions between scene clips.
  • πŸŽͺ Kinetic Subtitles: PyCaps renders styled animated subtitles that appear in sync with the original audio.
  • πŸ”’ Fully Local & Open-Source: All models are open-license and run on local GPU.

πŸ’» System Requirements

Hardware Requirements

  • GPU: NVIDIA GPU with 8GB+ VRAM (recommended: RTX 3080/4070 or better)
  • RAM: 16GB+ system RAM
  • Storage: SSD recommended for faster model loading and video processing
  • CPU: Modern multi-core processor

Software Requirements

  • Operating System: Linux, Windows, or macOS
  • Python: 3.8 or higher
  • CUDA: NVIDIA CUDA toolkit (for GPU acceleration)
  • FFmpeg: For audio/video processing

πŸš€ Quick Start (Gradio Web UI)

1. Install Dependencies

Ensure you have a suitable GPU (NVIDIA T4/A10 or better) with CUDA installed. Then install the required Python packages:

pip install -r requirements.txt

2. Launch the Web Interface

python app.py

This will start a Gradio web interface accessible at http://localhost:7860.

3. Using the Interface

  1. Upload Audio: Choose an audio file (MP3, WAV, M4A, etc.)
  2. Select Quality Preset: Choose from Fast, Balanced, or High Quality
  3. Configure Models: Optionally adjust AI models in the "AI Models" tab
  4. Customize Style: Modify scene prompts and visual style in other tabs
  5. Preview: Click "Preview First Scene" to test settings quickly
  6. Generate: Click "Generate Complete Music Video" to create the full video

πŸ“ Usage Tips

Audio Selection

  • Format: MP3, WAV, M4A, FLAC, OGG supported
  • Quality: Clear vocals work best for transcription
  • Length: 30 seconds to 3 minutes recommended for testing
  • Content: Songs with distinct lyrics produce better results

Performance Optimization

  • Fast Generation: Use 512x288 resolution with "tiny" Whisper model
  • Best Quality: Use 1280x720 with "large" Whisper model (requires more VRAM)
  • Memory Issues: Lower resolution, use smaller models, or reduce max segments

Style Customization

  • Visual Style Keywords: Add style terms like "cinematic, vibrant, neon" to influence all scenes
  • Prompt Template: Customize how the AI interprets lyrics into visual scenes
  • Consistency Mode: Use "Consistent (Img2Img)" for coherent visual style across scenes

πŸ› οΈ Advanced Usage

Command Line Interface

For batch processing or automation, you can use the smoke test script:

bash scripts/smoke_test.sh your_audio.mp3

Custom Templates

Create custom subtitle styles by adding new templates in the templates/ directory:

  1. Create a new folder: templates/your_style/
  2. Add pycaps.template.json with animation definitions
  3. Add styles.css with visual styling
  4. The template will appear in the interface dropdown

Model Configuration

Supported models are defined in the utility modules:

  • Whisper: utils/transcribe.py - Add new Whisper model names
  • LLM: utils/prompt_gen.py - Add new language models
  • Image: utils/video_gen.py - Add new Stable Diffusion variants
  • Video: utils/video_gen.py - Add new video diffusion models

πŸ§ͺ Testing

Run the basic functionality test:

python test_basic.py

For a complete end-to-end test with a sample audio file:

python test.py

πŸ“ Project Structure

Audio2KineticVid/
β”œβ”€β”€ app.py                  # Main Gradio web interface
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ utils/                  # Core processing modules
β”‚   β”œβ”€β”€ transcribe.py      # Whisper audio transcription
β”‚   β”œβ”€β”€ segment.py         # Intelligent lyric segmentation  
β”‚   β”œβ”€β”€ prompt_gen.py      # LLM scene description generation
β”‚   β”œβ”€β”€ video_gen.py       # Image and video generation
β”‚   └── glue.py           # Video stitching and subtitle overlay
β”œβ”€β”€ templates/             # Subtitle animation templates
β”‚   β”œβ”€β”€ minimalist/       # Clean, simple subtitle style
β”‚   └── dynamic/          # Dynamic animations
β”œβ”€β”€ scripts/              # Utility scripts
β”‚   └── smoke_test.sh     # End-to-end testing script
└── test_basic.py         # Component testing

🎬 Output

The application generates:

  • Final Video: MP4 file with synchronized audio, visuals, and animated subtitles
  • Scene Images: Individual AI-generated images for each lyric segment
  • Scene Descriptions: Text prompts used for image generation
  • Segmentation Data: Analyzed lyric segments with timing information

πŸ”§ Troubleshooting

Common Issues

GPU Memory Errors

  • Reduce video resolution (use 512x288 instead of 1280x720)
  • Use smaller models (tiny/base Whisper, SD 1.5 instead of SDXL)
  • Close other GPU-intensive applications

Audio Processing Fails

  • Ensure FFmpeg is installed and accessible
  • Try converting audio to WAV format first
  • Check that audio file is not corrupted

Model Loading Issues

  • Check internet connection (models download on first use)
  • Verify sufficient disk space for model files
  • Clear HuggingFace cache if models are corrupted

Slow Generation

  • Use "Fast" quality preset for testing
  • Reduce crossfade duration to 0 for hard cuts
  • Use dynamic FPS instead of fixed high FPS

Performance Monitoring

Monitor system resources during generation:

  • GPU Usage: Should be near 100% during image/video generation
  • RAM Usage: Peak during model loading and video processing
  • Disk I/O: High during model downloads and video encoding

🀝 Contributing

Contributions are welcome! Areas for improvement:

  • Additional subtitle animation templates
  • Support for more AI models
  • Performance optimizations
  • Additional audio/video formats
  • Batch processing capabilities

πŸ“„ License

This project uses open-source models and libraries. Please check individual model licenses for usage rights.

πŸ™ Acknowledgments

  • OpenAI Whisper for speech recognition
  • Stability AI for Stable Diffusion models
  • Hugging Face for model hosting and transformers
  • PyCaps for kinetic subtitle rendering
  • Gradio for the web interface