Spaces:
Running
on
Zero
Running
on
Zero
license: apache-2.0 | |
title: audio2kineticvid | |
sdk: gradio | |
emoji: π | |
colorFrom: red | |
colorTo: yellow | |
# Audio2KineticVid | |
Audio2KineticVid is a comprehensive tool that converts an audio track (e.g., a song) into a dynamic music video with AI-generated scenes and synchronized kinetic typography (animated subtitles). Everything runs locally using open-source models β no external APIs or paid services required. | |
## β¨ Features | |
- **π€ Whisper Transcription:** Choose from multiple Whisper models (tiny to large) for audio transcription with word-level timestamps. | |
- **π§ Adaptive Lyric Segmentation:** Splits lyrics into segments at natural pause points to align scene changes with the song. | |
- **π¨ Customizable Scene Generation:** Use various LLM models to generate scene descriptions for each lyric segment, with customizable system prompts and word limits. | |
- **π€ Multiple AI Models:** Select from a variety of text-to-image models (SDXL, SD 1.5, etc.) and video generation models. | |
- **π¬ Style Consistency Options:** Choose between independent scene generation or img2img-based style consistency for a more cohesive visual experience. | |
- **π Preview & Inspection:** Preview scenes before full generation and inspect all generated images in a gallery view. | |
- **π Seamless Transitions:** Configurable crossfade transitions between scene clips. | |
- **πͺ Kinetic Subtitles:** PyCaps renders styled animated subtitles that appear in sync with the original audio. | |
- **π Fully Local & Open-Source:** All models are open-license and run on local GPU. | |
## π» System Requirements | |
### Hardware Requirements | |
- **GPU**: NVIDIA GPU with 8GB+ VRAM (recommended: RTX 3080/4070 or better) | |
- **RAM**: 16GB+ system RAM | |
- **Storage**: SSD recommended for faster model loading and video processing | |
- **CPU**: Modern multi-core processor | |
### Software Requirements | |
- **Operating System**: Linux, Windows, or macOS | |
- **Python**: 3.8 or higher | |
- **CUDA**: NVIDIA CUDA toolkit (for GPU acceleration) | |
- **FFmpeg**: For audio/video processing | |
## π Quick Start (Gradio Web UI) | |
### 1. Install Dependencies | |
Ensure you have a suitable GPU (NVIDIA T4/A10 or better) with CUDA installed. Then install the required Python packages: | |
```bash | |
pip install -r requirements.txt | |
``` | |
### 2. Launch the Web Interface | |
```bash | |
python app.py | |
``` | |
This will start a Gradio web interface accessible at `http://localhost:7860`. | |
### 3. Using the Interface | |
1. **Upload Audio**: Choose an audio file (MP3, WAV, M4A, etc.) | |
2. **Select Quality Preset**: Choose from Fast, Balanced, or High Quality | |
3. **Configure Models**: Optionally adjust AI models in the "AI Models" tab | |
4. **Customize Style**: Modify scene prompts and visual style in other tabs | |
5. **Preview**: Click "Preview First Scene" to test settings quickly | |
6. **Generate**: Click "Generate Complete Music Video" to create the full video | |
## π Usage Tips | |
### Audio Selection | |
- **Format**: MP3, WAV, M4A, FLAC, OGG supported | |
- **Quality**: Clear vocals work best for transcription | |
- **Length**: 30 seconds to 3 minutes recommended for testing | |
- **Content**: Songs with distinct lyrics produce better results | |
### Performance Optimization | |
- **Fast Generation**: Use 512x288 resolution with "tiny" Whisper model | |
- **Best Quality**: Use 1280x720 with "large" Whisper model (requires more VRAM) | |
- **Memory Issues**: Lower resolution, use smaller models, or reduce max segments | |
### Style Customization | |
- **Visual Style Keywords**: Add style terms like "cinematic, vibrant, neon" to influence all scenes | |
- **Prompt Template**: Customize how the AI interprets lyrics into visual scenes | |
- **Consistency Mode**: Use "Consistent (Img2Img)" for coherent visual style across scenes | |
## π οΈ Advanced Usage | |
### Command Line Interface | |
For batch processing or automation, you can use the smoke test script: | |
```bash | |
bash scripts/smoke_test.sh your_audio.mp3 | |
``` | |
### Custom Templates | |
Create custom subtitle styles by adding new templates in the `templates/` directory: | |
1. Create a new folder: `templates/your_style/` | |
2. Add `pycaps.template.json` with animation definitions | |
3. Add `styles.css` with visual styling | |
4. The template will appear in the interface dropdown | |
### Model Configuration | |
Supported models are defined in the utility modules: | |
- **Whisper**: `utils/transcribe.py` - Add new Whisper model names | |
- **LLM**: `utils/prompt_gen.py` - Add new language models | |
- **Image**: `utils/video_gen.py` - Add new Stable Diffusion variants | |
- **Video**: `utils/video_gen.py` - Add new video diffusion models | |
## π§ͺ Testing | |
Run the basic functionality test: | |
```bash | |
python test_basic.py | |
``` | |
For a complete end-to-end test with a sample audio file: | |
```bash | |
python test.py | |
``` | |
## π Project Structure | |
``` | |
Audio2KineticVid/ | |
βββ app.py # Main Gradio web interface | |
βββ requirements.txt # Python dependencies | |
βββ utils/ # Core processing modules | |
β βββ transcribe.py # Whisper audio transcription | |
β βββ segment.py # Intelligent lyric segmentation | |
β βββ prompt_gen.py # LLM scene description generation | |
β βββ video_gen.py # Image and video generation | |
β βββ glue.py # Video stitching and subtitle overlay | |
βββ templates/ # Subtitle animation templates | |
β βββ minimalist/ # Clean, simple subtitle style | |
β βββ dynamic/ # Dynamic animations | |
βββ scripts/ # Utility scripts | |
β βββ smoke_test.sh # End-to-end testing script | |
βββ test_basic.py # Component testing | |
``` | |
## π¬ Output | |
The application generates: | |
- **Final Video**: MP4 file with synchronized audio, visuals, and animated subtitles | |
- **Scene Images**: Individual AI-generated images for each lyric segment | |
- **Scene Descriptions**: Text prompts used for image generation | |
- **Segmentation Data**: Analyzed lyric segments with timing information | |
## π§ Troubleshooting | |
### Common Issues | |
**GPU Memory Errors** | |
- Reduce video resolution (use 512x288 instead of 1280x720) | |
- Use smaller models (tiny/base Whisper, SD 1.5 instead of SDXL) | |
- Close other GPU-intensive applications | |
**Audio Processing Fails** | |
- Ensure FFmpeg is installed and accessible | |
- Try converting audio to WAV format first | |
- Check that audio file is not corrupted | |
**Model Loading Issues** | |
- Check internet connection (models download on first use) | |
- Verify sufficient disk space for model files | |
- Clear HuggingFace cache if models are corrupted | |
**Slow Generation** | |
- Use "Fast" quality preset for testing | |
- Reduce crossfade duration to 0 for hard cuts | |
- Use dynamic FPS instead of fixed high FPS | |
### Performance Monitoring | |
Monitor system resources during generation: | |
- **GPU Usage**: Should be near 100% during image/video generation | |
- **RAM Usage**: Peak during model loading and video processing | |
- **Disk I/O**: High during model downloads and video encoding | |
## π€ Contributing | |
Contributions are welcome! Areas for improvement: | |
- Additional subtitle animation templates | |
- Support for more AI models | |
- Performance optimizations | |
- Additional audio/video formats | |
- Batch processing capabilities | |
## π License | |
This project uses open-source models and libraries. Please check individual model licenses for usage rights. | |
## π Acknowledgments | |
- **OpenAI Whisper** for speech recognition | |
- **Stability AI** for Stable Diffusion models | |
- **Hugging Face** for model hosting and transformers | |
- **PyCaps** for kinetic subtitle rendering | |
- **Gradio** for the web interface |