|
--- |
|
title: ClipQuery |
|
emoji: π |
|
colorFrom: indigo |
|
colorTo: blue |
|
sdk: gradio |
|
sdk_version: 5.41.0 |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
|
|
# ClipQuery β Ask Questions of Any Podcast / Video and Hear the Answer |
|
|
|
ClipQuery turns *any* local audio or video file into a searchable, conversational experience. |
|
It automatically transcribes the media, indexes each sentence with embeddings, and lets you |
|
ask natural-language questions. It returns: |
|
|
|
1. A **30-second audio clip** from the exact place in the media where the answer occurs. |
|
2. The **timestamp** of the clip. |
|
3. A live **LangChain debug log** so you can inspect what happened behind the scenes. |
|
|
|
--- |
|
## How It Works |
|
|
|
``` |
|
βββββββββββββββββ transcribe & segment βββββββββββββββββ |
|
β audio / mp4 β ββββββββββββββββββββββββΆ β transcripts β |
|
βββββββββββββββββ βββββββββββββββββ |
|
β β |
|
β build embeddings (SBERT) β metadata: {start, end} |
|
βΌ βΌ |
|
βββββββββββββββββββ store vectors ββββββββββββββββββββββ |
|
β HuggingFace ββββββββββββββββββββββΆβ FAISS VectorStore β |
|
β Sentence-Transformerβ ββββββββββββββββββββββ |
|
βββββββββββββββββββ β² |
|
β retrieve top-k |
|
βΌ |
|
ββββββββββββββββββββββ |
|
β ChatOllama (phi3) β |
|
β RetrievalQA chain β |
|
ββββββββββββββββββββββ |
|
``` |
|
|
|
1. **Transcription** β `index_builder.py` uses `faster-whisper` to generate |
|
word-level timestamps, saved as `segments.json`. |
|
2. **Embedding + Index** β Sentence-Transformer (miniLM) embeddings are |
|
stored in a **FAISS** index (`data/*`). |
|
3. **Question Answering** β A local LLM (Ollama `phi3`) is wrapped in |
|
`RetrievalQAWithSourcesChain` to pull the most relevant transcript |
|
chunks and generate an answer. |
|
4. **Clip Extraction** β `clipper.py` calls `ffmpeg` to cut a 30 s MP3 |
|
between the `start` and `end` timestamps (extended to 30 s if shorter). |
|
5. **Debug Logging** β A custom `JSONLCallbackHandler` dumps every |
|
LangChain event to `langchain_debug.jsonl`; the Gradio UI streams it |
|
live in the **Debug Log** tab. |
|
|
|
--- |
|
## Installation |
|
|
|
### Prerequisites |
|
- Python 3.9+ (3.10 recommended) |
|
- FFmpeg (for audio processing) |
|
- For GPU acceleration: CUDA-compatible GPU (optional but recommended) |
|
|
|
### Quick Start (CPU/Spaces Mode) |
|
```bash |
|
# 1. Clone and set up |
|
python -m venv .venv && source .venv/bin/activate # Linux/macOS |
|
# OR on Windows: .venv\Scripts\activate |
|
|
|
pip install -r requirements.txt |
|
|
|
# 2. Run the app (uses flan-t5-base by default) |
|
python app.py |
|
``` |
|
|
|
### Local GPU Setup (Optional) |
|
For better performance with local models: |
|
|
|
1. **Install Ollama** |
|
```bash |
|
# macOS/Linux |
|
curl -fsSL https://ollama.com/install.sh | sh |
|
|
|
# Windows: Download from https://ollama.com/download |
|
``` |
|
|
|
2. **Download Models** (pick one) |
|
```bash |
|
# Small & fast (4GB VRAM+) |
|
ollama pull phi3 |
|
|
|
# Larger & more capable (8GB VRAM+) |
|
ollama pull mistral |
|
|
|
# Start Ollama in the background |
|
ollama serve & |
|
``` |
|
|
|
3. **Run with Local Model** |
|
```bash |
|
# The app will automatically detect Ollama if running |
|
python app.py |
|
``` |
|
|
|
### FFmpeg Setup |
|
```bash |
|
# macOS |
|
brew install ffmpeg |
|
|
|
# Ubuntu/Debian |
|
sudo apt update && sudo apt install ffmpeg |
|
|
|
# Windows (with Chocolatey) |
|
choco install ffmpeg |
|
``` |
|
|
|
### FFmpeg |
|
|
|
`clipper.py` requires FFmpeg. On macOS: |
|
```bash |
|
brew install ffmpeg |
|
``` |
|
|
|
--- |
|
## Quick Start |
|
|
|
1. **Launch the App** |
|
```bash |
|
python app.py |
|
``` |
|
This starts a local web server at http://127.0.0.1:7860 |
|
|
|
2. **First Run** |
|
- The first time you run with a new model, it will download the necessary files (1-5GB for `flan-t5-base`). |
|
- Subsequent starts will be faster as files are cached. |
|
|
|
3. **Using the App** |
|
1. **Upload** any audio/video file (mp3, mp4, etc.) |
|
2. Select a model: |
|
- For CPU/Spaces: `flan-t5-base` |
|
- For local GPU: `phi3` or `tinyllama` (requires Ollama) |
|
3. Ask a question in the **Ask** tab |
|
4. The app will: |
|
- Transcribe the media (first time only) |
|
- Find the most relevant 30-second clip |
|
- Play the audio and show the timestamp |
|
|
|
4. **Debugging** |
|
- Check the terminal for transcription progress |
|
- View the **Debug Log** tab for detailed LLM interactions |
|
- Logs are saved to `langchain_debug.jsonl` |
|
|
|
--- |
|
## Project Layout |
|
|
|
``` |
|
βββ app.py # Gradio UI and orchestration |
|
βββ clipper.py # ffmpeg clip extraction helper |
|
βββ index_builder.py # transcription + FAISS index builder |
|
βββ qa_engine.py # load index, build RetrievalQA chain, JSONL logging |
|
βββ logging_config.py # basic logger |
|
βββ requirements.txt |
|
βββ README.md |
|
``` |
|
|
|
Generated artifacts: |
|
|
|
* `downloads/audio.mp3` β copy of uploaded audio |
|
* `data/faiss_index*` β FAISS vector store |
|
* `data/segments.json` β transcript chunks with timestamps |
|
* `langchain_debug.jsonl` β streaming debug log |
|
|
|
--- |
|
## Customising |
|
|
|
* **Change minimum clip length** β Modify `MIN_CLIP_SEC` logic in |
|
`app.py` (currently hard-coded to 30 s). |
|
* **Use a different LLM** β Change the `ChatOllama(model=...)` argument |
|
in `qa_engine.py` (any Ollama-served model works). |
|
* **Prompt template** β Supply `chain_type_kwargs={"prompt": custom_prompt}` |
|
when calling `RetrievalQAWithSourcesChain.from_chain_type`. |
|
* **Rotate / clean logs** β Delete `langchain_debug.jsonl`; it will be |
|
recreated on the next query. |
|
|
|
--- |
|
## Troubleshooting |
|
|
|
### Common Issues |
|
|
|
| Issue | Solution | |
|
|-------|----------| |
|
| **Ollama not detected** | Run `ollama serve` in a separate terminal | |
|
| **CUDA Out of Memory** | Use a smaller model (`phi3` instead of `mistral`) or reduce batch size | |
|
| **FFmpeg not found** | Install FFmpeg and ensure it's in your PATH | |
|
| **Slow performance on CPU** | Use `phi3` or `tinyllama` with Ollama for GPU acceleration | |
|
| **Model download errors** | Check internet connection and disk space | |
|
|
|
### Advanced |
|
- **Reducing VRAM Usage**: |
|
```python |
|
# In app.py, reduce context length |
|
llm = ChatOllama(model="phi3", num_ctx=2048) # default is 4096 |
|
``` |
|
|
|
- **Faster Transcriptions**: |
|
```bash |
|
# Pre-convert to 16kHz mono WAV |
|
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le input_16k.wav |
|
``` |
|
|
|
- **Debug Logs**: |
|
- Check `langchain_debug.jsonl` for detailed traces |
|
- Set `LOG_LEVEL=DEBUG` for verbose output |
|
|
|
## Local Development |
|
|
|
### Environment Variables |
|
```bash |
|
# For HuggingFace models (required for Spaces) |
|
export HUGGINGFACEHUB_API_TOKEN="your_token_here" |
|
|
|
# For debugging |
|
export LOG_LEVEL=DEBUG |
|
``` |
|
|
|
### Running Tests |
|
```bash |
|
# Install test dependencies |
|
pip install pytest pytest-mock |
|
|
|
# Run tests |
|
pytest tests/ |
|
``` |
|
|
|
### Building for Production |
|
```bash |
|
# Create a standalone executable (using PyInstaller) |
|
pip install pyinstaller |
|
pyinstaller --onefile app.py |
|
``` |
|
|
|
--- |
|
## License |
|
|
|
MIT β do what you want, no warranty. |