ClipQuery / README.md
maguid28's picture
Update README.md
26930f6 verified
---
title: ClipQuery
emoji: πŸ”
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.41.0
app_file: app.py
pinned: false
---
# ClipQuery – Ask Questions of Any Podcast / Video and Hear the Answer
ClipQuery turns *any* local audio or video file into a searchable, conversational experience.
It automatically transcribes the media, indexes each sentence with embeddings, and lets you
ask natural-language questions. It returns:
1. A **30-second audio clip** from the exact place in the media where the answer occurs.
2. The **timestamp** of the clip.
3. A live **LangChain debug log** so you can inspect what happened behind the scenes.
---
## How It Works
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” transcribe & segment β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ audio / mp4 β”‚ ───────────────────────▢ β”‚ transcripts β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”‚ build embeddings (SBERT) β”‚ metadata: {start, end}
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” store vectors β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HuggingFace │────────────────────▢│ FAISS VectorStore β”‚
β”‚ Sentence-Transformerβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–²
β”‚ retrieve top-k
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ChatOllama (phi3) β”‚
β”‚ RetrievalQA chain β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
1. **Transcription** – `index_builder.py` uses `faster-whisper` to generate
word-level timestamps, saved as `segments.json`.
2. **Embedding + Index** – Sentence-Transformer (miniLM) embeddings are
stored in a **FAISS** index (`data/*`).
3. **Question Answering** – A local LLM (Ollama `phi3`) is wrapped in
`RetrievalQAWithSourcesChain` to pull the most relevant transcript
chunks and generate an answer.
4. **Clip Extraction** – `clipper.py` calls `ffmpeg` to cut a 30 s MP3
between the `start` and `end` timestamps (extended to 30 s if shorter).
5. **Debug Logging** – A custom `JSONLCallbackHandler` dumps every
LangChain event to `langchain_debug.jsonl`; the Gradio UI streams it
live in the **Debug Log** tab.
---
## Installation
### Prerequisites
- Python 3.9+ (3.10 recommended)
- FFmpeg (for audio processing)
- For GPU acceleration: CUDA-compatible GPU (optional but recommended)
### Quick Start (CPU/Spaces Mode)
```bash
# 1. Clone and set up
python -m venv .venv && source .venv/bin/activate # Linux/macOS
# OR on Windows: .venv\Scripts\activate
pip install -r requirements.txt
# 2. Run the app (uses flan-t5-base by default)
python app.py
```
### Local GPU Setup (Optional)
For better performance with local models:
1. **Install Ollama**
```bash
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: Download from https://ollama.com/download
```
2. **Download Models** (pick one)
```bash
# Small & fast (4GB VRAM+)
ollama pull phi3
# Larger & more capable (8GB VRAM+)
ollama pull mistral
# Start Ollama in the background
ollama serve &
```
3. **Run with Local Model**
```bash
# The app will automatically detect Ollama if running
python app.py
```
### FFmpeg Setup
```bash
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# Windows (with Chocolatey)
choco install ffmpeg
```
### FFmpeg
`clipper.py` requires FFmpeg. On macOS:
```bash
brew install ffmpeg
```
---
## Quick Start
1. **Launch the App**
```bash
python app.py
```
This starts a local web server at http://127.0.0.1:7860
2. **First Run**
- The first time you run with a new model, it will download the necessary files (1-5GB for `flan-t5-base`).
- Subsequent starts will be faster as files are cached.
3. **Using the App**
1. **Upload** any audio/video file (mp3, mp4, etc.)
2. Select a model:
- For CPU/Spaces: `flan-t5-base`
- For local GPU: `phi3` or `tinyllama` (requires Ollama)
3. Ask a question in the **Ask** tab
4. The app will:
- Transcribe the media (first time only)
- Find the most relevant 30-second clip
- Play the audio and show the timestamp
4. **Debugging**
- Check the terminal for transcription progress
- View the **Debug Log** tab for detailed LLM interactions
- Logs are saved to `langchain_debug.jsonl`
---
## Project Layout
```
β”œβ”€β”€ app.py # Gradio UI and orchestration
β”œβ”€β”€ clipper.py # ffmpeg clip extraction helper
β”œβ”€β”€ index_builder.py # transcription + FAISS index builder
β”œβ”€β”€ qa_engine.py # load index, build RetrievalQA chain, JSONL logging
β”œβ”€β”€ logging_config.py # basic logger
β”œβ”€β”€ requirements.txt
└── README.md
```
Generated artifacts:
* `downloads/audio.mp3` – copy of uploaded audio
* `data/faiss_index*` – FAISS vector store
* `data/segments.json` – transcript chunks with timestamps
* `langchain_debug.jsonl` – streaming debug log
---
## Customising
* **Change minimum clip length** – Modify `MIN_CLIP_SEC` logic in
`app.py` (currently hard-coded to 30 s).
* **Use a different LLM** – Change the `ChatOllama(model=...)` argument
in `qa_engine.py` (any Ollama-served model works).
* **Prompt template** – Supply `chain_type_kwargs={"prompt": custom_prompt}`
when calling `RetrievalQAWithSourcesChain.from_chain_type`.
* **Rotate / clean logs** – Delete `langchain_debug.jsonl`; it will be
recreated on the next query.
---
## Troubleshooting
### Common Issues
| Issue | Solution |
|-------|----------|
| **Ollama not detected** | Run `ollama serve` in a separate terminal |
| **CUDA Out of Memory** | Use a smaller model (`phi3` instead of `mistral`) or reduce batch size |
| **FFmpeg not found** | Install FFmpeg and ensure it's in your PATH |
| **Slow performance on CPU** | Use `phi3` or `tinyllama` with Ollama for GPU acceleration |
| **Model download errors** | Check internet connection and disk space |
### Advanced
- **Reducing VRAM Usage**:
```python
# In app.py, reduce context length
llm = ChatOllama(model="phi3", num_ctx=2048) # default is 4096
```
- **Faster Transcriptions**:
```bash
# Pre-convert to 16kHz mono WAV
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le input_16k.wav
```
- **Debug Logs**:
- Check `langchain_debug.jsonl` for detailed traces
- Set `LOG_LEVEL=DEBUG` for verbose output
## Local Development
### Environment Variables
```bash
# For HuggingFace models (required for Spaces)
export HUGGINGFACEHUB_API_TOKEN="your_token_here"
# For debugging
export LOG_LEVEL=DEBUG
```
### Running Tests
```bash
# Install test dependencies
pip install pytest pytest-mock
# Run tests
pytest tests/
```
### Building for Production
```bash
# Create a standalone executable (using PyInstaller)
pip install pyinstaller
pyinstaller --onefile app.py
```
---
## License
MIT – do what you want, no warranty.