Spaces:

maguid28
/

ClipQuery

Runtime error

File size: 7,725 Bytes

---
title: ClipQuery
emoji: 🔍
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.41.0
app_file: app.py
pinned: false
---

# ClipQuery – Ask Questions of Any Podcast / Video and Hear the Answer

ClipQuery turns *any* local audio or video file into a searchable, conversational experience.
It automatically transcribes the media, indexes each sentence with embeddings, and lets you
ask natural-language questions.  It returns:

1. A **30-second audio clip** from the exact place in the media where the answer occurs.
2. The **timestamp** of the clip.
3. A live **LangChain debug log** so you can inspect what happened behind the scenes.

---
## How It Works

```
┌───────────────┐   transcribe & segment   ┌───────────────┐
│  audio / mp4  │ ───────────────────────▶ │  transcripts  │
└───────────────┘                          └───────────────┘
        │                                           │
        │ build embeddings (SBERT)                  │  metadata: {start, end}
        ▼                                           ▼
┌─────────────────┐    store vectors     ┌────────────────────┐
│ HuggingFace     │────────────────────▶│ FAISS VectorStore  │
│ Sentence-Transformer│                 └────────────────────┘
└─────────────────┘                             ▲
                                                │ retrieve top-k
                                                ▼
                                         ┌────────────────────┐
                                         │ ChatOllama (phi3)  │
                                         │  RetrievalQA chain │
                                         └────────────────────┘
```

1. **Transcription** – `index_builder.py` uses `faster-whisper` to generate
   word-level timestamps, saved as `segments.json`.
2. **Embedding + Index** – Sentence-Transformer (miniLM) embeddings are
   stored in a **FAISS** index (`data/*`).
3. **Question Answering** – A local LLM (Ollama `phi3`) is wrapped in
   `RetrievalQAWithSourcesChain` to pull the most relevant transcript
   chunks and generate an answer.
4. **Clip Extraction** – `clipper.py` calls `ffmpeg` to cut a 30 s MP3
   between the `start` and `end` timestamps (extended to 30 s if shorter).
5. **Debug Logging** – A custom `JSONLCallbackHandler` dumps every
   LangChain event to `langchain_debug.jsonl`; the Gradio UI streams it
   live in the **Debug Log** tab.

---
## Installation

### Prerequisites
- Python 3.9+ (3.10 recommended)
- FFmpeg (for audio processing)
- For GPU acceleration: CUDA-compatible GPU (optional but recommended)

### Quick Start (CPU/Spaces Mode)
```bash
# 1. Clone and set up
python -m venv .venv && source .venv/bin/activate  # Linux/macOS
# OR on Windows: .venv\Scripts\activate

pip install -r requirements.txt

# 2. Run the app (uses flan-t5-base by default)
python app.py
```

### Local GPU Setup (Optional)
For better performance with local models:

1. **Install Ollama**
   ```bash
   # macOS/Linux
   curl -fsSL https://ollama.com/install.sh | sh
   
   # Windows: Download from https://ollama.com/download
   ```

2. **Download Models** (pick one)
   ```bash
   # Small & fast (4GB VRAM+)
   ollama pull phi3
   
   # Larger & more capable (8GB VRAM+)
   ollama pull mistral
   
   # Start Ollama in the background
   ollama serve &
   ```

3. **Run with Local Model**
   ```bash
   # The app will automatically detect Ollama if running
   python app.py
   ```

### FFmpeg Setup
```bash
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Windows (with Chocolatey)
choco install ffmpeg
```

### FFmpeg

`clipper.py` requires FFmpeg.  On macOS:
```bash
brew install ffmpeg
```

---
## Quick Start

1. **Launch the App**
   ```bash
   python app.py
   ```
   This starts a local web server at http://127.0.0.1:7860

2. **First Run**
   - The first time you run with a new model, it will download the necessary files (1-5GB for `flan-t5-base`).
   - Subsequent starts will be faster as files are cached.

3. **Using the App**
   1. **Upload** any audio/video file (mp3, mp4, etc.)
   2. Select a model:
      - For CPU/Spaces: `flan-t5-base`
      - For local GPU: `phi3` or `tinyllama` (requires Ollama)
   3. Ask a question in the **Ask** tab
   4. The app will:
      - Transcribe the media (first time only)
      - Find the most relevant 30-second clip
      - Play the audio and show the timestamp

4. **Debugging**
   - Check the terminal for transcription progress
   - View the **Debug Log** tab for detailed LLM interactions
   - Logs are saved to `langchain_debug.jsonl`

---
## Project Layout

```
├── app.py              # Gradio UI and orchestration
├── clipper.py          # ffmpeg clip extraction helper
├── index_builder.py    # transcription + FAISS index builder
├── qa_engine.py        # load index, build RetrievalQA chain, JSONL logging
├── logging_config.py   # basic logger
├── requirements.txt
└── README.md
```

Generated artifacts:

* `downloads/audio.mp3`         – copy of uploaded audio
* `data/faiss_index*`            – FAISS vector store
* `data/segments.json`          – transcript chunks with timestamps
* `langchain_debug.jsonl`       – streaming debug log

---
## Customising

* **Change minimum clip length** – Modify `MIN_CLIP_SEC` logic in
  `app.py` (currently hard-coded to 30 s).
* **Use a different LLM** – Change the `ChatOllama(model=...)` argument
  in `qa_engine.py` (any Ollama-served model works).
* **Prompt template** – Supply `chain_type_kwargs={"prompt": custom_prompt}`
  when calling `RetrievalQAWithSourcesChain.from_chain_type`.
* **Rotate / clean logs** – Delete `langchain_debug.jsonl`; it will be
  recreated on the next query.

---
## Troubleshooting

### Common Issues

| Issue | Solution |
|-------|----------|
| **Ollama not detected** | Run `ollama serve` in a separate terminal |
| **CUDA Out of Memory** | Use a smaller model (`phi3` instead of `mistral`) or reduce batch size |
| **FFmpeg not found** | Install FFmpeg and ensure it's in your PATH |
| **Slow performance on CPU** | Use `phi3` or `tinyllama` with Ollama for GPU acceleration |
| **Model download errors** | Check internet connection and disk space |

### Advanced
- **Reducing VRAM Usage**:
  ```python
  # In app.py, reduce context length
  llm = ChatOllama(model="phi3", num_ctx=2048)  # default is 4096
  ```

- **Faster Transcriptions**:
  ```bash
  # Pre-convert to 16kHz mono WAV
  ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le input_16k.wav
  ```

- **Debug Logs**:
  - Check `langchain_debug.jsonl` for detailed traces
  - Set `LOG_LEVEL=DEBUG` for verbose output

## Local Development

### Environment Variables
```bash
# For HuggingFace models (required for Spaces)
export HUGGINGFACEHUB_API_TOKEN="your_token_here"

# For debugging
export LOG_LEVEL=DEBUG
```

### Running Tests
```bash
# Install test dependencies
pip install pytest pytest-mock

# Run tests
pytest tests/
```

### Building for Production
```bash
# Create a standalone executable (using PyInstaller)
pip install pyinstaller
pyinstaller --onefile app.py
```

---
## License

MIT – do what you want, no warranty.