ClipQuery / README.md
maguid28's picture
Update README.md
26930f6 verified

A newer version of the Gradio SDK is available: 5.43.1

Upgrade
metadata
title: ClipQuery
emoji: πŸ”
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.41.0
app_file: app.py
pinned: false

ClipQuery – Ask Questions of Any Podcast / Video and Hear the Answer

ClipQuery turns any local audio or video file into a searchable, conversational experience. It automatically transcribes the media, indexes each sentence with embeddings, and lets you ask natural-language questions. It returns:

  1. A 30-second audio clip from the exact place in the media where the answer occurs.
  2. The timestamp of the clip.
  3. A live LangChain debug log so you can inspect what happened behind the scenes.

How It Works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   transcribe & segment   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  audio / mp4  β”‚ ───────────────────────▢ β”‚  transcripts  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                                           β”‚
        β”‚ build embeddings (SBERT)                  β”‚  metadata: {start, end}
        β–Ό                                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    store vectors     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HuggingFace     │────────────────────▢│ FAISS VectorStore  β”‚
β”‚ Sentence-Transformerβ”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β–²
                                                β”‚ retrieve top-k
                                                β–Ό
                                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                         β”‚ ChatOllama (phi3)  β”‚
                                         β”‚  RetrievalQA chain β”‚
                                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  1. Transcription – index_builder.py uses faster-whisper to generate word-level timestamps, saved as segments.json.
  2. Embedding + Index – Sentence-Transformer (miniLM) embeddings are stored in a FAISS index (data/*).
  3. Question Answering – A local LLM (Ollama phi3) is wrapped in RetrievalQAWithSourcesChain to pull the most relevant transcript chunks and generate an answer.
  4. Clip Extraction – clipper.py calls ffmpeg to cut a 30 s MP3 between the start and end timestamps (extended to 30 s if shorter).
  5. Debug Logging – A custom JSONLCallbackHandler dumps every LangChain event to langchain_debug.jsonl; the Gradio UI streams it live in the Debug Log tab.

Installation

Prerequisites

  • Python 3.9+ (3.10 recommended)
  • FFmpeg (for audio processing)
  • For GPU acceleration: CUDA-compatible GPU (optional but recommended)

Quick Start (CPU/Spaces Mode)

# 1. Clone and set up
python -m venv .venv && source .venv/bin/activate  # Linux/macOS
# OR on Windows: .venv\Scripts\activate

pip install -r requirements.txt

# 2. Run the app (uses flan-t5-base by default)
python app.py

Local GPU Setup (Optional)

For better performance with local models:

  1. Install Ollama

    # macOS/Linux
    curl -fsSL https://ollama.com/install.sh | sh
    
    # Windows: Download from https://ollama.com/download
    
  2. Download Models (pick one)

    # Small & fast (4GB VRAM+)
    ollama pull phi3
    
    # Larger & more capable (8GB VRAM+)
    ollama pull mistral
    
    # Start Ollama in the background
    ollama serve &
    
  3. Run with Local Model

    # The app will automatically detect Ollama if running
    python app.py
    

FFmpeg Setup

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Windows (with Chocolatey)
choco install ffmpeg

FFmpeg

clipper.py requires FFmpeg. On macOS:

brew install ffmpeg

Quick Start

  1. Launch the App

    python app.py
    

    This starts a local web server at http://127.0.0.1:7860

  2. First Run

    • The first time you run with a new model, it will download the necessary files (1-5GB for flan-t5-base).
    • Subsequent starts will be faster as files are cached.
  3. Using the App

    1. Upload any audio/video file (mp3, mp4, etc.)
    2. Select a model:
      • For CPU/Spaces: flan-t5-base
      • For local GPU: phi3 or tinyllama (requires Ollama)
    3. Ask a question in the Ask tab
    4. The app will:
      • Transcribe the media (first time only)
      • Find the most relevant 30-second clip
      • Play the audio and show the timestamp
  4. Debugging

    • Check the terminal for transcription progress
    • View the Debug Log tab for detailed LLM interactions
    • Logs are saved to langchain_debug.jsonl

Project Layout

β”œβ”€β”€ app.py              # Gradio UI and orchestration
β”œβ”€β”€ clipper.py          # ffmpeg clip extraction helper
β”œβ”€β”€ index_builder.py    # transcription + FAISS index builder
β”œβ”€β”€ qa_engine.py        # load index, build RetrievalQA chain, JSONL logging
β”œβ”€β”€ logging_config.py   # basic logger
β”œβ”€β”€ requirements.txt
└── README.md

Generated artifacts:

  • downloads/audio.mp3 – copy of uploaded audio
  • data/faiss_index* – FAISS vector store
  • data/segments.json – transcript chunks with timestamps
  • langchain_debug.jsonl – streaming debug log

Customising

  • Change minimum clip length – Modify MIN_CLIP_SEC logic in app.py (currently hard-coded to 30 s).
  • Use a different LLM – Change the ChatOllama(model=...) argument in qa_engine.py (any Ollama-served model works).
  • Prompt template – Supply chain_type_kwargs={"prompt": custom_prompt} when calling RetrievalQAWithSourcesChain.from_chain_type.
  • Rotate / clean logs – Delete langchain_debug.jsonl; it will be recreated on the next query.

Troubleshooting

Common Issues

Issue Solution
Ollama not detected Run ollama serve in a separate terminal
CUDA Out of Memory Use a smaller model (phi3 instead of mistral) or reduce batch size
FFmpeg not found Install FFmpeg and ensure it's in your PATH
Slow performance on CPU Use phi3 or tinyllama with Ollama for GPU acceleration
Model download errors Check internet connection and disk space

Advanced

  • Reducing VRAM Usage:

    # In app.py, reduce context length
    llm = ChatOllama(model="phi3", num_ctx=2048)  # default is 4096
    
  • Faster Transcriptions:

    # Pre-convert to 16kHz mono WAV
    ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le input_16k.wav
    
  • Debug Logs:

    • Check langchain_debug.jsonl for detailed traces
    • Set LOG_LEVEL=DEBUG for verbose output

Local Development

Environment Variables

# For HuggingFace models (required for Spaces)
export HUGGINGFACEHUB_API_TOKEN="your_token_here"

# For debugging
export LOG_LEVEL=DEBUG

Running Tests

# Install test dependencies
pip install pytest pytest-mock

# Run tests
pytest tests/

Building for Production

# Create a standalone executable (using PyInstaller)
pip install pyinstaller
pyinstaller --onefile app.py

License

MIT – do what you want, no warranty.