A newer version of the Gradio SDK is available:
5.43.1
metadata
title: ClipQuery
emoji: π
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.41.0
app_file: app.py
pinned: false
ClipQuery β Ask Questions of Any Podcast / Video and Hear the Answer
ClipQuery turns any local audio or video file into a searchable, conversational experience. It automatically transcribes the media, indexes each sentence with embeddings, and lets you ask natural-language questions. It returns:
- A 30-second audio clip from the exact place in the media where the answer occurs.
- The timestamp of the clip.
- A live LangChain debug log so you can inspect what happened behind the scenes.
How It Works
βββββββββββββββββ transcribe & segment βββββββββββββββββ
β audio / mp4 β ββββββββββββββββββββββββΆ β transcripts β
βββββββββββββββββ βββββββββββββββββ
β β
β build embeddings (SBERT) β metadata: {start, end}
βΌ βΌ
βββββββββββββββββββ store vectors ββββββββββββββββββββββ
β HuggingFace ββββββββββββββββββββββΆβ FAISS VectorStore β
β Sentence-Transformerβ ββββββββββββββββββββββ
βββββββββββββββββββ β²
β retrieve top-k
βΌ
ββββββββββββββββββββββ
β ChatOllama (phi3) β
β RetrievalQA chain β
ββββββββββββββββββββββ
- Transcription β
index_builder.py
usesfaster-whisper
to generate word-level timestamps, saved assegments.json
. - Embedding + Index β Sentence-Transformer (miniLM) embeddings are
stored in a FAISS index (
data/*
). - Question Answering β A local LLM (Ollama
phi3
) is wrapped inRetrievalQAWithSourcesChain
to pull the most relevant transcript chunks and generate an answer. - Clip Extraction β
clipper.py
callsffmpeg
to cut a 30 s MP3 between thestart
andend
timestamps (extended to 30 s if shorter). - Debug Logging β A custom
JSONLCallbackHandler
dumps every LangChain event tolangchain_debug.jsonl
; the Gradio UI streams it live in the Debug Log tab.
Installation
Prerequisites
- Python 3.9+ (3.10 recommended)
- FFmpeg (for audio processing)
- For GPU acceleration: CUDA-compatible GPU (optional but recommended)
Quick Start (CPU/Spaces Mode)
# 1. Clone and set up
python -m venv .venv && source .venv/bin/activate # Linux/macOS
# OR on Windows: .venv\Scripts\activate
pip install -r requirements.txt
# 2. Run the app (uses flan-t5-base by default)
python app.py
Local GPU Setup (Optional)
For better performance with local models:
Install Ollama
# macOS/Linux curl -fsSL https://ollama.com/install.sh | sh # Windows: Download from https://ollama.com/download
Download Models (pick one)
# Small & fast (4GB VRAM+) ollama pull phi3 # Larger & more capable (8GB VRAM+) ollama pull mistral # Start Ollama in the background ollama serve &
Run with Local Model
# The app will automatically detect Ollama if running python app.py
FFmpeg Setup
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# Windows (with Chocolatey)
choco install ffmpeg
FFmpeg
clipper.py
requires FFmpeg. On macOS:
brew install ffmpeg
Quick Start
Launch the App
python app.py
This starts a local web server at http://127.0.0.1:7860
First Run
- The first time you run with a new model, it will download the necessary files (1-5GB for
flan-t5-base
). - Subsequent starts will be faster as files are cached.
- The first time you run with a new model, it will download the necessary files (1-5GB for
Using the App
- Upload any audio/video file (mp3, mp4, etc.)
- Select a model:
- For CPU/Spaces:
flan-t5-base
- For local GPU:
phi3
ortinyllama
(requires Ollama)
- For CPU/Spaces:
- Ask a question in the Ask tab
- The app will:
- Transcribe the media (first time only)
- Find the most relevant 30-second clip
- Play the audio and show the timestamp
Debugging
- Check the terminal for transcription progress
- View the Debug Log tab for detailed LLM interactions
- Logs are saved to
langchain_debug.jsonl
Project Layout
βββ app.py # Gradio UI and orchestration
βββ clipper.py # ffmpeg clip extraction helper
βββ index_builder.py # transcription + FAISS index builder
βββ qa_engine.py # load index, build RetrievalQA chain, JSONL logging
βββ logging_config.py # basic logger
βββ requirements.txt
βββ README.md
Generated artifacts:
downloads/audio.mp3
β copy of uploaded audiodata/faiss_index*
β FAISS vector storedata/segments.json
β transcript chunks with timestampslangchain_debug.jsonl
β streaming debug log
Customising
- Change minimum clip length β Modify
MIN_CLIP_SEC
logic inapp.py
(currently hard-coded to 30 s). - Use a different LLM β Change the
ChatOllama(model=...)
argument inqa_engine.py
(any Ollama-served model works). - Prompt template β Supply
chain_type_kwargs={"prompt": custom_prompt}
when callingRetrievalQAWithSourcesChain.from_chain_type
. - Rotate / clean logs β Delete
langchain_debug.jsonl
; it will be recreated on the next query.
Troubleshooting
Common Issues
Issue | Solution |
---|---|
Ollama not detected | Run ollama serve in a separate terminal |
CUDA Out of Memory | Use a smaller model (phi3 instead of mistral ) or reduce batch size |
FFmpeg not found | Install FFmpeg and ensure it's in your PATH |
Slow performance on CPU | Use phi3 or tinyllama with Ollama for GPU acceleration |
Model download errors | Check internet connection and disk space |
Advanced
Reducing VRAM Usage:
# In app.py, reduce context length llm = ChatOllama(model="phi3", num_ctx=2048) # default is 4096
Faster Transcriptions:
# Pre-convert to 16kHz mono WAV ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le input_16k.wav
Debug Logs:
- Check
langchain_debug.jsonl
for detailed traces - Set
LOG_LEVEL=DEBUG
for verbose output
- Check
Local Development
Environment Variables
# For HuggingFace models (required for Spaces)
export HUGGINGFACEHUB_API_TOKEN="your_token_here"
# For debugging
export LOG_LEVEL=DEBUG
Running Tests
# Install test dependencies
pip install pytest pytest-mock
# Run tests
pytest tests/
Building for Production
# Create a standalone executable (using PyInstaller)
pip install pyinstaller
pyinstaller --onefile app.py
License
MIT β do what you want, no warranty.