--- title: ClipQuery emoji: πŸ” colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 5.41.0 app_file: app.py pinned: false --- # ClipQuery – Ask Questions of Any Podcast / Video and Hear the Answer ClipQuery turns *any* local audio or video file into a searchable, conversational experience. It automatically transcribes the media, indexes each sentence with embeddings, and lets you ask natural-language questions. It returns: 1. A **30-second audio clip** from the exact place in the media where the answer occurs. 2. The **timestamp** of the clip. 3. A live **LangChain debug log** so you can inspect what happened behind the scenes. --- ## How It Works ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” transcribe & segment β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ audio / mp4 β”‚ ───────────────────────▢ β”‚ transcripts β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ build embeddings (SBERT) β”‚ metadata: {start, end} β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” store vectors β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ HuggingFace │────────────────────▢│ FAISS VectorStore β”‚ β”‚ Sentence-Transformerβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–² β”‚ retrieve top-k β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ChatOllama (phi3) β”‚ β”‚ RetrievalQA chain β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` 1. **Transcription** – `index_builder.py` uses `faster-whisper` to generate word-level timestamps, saved as `segments.json`. 2. **Embedding + Index** – Sentence-Transformer (miniLM) embeddings are stored in a **FAISS** index (`data/*`). 3. **Question Answering** – A local LLM (Ollama `phi3`) is wrapped in `RetrievalQAWithSourcesChain` to pull the most relevant transcript chunks and generate an answer. 4. **Clip Extraction** – `clipper.py` calls `ffmpeg` to cut a 30 s MP3 between the `start` and `end` timestamps (extended to 30 s if shorter). 5. **Debug Logging** – A custom `JSONLCallbackHandler` dumps every LangChain event to `langchain_debug.jsonl`; the Gradio UI streams it live in the **Debug Log** tab. --- ## Installation ### Prerequisites - Python 3.9+ (3.10 recommended) - FFmpeg (for audio processing) - For GPU acceleration: CUDA-compatible GPU (optional but recommended) ### Quick Start (CPU/Spaces Mode) ```bash # 1. Clone and set up python -m venv .venv && source .venv/bin/activate # Linux/macOS # OR on Windows: .venv\Scripts\activate pip install -r requirements.txt # 2. Run the app (uses flan-t5-base by default) python app.py ``` ### Local GPU Setup (Optional) For better performance with local models: 1. **Install Ollama** ```bash # macOS/Linux curl -fsSL https://ollama.com/install.sh | sh # Windows: Download from https://ollama.com/download ``` 2. **Download Models** (pick one) ```bash # Small & fast (4GB VRAM+) ollama pull phi3 # Larger & more capable (8GB VRAM+) ollama pull mistral # Start Ollama in the background ollama serve & ``` 3. **Run with Local Model** ```bash # The app will automatically detect Ollama if running python app.py ``` ### FFmpeg Setup ```bash # macOS brew install ffmpeg # Ubuntu/Debian sudo apt update && sudo apt install ffmpeg # Windows (with Chocolatey) choco install ffmpeg ``` ### FFmpeg `clipper.py` requires FFmpeg. On macOS: ```bash brew install ffmpeg ``` --- ## Quick Start 1. **Launch the App** ```bash python app.py ``` This starts a local web server at http://127.0.0.1:7860 2. **First Run** - The first time you run with a new model, it will download the necessary files (1-5GB for `flan-t5-base`). - Subsequent starts will be faster as files are cached. 3. **Using the App** 1. **Upload** any audio/video file (mp3, mp4, etc.) 2. Select a model: - For CPU/Spaces: `flan-t5-base` - For local GPU: `phi3` or `tinyllama` (requires Ollama) 3. Ask a question in the **Ask** tab 4. The app will: - Transcribe the media (first time only) - Find the most relevant 30-second clip - Play the audio and show the timestamp 4. **Debugging** - Check the terminal for transcription progress - View the **Debug Log** tab for detailed LLM interactions - Logs are saved to `langchain_debug.jsonl` --- ## Project Layout ``` β”œβ”€β”€ app.py # Gradio UI and orchestration β”œβ”€β”€ clipper.py # ffmpeg clip extraction helper β”œβ”€β”€ index_builder.py # transcription + FAISS index builder β”œβ”€β”€ qa_engine.py # load index, build RetrievalQA chain, JSONL logging β”œβ”€β”€ logging_config.py # basic logger β”œβ”€β”€ requirements.txt └── README.md ``` Generated artifacts: * `downloads/audio.mp3` – copy of uploaded audio * `data/faiss_index*` – FAISS vector store * `data/segments.json` – transcript chunks with timestamps * `langchain_debug.jsonl` – streaming debug log --- ## Customising * **Change minimum clip length** – Modify `MIN_CLIP_SEC` logic in `app.py` (currently hard-coded to 30 s). * **Use a different LLM** – Change the `ChatOllama(model=...)` argument in `qa_engine.py` (any Ollama-served model works). * **Prompt template** – Supply `chain_type_kwargs={"prompt": custom_prompt}` when calling `RetrievalQAWithSourcesChain.from_chain_type`. * **Rotate / clean logs** – Delete `langchain_debug.jsonl`; it will be recreated on the next query. --- ## Troubleshooting ### Common Issues | Issue | Solution | |-------|----------| | **Ollama not detected** | Run `ollama serve` in a separate terminal | | **CUDA Out of Memory** | Use a smaller model (`phi3` instead of `mistral`) or reduce batch size | | **FFmpeg not found** | Install FFmpeg and ensure it's in your PATH | | **Slow performance on CPU** | Use `phi3` or `tinyllama` with Ollama for GPU acceleration | | **Model download errors** | Check internet connection and disk space | ### Advanced - **Reducing VRAM Usage**: ```python # In app.py, reduce context length llm = ChatOllama(model="phi3", num_ctx=2048) # default is 4096 ``` - **Faster Transcriptions**: ```bash # Pre-convert to 16kHz mono WAV ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le input_16k.wav ``` - **Debug Logs**: - Check `langchain_debug.jsonl` for detailed traces - Set `LOG_LEVEL=DEBUG` for verbose output ## Local Development ### Environment Variables ```bash # For HuggingFace models (required for Spaces) export HUGGINGFACEHUB_API_TOKEN="your_token_here" # For debugging export LOG_LEVEL=DEBUG ``` ### Running Tests ```bash # Install test dependencies pip install pytest pytest-mock # Run tests pytest tests/ ``` ### Building for Production ```bash # Create a standalone executable (using PyInstaller) pip install pyinstaller pyinstaller --onefile app.py ``` --- ## License MIT – do what you want, no warranty.