Spaces:

maguid28
/

ClipQuery

Runtime error

App Files Files Community

ClipQuery / README.md

maguid28

Update README.md

26930f6 verified 20 days ago

preview code

raw

history blame contribute delete

7.73 kB

	---
	title: ClipQuery
	emoji: 🔍
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 5.41.0
	app_file: app.py
	pinned: false
	---

	# ClipQuery – Ask Questions of Any Podcast / Video and Hear the Answer

	ClipQuery turns any local audio or video file into a searchable, conversational experience.
	It automatically transcribes the media, indexes each sentence with embeddings, and lets you
	ask natural-language questions. It returns:

	1. A 30-second audio clip from the exact place in the media where the answer occurs.
	2. The timestamp of the clip.
	3. A live LangChain debug log so you can inspect what happened behind the scenes.

	---
	## How It Works

	```
	┌───────────────┐ transcribe & segment ┌───────────────┐
	│ audio / mp4 │ ───────────────────────▶ │ transcripts │
	└───────────────┘ └───────────────┘
	│ │
	│ build embeddings (SBERT) │ metadata: {start, end}
	▼ ▼
	┌─────────────────┐ store vectors ┌────────────────────┐
	│ HuggingFace │────────────────────▶│ FAISS VectorStore │
	│ Sentence-Transformer│ └────────────────────┘
	└─────────────────┘ ▲
	│ retrieve top-k
	▼
	┌────────────────────┐
	│ ChatOllama (phi3) │
	│ RetrievalQA chain │
	└────────────────────┘
	```

	1. Transcription – `index_builder.py` uses `faster-whisper` to generate
	word-level timestamps, saved as `segments.json`.
	2. Embedding + Index – Sentence-Transformer (miniLM) embeddings are
	stored in a FAISS index (`data/*`).
	3. Question Answering – A local LLM (Ollama `phi3`) is wrapped in
	`RetrievalQAWithSourcesChain` to pull the most relevant transcript
	chunks and generate an answer.
	4. Clip Extraction – `clipper.py` calls `ffmpeg` to cut a 30 s MP3
	between the `start` and `end` timestamps (extended to 30 s if shorter).
	5. Debug Logging – A custom `JSONLCallbackHandler` dumps every
	LangChain event to `langchain_debug.jsonl`; the Gradio UI streams it
	live in the Debug Log tab.

	---
	## Installation

	### Prerequisites
	- Python 3.9+ (3.10 recommended)
	- FFmpeg (for audio processing)
	- For GPU acceleration: CUDA-compatible GPU (optional but recommended)

	### Quick Start (CPU/Spaces Mode)
	```bash
	# 1. Clone and set up
	python -m venv .venv && source .venv/bin/activate # Linux/macOS
	# OR on Windows: .venv\Scripts\activate

	pip install -r requirements.txt

	# 2. Run the app (uses flan-t5-base by default)
	python app.py
	```

	### Local GPU Setup (Optional)
	For better performance with local models:

	1. Install Ollama
	```bash
	# macOS/Linux
	curl -fsSL https://ollama.com/install.sh \| sh

	# Windows: Download from https://ollama.com/download
	```

	2. Download Models (pick one)
	```bash
	# Small & fast (4GB VRAM+)
	ollama pull phi3

	# Larger & more capable (8GB VRAM+)
	ollama pull mistral

	# Start Ollama in the background
	ollama serve &
	```

	3. Run with Local Model
	```bash
	# The app will automatically detect Ollama if running
	python app.py
	```

	### FFmpeg Setup
	```bash
	# macOS
	brew install ffmpeg

	# Ubuntu/Debian
	sudo apt update && sudo apt install ffmpeg

	# Windows (with Chocolatey)
	choco install ffmpeg
	```

	### FFmpeg

	`clipper.py` requires FFmpeg. On macOS:
	```bash
	brew install ffmpeg
	```

	---
	## Quick Start

	1. Launch the App
	```bash
	python app.py
	```
	This starts a local web server at http://127.0.0.1:7860

	2. First Run
	- The first time you run with a new model, it will download the necessary files (1-5GB for `flan-t5-base`).
	- Subsequent starts will be faster as files are cached.

	3. Using the App
	1. Upload any audio/video file (mp3, mp4, etc.)
	2. Select a model:
	- For CPU/Spaces: `flan-t5-base`
	- For local GPU: `phi3` or `tinyllama` (requires Ollama)
	3. Ask a question in the Ask tab
	4. The app will:
	- Transcribe the media (first time only)
	- Find the most relevant 30-second clip
	- Play the audio and show the timestamp

	4. Debugging
	- Check the terminal for transcription progress
	- View the Debug Log tab for detailed LLM interactions
	- Logs are saved to `langchain_debug.jsonl`

	---
	## Project Layout

	```
	├── app.py # Gradio UI and orchestration
	├── clipper.py # ffmpeg clip extraction helper
	├── index_builder.py # transcription + FAISS index builder
	├── qa_engine.py # load index, build RetrievalQA chain, JSONL logging
	├── logging_config.py # basic logger
	├── requirements.txt
	└── README.md
	```

	Generated artifacts:

	* `downloads/audio.mp3` – copy of uploaded audio
	* `data/faiss_index*` – FAISS vector store
	* `data/segments.json` – transcript chunks with timestamps
	* `langchain_debug.jsonl` – streaming debug log

	---
	## Customising

	* Change minimum clip length – Modify `MIN_CLIP_SEC` logic in
	`app.py` (currently hard-coded to 30 s).
	* Use a different LLM – Change the `ChatOllama(model=...)` argument
	in `qa_engine.py` (any Ollama-served model works).
	* Prompt template – Supply `chain_type_kwargs={"prompt": custom_prompt}`
	when calling `RetrievalQAWithSourcesChain.from_chain_type`.
	* Rotate / clean logs – Delete `langchain_debug.jsonl`; it will be
	recreated on the next query.

	---
	## Troubleshooting

	### Common Issues

	\| Issue \| Solution \|
	\|-------\|----------\|
	\| Ollama not detected \| Run `ollama serve` in a separate terminal \|
	\| CUDA Out of Memory \| Use a smaller model (`phi3` instead of `mistral`) or reduce batch size \|
	\| FFmpeg not found \| Install FFmpeg and ensure it's in your PATH \|
	\| Slow performance on CPU \| Use `phi3` or `tinyllama` with Ollama for GPU acceleration \|
	\| Model download errors \| Check internet connection and disk space \|

	### Advanced
	- Reducing VRAM Usage:
	```python
	# In app.py, reduce context length
	llm = ChatOllama(model="phi3", num_ctx=2048) # default is 4096
	```

	- Faster Transcriptions:
	```bash
	# Pre-convert to 16kHz mono WAV
	ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le input_16k.wav
	```

	- Debug Logs:
	- Check `langchain_debug.jsonl` for detailed traces
	- Set `LOG_LEVEL=DEBUG` for verbose output

	## Local Development

	### Environment Variables
	```bash
	# For HuggingFace models (required for Spaces)
	export HUGGINGFACEHUB_API_TOKEN="your_token_here"

	# For debugging
	export LOG_LEVEL=DEBUG
	```

	### Running Tests
	```bash
	# Install test dependencies
	pip install pytest pytest-mock

	# Run tests
	pytest tests/
	```

	### Building for Production
	```bash
	# Create a standalone executable (using PyInstaller)
	pip install pyinstaller
	pyinstaller --onefile app.py
	```

	---
	## License

	MIT – do what you want, no warranty.