---
title: Lets Talk
emoji: 🐨
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
---

# Welcome to TheDataGuy Chat! 👋

This is a Q&A chatbot powered by [TheDataGuy blog](https://thedataguy.pro/blog/) blog posts. Ask questions about topics covered in the blog, such as:

- RAGAS and RAG evaluation
- Building research agents
- Metric-driven development
- Data science best practices

## How it works

Under the hood, this application uses:

1. **Snowflake Arctic Embeddings**: To convert text into vector representations
   - Base model: `Snowflake/snowflake-arctic-embed-l`
   - Fine-tuned model: `mafzaal/thedataguy_arctic_ft` (custom-tuned using blog-specific query-context pairs)

2. **Qdrant Vector Database**: To store and search for similar content
   - Efficiently indexes blog post content for fast semantic search
   - Supports real-time updates when new blog posts are published

3. **GPT-4o-mini**: To generate helpful responses based on retrieved content
   - Primary model: OpenAI `gpt-4o-mini` for production inference
   - Evaluation model: OpenAI `gpt-4.1` for complex tasks including synthetic data generation and evaluation

4. **LangChain**: For building the RAG workflow
   - Orchestrates the retrieval and generation components
   - Provides flexible components for LLM application development
   - Structured for easy maintenance and future enhancements

5. **Chainlit**: For the chat interface
   - Offers an interactive UI with message threading
   - Supports file uploads and custom components

## Technology Stack

### Core Components
- **Vector Database**: Qdrant (stores embeddings via `pipeline.py`)
- **Embedding Model**: Snowflake Arctic Embeddings
- **LLM**: OpenAI GPT-4o-mini
- **Framework**: LangChain + Chainlit
- **Development Language**: Python 3.13

### Advanced Features
- **Evaluation**: Ragas metrics for evaluating RAG performance:
  - Faithfulness
  - Context Relevancy
  - Answer Relevancy
  - Topic Adherence
- **Synthetic Data Generation**: For training and testing
- **Vector Store Updates**: Automated pipeline to update when new blog content is published
- **Fine-tuned Embeddings**: Custom embeddings tuned for technical content

## Project Structure

```
lets-talk/
├── data/                  # Raw blog post content
├── py-src/                # Python source code
│   ├── lets_talk/         # Core application modules
│   │   ├── agent.py       # Agent implementation
│   │   ├── config.py      # Configuration settings
│   │   ├── models.py      # Data models
│   │   ├── prompts.py     # LLM prompt templates
│   │   ├── rag.py         # RAG implementation
│   │   ├── rss_tool.py    # RSS feed integration
│   │   └── tools.py       # Tool implementations
│   ├── app.py             # Main application entry point
│   └── pipeline.py        # Data processing pipeline
├── db/                    # Vector database storage
├── evals/                 # Evaluation datasets and results
└── notebooks/             # Jupyter notebooks for analysis
```

## Environment Setup

The application requires the following environment variables:

```
OPENAI_API_KEY=your_openai_api_key
VECTOR_STORAGE_PATH=./db/vector_store_tdg
LLM_MODEL=gpt-4o-mini
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
```

Additional configuration options for vector database creation:

```
# Vector Database Creation Configuration
FORCE_RECREATE=False      # Whether to force recreation of the vector store
OUTPUT_DIR=./stats        # Directory to save stats and artifacts
USE_CHUNKING=True         # Whether to split documents into chunks
SHOULD_SAVE_STATS=True    # Whether to save statistics about the documents
```

## Running Locally

### Using Docker

```bash
docker build -t lets-talk .
docker run -p 7860:7860 \
    --env-file ./.env \
    lets-talk
```

### Using Python

```bash
# Install dependencies
uv init && uv sync

# Run the application
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
```

## Deployment

The application is designed to be deployed on:

- **Development**: Hugging Face Spaces ([Live Demo](https://huggingface.co/spaces/mafzaal/lets_talk))
- **Production**: Azure Container Apps (planned)

## Evaluation

This project includes extensive evaluation capabilities using the Ragas framework:

- **Synthetic Data Generation**: For creating test datasets
- **Metric Evaluation**: Measuring faithfulness, relevance, and more
- **Fine-tuning Analysis**: Comparing different embedding models

## Future Enhancements

- **Agentic Reasoning**: Adding more sophisticated agent capabilities
- **Web UI Integration**: Custom Svelte component for the blog
- **CI/CD**: GitHub Actions workflow for automated deployment
- **Monitoring**: LangSmith integration for observability

## License

This project is available under the MIT License.

## Acknowledgements

- [TheDataGuy blog](https://thedataguy.pro/blog/) for the content
- [Ragas](https://docs.ragas.io/) for evaluation framework
- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) for RAG components
- [Chainlit](https://docs.chainlit.io/) for the chat interface

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference