File size: 12,926 Bytes
cc5edd6 9ab0526 c852551 9ab0526 cc5edd6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
---
title: AI-Powered Data Analyst Assistant
emoji: π
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
---
# π AI-Powered Data Analyst Assistant
### Empowering Business Users with Self-Service Analytics via Conversational AI
---
## π Project Overview
The **AI-Powered Data Analyst Assistant** is a cutting-edge conversational AI application designed to revolutionize how business users interact with complex sales data. Instead of relying on technical teams for custom reports, users can simply ask questions in natural language, and the assistant will retrieve precise answers, either by querying a database or providing definitions from a business knowledge base.
This project showcases a sophisticated multi-agent architecture leveraging **Large Language Models (LLMs)**, **Retrieval Augmented Generation (RAG)**, and **fine-tuning** to deliver accurate, context-aware, and actionable insights, bridging the gap between raw data and business understanding.
Link to Project Demo: https://huggingface.co/spaces/DakshC/Data_Analyst_Assistant
Link to Project Demo Video: https://www.dropbox.com/scl/fi/7lo9j75h36ic5xipqt7gq/AI_Data_Analyst_Assistant_Demo.mp4?rlkey=2suvkn076goi52nsrokz407eq&st=zmtyyg3u&dl=0
## ποΈ Database Schema & Data Scale
The application operates over a transactional sales database, meticulously designed and populated with realistic dummy data to simulate a diverse business environment. This ensures the agent's analytical capabilities are tested against varied scenarios.
**Data Scale:**
* **Sales Records:** Over **15,000** individual sales transactions.
* **Customers:** **20** unique customer profiles.
* **Products:** **13** distinct products across **4** categories (Electronics, Furniture, Stationery, Apparel).
* **Regions:** **5** geographical sales regions (North, South, East, West, Central).
* **Temporal Coverage:** Sales data spans the **last 2 years**, with a bias towards recent activity and seasonal trends for realistic time-series analysis.
**Core Tables Overview:**
### `regions` Table
Stores information about geographical sales regions.
| region_id | region_name |
| :-------- | :---------- |
| 1 | North |
| 2 | South |
| 3 | East |
| 4 | West |
| 5 | Central |
### `products` Table
Contains details about products sold.
| product_id | product_name | category | price |
| :--------- | :-------------------------- | :---------- | :------- |
| 101 | Laptop Basic | Electronics | 800.00 |
| 102 | Laptop Pro | Electronics | 1500.00 |
| 103 | Smartphone X | Electronics | 700.00 |
| 104 | Smartwatch | Electronics | 250.00 |
| 201 | Office Desk Standard | Furniture | 200.00 |
| ... | | | |
### `customers` Table
Stores customer information and their assigned region.
| customer_id | customer_name | email | region_id |
| :---------- | :-------------- | :---------------------- | :-------- |
| 1 | Alice Smith | [email protected] | 1 |
| 2 | Bob Johnson | [email protected] | 3 |
| 3 | Charlie Brown | [email protected]| 1 |
| 4 | Diana Prince | [email protected] | 5 |
| 5 | Eve Adams | [email protected] | 4 |
| ... | | | |
### `sales` Table
Records individual sales transactions, linking products, customers, and regions by ID.
| sale_id | product_id | customer_id | region_id | sale_date | quantity | amount |
| :------ | :--------- | :---------- | :-------- | :--------- | :------- | :-------- |
| 1 | 401 | 16 | 2 | 2025-06-21 | 2 | 1600.00 |
| 2 | 103 | 1 | 1 | 2025-06-15 | 1 | 700.00 |
| 3 | 201 | 10 | 1 | 2025-05-30 | 3 | 600.00 |
| 4 | 302 | 12 | 4 | 2025-06-01 | 5 | 150.00 |
| 5 | 405 | 15 | 5 | 2025-06-25 | 1 | 110.00 |
| ... | | | | | | |
## β¨ Practical Use Cases & Impact
* **Self-Service Business Intelligence (BI):** Empowers non-technical users to access and analyze data independently, reducing reliance on data teams.
* **Faster Decision Making:** Get real-time answers to data-related questions, enabling quicker, data-driven decisions.
* **Improved Data Accessibility:** Makes complex database schemas and business metrics understandable and queryable for a wider audience.
* **Enhanced Productivity:** Automates routine data retrieval tasks, freeing up data analysts for more strategic work.
* **Democratization of Data:** Fosters a data-literate culture by making data exploration intuitive and conversational.
## π οΈ Technical Proficiencies & Highlights
This project demonstrates expertise across a range of advanced AI and data engineering concepts:
* **Multi-Agent Architecture:**
* **Orchestrator Agent:** Utilizes the **LlamaIndex ReAct Agent** framework (Reasoning and Acting) to intelligently route user queries to the appropriate specialized sub-agent (NL-to-SQL or KPI Explainer) based on intent detection. This showcases complex agentic workflow design.
* **Specialized Agents:** Distinct agents for distinct tasks (Data Querying, KPI Explanation) ensuring modularity and focused expertise.
* **Natural Language to SQL (NL2SQL):**
* **Fine-tuning (LoRA on Nebius AI):** A **Meta-Llama-3.1-8B-Instruct-LoRa** model was fine-tuned on a custom dataset (`nl_sql_finetune_dataset.jsonl`) of ~70 natural language questions and their corresponding **SQLite** SQL queries. This demonstrates specialized model adaptation for domain-specific accuracy.
* **Synthetic Data Generation:** The NL-to-SQL dataset itself was (partially) generated using conversational AI tools like **ChatGPT**.
* **Robust SQL Execution:** The `SqlExecutorTool` securely executes only `SELECT` queries against the SQLite database, returning results in a structured format (CSV) for LLM interpretation.
* **Retrieval Augmented Generation (RAG):**
* **Schema RAG:** The `SchemaRetrieverTool` uses RAG over a `data_dictionary.md` to provide real-time, relevant database schema context (tables, columns, relationships, descriptions) to the NL-to-SQL agent. This ensures generated SQL is accurate and schema-aware, even for complex joins.
* **KPI RAG:** A dedicated RAG component over `kpi_definitions.md` enables the KPI Answering Agent to provide precise definitions and calculation methods for key business metrics.
* **Vector Database:** **ChromaDB** is employed as the persistent vector store to efficiently index and retrieve relevant information based on semantic similarity.
* **Embedding Models:** Leverages **Nebius AI Embeddings** (`BAAI/bge-en-icl) for converting natural language and knowledge base content into high-dimensional vectors for semantic search.
* **Prompt Engineering:**
* Extensive custom system prompts are designed for each agent to guide their reasoning, tool usage, and response generation, ensuring adherence to `Thought-Action-Observation` loops and specific output requirements.
* **Data Management & Orchestration:**
* **SQLite Database:** Used as the backend for sales data, demonstrating proficiency in database interaction.
* **Pandas:** Utilized for efficient data manipulation and preparing results for LLM consumption.
* **Dynamic Data Generation:** Includes a robust `setup_database.py` script to generate realistic dummy sales data with temporal, regional, and product-specific biases, crucial for comprehensive testing of analytical queries.
* **Python Development:** Clean, modular, and well-structured Python codebase, following best practices for agent development and dependency management (`requirements.txt`, `.env`).
## π Project Structure
```
AI_DATA_ANALYST_ASSISTANT/
βββ chroma_db_schema/ # Persistent ChromaDB for schema context
βββ chroma_db_kpi/ # Persistent ChromaDB for KPI definitions
βββ data/
β βββ sales_database.db # SQLite database for sales data
βββ fine-tuning/
β βββ model_checkpoints/ # Saved model states from fine-tuning
β βββ finetune_script.py # Script for fine-tuning NL-to-SQL LLM
β βββ finetuned_model_deployment.py # Script to deploy fine-tuned model
β βββ nl_sql_finetune_dataset.jsonl # Dataset for NL-to-SQL fine-tuning
βββ knowledge_base/
β βββ business_glossary/
β β βββ kpi_definitions.md # KPI definitions for RAG
β βββ schema/
β βββ data_dictionary.md # Human-readable schema descriptions for RAG
β βββ sales_schema.sql # SQL DDL for database creation
βββ src/
β βββ agents/
β β βββ agent_models/
β β β βββ models.py # Centralized LLM initialization
β β βββ agent_tools/
β β β βββ schema_retriever_tool.py # Tool for retrieving schema context
β β β βββ sql_executor_tool.py # Tool for executing SQL queries
β β βββ nl_sql_agent.py # Core Natural Language to SQL Agent
β β βββ orchestrator_agent.py # Routes queries to specialized agents
β βββ rag_index.py # Script to build and persist RAG indexes
β βββ setup_database.py # Script to setup and populate the database
βββ venv/ # Python Virtual Environment
βββ .env # Environment variables (e.g., API keys)
βββ .gitignore # Git ignore file
βββ app.py # Main application entry point (e.g., API)
βββ requirements.txt # Python dependencies
```
## βοΈ Setup & Installation
To get the AI Data Analyst Assistant up and running locally:
1. **Clone the Repository:**
```bash
git clone [https://github.com/YourGitHubUsername/AI_Data_Analyst_Assistant.git](https://github.com/YourGitHubUsername/AI_Data_Analyst_Assistant.git)
cd AI_Data_Analyst_Assistant
```
2. **Create and Activate Virtual Environment:**
```bash
python -m venv venv
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
```
3. **Install Dependencies:**
```bash
pip install -r requirements.txt
pip install tqdm # For progress bars during data generation
```
4. **Set Up Environment Variables:**
* Create a `.env` file in the project root.
* Add your Nebius AI API Key:
```
NEBIUS_API_KEY="your_nebius_ai_api_key_here"
```
5. **Initialize the Database:**
This script will create `sales_database.db` and populate it with rich dummy data.
```bash
python src/setup_database.py
```
6. **Build RAG Indexes:**
This will create the `chroma_db_schema` and `chroma_db_kpi` directories with your vectorized knowledge bases.
```bash
python src/rag_index.py
```
7. **Fine-tune (Optional - for advanced development):**
If you're developing the fine-tuned model, run your fine-tuning script. This project assumes you have already fine-tuned and deployed your `meta-llama/Meta-Llama-3.1-8B-Instruct-LoRa:nl-to-sql-finetuned-jbkN` model on Nebius AI as configured in `src/agents/agent_models/models.py`.
## π How to Use (Demonstration)
You can interact directly with the `NLSQLAgent` for testing purposes:
1. **Start the Agent:**
```bash
python src/nl_sql_agent.py
```
2. **Ask Questions!** Try some of these examples:
* "What is the total number of sales?"
* "What are the names of customers in the North region?"
* "How much revenue did we generate from Electronics products?"
* "Which customer has the most sales in the past one month?"
* "Show me the total revenue for each month over the last six months, ordered by month."
* "Which are our top 5 products by total quantity sold across all time?"
* "What is the average amount per sale for each product category?"
* "How many unique customers have made a purchase in each region over the last year?"
* "What is the total revenue generated by customers from each region?"
* "Which products have not been sold in the last 3 months?" (This tests an advanced SQL pattern and might require further fine-tuning for specific dialect) |