Final_Assignment

Running

File size: 6,282 Bytes

fb96d1e
30709ab
fb96d1e
30709ab
fb96d1e
30709ab
fb96d1e
30709ab
fb96d1e
30709ab
fb96d1e
30709ab
fb96d1e
 
30709ab
fb96d1e
 
30709ab
fb96d1e
 
30709ab
 
fb96d1e
30709ab
fb96d1e
 
30709ab
fb96d1e
 
30709ab
fb96d1e
 
30709ab
 
fb96d1e
30709ab
fb96d1e
 
30709ab
 
fb96d1e
 
30709ab
fb96d1e
 
 
 
30709ab
fb96d1e
 
 
 
30709ab
fb96d1e
30709ab
fb96d1e
30709ab
fb96d1e
30709ab
fb96d1e
 
 
 
30709ab
fb96d1e
30709ab
fb96d1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30709ab
 
fb96d1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0fb5c7

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

This is a **production-ready GAIA benchmark AI agent** achieving 85% accuracy through a sophisticated multi-agent architecture. The system has been **fully refactored** into a modular, maintainable architecture that specializes in complex question answering across multimedia, research, file processing, chess analysis, and mathematical reasoning domains.

## Development Commands

### Setup and Installation
```bash
# Install dependencies
pip install -r requirements.txt

# Test API key configuration
python test_api_keys.py

# Verify core functionality
python -c "from main import GAIASolver; print('✅ Core GAIASolver available')"
```

### Running the System
```bash
# Run legacy monolithic solver
python main.py

# Run refactored modular solver (recommended)
python main_refactored.py

# Run Gradio web interface
python app.py
```

### Testing Commands
```bash
# Comprehensive async testing
python async_complete_test.py

# Test question classification
python test_improved_classification.py
python final_classification_test.py

# Test YouTube functionality
python direct_youtube_test.py
python simple_youtube_test.py
python test_youtube_question.py

# Test individual components
python -c "from gaia_tools import GAIA_TOOLS; print(f'Available tools: {len(GAIA_TOOLS)}')"
python -c "from question_classifier import QuestionClassifier; c = QuestionClassifier(); print('✅ Classifier ready')"
```

## Architecture Overview

### Dual Architecture Design

This project maintains both **legacy monolithic** and **refactored modular** architectures:

**Legacy Architecture (main.py):**
- Monolithic 1285-line solver with all functionality integrated
- Comprehensive tool collection in gaia_tools.py (4887 lines)
- Single-file approach for rapid development and deployment

**Refactored Architecture (gaia/ package):**
```
gaia/
├── core/           # Main solver logic
│   ├── solver.py           # GAIASolver main class
│   ├── answer_extractor.py # Specialized answer extraction classes
│   └── question_processor.py # Question classification and processing
├── tools/          # Tool implementations  
│   ├── base.py            # Abstract tool interface and registry
│   ├── registry.py        # Tool discovery and management
│   └── [specialized tool modules]
├── models/         # Model providers and management
│   ├── manager.py         # ModelManager with fallback chains
│   └── providers.py       # LiteLLM, Gemini, Kluster providers
├── config/         # Configuration management
│   └── settings.py        # Config, ModelConfig classes
└── utils/          # Utilities and helpers
    ├── exceptions.py      # Custom exception hierarchy
    └── logging.py         # Logging configuration
```

### Core Components

**GAIASolver (main.py):** Legacy monolithic solver with 1000+ lines of sophisticated processing logic
**GAIASolver (gaia/core/solver.py):** Refactored main orchestrator using dependency injection
**QuestionClassifier:** LLM-based intelligent routing with pattern-based fallbacks
**GAIA_TOOLS:** 42 specialized tools including enhanced Wikipedia research, chess analysis, Excel processing, and multimedia analysis
**ModelManager:** Handles model initialization, fallback chains (Kluster.ai → Gemini → Qwen), and lifecycle management

### Question Type Specialization

**Research Questions (92% accuracy):**
- Enhanced Wikipedia tools with date-specific searches and Featured Articles integration
- Multi-step research coordination with cross-validation
- Anti-hallucination safeguards to prevent fabrication

**Chess Questions (100% accuracy):**
- Universal FEN correction system handling any vision error pattern
- Multi-tool consensus system for maximum accuracy
- Perfect algebraic notation extraction

**YouTube/Multimedia Questions:**
- Enhanced URL detection with multiple regex patterns
- Forced classification override for YouTube content
- Specialized prompts with explicit tool usage instructions

**File Processing (100% accuracy):**
- Format-specific tools for Excel (.xlsx/.xls), Python (.py), text files
- Deterministic Python execution with sandboxed environment
- Financial calculation specialization with proper currency formatting

## Environment Configuration

### Required API Keys (set in .env)
- `GEMINI_API_KEY` - Primary model (Gemini Flash 2.0)
- `HUGGINGFACE_TOKEN` - Fallback model and classification
- `KLUSTER_API_KEY` - Optional premium model access

### Model Fallback Chain
1. **Kluster.ai** (Qwen3-235B, Gemma3-27B) - Premium option
2. **Gemini Flash 2.0** - Primary production model
3. **Qwen 2.5-72B** - Reliable fallback via HuggingFace

## Key Design Patterns

### Anti-Hallucination Architecture
- **Tool result prioritization**: Always uses exact tool outputs over internal reasoning
- **Cross-validation**: Multiple verification methods for critical information  
- **Source attribution**: Clear tracking and validation of information sources
- **Validation rules**: Type-specific answer extraction and verification

### Performance Optimizations
- **Fresh agent creation** for each question to avoid token accumulation
- **Concurrent processing** support with async operations  
- **15-minute web cache** for improved response times
- **Exponential backoff** for API rate limiting

## File Organization

### Core Files
- `main.py` - Legacy monolithic solver (1285 lines)
- `main_refactored.py` - Entry point for refactored architecture
- `gaia_tools.py` - 42 specialized tools with robust error handling (4887 lines)
- `question_classifier.py` - LLM + pattern-based classification system
- `app.py` - Production Gradio interface with comprehensive error handling

### Supporting Files
- `async_complete_test.py` - Comprehensive async testing infrastructure
- `enhanced_wikipedia_tools.py` - Advanced Wikipedia research capabilities
- `universal_fen_correction.py` - Chess-specific FEN notation correction
- `wikipedia_featured_articles_by_date.py` - Date-specific Wikipedia searches

## Local Configuration Notes
- huggingface token can get from secrets in .env