Spaces:
Running
Running
File size: 6,282 Bytes
fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e 30709ab fb96d1e b0fb5c7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
This is a **production-ready GAIA benchmark AI agent** achieving 85% accuracy through a sophisticated multi-agent architecture. The system has been **fully refactored** into a modular, maintainable architecture that specializes in complex question answering across multimedia, research, file processing, chess analysis, and mathematical reasoning domains.
## Development Commands
### Setup and Installation
```bash
# Install dependencies
pip install -r requirements.txt
# Test API key configuration
python test_api_keys.py
# Verify core functionality
python -c "from main import GAIASolver; print('β
Core GAIASolver available')"
```
### Running the System
```bash
# Run legacy monolithic solver
python main.py
# Run refactored modular solver (recommended)
python main_refactored.py
# Run Gradio web interface
python app.py
```
### Testing Commands
```bash
# Comprehensive async testing
python async_complete_test.py
# Test question classification
python test_improved_classification.py
python final_classification_test.py
# Test YouTube functionality
python direct_youtube_test.py
python simple_youtube_test.py
python test_youtube_question.py
# Test individual components
python -c "from gaia_tools import GAIA_TOOLS; print(f'Available tools: {len(GAIA_TOOLS)}')"
python -c "from question_classifier import QuestionClassifier; c = QuestionClassifier(); print('β
Classifier ready')"
```
## Architecture Overview
### Dual Architecture Design
This project maintains both **legacy monolithic** and **refactored modular** architectures:
**Legacy Architecture (main.py):**
- Monolithic 1285-line solver with all functionality integrated
- Comprehensive tool collection in gaia_tools.py (4887 lines)
- Single-file approach for rapid development and deployment
**Refactored Architecture (gaia/ package):**
```
gaia/
βββ core/ # Main solver logic
β βββ solver.py # GAIASolver main class
β βββ answer_extractor.py # Specialized answer extraction classes
β βββ question_processor.py # Question classification and processing
βββ tools/ # Tool implementations
β βββ base.py # Abstract tool interface and registry
β βββ registry.py # Tool discovery and management
β βββ [specialized tool modules]
βββ models/ # Model providers and management
β βββ manager.py # ModelManager with fallback chains
β βββ providers.py # LiteLLM, Gemini, Kluster providers
βββ config/ # Configuration management
β βββ settings.py # Config, ModelConfig classes
βββ utils/ # Utilities and helpers
βββ exceptions.py # Custom exception hierarchy
βββ logging.py # Logging configuration
```
### Core Components
**GAIASolver (main.py):** Legacy monolithic solver with 1000+ lines of sophisticated processing logic
**GAIASolver (gaia/core/solver.py):** Refactored main orchestrator using dependency injection
**QuestionClassifier:** LLM-based intelligent routing with pattern-based fallbacks
**GAIA_TOOLS:** 42 specialized tools including enhanced Wikipedia research, chess analysis, Excel processing, and multimedia analysis
**ModelManager:** Handles model initialization, fallback chains (Kluster.ai β Gemini β Qwen), and lifecycle management
### Question Type Specialization
**Research Questions (92% accuracy):**
- Enhanced Wikipedia tools with date-specific searches and Featured Articles integration
- Multi-step research coordination with cross-validation
- Anti-hallucination safeguards to prevent fabrication
**Chess Questions (100% accuracy):**
- Universal FEN correction system handling any vision error pattern
- Multi-tool consensus system for maximum accuracy
- Perfect algebraic notation extraction
**YouTube/Multimedia Questions:**
- Enhanced URL detection with multiple regex patterns
- Forced classification override for YouTube content
- Specialized prompts with explicit tool usage instructions
**File Processing (100% accuracy):**
- Format-specific tools for Excel (.xlsx/.xls), Python (.py), text files
- Deterministic Python execution with sandboxed environment
- Financial calculation specialization with proper currency formatting
## Environment Configuration
### Required API Keys (set in .env)
- `GEMINI_API_KEY` - Primary model (Gemini Flash 2.0)
- `HUGGINGFACE_TOKEN` - Fallback model and classification
- `KLUSTER_API_KEY` - Optional premium model access
### Model Fallback Chain
1. **Kluster.ai** (Qwen3-235B, Gemma3-27B) - Premium option
2. **Gemini Flash 2.0** - Primary production model
3. **Qwen 2.5-72B** - Reliable fallback via HuggingFace
## Key Design Patterns
### Anti-Hallucination Architecture
- **Tool result prioritization**: Always uses exact tool outputs over internal reasoning
- **Cross-validation**: Multiple verification methods for critical information
- **Source attribution**: Clear tracking and validation of information sources
- **Validation rules**: Type-specific answer extraction and verification
### Performance Optimizations
- **Fresh agent creation** for each question to avoid token accumulation
- **Concurrent processing** support with async operations
- **15-minute web cache** for improved response times
- **Exponential backoff** for API rate limiting
## File Organization
### Core Files
- `main.py` - Legacy monolithic solver (1285 lines)
- `main_refactored.py` - Entry point for refactored architecture
- `gaia_tools.py` - 42 specialized tools with robust error handling (4887 lines)
- `question_classifier.py` - LLM + pattern-based classification system
- `app.py` - Production Gradio interface with comprehensive error handling
### Supporting Files
- `async_complete_test.py` - Comprehensive async testing infrastructure
- `enhanced_wikipedia_tools.py` - Advanced Wikipedia research capabilities
- `universal_fen_correction.py` - Chess-specific FEN notation correction
- `wikipedia_featured_articles_by_date.py` - Date-specific Wikipedia searches
## Local Configuration Notes
- huggingface token can get from secrets in .env |