Spaces:
Sleeping
Sleeping
File size: 4,997 Bytes
ec29d6f 7f5506e 8dafde0 7f5506e 8dafde0 7f5506e 8dafde0 7f5506e 8dafde0 7f5506e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
title: InferenceProviderTestingBackend
emoji: π
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
---
# Inference Provider Testing Dashboard
A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
## Setup
### Prerequisites
- Python 3.8+
- Hugging Face account with API token
- Access to the `IPTesting` namespace on Hugging Face
### Installation
1. Clone or navigate to this repository:
```bash
cd InferenceProviderTestingBackend
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set up your Hugging Face token as an environment variable:
```bash
export HF_TOKEN="your_huggingface_token_here"
```
**Important**: Your HF_TOKEN must have:
- Permission to call inference providers
- Write access to the `IPTesting` organization
## Usage
### Starting the Dashboard
Run the Gradio app:
```bash
python app.py
```
### Initialize Models and Providers
1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers.
2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations:
```
meta-llama/Llama-3.2-3B-Instruct fireworks-ai
meta-llama/Llama-3.2-3B-Instruct together-ai
Qwen/Qwen2.5-7B-Instruct fireworks-ai
mistralai/Mistral-7B-Instruct-v0.3 together-ai
```
Format: `model_name provider_name` (separated by spaces or tabs)
### Launching Jobs
1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`)
2. Verify the config file path (default: `models_providers.txt`)
3. Click **"Launch Jobs"**
The system will:
- Read all model-provider combinations from the config file
- Launch a separate evaluation job for each combination
- Log the job ID and status
- Monitor job progress automatically
### Monitoring Jobs
The **Job Results** table displays all jobs with:
- **Model**: The model being tested
- **Provider**: The inference provider
- **Last Run**: Timestamp of when the job was last launched
- **Status**: Current status (running/complete/failed/cancelled)
- **Current Score**: Average score from the most recent run
- **Previous Score**: Average score from the prior run (for comparison)
- **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection
The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
## Configuration
### Tasks Format
The tasks parameter follows the lighteval format. Examples:
- `lighteval|mmlu|0` - MMLU benchmark
### Daily Checkpoint
The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day.
### Data Persistence
All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means:
- Results persist across app restarts
- Historical score comparisons are maintained
- Data can be accessed programmatically via the HF datasets library
## Architecture
- **Main Thread**: Runs the Gradio interface
- **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
- **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
- **Thread-safe**: Uses locks to prevent access issues when checking job_results
- **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset
## Troubleshooting
### Jobs Not Launching
- Verify your `HF_TOKEN` is set and has the required permissions
- Check that the `IPTesting` namespace exists and you have access
- Review logs for specific error messages
### Scores Not Appearing
- Scores are extracted from job logs after completion
- The extraction parses the results table that appears in job logs
- It extracts the score for each task (from the first row where the task name appears)
- The final score is the average of all task scores
- Example table format:
```
| Task | Version | Metric | Value | Stderr |
| extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 |
| lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 |
```
- If scores don't appear, check console output for extraction errors or parsing issues
## Files
- [app.py](app.py) - Main Gradio application with UI and job management
- [utils/](utils/) - Utility package with helper modules:
- [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
- [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
- [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
- [requirements.txt](requirements.txt) - Python dependencies
- [README.md](README.md) - This file
|