Clémentine
gitignore
096bf86
---
title: InferenceProviderTestingBackend
emoji: 📈
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
---
# Inference Provider Testing Dashboard
A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
## Setup
### Prerequisites
- Python 3.8+
- Hugging Face account with API token
- Access to the `IPTesting` namespace on Hugging Face
### Installation
1. Clone or navigate to this repository:
```bash
cd InferenceProviderTestingBackend
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set up your Hugging Face token as an environment variable:
```bash
export HF_TOKEN="your_huggingface_token_here"
```
**Important**: Your HF_TOKEN must have:
- Permission to call inference providers
- Write access to the `IPTesting` organization
## Usage
### Starting the Dashboard
Run the Gradio app:
```bash
python app.py
```
### Initialize Models and Providers
1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers.
2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations:
```
meta-llama/Llama-3.2-3B-Instruct fireworks-ai
meta-llama/Llama-3.2-3B-Instruct together-ai
Qwen/Qwen2.5-7B-Instruct fireworks-ai
mistralai/Mistral-7B-Instruct-v0.3 together-ai
```
Format: `model_name provider_name` (separated by spaces or tabs)
### Launching Jobs
1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`)
2. Verify the config file path (default: `models_providers.txt`)
3. Click **"Launch Jobs"**
The system will:
- Read all model-provider combinations from the config file
- Launch a separate evaluation job for each combination
- Log the job ID and status
- Monitor job progress automatically
### Monitoring Jobs
The **Job Results** table displays all jobs with:
- **Model**: The model being tested
- **Provider**: The inference provider
- **Last Run**: Timestamp of when the job was last launched
- **Status**: Current status (running/complete/failed/cancelled)
- **Current Score**: Average score from the most recent run
- **Previous Score**: Average score from the prior run (for comparison)
- **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection
The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
## Configuration
### Tasks Format
The tasks parameter follows the lighteval format. Examples:
- `lighteval|mmlu|0` - MMLU benchmark
### Daily Checkpoint
The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day.
### Data Persistence
All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means:
- Results persist across app restarts
- Historical score comparisons are maintained
- Data can be accessed programmatically via the HF datasets library
## Architecture
- **Main Thread**: Runs the Gradio interface
- **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
- **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
- **Thread-safe**: Uses locks to prevent access issues when checking job_results
- **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset
## Troubleshooting
### Jobs Not Launching
- Verify your `HF_TOKEN` is set and has the required permissions
- Check that the `IPTesting` namespace exists and you have access
- Review logs for specific error messages
### Scores Not Appearing
- Scores are extracted from job logs after completion
- The extraction parses the results table that appears in job logs
- It extracts the score for each task (from the first row where the task name appears)
- The final score is the average of all task scores
- Example table format:
```
| Task | Version | Metric | Value | Stderr |
| extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 |
| lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 |
```
- If scores don't appear, check console output for extraction errors or parsing issues
## Files
- [app.py](app.py) - Main Gradio application with UI and job management
- [utils/](utils/) - Utility package with helper modules:
- [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
- [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
- [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
- [requirements.txt](requirements.txt) - Python dependencies
- [README.md](README.md) - This file