Spaces:

OpenEvals
/

InferenceProviderTesting

Sleeping

File size: 4,997 Bytes

---
title: InferenceProviderTestingBackend
emoji: 📈
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
---

# Inference Provider Testing Dashboard

A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.

## Setup

### Prerequisites

- Python 3.8+
- Hugging Face account with API token
- Access to the `IPTesting` namespace on Hugging Face

### Installation

1. Clone or navigate to this repository:
```bash
cd InferenceProviderTestingBackend
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Set up your Hugging Face token as an environment variable:
```bash
export HF_TOKEN="your_huggingface_token_here"
```

**Important**: Your HF_TOKEN must have:
- Permission to call inference providers
- Write access to the `IPTesting` organization

## Usage

### Starting the Dashboard

Run the Gradio app:
```bash
python app.py
```

### Initialize Models and Providers

1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers.

2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations:
```
meta-llama/Llama-3.2-3B-Instruct  fireworks-ai
meta-llama/Llama-3.2-3B-Instruct  together-ai
Qwen/Qwen2.5-7B-Instruct  fireworks-ai
mistralai/Mistral-7B-Instruct-v0.3  together-ai
```

Format: `model_name  provider_name` (separated by spaces or tabs)

### Launching Jobs

1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`)
2. Verify the config file path (default: `models_providers.txt`)
3. Click **"Launch Jobs"**

The system will:
- Read all model-provider combinations from the config file
- Launch a separate evaluation job for each combination
- Log the job ID and status
- Monitor job progress automatically

### Monitoring Jobs

The **Job Results** table displays all jobs with:
- **Model**: The model being tested
- **Provider**: The inference provider
- **Last Run**: Timestamp of when the job was last launched
- **Status**: Current status (running/complete/failed/cancelled)
- **Current Score**: Average score from the most recent run
- **Previous Score**: Average score from the prior run (for comparison)
- **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection

The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.

## Configuration

### Tasks Format

The tasks parameter follows the lighteval format. Examples:
- `lighteval|mmlu|0` - MMLU benchmark

### Daily Checkpoint

The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. 

### Data Persistence

All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means:
- Results persist across app restarts
- Historical score comparisons are maintained
- Data can be accessed programmatically via the HF datasets library

## Architecture

- **Main Thread**: Runs the Gradio interface
- **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
- **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
- **Thread-safe**: Uses locks to prevent access issues when checking job_results
- **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset

## Troubleshooting

### Jobs Not Launching

- Verify your `HF_TOKEN` is set and has the required permissions
- Check that the `IPTesting` namespace exists and you have access
- Review logs for specific error messages

### Scores Not Appearing

- Scores are extracted from job logs after completion
- The extraction parses the results table that appears in job logs
- It extracts the score for each task (from the first row where the task name appears)
- The final score is the average of all task scores
- Example table format:
  ```
  | Task                    | Version | Metric                | Value  | Stderr |
  | extended:ifeval:0       |         | prompt_level_strict_acc | 0.9100 | 0.0288 |
  | lighteval:gpqa:diamond:0 |        | gpqa_pass@k_with_k     | 0.5000 | 0.0503 |
  ```
- If scores don't appear, check console output for extraction errors or parsing issues

## Files

- [app.py](app.py) - Main Gradio application with UI and job management
- [utils/](utils/) - Utility package with helper modules:
  - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
  - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
- [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
- [requirements.txt](requirements.txt) - Python dependencies
- [README.md](README.md) - This file