File size: 4,997 Bytes
ec29d6f
 
 
 
 
 
 
 
 
 
 
7f5506e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8dafde0
7f5506e
 
 
 
 
 
 
 
8dafde0
7f5506e
 
 
8dafde0
7f5506e
 
 
 
 
 
 
 
 
 
 
 
 
8dafde0
7f5506e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: InferenceProviderTestingBackend
emoji: πŸ“ˆ
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
---

# Inference Provider Testing Dashboard

A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.

## Setup

### Prerequisites

- Python 3.8+
- Hugging Face account with API token
- Access to the `IPTesting` namespace on Hugging Face

### Installation

1. Clone or navigate to this repository:
```bash
cd InferenceProviderTestingBackend
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Set up your Hugging Face token as an environment variable:
```bash
export HF_TOKEN="your_huggingface_token_here"
```

**Important**: Your HF_TOKEN must have:
- Permission to call inference providers
- Write access to the `IPTesting` organization

## Usage

### Starting the Dashboard

Run the Gradio app:
```bash
python app.py
```

### Initialize Models and Providers

1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers.

2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations:
```
meta-llama/Llama-3.2-3B-Instruct  fireworks-ai
meta-llama/Llama-3.2-3B-Instruct  together-ai
Qwen/Qwen2.5-7B-Instruct  fireworks-ai
mistralai/Mistral-7B-Instruct-v0.3  together-ai
```

Format: `model_name  provider_name` (separated by spaces or tabs)

### Launching Jobs

1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`)
2. Verify the config file path (default: `models_providers.txt`)
3. Click **"Launch Jobs"**

The system will:
- Read all model-provider combinations from the config file
- Launch a separate evaluation job for each combination
- Log the job ID and status
- Monitor job progress automatically

### Monitoring Jobs

The **Job Results** table displays all jobs with:
- **Model**: The model being tested
- **Provider**: The inference provider
- **Last Run**: Timestamp of when the job was last launched
- **Status**: Current status (running/complete/failed/cancelled)
- **Current Score**: Average score from the most recent run
- **Previous Score**: Average score from the prior run (for comparison)
- **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection

The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.

## Configuration

### Tasks Format

The tasks parameter follows the lighteval format. Examples:
- `lighteval|mmlu|0` - MMLU benchmark

### Daily Checkpoint

The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. 

### Data Persistence

All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means:
- Results persist across app restarts
- Historical score comparisons are maintained
- Data can be accessed programmatically via the HF datasets library

## Architecture

- **Main Thread**: Runs the Gradio interface
- **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
- **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
- **Thread-safe**: Uses locks to prevent access issues when checking job_results
- **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset

## Troubleshooting

### Jobs Not Launching

- Verify your `HF_TOKEN` is set and has the required permissions
- Check that the `IPTesting` namespace exists and you have access
- Review logs for specific error messages

### Scores Not Appearing

- Scores are extracted from job logs after completion
- The extraction parses the results table that appears in job logs
- It extracts the score for each task (from the first row where the task name appears)
- The final score is the average of all task scores
- Example table format:
  ```
  | Task                    | Version | Metric                | Value  | Stderr |
  | extended:ifeval:0       |         | prompt_level_strict_acc | 0.9100 | 0.0288 |
  | lighteval:gpqa:diamond:0 |        | gpqa_pass@k_with_k     | 0.5000 | 0.0503 |
  ```
- If scores don't appear, check console output for extraction errors or parsing issues

## Files

- [app.py](app.py) - Main Gradio application with UI and job management
- [utils/](utils/) - Utility package with helper modules:
  - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
  - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
- [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
- [requirements.txt](requirements.txt) - Python dependencies
- [README.md](README.md) - This file