Spaces:
Sleeping
Sleeping
Clémentine
commited on
Commit
·
7f5506e
1
Parent(s):
ec29d6f
wip
Browse files- README.md +188 -1
- app.py +135 -0
- globals.py +21 -0
- model_providers.txt +9 -0
- requirements.txt +5 -0
- utils/__init__.py +0 -0
- utils/io.py +149 -0
- utils/jobs.py +198 -0
README.md
CHANGED
|
@@ -9,4 +9,191 @@ app_file: app.py
|
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Inference Provider Testing Dashboard
|
| 13 |
+
|
| 14 |
+
A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
|
| 15 |
+
|
| 16 |
+
## Features
|
| 17 |
+
|
| 18 |
+
- **Automatic Model Discovery**: Fetch popular text-generation models with inference providers from Hugging Face Hub
|
| 19 |
+
- **Batch Job Launching**: Run evaluation jobs for multiple model-provider combinations from a configuration file
|
| 20 |
+
- **Results Table Dashboard**: View all jobs with model, provider, last run, status, current score, and previous score
|
| 21 |
+
- **Score Tracking**: Automatically extracts average scores from completed jobs and tracks history
|
| 22 |
+
- **Persistent Storage**: Results saved to HuggingFace dataset for persistence across restarts
|
| 23 |
+
- **Individual Job Relaunch**: Easily relaunch specific model-provider combinations
|
| 24 |
+
- **Real-time Monitoring**: Auto-refresh results table every 30 seconds
|
| 25 |
+
- **Daily Checkpoint**: Automatic daily save at midnight to preserve state
|
| 26 |
+
|
| 27 |
+
## Setup
|
| 28 |
+
|
| 29 |
+
### Prerequisites
|
| 30 |
+
|
| 31 |
+
- Python 3.8+
|
| 32 |
+
- Hugging Face account with API token
|
| 33 |
+
- Access to the `IPTesting` namespace on Hugging Face
|
| 34 |
+
|
| 35 |
+
### Installation
|
| 36 |
+
|
| 37 |
+
1. Clone or navigate to this repository:
|
| 38 |
+
```bash
|
| 39 |
+
cd InferenceProviderTestingBackend
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
2. Install dependencies:
|
| 43 |
+
```bash
|
| 44 |
+
pip install -r requirements.txt
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
3. Set up your Hugging Face token as an environment variable:
|
| 48 |
+
```bash
|
| 49 |
+
export HF_TOKEN="your_huggingface_token_here"
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
**Important**: Your HF_TOKEN must have:
|
| 53 |
+
- Permission to call inference providers
|
| 54 |
+
- Write access to the `IPTesting` organization
|
| 55 |
+
|
| 56 |
+
## Usage
|
| 57 |
+
|
| 58 |
+
### Starting the Dashboard
|
| 59 |
+
|
| 60 |
+
Run the Gradio app:
|
| 61 |
+
```bash
|
| 62 |
+
python app.py
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
The dashboard will be available at `http://localhost:7860`
|
| 66 |
+
|
| 67 |
+
### Initialize Models and Providers
|
| 68 |
+
|
| 69 |
+
1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers.
|
| 70 |
+
|
| 71 |
+
2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations:
|
| 72 |
+
```
|
| 73 |
+
meta-llama/Llama-3.2-3B-Instruct fireworks-ai
|
| 74 |
+
meta-llama/Llama-3.2-3B-Instruct together-ai
|
| 75 |
+
Qwen/Qwen2.5-7B-Instruct fireworks-ai
|
| 76 |
+
mistralai/Mistral-7B-Instruct-v0.3 together-ai
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
Format: `model_name provider_name` (separated by spaces or tabs)
|
| 80 |
+
|
| 81 |
+
### Launching Jobs
|
| 82 |
+
|
| 83 |
+
1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`)
|
| 84 |
+
2. Verify the config file path (default: `models_providers.txt`)
|
| 85 |
+
3. Click **"Launch Jobs"**
|
| 86 |
+
|
| 87 |
+
The system will:
|
| 88 |
+
- Read all model-provider combinations from the config file
|
| 89 |
+
- Launch a separate evaluation job for each combination
|
| 90 |
+
- Log the job ID and status
|
| 91 |
+
- Monitor job progress automatically
|
| 92 |
+
|
| 93 |
+
### Monitoring Jobs
|
| 94 |
+
|
| 95 |
+
The **Job Results** table displays all jobs with:
|
| 96 |
+
- **Model**: The model being tested
|
| 97 |
+
- **Provider**: The inference provider
|
| 98 |
+
- **Last Run**: Timestamp of when the job was last launched
|
| 99 |
+
- **Status**: Current status (running/complete/failed/cancelled)
|
| 100 |
+
- **Current Score**: Average score from the most recent run
|
| 101 |
+
- **Previous Score**: Average score from the prior run (for comparison)
|
| 102 |
+
|
| 103 |
+
The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
|
| 104 |
+
|
| 105 |
+
### Relaunching Individual Jobs
|
| 106 |
+
|
| 107 |
+
To rerun a specific model-provider combination:
|
| 108 |
+
1. Enter the model name (e.g., `meta-llama/Llama-3.2-3B-Instruct`)
|
| 109 |
+
2. Enter the provider name (e.g., `fireworks-ai`)
|
| 110 |
+
3. Optionally modify the tasks
|
| 111 |
+
4. Click "Relaunch Job"
|
| 112 |
+
|
| 113 |
+
When relaunching, the current score automatically moves to previous score for comparison.
|
| 114 |
+
|
| 115 |
+
## Configuration
|
| 116 |
+
|
| 117 |
+
### Tasks Format
|
| 118 |
+
|
| 119 |
+
The tasks parameter follows the lighteval format. Examples:
|
| 120 |
+
- `lighteval|mmlu|0|0` - MMLU benchmark
|
| 121 |
+
- `lighteval|hellaswag|0|0` - HellaSwag benchmark
|
| 122 |
+
|
| 123 |
+
### Daily Checkpoint
|
| 124 |
+
|
| 125 |
+
The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. This ensures data persistence and prevents data loss from long-running sessions.
|
| 126 |
+
|
| 127 |
+
### Data Persistence
|
| 128 |
+
|
| 129 |
+
All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means:
|
| 130 |
+
- Results persist across app restarts
|
| 131 |
+
- Historical score comparisons are maintained
|
| 132 |
+
- Data can be accessed programmatically via the HF datasets library
|
| 133 |
+
|
| 134 |
+
## Job Command Details
|
| 135 |
+
|
| 136 |
+
Each job runs with the following configuration:
|
| 137 |
+
- **Image**: `hf.co/spaces/OpenEvals/EvalsOnTheHub`
|
| 138 |
+
- **Command**: `lighteval endpoint inference-providers`
|
| 139 |
+
- **Namespace**: `IPTesting`
|
| 140 |
+
- **Flags**: `--push-to-hub --save-details --results-org IPTesting`
|
| 141 |
+
|
| 142 |
+
Results are automatically pushed to the `IPTesting` organization on Hugging Face Hub.
|
| 143 |
+
|
| 144 |
+
## Architecture
|
| 145 |
+
|
| 146 |
+
- **Main Thread**: Runs the Gradio interface
|
| 147 |
+
- **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
|
| 148 |
+
- **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
|
| 149 |
+
- **Thread-safe Operations**: Uses locks to prevent race conditions when accessing job_results
|
| 150 |
+
- **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset
|
| 151 |
+
|
| 152 |
+
## Troubleshooting
|
| 153 |
+
|
| 154 |
+
### Jobs Not Launching
|
| 155 |
+
|
| 156 |
+
- Verify your `HF_TOKEN` is set and has the required permissions
|
| 157 |
+
- Check that the `IPTesting` namespace exists and you have access
|
| 158 |
+
- Review logs for specific error messages
|
| 159 |
+
|
| 160 |
+
### Empty Models List
|
| 161 |
+
|
| 162 |
+
- Ensure you have internet connectivity
|
| 163 |
+
- The Hugging Face Hub API must be accessible
|
| 164 |
+
- Try running the initialization again
|
| 165 |
+
|
| 166 |
+
### Job Status Not Updating
|
| 167 |
+
|
| 168 |
+
- Check your internet connection
|
| 169 |
+
- Verify the job IDs are valid
|
| 170 |
+
- Check console output for API errors
|
| 171 |
+
|
| 172 |
+
### Scores Not Appearing
|
| 173 |
+
|
| 174 |
+
- Scores are extracted from job logs after completion
|
| 175 |
+
- The extraction parses the results table that appears in job logs
|
| 176 |
+
- It extracts the score for each task (from the first row where the task name appears)
|
| 177 |
+
- The final score is the average of all task scores
|
| 178 |
+
- Example table format:
|
| 179 |
+
```
|
| 180 |
+
| Task | Version | Metric | Value | Stderr |
|
| 181 |
+
| extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 |
|
| 182 |
+
| lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 |
|
| 183 |
+
```
|
| 184 |
+
- If scores don't appear, check console output for extraction errors or parsing issues
|
| 185 |
+
|
| 186 |
+
## Files
|
| 187 |
+
|
| 188 |
+
- [app.py](app.py) - Main Gradio application with UI and job management
|
| 189 |
+
- [utils/](utils/) - Utility package with helper modules:
|
| 190 |
+
- [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
|
| 191 |
+
- [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
|
| 192 |
+
- [utils/__init__.py](utils/__init__.py) - Package initialization and exports
|
| 193 |
+
- [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
|
| 194 |
+
- [requirements.txt](requirements.txt) - Python dependencies
|
| 195 |
+
- [README.md](README.md) - This file
|
| 196 |
+
|
| 197 |
+
## License
|
| 198 |
+
|
| 199 |
+
This project is provided as-is for evaluation testing purposes.
|
app.py
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
import time
|
| 3 |
+
from apscheduler.schedulers.background import BackgroundScheduler
|
| 4 |
+
import threading
|
| 5 |
+
import globals
|
| 6 |
+
from globals import TASKS, LOCAL_CONFIG_FILE
|
| 7 |
+
from utils.io import initialize_models_providers_file, save_results, load_results, load_models_providers, get_results_table
|
| 8 |
+
from utils.jobs import run_single_job, launch_jobs, update_job_statuses
|
| 9 |
+
from typing import List, Optional
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def status_monitor() -> None:
|
| 13 |
+
"""Background thread to monitor job statuses."""
|
| 14 |
+
while True:
|
| 15 |
+
update_job_statuses()
|
| 16 |
+
time.sleep(240) # Check every 30 seconds
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def daily_checkpoint() -> None:
|
| 20 |
+
"""Daily checkpoint - save current state."""
|
| 21 |
+
print("Daily checkpoint - saving current state")
|
| 22 |
+
save_results()
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
# Create Gradio interface
|
| 26 |
+
def create_app() -> gr.Blocks:
|
| 27 |
+
with gr.Blocks(title="Inference Provider Testing Dashboard") as demo:
|
| 28 |
+
gr.Markdown("# Inference Provider Testing Dashboard")
|
| 29 |
+
gr.Markdown("Launch and monitor evaluation jobs for multiple models and providers.")
|
| 30 |
+
|
| 31 |
+
output = gr.Textbox(label="Logs and status", interactive=False)
|
| 32 |
+
|
| 33 |
+
with gr.Row():
|
| 34 |
+
with gr.Column():
|
| 35 |
+
gr.Markdown("## Initialize Config File")
|
| 36 |
+
init_btn = gr.Button("Fetch and Initialize Models/Providers", variant="secondary")
|
| 37 |
+
|
| 38 |
+
with gr.Row():
|
| 39 |
+
with gr.Column():
|
| 40 |
+
gr.Markdown("## Launch Jobs")
|
| 41 |
+
launch_btn = gr.Button("Launch All Jobs", variant="primary")
|
| 42 |
+
|
| 43 |
+
with gr.Row():
|
| 44 |
+
with gr.Column():
|
| 45 |
+
gr.Markdown("## Job Results")
|
| 46 |
+
results_table = gr.Dataframe(
|
| 47 |
+
headers=["Model", "Provider", "Last Run", "Status", "Current Score", "Previous Score"],
|
| 48 |
+
value=get_results_table(),
|
| 49 |
+
interactive=False,
|
| 50 |
+
wrap=True
|
| 51 |
+
)
|
| 52 |
+
refresh_btn = gr.Button("Refresh Results")
|
| 53 |
+
|
| 54 |
+
with gr.Row():
|
| 55 |
+
with gr.Column():
|
| 56 |
+
gr.Markdown("## Relaunch Individual Job")
|
| 57 |
+
|
| 58 |
+
# Load model-provider combinations
|
| 59 |
+
models_providers = load_models_providers(LOCAL_CONFIG_FILE)
|
| 60 |
+
model_choices = sorted(list(set([mp[0] for mp in models_providers])))
|
| 61 |
+
|
| 62 |
+
relaunch_model = gr.Dropdown(
|
| 63 |
+
label="Model",
|
| 64 |
+
choices=model_choices,
|
| 65 |
+
interactive=True
|
| 66 |
+
)
|
| 67 |
+
relaunch_provider = gr.Dropdown(
|
| 68 |
+
label="Provider",
|
| 69 |
+
choices=[],
|
| 70 |
+
interactive=True
|
| 71 |
+
)
|
| 72 |
+
relaunch_btn = gr.Button("Relaunch Job", variant="secondary")
|
| 73 |
+
|
| 74 |
+
def update_provider_choices(model: Optional[str]) -> gr.update:
|
| 75 |
+
"""Update provider dropdown based on selected model."""
|
| 76 |
+
if not model:
|
| 77 |
+
return gr.update(choices=[])
|
| 78 |
+
|
| 79 |
+
# Get providers for the selected model from the config file
|
| 80 |
+
models_providers = load_models_providers(LOCAL_CONFIG_FILE)
|
| 81 |
+
providers = [mp[1] for mp in models_providers if mp[0] == model]
|
| 82 |
+
|
| 83 |
+
return gr.update(choices=providers, value=providers[0] if providers else None)
|
| 84 |
+
|
| 85 |
+
# Event handlers
|
| 86 |
+
init_btn.click(
|
| 87 |
+
fn=initialize_models_providers_file,
|
| 88 |
+
outputs=output
|
| 89 |
+
)
|
| 90 |
+
|
| 91 |
+
launch_btn.click(
|
| 92 |
+
fn=launch_jobs,
|
| 93 |
+
outputs=output
|
| 94 |
+
)
|
| 95 |
+
|
| 96 |
+
refresh_btn.click(
|
| 97 |
+
fn=get_results_table,
|
| 98 |
+
outputs=results_table
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
# Update provider dropdown when model is selected
|
| 102 |
+
relaunch_model.change(
|
| 103 |
+
fn=update_provider_choices,
|
| 104 |
+
inputs=relaunch_model,
|
| 105 |
+
outputs=relaunch_provider
|
| 106 |
+
)
|
| 107 |
+
|
| 108 |
+
relaunch_btn.click(
|
| 109 |
+
fn=run_single_job,
|
| 110 |
+
inputs=[relaunch_model, relaunch_provider],
|
| 111 |
+
outputs=output
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
return demo
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
if __name__ == "__main__":
|
| 118 |
+
# Load previous results
|
| 119 |
+
load_results()
|
| 120 |
+
print("Starting Inference Provider Testing Dashboard")
|
| 121 |
+
|
| 122 |
+
# Start status monitor thread
|
| 123 |
+
monitor_thread = threading.Thread(target=status_monitor, daemon=True)
|
| 124 |
+
monitor_thread.start()
|
| 125 |
+
print("Job status monitor started")
|
| 126 |
+
|
| 127 |
+
# Start APScheduler for daily checkpoint
|
| 128 |
+
scheduler = BackgroundScheduler()
|
| 129 |
+
scheduler.add_job(daily_checkpoint, 'cron', hour=0, minute=0) # Run at midnight
|
| 130 |
+
scheduler.start()
|
| 131 |
+
print("Daily checkpoint scheduler started (saves at 00:00)")
|
| 132 |
+
|
| 133 |
+
# Create and launch the Gradio interface
|
| 134 |
+
demo = create_app()
|
| 135 |
+
demo.launch(server_name="0.0.0.0", server_port=7860)
|
globals.py
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Global variables and configuration for the Inference Provider Testing Dashboard."""
|
| 2 |
+
|
| 3 |
+
import threading
|
| 4 |
+
from typing import Dict, Any, Optional
|
| 5 |
+
|
| 6 |
+
# Type definition for job result entries
|
| 7 |
+
JobResult = Dict[str, Any] # {model, provider, last_run, status, current_score, previous_score, job_id}
|
| 8 |
+
|
| 9 |
+
# Global variables to track jobs
|
| 10 |
+
job_results: Dict[str, JobResult] = {} # {model_provider_key: JobResult}
|
| 11 |
+
results_lock: threading.Lock = threading.Lock()
|
| 12 |
+
|
| 13 |
+
# Configuration
|
| 14 |
+
RESULTS_DATASET_NAME: str = "IPTesting/inference-provider-test-results"
|
| 15 |
+
LOCAL_CONFIG_FILE: str = "model_providers.txt"
|
| 16 |
+
TASKS: str = "extended|ifeval|0,lighteval|gsm_plus|0,lighteval|gpqa:diamond|0"
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def get_model_provider_key(model: str, provider: str) -> str:
|
| 20 |
+
"""Create a unique key for model-provider combination."""
|
| 21 |
+
return f"{model}||{provider}"
|
model_providers.txt
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Models and Providers Configuration
|
| 2 |
+
# Format: model_name provider_name
|
| 3 |
+
# Auto-generated on 2025-10-10 15:38:11
|
| 4 |
+
|
| 5 |
+
deepseek-ai/DeepSeek-R1 fireworks-ai
|
| 6 |
+
deepseek-ai/DeepSeek-R1 hyperbolic
|
| 7 |
+
deepseek-ai/DeepSeek-R1 novita
|
| 8 |
+
deepseek-ai/DeepSeek-R1 together
|
| 9 |
+
deepseek-ai/DeepSeek-R1 sambanova
|
requirements.txt
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=4.0.0
|
| 2 |
+
huggingface-hub>=0.20.0
|
| 3 |
+
apscheduler>=3.10.0
|
| 4 |
+
datasets>=2.14.0
|
| 5 |
+
pandas>=1.5.0
|
utils/__init__.py
ADDED
|
File without changes
|
utils/io.py
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from huggingface_hub import list_models, model_info
|
| 2 |
+
from datetime import datetime
|
| 3 |
+
from datasets import Dataset, load_dataset
|
| 4 |
+
import pandas as pd
|
| 5 |
+
import os
|
| 6 |
+
import globals
|
| 7 |
+
from typing import List, Tuple
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def get_models_providers() -> List[Tuple[str, List[str]]]:
|
| 11 |
+
"""Get list of popular text generation models and associated providers from Hugging Face"""
|
| 12 |
+
models = list_models(
|
| 13 |
+
filter="text-generation",
|
| 14 |
+
sort="likes",
|
| 15 |
+
direction=-1,
|
| 16 |
+
limit=1,
|
| 17 |
+
expand="inferenceProviderMapping"
|
| 18 |
+
)
|
| 19 |
+
|
| 20 |
+
model_providers = [
|
| 21 |
+
(model.id, [p.provider for p in model.inference_provider_mapping])
|
| 22 |
+
for model in models
|
| 23 |
+
if hasattr(model, 'inference_provider_mapping') and model.inference_provider_mapping
|
| 24 |
+
]
|
| 25 |
+
return model_providers
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def initialize_models_providers_file(file_path: str = globals.LOCAL_CONFIG_FILE) -> str:
|
| 29 |
+
"""Initialize the models_providers.txt file with popular models and their providers."""
|
| 30 |
+
model_to_providers = get_models_providers()
|
| 31 |
+
|
| 32 |
+
with open(file_path, 'w') as f:
|
| 33 |
+
f.write("# Models and Providers Configuration\n")
|
| 34 |
+
f.write("# Format: model_name provider_name\n")
|
| 35 |
+
f.write(f"# Auto-generated on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
|
| 36 |
+
|
| 37 |
+
count = 0
|
| 38 |
+
for (model_id, providers) in model_to_providers:
|
| 39 |
+
try:
|
| 40 |
+
for provider in providers:
|
| 41 |
+
f.write(f"{model_id} {provider}\n")
|
| 42 |
+
count += 1
|
| 43 |
+
except Exception as e:
|
| 44 |
+
print(f"Error processing model {model_id}: {e}")
|
| 45 |
+
continue
|
| 46 |
+
|
| 47 |
+
print(f"Successfully wrote {count} model-provider combinations to {file_path}")
|
| 48 |
+
return f"Initialized {count} model-provider combinations"
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def load_models_providers(file_path: str = "models_providers.txt") -> List[Tuple[str, str]]:
|
| 52 |
+
"""Load models and providers from text file."""
|
| 53 |
+
models_providers = []
|
| 54 |
+
try:
|
| 55 |
+
with open(file_path, 'r') as f:
|
| 56 |
+
for line in f:
|
| 57 |
+
line = line.strip()
|
| 58 |
+
# Skip empty lines and comments
|
| 59 |
+
if line and not line.startswith('#'):
|
| 60 |
+
parts = line.split()
|
| 61 |
+
if len(parts) >= 2:
|
| 62 |
+
model = parts[0]
|
| 63 |
+
provider = parts[1]
|
| 64 |
+
models_providers.append((model, provider))
|
| 65 |
+
except Exception as e:
|
| 66 |
+
print(f"Error loading models_providers.txt: {str(e)}")
|
| 67 |
+
return models_providers
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
def save_results() -> None:
|
| 71 |
+
"""Persist job results to HuggingFace dataset."""
|
| 72 |
+
try:
|
| 73 |
+
with globals.results_lock:
|
| 74 |
+
if not globals.job_results:
|
| 75 |
+
print("No results to save")
|
| 76 |
+
return
|
| 77 |
+
|
| 78 |
+
records = list(globals.job_results.values())
|
| 79 |
+
df = pd.DataFrame(records)
|
| 80 |
+
dataset = Dataset.from_pandas(df)
|
| 81 |
+
|
| 82 |
+
# Push to HuggingFace Hub
|
| 83 |
+
dataset.push_to_hub(
|
| 84 |
+
globals.RESULTS_DATASET_NAME,
|
| 85 |
+
token=os.getenv("HF_TOKEN"),
|
| 86 |
+
private=False
|
| 87 |
+
)
|
| 88 |
+
print(f"Saved {len(records)} results to dataset")
|
| 89 |
+
|
| 90 |
+
except Exception as e:
|
| 91 |
+
print(f"Error saving results to dataset: {e}")
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
def load_results() -> None:
|
| 95 |
+
"""Load job results from HuggingFace dataset."""
|
| 96 |
+
try:
|
| 97 |
+
# Try to load existing dataset
|
| 98 |
+
dataset = load_dataset(
|
| 99 |
+
globals.RESULTS_DATASET_NAME,
|
| 100 |
+
split="train",
|
| 101 |
+
token=os.getenv("HF_TOKEN")
|
| 102 |
+
)
|
| 103 |
+
|
| 104 |
+
# Convert dataset to job_results dict
|
| 105 |
+
for row in dataset:
|
| 106 |
+
key = globals.get_model_provider_key(row["model"], row["provider"])
|
| 107 |
+
globals.job_results[key] = {
|
| 108 |
+
"model": row["model"],
|
| 109 |
+
"provider": row["provider"],
|
| 110 |
+
"last_run": row["last_run"],
|
| 111 |
+
"status": row["status"],
|
| 112 |
+
"current_score": row["current_score"],
|
| 113 |
+
"previous_score": row["previous_score"],
|
| 114 |
+
"job_id": row["job_id"]
|
| 115 |
+
}
|
| 116 |
+
|
| 117 |
+
print(f"Loaded {len(globals.job_results)} results from dataset")
|
| 118 |
+
|
| 119 |
+
except Exception as e:
|
| 120 |
+
print(f"No existing dataset found or error loading: {e}")
|
| 121 |
+
print("Starting with empty results")
|
| 122 |
+
|
| 123 |
+
def get_results_table() -> List[List[str]]:
|
| 124 |
+
"""Return job results as a list for Gradio DataFrame."""
|
| 125 |
+
with globals.results_lock:
|
| 126 |
+
if not globals.job_results:
|
| 127 |
+
return []
|
| 128 |
+
|
| 129 |
+
table_data = []
|
| 130 |
+
for key, info in globals.job_results.items():
|
| 131 |
+
current_score = info.get("current_score", "N/A")
|
| 132 |
+
if current_score is not None and isinstance(current_score, (int, float)):
|
| 133 |
+
current_score = f"{current_score:.4f}"
|
| 134 |
+
|
| 135 |
+
previous_score = info.get("previous_score", "N/A")
|
| 136 |
+
if previous_score is not None and isinstance(previous_score, (int, float)):
|
| 137 |
+
previous_score = f"{previous_score:.4f}"
|
| 138 |
+
|
| 139 |
+
table_data.append([
|
| 140 |
+
info["model"],
|
| 141 |
+
info["provider"],
|
| 142 |
+
info["last_run"],
|
| 143 |
+
info["status"],
|
| 144 |
+
current_score,
|
| 145 |
+
previous_score
|
| 146 |
+
])
|
| 147 |
+
|
| 148 |
+
return table_data
|
| 149 |
+
|
utils/jobs.py
ADDED
|
@@ -0,0 +1,198 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from huggingface_hub import run_job, inspect_job
|
| 2 |
+
import os
|
| 3 |
+
import re
|
| 4 |
+
import time
|
| 5 |
+
from datetime import datetime
|
| 6 |
+
import globals
|
| 7 |
+
from utils.io import save_results, load_models_providers
|
| 8 |
+
from typing import Optional
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def extract_score_from_job(job_id: str) -> Optional[float]:
|
| 12 |
+
"""Extract average score from completed job logs.
|
| 13 |
+
|
| 14 |
+
Parses the results table and calculates the average of the main metric
|
| 15 |
+
for each task (the metric on the same line as the task name).
|
| 16 |
+
"""
|
| 17 |
+
try:
|
| 18 |
+
# Inspect the job to get details and logs
|
| 19 |
+
job_info = inspect_job(job_id=job_id)
|
| 20 |
+
|
| 21 |
+
# Get the logs from the job
|
| 22 |
+
if hasattr(job_info, 'logs') and job_info.logs:
|
| 23 |
+
logs = job_info.logs
|
| 24 |
+
lines = logs.split('\n')
|
| 25 |
+
|
| 26 |
+
# Find the results table
|
| 27 |
+
# Look for lines that match the pattern: |task_name|version|metric|value|...|
|
| 28 |
+
# We want to extract the score (value) from lines where the task name is not empty
|
| 29 |
+
|
| 30 |
+
scores = []
|
| 31 |
+
|
| 32 |
+
for line in lines:
|
| 33 |
+
# Check if we're in a table (contains pipe separators)
|
| 34 |
+
if '|' in line:
|
| 35 |
+
parts = [p.strip() for p in line.split('|')]
|
| 36 |
+
|
| 37 |
+
# Skip header and separator lines
|
| 38 |
+
# Table format: | Task | Version | Metric | Value | | Stderr |
|
| 39 |
+
if len(parts) >= 5:
|
| 40 |
+
task = parts[1] if len(parts) > 1 else ""
|
| 41 |
+
metric = parts[3] if len(parts) > 3 else ""
|
| 42 |
+
value = parts[4] if len(parts) > 4 else ""
|
| 43 |
+
|
| 44 |
+
# We only want lines where the task name is not empty (main metric for that task)
|
| 45 |
+
# Skip lines with "Task", "---", or empty task names
|
| 46 |
+
if task and task not in ["Task", ""] and not task.startswith("-"):
|
| 47 |
+
# Try to extract numeric value
|
| 48 |
+
# Remove any extra characters and convert to float
|
| 49 |
+
value_clean = value.strip()
|
| 50 |
+
try:
|
| 51 |
+
# Extract the numeric part (may have ± symbol after)
|
| 52 |
+
score_match = re.match(r'([0-9]+\.?[0-9]*)', value_clean)
|
| 53 |
+
if score_match:
|
| 54 |
+
score = float(score_match.group(1))
|
| 55 |
+
scores.append(score)
|
| 56 |
+
print(f"Extracted score {score} for task '{task}' metric '{metric}'")
|
| 57 |
+
except (ValueError, AttributeError):
|
| 58 |
+
continue
|
| 59 |
+
|
| 60 |
+
# Calculate average of all task scores
|
| 61 |
+
if scores:
|
| 62 |
+
average_score = sum(scores) / len(scores)
|
| 63 |
+
print(f"Calculated average score: {average_score:.4f} from {len(scores)} tasks")
|
| 64 |
+
return average_score
|
| 65 |
+
else:
|
| 66 |
+
print("No scores found in job logs")
|
| 67 |
+
|
| 68 |
+
return None
|
| 69 |
+
|
| 70 |
+
except Exception as e:
|
| 71 |
+
print(f"Error extracting score for job {job_id}: {e}")
|
| 72 |
+
import traceback
|
| 73 |
+
traceback.print_exc()
|
| 74 |
+
return None
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def run_single_job(model: str, provider: str, tasks: str) -> Optional[str]:
|
| 78 |
+
"""Run a single job for a model-provider combination."""
|
| 79 |
+
|
| 80 |
+
if not model or not provider:
|
| 81 |
+
print("Missing model or provider")
|
| 82 |
+
return -1
|
| 83 |
+
|
| 84 |
+
# Verify the model-provider combination exists in the config
|
| 85 |
+
models_providers = load_models_providers(globals.LOCAL_CONFIG_FILE)
|
| 86 |
+
if (model, provider) not in models_providers:
|
| 87 |
+
print( f"Error: {model} with {provider} not found in {globals.LOCAL_CONFIG_FILE}")
|
| 88 |
+
return -1
|
| 89 |
+
|
| 90 |
+
# Check if job is already running
|
| 91 |
+
key = globals.get_model_provider_key(model, provider)
|
| 92 |
+
with globals.results_lock:
|
| 93 |
+
if key in globals.job_results:
|
| 94 |
+
current_status = globals.job_results[key].get("status")
|
| 95 |
+
if current_status == "running":
|
| 96 |
+
print( f"Job for {model} on {provider} is already running. Please wait for it to complete.")
|
| 97 |
+
return -1
|
| 98 |
+
|
| 99 |
+
print(f"Starting job for model={model}, provider={provider}")
|
| 100 |
+
|
| 101 |
+
job = run_job(
|
| 102 |
+
image="hf.co/spaces/OpenEvals/EvalsOnTheHub",
|
| 103 |
+
command=[
|
| 104 |
+
"lighteval", "endpoint", "inference-providers",
|
| 105 |
+
f"model_name={model},provider={provider}",
|
| 106 |
+
tasks,
|
| 107 |
+
"--push-to-hub", "--save-details",
|
| 108 |
+
"--results-org", "IPTesting",
|
| 109 |
+
"--max-samples", "10"
|
| 110 |
+
],
|
| 111 |
+
namespace="clefourrier",
|
| 112 |
+
secrets={"HF_TOKEN": os.getenv("HF_TOKEN")},
|
| 113 |
+
token=os.getenv("HF_TOKEN")
|
| 114 |
+
)
|
| 115 |
+
|
| 116 |
+
job_id = job.id
|
| 117 |
+
key = globals.get_model_provider_key(model, provider)
|
| 118 |
+
|
| 119 |
+
with globals.results_lock:
|
| 120 |
+
# Move current score to previous score if it exists (relaunching)
|
| 121 |
+
previous_score = None
|
| 122 |
+
if key in globals.job_results and globals.job_results[key].get("current_score"):
|
| 123 |
+
previous_score = globals.job_results[key]["current_score"]
|
| 124 |
+
|
| 125 |
+
globals.job_results[key] = {
|
| 126 |
+
"model": model,
|
| 127 |
+
"provider": provider,
|
| 128 |
+
"last_run": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
|
| 129 |
+
"status": "running",
|
| 130 |
+
"current_score": None,
|
| 131 |
+
"previous_score": previous_score,
|
| 132 |
+
"job_id": job_id
|
| 133 |
+
}
|
| 134 |
+
|
| 135 |
+
save_results()
|
| 136 |
+
print(f"Job launched: ID={job_id}, model={model}, provider={provider}")
|
| 137 |
+
return job_id
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
def launch_jobs(tasks: str = globals.TASKS, config_file: str = globals.LOCAL_CONFIG_FILE):
|
| 141 |
+
"""Launch jobs for all models and providers."""
|
| 142 |
+
models_providers = load_models_providers(config_file)
|
| 143 |
+
|
| 144 |
+
if not models_providers:
|
| 145 |
+
print("No valid model-provider combinations found in config file")
|
| 146 |
+
return "No valid model-provider combinations found"
|
| 147 |
+
|
| 148 |
+
print(f"Found {len(models_providers)} model-provider combinations")
|
| 149 |
+
|
| 150 |
+
launched_count = 0
|
| 151 |
+
for model, provider in models_providers:
|
| 152 |
+
job_id = run_single_job(model, provider, tasks)
|
| 153 |
+
if job_id != -1:
|
| 154 |
+
launched_count += 1
|
| 155 |
+
# Small delay between launches to avoid rate limiting
|
| 156 |
+
time.sleep(2)
|
| 157 |
+
|
| 158 |
+
print(f"Launched {launched_count}/{len(models_providers)} jobs successfully")
|
| 159 |
+
return f"Launched {launched_count} jobs"
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
def update_job_statuses() -> None:
|
| 163 |
+
"""Check and update the status of active jobs."""
|
| 164 |
+
try:
|
| 165 |
+
with globals.results_lock:
|
| 166 |
+
keys = list(globals.job_results.keys())
|
| 167 |
+
|
| 168 |
+
for key in keys:
|
| 169 |
+
try:
|
| 170 |
+
with globals.results_lock:
|
| 171 |
+
if globals.job_results[key]["status"] in ["complete", "failed", "cancelled"]:
|
| 172 |
+
continue # Skip already finished jobs
|
| 173 |
+
|
| 174 |
+
job_id = globals.job_results[key]["job_id"]
|
| 175 |
+
|
| 176 |
+
job_info = inspect_job(job_id=job_id)
|
| 177 |
+
new_status = job_info.status.stage
|
| 178 |
+
|
| 179 |
+
with globals.results_lock:
|
| 180 |
+
old_status = globals.job_results[key]["status"]
|
| 181 |
+
|
| 182 |
+
if old_status != new_status:
|
| 183 |
+
globals.job_results[key]["status"] = new_status
|
| 184 |
+
print(f"Job {job_id} status changed: {old_status} -> {new_status}")
|
| 185 |
+
|
| 186 |
+
# If job completed, try to extract score
|
| 187 |
+
if new_status == "COMPLETED":
|
| 188 |
+
score = extract_score_from_job(job_id)
|
| 189 |
+
if score is not None:
|
| 190 |
+
globals.job_results[key]["current_score"] = score
|
| 191 |
+
|
| 192 |
+
except Exception as e:
|
| 193 |
+
print(f"Error checking job: {str(e)}")
|
| 194 |
+
|
| 195 |
+
save_results()
|
| 196 |
+
|
| 197 |
+
except Exception as e:
|
| 198 |
+
print(f"Error in update_job_statuses: {str(e)}")
|