Clémentine commited on
Commit
7f5506e
·
1 Parent(s): ec29d6f
Files changed (8) hide show
  1. README.md +188 -1
  2. app.py +135 -0
  3. globals.py +21 -0
  4. model_providers.txt +9 -0
  5. requirements.txt +5 -0
  6. utils/__init__.py +0 -0
  7. utils/io.py +149 -0
  8. utils/jobs.py +198 -0
README.md CHANGED
@@ -9,4 +9,191 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pinned: false
10
  ---
11
 
12
+ # Inference Provider Testing Dashboard
13
+
14
+ A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
15
+
16
+ ## Features
17
+
18
+ - **Automatic Model Discovery**: Fetch popular text-generation models with inference providers from Hugging Face Hub
19
+ - **Batch Job Launching**: Run evaluation jobs for multiple model-provider combinations from a configuration file
20
+ - **Results Table Dashboard**: View all jobs with model, provider, last run, status, current score, and previous score
21
+ - **Score Tracking**: Automatically extracts average scores from completed jobs and tracks history
22
+ - **Persistent Storage**: Results saved to HuggingFace dataset for persistence across restarts
23
+ - **Individual Job Relaunch**: Easily relaunch specific model-provider combinations
24
+ - **Real-time Monitoring**: Auto-refresh results table every 30 seconds
25
+ - **Daily Checkpoint**: Automatic daily save at midnight to preserve state
26
+
27
+ ## Setup
28
+
29
+ ### Prerequisites
30
+
31
+ - Python 3.8+
32
+ - Hugging Face account with API token
33
+ - Access to the `IPTesting` namespace on Hugging Face
34
+
35
+ ### Installation
36
+
37
+ 1. Clone or navigate to this repository:
38
+ ```bash
39
+ cd InferenceProviderTestingBackend
40
+ ```
41
+
42
+ 2. Install dependencies:
43
+ ```bash
44
+ pip install -r requirements.txt
45
+ ```
46
+
47
+ 3. Set up your Hugging Face token as an environment variable:
48
+ ```bash
49
+ export HF_TOKEN="your_huggingface_token_here"
50
+ ```
51
+
52
+ **Important**: Your HF_TOKEN must have:
53
+ - Permission to call inference providers
54
+ - Write access to the `IPTesting` organization
55
+
56
+ ## Usage
57
+
58
+ ### Starting the Dashboard
59
+
60
+ Run the Gradio app:
61
+ ```bash
62
+ python app.py
63
+ ```
64
+
65
+ The dashboard will be available at `http://localhost:7860`
66
+
67
+ ### Initialize Models and Providers
68
+
69
+ 1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers.
70
+
71
+ 2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations:
72
+ ```
73
+ meta-llama/Llama-3.2-3B-Instruct fireworks-ai
74
+ meta-llama/Llama-3.2-3B-Instruct together-ai
75
+ Qwen/Qwen2.5-7B-Instruct fireworks-ai
76
+ mistralai/Mistral-7B-Instruct-v0.3 together-ai
77
+ ```
78
+
79
+ Format: `model_name provider_name` (separated by spaces or tabs)
80
+
81
+ ### Launching Jobs
82
+
83
+ 1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`)
84
+ 2. Verify the config file path (default: `models_providers.txt`)
85
+ 3. Click **"Launch Jobs"**
86
+
87
+ The system will:
88
+ - Read all model-provider combinations from the config file
89
+ - Launch a separate evaluation job for each combination
90
+ - Log the job ID and status
91
+ - Monitor job progress automatically
92
+
93
+ ### Monitoring Jobs
94
+
95
+ The **Job Results** table displays all jobs with:
96
+ - **Model**: The model being tested
97
+ - **Provider**: The inference provider
98
+ - **Last Run**: Timestamp of when the job was last launched
99
+ - **Status**: Current status (running/complete/failed/cancelled)
100
+ - **Current Score**: Average score from the most recent run
101
+ - **Previous Score**: Average score from the prior run (for comparison)
102
+
103
+ The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
104
+
105
+ ### Relaunching Individual Jobs
106
+
107
+ To rerun a specific model-provider combination:
108
+ 1. Enter the model name (e.g., `meta-llama/Llama-3.2-3B-Instruct`)
109
+ 2. Enter the provider name (e.g., `fireworks-ai`)
110
+ 3. Optionally modify the tasks
111
+ 4. Click "Relaunch Job"
112
+
113
+ When relaunching, the current score automatically moves to previous score for comparison.
114
+
115
+ ## Configuration
116
+
117
+ ### Tasks Format
118
+
119
+ The tasks parameter follows the lighteval format. Examples:
120
+ - `lighteval|mmlu|0|0` - MMLU benchmark
121
+ - `lighteval|hellaswag|0|0` - HellaSwag benchmark
122
+
123
+ ### Daily Checkpoint
124
+
125
+ The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. This ensures data persistence and prevents data loss from long-running sessions.
126
+
127
+ ### Data Persistence
128
+
129
+ All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means:
130
+ - Results persist across app restarts
131
+ - Historical score comparisons are maintained
132
+ - Data can be accessed programmatically via the HF datasets library
133
+
134
+ ## Job Command Details
135
+
136
+ Each job runs with the following configuration:
137
+ - **Image**: `hf.co/spaces/OpenEvals/EvalsOnTheHub`
138
+ - **Command**: `lighteval endpoint inference-providers`
139
+ - **Namespace**: `IPTesting`
140
+ - **Flags**: `--push-to-hub --save-details --results-org IPTesting`
141
+
142
+ Results are automatically pushed to the `IPTesting` organization on Hugging Face Hub.
143
+
144
+ ## Architecture
145
+
146
+ - **Main Thread**: Runs the Gradio interface
147
+ - **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
148
+ - **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
149
+ - **Thread-safe Operations**: Uses locks to prevent race conditions when accessing job_results
150
+ - **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset
151
+
152
+ ## Troubleshooting
153
+
154
+ ### Jobs Not Launching
155
+
156
+ - Verify your `HF_TOKEN` is set and has the required permissions
157
+ - Check that the `IPTesting` namespace exists and you have access
158
+ - Review logs for specific error messages
159
+
160
+ ### Empty Models List
161
+
162
+ - Ensure you have internet connectivity
163
+ - The Hugging Face Hub API must be accessible
164
+ - Try running the initialization again
165
+
166
+ ### Job Status Not Updating
167
+
168
+ - Check your internet connection
169
+ - Verify the job IDs are valid
170
+ - Check console output for API errors
171
+
172
+ ### Scores Not Appearing
173
+
174
+ - Scores are extracted from job logs after completion
175
+ - The extraction parses the results table that appears in job logs
176
+ - It extracts the score for each task (from the first row where the task name appears)
177
+ - The final score is the average of all task scores
178
+ - Example table format:
179
+ ```
180
+ | Task | Version | Metric | Value | Stderr |
181
+ | extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 |
182
+ | lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 |
183
+ ```
184
+ - If scores don't appear, check console output for extraction errors or parsing issues
185
+
186
+ ## Files
187
+
188
+ - [app.py](app.py) - Main Gradio application with UI and job management
189
+ - [utils/](utils/) - Utility package with helper modules:
190
+ - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
191
+ - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
192
+ - [utils/__init__.py](utils/__init__.py) - Package initialization and exports
193
+ - [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
194
+ - [requirements.txt](requirements.txt) - Python dependencies
195
+ - [README.md](README.md) - This file
196
+
197
+ ## License
198
+
199
+ This project is provided as-is for evaluation testing purposes.
app.py ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import time
3
+ from apscheduler.schedulers.background import BackgroundScheduler
4
+ import threading
5
+ import globals
6
+ from globals import TASKS, LOCAL_CONFIG_FILE
7
+ from utils.io import initialize_models_providers_file, save_results, load_results, load_models_providers, get_results_table
8
+ from utils.jobs import run_single_job, launch_jobs, update_job_statuses
9
+ from typing import List, Optional
10
+
11
+
12
+ def status_monitor() -> None:
13
+ """Background thread to monitor job statuses."""
14
+ while True:
15
+ update_job_statuses()
16
+ time.sleep(240) # Check every 30 seconds
17
+
18
+
19
+ def daily_checkpoint() -> None:
20
+ """Daily checkpoint - save current state."""
21
+ print("Daily checkpoint - saving current state")
22
+ save_results()
23
+
24
+
25
+ # Create Gradio interface
26
+ def create_app() -> gr.Blocks:
27
+ with gr.Blocks(title="Inference Provider Testing Dashboard") as demo:
28
+ gr.Markdown("# Inference Provider Testing Dashboard")
29
+ gr.Markdown("Launch and monitor evaluation jobs for multiple models and providers.")
30
+
31
+ output = gr.Textbox(label="Logs and status", interactive=False)
32
+
33
+ with gr.Row():
34
+ with gr.Column():
35
+ gr.Markdown("## Initialize Config File")
36
+ init_btn = gr.Button("Fetch and Initialize Models/Providers", variant="secondary")
37
+
38
+ with gr.Row():
39
+ with gr.Column():
40
+ gr.Markdown("## Launch Jobs")
41
+ launch_btn = gr.Button("Launch All Jobs", variant="primary")
42
+
43
+ with gr.Row():
44
+ with gr.Column():
45
+ gr.Markdown("## Job Results")
46
+ results_table = gr.Dataframe(
47
+ headers=["Model", "Provider", "Last Run", "Status", "Current Score", "Previous Score"],
48
+ value=get_results_table(),
49
+ interactive=False,
50
+ wrap=True
51
+ )
52
+ refresh_btn = gr.Button("Refresh Results")
53
+
54
+ with gr.Row():
55
+ with gr.Column():
56
+ gr.Markdown("## Relaunch Individual Job")
57
+
58
+ # Load model-provider combinations
59
+ models_providers = load_models_providers(LOCAL_CONFIG_FILE)
60
+ model_choices = sorted(list(set([mp[0] for mp in models_providers])))
61
+
62
+ relaunch_model = gr.Dropdown(
63
+ label="Model",
64
+ choices=model_choices,
65
+ interactive=True
66
+ )
67
+ relaunch_provider = gr.Dropdown(
68
+ label="Provider",
69
+ choices=[],
70
+ interactive=True
71
+ )
72
+ relaunch_btn = gr.Button("Relaunch Job", variant="secondary")
73
+
74
+ def update_provider_choices(model: Optional[str]) -> gr.update:
75
+ """Update provider dropdown based on selected model."""
76
+ if not model:
77
+ return gr.update(choices=[])
78
+
79
+ # Get providers for the selected model from the config file
80
+ models_providers = load_models_providers(LOCAL_CONFIG_FILE)
81
+ providers = [mp[1] for mp in models_providers if mp[0] == model]
82
+
83
+ return gr.update(choices=providers, value=providers[0] if providers else None)
84
+
85
+ # Event handlers
86
+ init_btn.click(
87
+ fn=initialize_models_providers_file,
88
+ outputs=output
89
+ )
90
+
91
+ launch_btn.click(
92
+ fn=launch_jobs,
93
+ outputs=output
94
+ )
95
+
96
+ refresh_btn.click(
97
+ fn=get_results_table,
98
+ outputs=results_table
99
+ )
100
+
101
+ # Update provider dropdown when model is selected
102
+ relaunch_model.change(
103
+ fn=update_provider_choices,
104
+ inputs=relaunch_model,
105
+ outputs=relaunch_provider
106
+ )
107
+
108
+ relaunch_btn.click(
109
+ fn=run_single_job,
110
+ inputs=[relaunch_model, relaunch_provider],
111
+ outputs=output
112
+ )
113
+
114
+ return demo
115
+
116
+
117
+ if __name__ == "__main__":
118
+ # Load previous results
119
+ load_results()
120
+ print("Starting Inference Provider Testing Dashboard")
121
+
122
+ # Start status monitor thread
123
+ monitor_thread = threading.Thread(target=status_monitor, daemon=True)
124
+ monitor_thread.start()
125
+ print("Job status monitor started")
126
+
127
+ # Start APScheduler for daily checkpoint
128
+ scheduler = BackgroundScheduler()
129
+ scheduler.add_job(daily_checkpoint, 'cron', hour=0, minute=0) # Run at midnight
130
+ scheduler.start()
131
+ print("Daily checkpoint scheduler started (saves at 00:00)")
132
+
133
+ # Create and launch the Gradio interface
134
+ demo = create_app()
135
+ demo.launch(server_name="0.0.0.0", server_port=7860)
globals.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Global variables and configuration for the Inference Provider Testing Dashboard."""
2
+
3
+ import threading
4
+ from typing import Dict, Any, Optional
5
+
6
+ # Type definition for job result entries
7
+ JobResult = Dict[str, Any] # {model, provider, last_run, status, current_score, previous_score, job_id}
8
+
9
+ # Global variables to track jobs
10
+ job_results: Dict[str, JobResult] = {} # {model_provider_key: JobResult}
11
+ results_lock: threading.Lock = threading.Lock()
12
+
13
+ # Configuration
14
+ RESULTS_DATASET_NAME: str = "IPTesting/inference-provider-test-results"
15
+ LOCAL_CONFIG_FILE: str = "model_providers.txt"
16
+ TASKS: str = "extended|ifeval|0,lighteval|gsm_plus|0,lighteval|gpqa:diamond|0"
17
+
18
+
19
+ def get_model_provider_key(model: str, provider: str) -> str:
20
+ """Create a unique key for model-provider combination."""
21
+ return f"{model}||{provider}"
model_providers.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # Models and Providers Configuration
2
+ # Format: model_name provider_name
3
+ # Auto-generated on 2025-10-10 15:38:11
4
+
5
+ deepseek-ai/DeepSeek-R1 fireworks-ai
6
+ deepseek-ai/DeepSeek-R1 hyperbolic
7
+ deepseek-ai/DeepSeek-R1 novita
8
+ deepseek-ai/DeepSeek-R1 together
9
+ deepseek-ai/DeepSeek-R1 sambanova
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ huggingface-hub>=0.20.0
3
+ apscheduler>=3.10.0
4
+ datasets>=2.14.0
5
+ pandas>=1.5.0
utils/__init__.py ADDED
File without changes
utils/io.py ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from huggingface_hub import list_models, model_info
2
+ from datetime import datetime
3
+ from datasets import Dataset, load_dataset
4
+ import pandas as pd
5
+ import os
6
+ import globals
7
+ from typing import List, Tuple
8
+
9
+
10
+ def get_models_providers() -> List[Tuple[str, List[str]]]:
11
+ """Get list of popular text generation models and associated providers from Hugging Face"""
12
+ models = list_models(
13
+ filter="text-generation",
14
+ sort="likes",
15
+ direction=-1,
16
+ limit=1,
17
+ expand="inferenceProviderMapping"
18
+ )
19
+
20
+ model_providers = [
21
+ (model.id, [p.provider for p in model.inference_provider_mapping])
22
+ for model in models
23
+ if hasattr(model, 'inference_provider_mapping') and model.inference_provider_mapping
24
+ ]
25
+ return model_providers
26
+
27
+
28
+ def initialize_models_providers_file(file_path: str = globals.LOCAL_CONFIG_FILE) -> str:
29
+ """Initialize the models_providers.txt file with popular models and their providers."""
30
+ model_to_providers = get_models_providers()
31
+
32
+ with open(file_path, 'w') as f:
33
+ f.write("# Models and Providers Configuration\n")
34
+ f.write("# Format: model_name provider_name\n")
35
+ f.write(f"# Auto-generated on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
36
+
37
+ count = 0
38
+ for (model_id, providers) in model_to_providers:
39
+ try:
40
+ for provider in providers:
41
+ f.write(f"{model_id} {provider}\n")
42
+ count += 1
43
+ except Exception as e:
44
+ print(f"Error processing model {model_id}: {e}")
45
+ continue
46
+
47
+ print(f"Successfully wrote {count} model-provider combinations to {file_path}")
48
+ return f"Initialized {count} model-provider combinations"
49
+
50
+
51
+ def load_models_providers(file_path: str = "models_providers.txt") -> List[Tuple[str, str]]:
52
+ """Load models and providers from text file."""
53
+ models_providers = []
54
+ try:
55
+ with open(file_path, 'r') as f:
56
+ for line in f:
57
+ line = line.strip()
58
+ # Skip empty lines and comments
59
+ if line and not line.startswith('#'):
60
+ parts = line.split()
61
+ if len(parts) >= 2:
62
+ model = parts[0]
63
+ provider = parts[1]
64
+ models_providers.append((model, provider))
65
+ except Exception as e:
66
+ print(f"Error loading models_providers.txt: {str(e)}")
67
+ return models_providers
68
+
69
+
70
+ def save_results() -> None:
71
+ """Persist job results to HuggingFace dataset."""
72
+ try:
73
+ with globals.results_lock:
74
+ if not globals.job_results:
75
+ print("No results to save")
76
+ return
77
+
78
+ records = list(globals.job_results.values())
79
+ df = pd.DataFrame(records)
80
+ dataset = Dataset.from_pandas(df)
81
+
82
+ # Push to HuggingFace Hub
83
+ dataset.push_to_hub(
84
+ globals.RESULTS_DATASET_NAME,
85
+ token=os.getenv("HF_TOKEN"),
86
+ private=False
87
+ )
88
+ print(f"Saved {len(records)} results to dataset")
89
+
90
+ except Exception as e:
91
+ print(f"Error saving results to dataset: {e}")
92
+
93
+
94
+ def load_results() -> None:
95
+ """Load job results from HuggingFace dataset."""
96
+ try:
97
+ # Try to load existing dataset
98
+ dataset = load_dataset(
99
+ globals.RESULTS_DATASET_NAME,
100
+ split="train",
101
+ token=os.getenv("HF_TOKEN")
102
+ )
103
+
104
+ # Convert dataset to job_results dict
105
+ for row in dataset:
106
+ key = globals.get_model_provider_key(row["model"], row["provider"])
107
+ globals.job_results[key] = {
108
+ "model": row["model"],
109
+ "provider": row["provider"],
110
+ "last_run": row["last_run"],
111
+ "status": row["status"],
112
+ "current_score": row["current_score"],
113
+ "previous_score": row["previous_score"],
114
+ "job_id": row["job_id"]
115
+ }
116
+
117
+ print(f"Loaded {len(globals.job_results)} results from dataset")
118
+
119
+ except Exception as e:
120
+ print(f"No existing dataset found or error loading: {e}")
121
+ print("Starting with empty results")
122
+
123
+ def get_results_table() -> List[List[str]]:
124
+ """Return job results as a list for Gradio DataFrame."""
125
+ with globals.results_lock:
126
+ if not globals.job_results:
127
+ return []
128
+
129
+ table_data = []
130
+ for key, info in globals.job_results.items():
131
+ current_score = info.get("current_score", "N/A")
132
+ if current_score is not None and isinstance(current_score, (int, float)):
133
+ current_score = f"{current_score:.4f}"
134
+
135
+ previous_score = info.get("previous_score", "N/A")
136
+ if previous_score is not None and isinstance(previous_score, (int, float)):
137
+ previous_score = f"{previous_score:.4f}"
138
+
139
+ table_data.append([
140
+ info["model"],
141
+ info["provider"],
142
+ info["last_run"],
143
+ info["status"],
144
+ current_score,
145
+ previous_score
146
+ ])
147
+
148
+ return table_data
149
+
utils/jobs.py ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from huggingface_hub import run_job, inspect_job
2
+ import os
3
+ import re
4
+ import time
5
+ from datetime import datetime
6
+ import globals
7
+ from utils.io import save_results, load_models_providers
8
+ from typing import Optional
9
+
10
+
11
+ def extract_score_from_job(job_id: str) -> Optional[float]:
12
+ """Extract average score from completed job logs.
13
+
14
+ Parses the results table and calculates the average of the main metric
15
+ for each task (the metric on the same line as the task name).
16
+ """
17
+ try:
18
+ # Inspect the job to get details and logs
19
+ job_info = inspect_job(job_id=job_id)
20
+
21
+ # Get the logs from the job
22
+ if hasattr(job_info, 'logs') and job_info.logs:
23
+ logs = job_info.logs
24
+ lines = logs.split('\n')
25
+
26
+ # Find the results table
27
+ # Look for lines that match the pattern: |task_name|version|metric|value|...|
28
+ # We want to extract the score (value) from lines where the task name is not empty
29
+
30
+ scores = []
31
+
32
+ for line in lines:
33
+ # Check if we're in a table (contains pipe separators)
34
+ if '|' in line:
35
+ parts = [p.strip() for p in line.split('|')]
36
+
37
+ # Skip header and separator lines
38
+ # Table format: | Task | Version | Metric | Value | | Stderr |
39
+ if len(parts) >= 5:
40
+ task = parts[1] if len(parts) > 1 else ""
41
+ metric = parts[3] if len(parts) > 3 else ""
42
+ value = parts[4] if len(parts) > 4 else ""
43
+
44
+ # We only want lines where the task name is not empty (main metric for that task)
45
+ # Skip lines with "Task", "---", or empty task names
46
+ if task and task not in ["Task", ""] and not task.startswith("-"):
47
+ # Try to extract numeric value
48
+ # Remove any extra characters and convert to float
49
+ value_clean = value.strip()
50
+ try:
51
+ # Extract the numeric part (may have ± symbol after)
52
+ score_match = re.match(r'([0-9]+\.?[0-9]*)', value_clean)
53
+ if score_match:
54
+ score = float(score_match.group(1))
55
+ scores.append(score)
56
+ print(f"Extracted score {score} for task '{task}' metric '{metric}'")
57
+ except (ValueError, AttributeError):
58
+ continue
59
+
60
+ # Calculate average of all task scores
61
+ if scores:
62
+ average_score = sum(scores) / len(scores)
63
+ print(f"Calculated average score: {average_score:.4f} from {len(scores)} tasks")
64
+ return average_score
65
+ else:
66
+ print("No scores found in job logs")
67
+
68
+ return None
69
+
70
+ except Exception as e:
71
+ print(f"Error extracting score for job {job_id}: {e}")
72
+ import traceback
73
+ traceback.print_exc()
74
+ return None
75
+
76
+
77
+ def run_single_job(model: str, provider: str, tasks: str) -> Optional[str]:
78
+ """Run a single job for a model-provider combination."""
79
+
80
+ if not model or not provider:
81
+ print("Missing model or provider")
82
+ return -1
83
+
84
+ # Verify the model-provider combination exists in the config
85
+ models_providers = load_models_providers(globals.LOCAL_CONFIG_FILE)
86
+ if (model, provider) not in models_providers:
87
+ print( f"Error: {model} with {provider} not found in {globals.LOCAL_CONFIG_FILE}")
88
+ return -1
89
+
90
+ # Check if job is already running
91
+ key = globals.get_model_provider_key(model, provider)
92
+ with globals.results_lock:
93
+ if key in globals.job_results:
94
+ current_status = globals.job_results[key].get("status")
95
+ if current_status == "running":
96
+ print( f"Job for {model} on {provider} is already running. Please wait for it to complete.")
97
+ return -1
98
+
99
+ print(f"Starting job for model={model}, provider={provider}")
100
+
101
+ job = run_job(
102
+ image="hf.co/spaces/OpenEvals/EvalsOnTheHub",
103
+ command=[
104
+ "lighteval", "endpoint", "inference-providers",
105
+ f"model_name={model},provider={provider}",
106
+ tasks,
107
+ "--push-to-hub", "--save-details",
108
+ "--results-org", "IPTesting",
109
+ "--max-samples", "10"
110
+ ],
111
+ namespace="clefourrier",
112
+ secrets={"HF_TOKEN": os.getenv("HF_TOKEN")},
113
+ token=os.getenv("HF_TOKEN")
114
+ )
115
+
116
+ job_id = job.id
117
+ key = globals.get_model_provider_key(model, provider)
118
+
119
+ with globals.results_lock:
120
+ # Move current score to previous score if it exists (relaunching)
121
+ previous_score = None
122
+ if key in globals.job_results and globals.job_results[key].get("current_score"):
123
+ previous_score = globals.job_results[key]["current_score"]
124
+
125
+ globals.job_results[key] = {
126
+ "model": model,
127
+ "provider": provider,
128
+ "last_run": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
129
+ "status": "running",
130
+ "current_score": None,
131
+ "previous_score": previous_score,
132
+ "job_id": job_id
133
+ }
134
+
135
+ save_results()
136
+ print(f"Job launched: ID={job_id}, model={model}, provider={provider}")
137
+ return job_id
138
+
139
+
140
+ def launch_jobs(tasks: str = globals.TASKS, config_file: str = globals.LOCAL_CONFIG_FILE):
141
+ """Launch jobs for all models and providers."""
142
+ models_providers = load_models_providers(config_file)
143
+
144
+ if not models_providers:
145
+ print("No valid model-provider combinations found in config file")
146
+ return "No valid model-provider combinations found"
147
+
148
+ print(f"Found {len(models_providers)} model-provider combinations")
149
+
150
+ launched_count = 0
151
+ for model, provider in models_providers:
152
+ job_id = run_single_job(model, provider, tasks)
153
+ if job_id != -1:
154
+ launched_count += 1
155
+ # Small delay between launches to avoid rate limiting
156
+ time.sleep(2)
157
+
158
+ print(f"Launched {launched_count}/{len(models_providers)} jobs successfully")
159
+ return f"Launched {launched_count} jobs"
160
+
161
+
162
+ def update_job_statuses() -> None:
163
+ """Check and update the status of active jobs."""
164
+ try:
165
+ with globals.results_lock:
166
+ keys = list(globals.job_results.keys())
167
+
168
+ for key in keys:
169
+ try:
170
+ with globals.results_lock:
171
+ if globals.job_results[key]["status"] in ["complete", "failed", "cancelled"]:
172
+ continue # Skip already finished jobs
173
+
174
+ job_id = globals.job_results[key]["job_id"]
175
+
176
+ job_info = inspect_job(job_id=job_id)
177
+ new_status = job_info.status.stage
178
+
179
+ with globals.results_lock:
180
+ old_status = globals.job_results[key]["status"]
181
+
182
+ if old_status != new_status:
183
+ globals.job_results[key]["status"] = new_status
184
+ print(f"Job {job_id} status changed: {old_status} -> {new_status}")
185
+
186
+ # If job completed, try to extract score
187
+ if new_status == "COMPLETED":
188
+ score = extract_score_from_job(job_id)
189
+ if score is not None:
190
+ globals.job_results[key]["current_score"] = score
191
+
192
+ except Exception as e:
193
+ print(f"Error checking job: {str(e)}")
194
+
195
+ save_results()
196
+
197
+ except Exception as e:
198
+ print(f"Error in update_job_statuses: {str(e)}")