Spaces:

OpenEvals
/

InferenceProviderTesting

Sleeping

App Files Files Community

Clémentine commited on Oct 10

Commit

7f5506e

1 Parent(s): ec29d6f

wip

Browse files

Files changed (8) hide show

README.md +188 -1
app.py +135 -0
globals.py +21 -0
model_providers.txt +9 -0
requirements.txt +5 -0
utils/__init__.py +0 -0
utils/io.py +149 -0
utils/jobs.py +198 -0

README.md CHANGED Viewed

@@ -9,4 +9,191 @@ app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 pinned: false
 ---
+# Inference Provider Testing Dashboard
+A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
+## Features
+- **Automatic Model Discovery**: Fetch popular text-generation models with inference providers from Hugging Face Hub
+- **Batch Job Launching**: Run evaluation jobs for multiple model-provider combinations from a configuration file
+- **Results Table Dashboard**: View all jobs with model, provider, last run, status, current score, and previous score
+- **Score Tracking**: Automatically extracts average scores from completed jobs and tracks history
+- **Persistent Storage**: Results saved to HuggingFace dataset for persistence across restarts
+- **Individual Job Relaunch**: Easily relaunch specific model-provider combinations
+- **Real-time Monitoring**: Auto-refresh results table every 30 seconds
+- **Daily Checkpoint**: Automatic daily save at midnight to preserve state
+## Setup
+### Prerequisites
+- Python 3.8+
+- Hugging Face account with API token
+- Access to the `IPTesting` namespace on Hugging Face
+### Installation
+1. Clone or navigate to this repository:
+```bash
+cd InferenceProviderTestingBackend
+```
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+3. Set up your Hugging Face token as an environment variable:
+```bash
+export HF_TOKEN="your_huggingface_token_here"
+```
+**Important**: Your HF_TOKEN must have:
+- Permission to call inference providers
+- Write access to the `IPTesting` organization
+## Usage
+### Starting the Dashboard
+Run the Gradio app:
+```bash
+python app.py
+```
+The dashboard will be available at `http://localhost:7860`
+### Initialize Models and Providers
+1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers.
+2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations:
+```
+meta-llama/Llama-3.2-3B-Instruct  fireworks-ai
+meta-llama/Llama-3.2-3B-Instruct  together-ai
+Qwen/Qwen2.5-7B-Instruct  fireworks-ai
+mistralai/Mistral-7B-Instruct-v0.3  together-ai
+```
+Format: `model_name  provider_name` (separated by spaces or tabs)
+### Launching Jobs
+1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`)
+2. Verify the config file path (default: `models_providers.txt`)
+3. Click **"Launch Jobs"**
+The system will:
+- Read all model-provider combinations from the config file
+- Launch a separate evaluation job for each combination
+- Log the job ID and status
+- Monitor job progress automatically
+### Monitoring Jobs
+The **Job Results** table displays all jobs with:
+- **Model**: The model being tested
+- **Provider**: The inference provider
+- **Last Run**: Timestamp of when the job was last launched
+- **Status**: Current status (running/complete/failed/cancelled)
+- **Current Score**: Average score from the most recent run
+- **Previous Score**: Average score from the prior run (for comparison)
+The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
+### Relaunching Individual Jobs
+To rerun a specific model-provider combination:
+1. Enter the model name (e.g., `meta-llama/Llama-3.2-3B-Instruct`)
+2. Enter the provider name (e.g., `fireworks-ai`)
+3. Optionally modify the tasks
+4. Click "Relaunch Job"
+When relaunching, the current score automatically moves to previous score for comparison.
+## Configuration
+### Tasks Format
+The tasks parameter follows the lighteval format. Examples:
+- `lighteval|mmlu|0|0` - MMLU benchmark
+- `lighteval|hellaswag|0|0` - HellaSwag benchmark
+### Daily Checkpoint
+The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. This ensures data persistence and prevents data loss from long-running sessions.
+### Data Persistence
+All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means:
+- Results persist across app restarts
+- Historical score comparisons are maintained
+- Data can be accessed programmatically via the HF datasets library
+## Job Command Details
+Each job runs with the following configuration:
+- **Image**: `hf.co/spaces/OpenEvals/EvalsOnTheHub`
+- **Command**: `lighteval endpoint inference-providers`
+- **Namespace**: `IPTesting`
+- **Flags**: `--push-to-hub --save-details --results-org IPTesting`
+Results are automatically pushed to the `IPTesting` organization on Hugging Face Hub.
+## Architecture
+- **Main Thread**: Runs the Gradio interface
+- **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
+- **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
+- **Thread-safe Operations**: Uses locks to prevent race conditions when accessing job_results
+- **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset
+## Troubleshooting
+### Jobs Not Launching
+- Verify your `HF_TOKEN` is set and has the required permissions
+- Check that the `IPTesting` namespace exists and you have access
+- Review logs for specific error messages
+### Empty Models List
+- Ensure you have internet connectivity
+- The Hugging Face Hub API must be accessible
+- Try running the initialization again
+### Job Status Not Updating
+- Check your internet connection
+- Verify the job IDs are valid
+- Check console output for API errors
+### Scores Not Appearing
+- Scores are extracted from job logs after completion
+- The extraction parses the results table that appears in job logs
+- It extracts the score for each task (from the first row where the task name appears)
+- The final score is the average of all task scores
+- Example table format:
+  ```
+  | Task                    | Version | Metric                | Value  | Stderr |
+  | extended:ifeval:0       |         | prompt_level_strict_acc | 0.9100 | 0.0288 |
+  | lighteval:gpqa:diamond:0 |        | gpqa_pass@k_with_k     | 0.5000 | 0.0503 |
+  ```
+- If scores don't appear, check console output for extraction errors or parsing issues
+## Files
+- [app.py](app.py) - Main Gradio application with UI and job management
+- [utils/](utils/) - Utility package with helper modules:
+  - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
+  - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
+  - [utils/__init__.py](utils/__init__.py) - Package initialization and exports
+- [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
+- [requirements.txt](requirements.txt) - Python dependencies
+- [README.md](README.md) - This file
+## License
+This project is provided as-is for evaluation testing purposes.

app.py ADDED Viewed

	@@ -0,0 +1,135 @@

+import gradio as gr
+import time
+from apscheduler.schedulers.background import BackgroundScheduler
+import threading
+import globals
+from globals import TASKS, LOCAL_CONFIG_FILE
+from utils.io import initialize_models_providers_file, save_results, load_results, load_models_providers, get_results_table
+from utils.jobs import run_single_job, launch_jobs, update_job_statuses
+from typing import List, Optional
+def status_monitor() -> None:
+    """Background thread to monitor job statuses."""
+    while True:
+        update_job_statuses()
+        time.sleep(240)  # Check every 30 seconds
+def daily_checkpoint() -> None:
+    """Daily checkpoint - save current state."""
+    print("Daily checkpoint - saving current state")
+    save_results()
+# Create Gradio interface
+def create_app() -> gr.Blocks:
+    with gr.Blocks(title="Inference Provider Testing Dashboard") as demo:
+        gr.Markdown("# Inference Provider Testing Dashboard")
+        gr.Markdown("Launch and monitor evaluation jobs for multiple models and providers.")
+        output = gr.Textbox(label="Logs and status", interactive=False)
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("## Initialize Config File")
+                init_btn = gr.Button("Fetch and Initialize Models/Providers", variant="secondary")
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("## Launch Jobs")
+                launch_btn = gr.Button("Launch All Jobs", variant="primary")
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("## Job Results")
+                results_table = gr.Dataframe(
+                    headers=["Model", "Provider", "Last Run", "Status", "Current Score", "Previous Score"],
+                    value=get_results_table(),
+                    interactive=False,
+                    wrap=True
+                )
+                refresh_btn = gr.Button("Refresh Results")
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("## Relaunch Individual Job")
+                # Load model-provider combinations
+                models_providers = load_models_providers(LOCAL_CONFIG_FILE)
+                model_choices = sorted(list(set([mp[0] for mp in models_providers])))
+                relaunch_model = gr.Dropdown(
+                    label="Model",
+                    choices=model_choices,
+                    interactive=True
+                )
+                relaunch_provider = gr.Dropdown(
+                    label="Provider",
+                    choices=[],
+                    interactive=True
+                )
+                relaunch_btn = gr.Button("Relaunch Job", variant="secondary")
+        def update_provider_choices(model: Optional[str]) -> gr.update:
+            """Update provider dropdown based on selected model."""
+            if not model:
+                return gr.update(choices=[])
+            # Get providers for the selected model from the config file
+            models_providers = load_models_providers(LOCAL_CONFIG_FILE)
+            providers = [mp[1] for mp in models_providers if mp[0] == model]
+            return gr.update(choices=providers, value=providers[0] if providers else None)
+        # Event handlers
+        init_btn.click(
+            fn=initialize_models_providers_file,
+            outputs=output
+        )
+        launch_btn.click(
+            fn=launch_jobs,
+            outputs=output
+        )
+        refresh_btn.click(
+            fn=get_results_table,
+            outputs=results_table
+        )
+        # Update provider dropdown when model is selected
+        relaunch_model.change(
+            fn=update_provider_choices,
+            inputs=relaunch_model,
+            outputs=relaunch_provider
+        )
+        relaunch_btn.click(
+            fn=run_single_job,
+            inputs=[relaunch_model, relaunch_provider],
+            outputs=output
+        )
+    return demo
+if __name__ == "__main__":
+    # Load previous results
+    load_results()
+    print("Starting Inference Provider Testing Dashboard")
+    # Start status monitor thread
+    monitor_thread = threading.Thread(target=status_monitor, daemon=True)
+    monitor_thread.start()
+    print("Job status monitor started")
+    # Start APScheduler for daily checkpoint
+    scheduler = BackgroundScheduler()
+    scheduler.add_job(daily_checkpoint, 'cron', hour=0, minute=0)  # Run at midnight
+    scheduler.start()
+    print("Daily checkpoint scheduler started (saves at 00:00)")
+    # Create and launch the Gradio interface
+    demo = create_app()
+    demo.launch(server_name="0.0.0.0", server_port=7860)

globals.py ADDED Viewed

	@@ -0,0 +1,21 @@

+"""Global variables and configuration for the Inference Provider Testing Dashboard."""
+import threading
+from typing import Dict, Any, Optional
+# Type definition for job result entries
+JobResult = Dict[str, Any]  # {model, provider, last_run, status, current_score, previous_score, job_id}
+# Global variables to track jobs
+job_results: Dict[str, JobResult] = {}  # {model_provider_key: JobResult}
+results_lock: threading.Lock = threading.Lock()
+# Configuration
+RESULTS_DATASET_NAME: str = "IPTesting/inference-provider-test-results"
+LOCAL_CONFIG_FILE: str = "model_providers.txt"
+TASKS: str = "extended|ifeval|0,lighteval|gsm_plus|0,lighteval|gpqa:diamond|0"
+def get_model_provider_key(model: str, provider: str) -> str:
+    """Create a unique key for model-provider combination."""
+    return f"{model}||{provider}"

model_providers.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+# Models and Providers Configuration
+# Format: model_name  provider_name
+# Auto-generated on 2025-10-10 15:38:11
+deepseek-ai/DeepSeek-R1  fireworks-ai
+deepseek-ai/DeepSeek-R1  hyperbolic
+deepseek-ai/DeepSeek-R1  novita
+deepseek-ai/DeepSeek-R1  together
+deepseek-ai/DeepSeek-R1  sambanova

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+gradio>=4.0.0
+huggingface-hub>=0.20.0
+apscheduler>=3.10.0
+datasets>=2.14.0
+pandas>=1.5.0

utils/__init__.py ADDED Viewed

File without changes

utils/io.py ADDED Viewed

	@@ -0,0 +1,149 @@

+from huggingface_hub import list_models, model_info
+from datetime import datetime
+from datasets import Dataset, load_dataset
+import pandas as pd
+import os
+import globals
+from typing import List, Tuple
+def get_models_providers() -> List[Tuple[str, List[str]]]:
+    """Get list of popular text generation models and associated providers from Hugging Face"""
+    models = list_models(
+        filter="text-generation",
+        sort="likes",
+        direction=-1,
+        limit=1,
+        expand="inferenceProviderMapping"
+    )
+    model_providers = [
+        (model.id, [p.provider for p in model.inference_provider_mapping])
+        for model in models
+        if hasattr(model, 'inference_provider_mapping') and model.inference_provider_mapping
+    ]
+    return model_providers
+def initialize_models_providers_file(file_path: str = globals.LOCAL_CONFIG_FILE) -> str:
+    """Initialize the models_providers.txt file with popular models and their providers."""
+    model_to_providers = get_models_providers()
+    with open(file_path, 'w') as f:
+        f.write("# Models and Providers Configuration\n")
+        f.write("# Format: model_name  provider_name\n")
+        f.write(f"# Auto-generated on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
+        count = 0
+        for (model_id, providers) in model_to_providers:
+            try:
+                for provider in providers:
+                    f.write(f"{model_id}  {provider}\n")
+                    count += 1
+            except Exception as e:
+                print(f"Error processing model {model_id}: {e}")
+                continue
+        print(f"Successfully wrote {count} model-provider combinations to {file_path}")
+        return f"Initialized {count} model-provider combinations"
+def load_models_providers(file_path: str = "models_providers.txt") -> List[Tuple[str, str]]:
+    """Load models and providers from text file."""
+    models_providers = []
+    try:
+        with open(file_path, 'r') as f:
+            for line in f:
+                line = line.strip()
+                # Skip empty lines and comments
+                if line and not line.startswith('#'):
+                    parts = line.split()
+                    if len(parts) >= 2:
+                        model = parts[0]
+                        provider = parts[1]
+                        models_providers.append((model, provider))
+    except Exception as e:
+        print(f"Error loading models_providers.txt: {str(e)}")
+    return models_providers
+def save_results() -> None:
+    """Persist job results to HuggingFace dataset."""
+    try:
+        with globals.results_lock:
+            if not globals.job_results:
+                print("No results to save")
+                return
+            records = list(globals.job_results.values())
+            df = pd.DataFrame(records)
+            dataset = Dataset.from_pandas(df)
+            # Push to HuggingFace Hub
+            dataset.push_to_hub(
+                globals.RESULTS_DATASET_NAME,
+                token=os.getenv("HF_TOKEN"),
+                private=False
+            )
+            print(f"Saved {len(records)} results to dataset")
+    except Exception as e:
+        print(f"Error saving results to dataset: {e}")
+def load_results() -> None:
+    """Load job results from HuggingFace dataset."""
+    try:
+        # Try to load existing dataset
+        dataset = load_dataset(
+            globals.RESULTS_DATASET_NAME,
+            split="train",
+            token=os.getenv("HF_TOKEN")
+        )
+        # Convert dataset to job_results dict
+        for row in dataset:
+            key = globals.get_model_provider_key(row["model"], row["provider"])
+            globals.job_results[key] = {
+                "model": row["model"],
+                "provider": row["provider"],
+                "last_run": row["last_run"],
+                "status": row["status"],
+                "current_score": row["current_score"],
+                "previous_score": row["previous_score"],
+                "job_id": row["job_id"]
+            }
+        print(f"Loaded {len(globals.job_results)} results from dataset")
+    except Exception as e:
+        print(f"No existing dataset found or error loading: {e}")
+        print("Starting with empty results")
+def get_results_table() -> List[List[str]]:
+    """Return job results as a list for Gradio DataFrame."""
+    with globals.results_lock:
+        if not globals.job_results:
+            return []
+        table_data = []
+        for key, info in globals.job_results.items():
+            current_score = info.get("current_score", "N/A")
+            if current_score is not None and isinstance(current_score, (int, float)):
+                current_score = f"{current_score:.4f}"
+            previous_score = info.get("previous_score", "N/A")
+            if previous_score is not None and isinstance(previous_score, (int, float)):
+                previous_score = f"{previous_score:.4f}"
+            table_data.append([
+                info["model"],
+                info["provider"],
+                info["last_run"],
+                info["status"],
+                current_score,
+                previous_score
+            ])
+        return table_data

utils/jobs.py ADDED Viewed

	@@ -0,0 +1,198 @@

+from huggingface_hub import run_job, inspect_job
+import os
+import re
+import time
+from datetime import datetime
+import globals
+from utils.io import save_results, load_models_providers
+from typing import Optional
+def extract_score_from_job(job_id: str) -> Optional[float]:
+    """Extract average score from completed job logs.
+    Parses the results table and calculates the average of the main metric
+    for each task (the metric on the same line as the task name).
+    """
+    try:
+        # Inspect the job to get details and logs
+        job_info = inspect_job(job_id=job_id)
+        # Get the logs from the job
+        if hasattr(job_info, 'logs') and job_info.logs:
+            logs = job_info.logs
+            lines = logs.split('\n')
+            # Find the results table
+            # Look for lines that match the pattern: |task_name|version|metric|value|...|
+            # We want to extract the score (value) from lines where the task name is not empty
+            scores = []
+            for line in lines:
+                # Check if we're in a table (contains pipe separators)
+                if '|' in line:
+                    parts = [p.strip() for p in line.split('|')]
+                    # Skip header and separator lines
+                    # Table format: | Task | Version | Metric | Value | | Stderr |
+                    if len(parts) >= 5:
+                        task = parts[1] if len(parts) > 1 else ""
+                        metric = parts[3] if len(parts) > 3 else ""
+                        value = parts[4] if len(parts) > 4 else ""
+                        # We only want lines where the task name is not empty (main metric for that task)
+                        # Skip lines with "Task", "---", or empty task names
+                        if task and task not in ["Task", ""] and not task.startswith("-"):
+                            # Try to extract numeric value
+                            # Remove any extra characters and convert to float
+                            value_clean = value.strip()
+                            try:
+                                # Extract the numeric part (may have ± symbol after)
+                                score_match = re.match(r'([0-9]+\.?[0-9]*)', value_clean)
+                                if score_match:
+                                    score = float(score_match.group(1))
+                                    scores.append(score)
+                                    print(f"Extracted score {score} for task '{task}' metric '{metric}'")
+                            except (ValueError, AttributeError):
+                                continue
+            # Calculate average of all task scores
+            if scores:
+                average_score = sum(scores) / len(scores)
+                print(f"Calculated average score: {average_score:.4f} from {len(scores)} tasks")
+                return average_score
+            else:
+                print("No scores found in job logs")
+        return None
+    except Exception as e:
+        print(f"Error extracting score for job {job_id}: {e}")
+        import traceback
+        traceback.print_exc()
+        return None
+def run_single_job(model: str, provider: str, tasks: str) -> Optional[str]:
+    """Run a single job for a model-provider combination."""
+    if not model or not provider:
+        print("Missing model or provider")
+        return -1
+    # Verify the model-provider combination exists in the config
+    models_providers = load_models_providers(globals.LOCAL_CONFIG_FILE)
+    if (model, provider) not in models_providers:
+        print( f"Error: {model} with {provider} not found in {globals.LOCAL_CONFIG_FILE}")
+        return -1
+    # Check if job is already running
+    key = globals.get_model_provider_key(model, provider)
+    with globals.results_lock:
+        if key in globals.job_results:
+            current_status = globals.job_results[key].get("status")
+            if current_status == "running":
+                print( f"Job for {model} on {provider} is already running. Please wait for it to complete.")
+                return -1
+    print(f"Starting job for model={model}, provider={provider}")
+    job = run_job(
+        image="hf.co/spaces/OpenEvals/EvalsOnTheHub",
+        command=[
+            "lighteval", "endpoint", "inference-providers",
+            f"model_name={model},provider={provider}",
+            tasks,
+            "--push-to-hub", "--save-details",
+            "--results-org", "IPTesting",
+            "--max-samples", "10"
+        ],
+        namespace="clefourrier",
+        secrets={"HF_TOKEN": os.getenv("HF_TOKEN")},
+        token=os.getenv("HF_TOKEN")
+    )
+    job_id = job.id
+    key = globals.get_model_provider_key(model, provider)
+    with globals.results_lock:
+        # Move current score to previous score if it exists (relaunching)
+        previous_score = None
+        if key in globals.job_results and globals.job_results[key].get("current_score"):
+            previous_score = globals.job_results[key]["current_score"]
+        globals.job_results[key] = {
+            "model": model,
+            "provider": provider,
+            "last_run": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+            "status": "running",
+            "current_score": None,
+            "previous_score": previous_score,
+            "job_id": job_id
+        }
+    save_results()
+    print(f"Job launched: ID={job_id}, model={model}, provider={provider}")
+    return job_id
+def launch_jobs(tasks: str = globals.TASKS, config_file: str = globals.LOCAL_CONFIG_FILE):
+    """Launch jobs for all models and providers."""
+    models_providers = load_models_providers(config_file)
+    if not models_providers:
+        print("No valid model-provider combinations found in config file")
+        return "No valid model-provider combinations found"
+    print(f"Found {len(models_providers)} model-provider combinations")
+    launched_count = 0
+    for model, provider in models_providers:
+        job_id = run_single_job(model, provider, tasks)
+        if job_id != -1:
+            launched_count += 1
+        # Small delay between launches to avoid rate limiting
+        time.sleep(2)
+    print(f"Launched {launched_count}/{len(models_providers)} jobs successfully")
+    return f"Launched {launched_count} jobs"
+def update_job_statuses() -> None:
+    """Check and update the status of active jobs."""
+    try:
+        with globals.results_lock:
+            keys = list(globals.job_results.keys())
+        for key in keys:
+            try:
+                with globals.results_lock:
+                    if globals.job_results[key]["status"] in ["complete", "failed", "cancelled"]:
+                        continue  # Skip already finished jobs
+                    job_id = globals.job_results[key]["job_id"]
+                job_info = inspect_job(job_id=job_id)
+                new_status = job_info.status.stage
+                with globals.results_lock:
+                    old_status = globals.job_results[key]["status"]
+                    if old_status != new_status:
+                        globals.job_results[key]["status"] = new_status
+                        print(f"Job {job_id} status changed: {old_status} -> {new_status}")
+                        # If job completed, try to extract score
+                        if new_status == "COMPLETED":
+                            score = extract_score_from_job(job_id)
+                            if score is not None:
+                                globals.job_results[key]["current_score"] = score
+            except Exception as e:
+                print(f"Error checking job: {str(e)}")
+        save_results()
+    except Exception as e:
+        print(f"Error in update_job_statuses: {str(e)}")