Spaces:
Sleeping
Sleeping
| title: InferenceProviderTestingBackend | |
| emoji: 📈 | |
| colorFrom: yellow | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| # Inference Provider Testing Dashboard | |
| A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API. | |
| ## Setup | |
| ### Prerequisites | |
| - Python 3.8+ | |
| - Hugging Face account with API token | |
| - Access to the `IPTesting` namespace on Hugging Face | |
| ### Installation | |
| 1. Clone or navigate to this repository: | |
| ```bash | |
| cd InferenceProviderTestingBackend | |
| ``` | |
| 2. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. Set up your Hugging Face token as an environment variable: | |
| ```bash | |
| export HF_TOKEN="your_huggingface_token_here" | |
| ``` | |
| **Important**: Your HF_TOKEN must have: | |
| - Permission to call inference providers | |
| - Write access to the `IPTesting` organization | |
| ## Usage | |
| ### Starting the Dashboard | |
| Run the Gradio app: | |
| ```bash | |
| python app.py | |
| ``` | |
| ### Initialize Models and Providers | |
| 1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers. | |
| 2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations: | |
| ``` | |
| meta-llama/Llama-3.2-3B-Instruct fireworks-ai | |
| meta-llama/Llama-3.2-3B-Instruct together-ai | |
| Qwen/Qwen2.5-7B-Instruct fireworks-ai | |
| mistralai/Mistral-7B-Instruct-v0.3 together-ai | |
| ``` | |
| Format: `model_name provider_name` (separated by spaces or tabs) | |
| ### Launching Jobs | |
| 1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`) | |
| 2. Verify the config file path (default: `models_providers.txt`) | |
| 3. Click **"Launch Jobs"** | |
| The system will: | |
| - Read all model-provider combinations from the config file | |
| - Launch a separate evaluation job for each combination | |
| - Log the job ID and status | |
| - Monitor job progress automatically | |
| ### Monitoring Jobs | |
| The **Job Results** table displays all jobs with: | |
| - **Model**: The model being tested | |
| - **Provider**: The inference provider | |
| - **Last Run**: Timestamp of when the job was last launched | |
| - **Status**: Current status (running/complete/failed/cancelled) | |
| - **Current Score**: Average score from the most recent run | |
| - **Previous Score**: Average score from the prior run (for comparison) | |
| - **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection | |
| The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates. | |
| ## Configuration | |
| ### Tasks Format | |
| The tasks parameter follows the lighteval format. Examples: | |
| - `lighteval|mmlu|0` - MMLU benchmark | |
| ### Daily Checkpoint | |
| The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. | |
| ### Data Persistence | |
| All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means: | |
| - Results persist across app restarts | |
| - Historical score comparisons are maintained | |
| - Data can be accessed programmatically via the HF datasets library | |
| ## Architecture | |
| - **Main Thread**: Runs the Gradio interface | |
| - **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs | |
| - **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based) | |
| - **Thread-safe**: Uses locks to prevent access issues when checking job_results | |
| - **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset | |
| ## Troubleshooting | |
| ### Jobs Not Launching | |
| - Verify your `HF_TOKEN` is set and has the required permissions | |
| - Check that the `IPTesting` namespace exists and you have access | |
| - Review logs for specific error messages | |
| ### Scores Not Appearing | |
| - Scores are extracted from job logs after completion | |
| - The extraction parses the results table that appears in job logs | |
| - It extracts the score for each task (from the first row where the task name appears) | |
| - The final score is the average of all task scores | |
| - Example table format: | |
| ``` | |
| | Task | Version | Metric | Value | Stderr | | |
| | extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 | | |
| | lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 | | |
| ``` | |
| - If scores don't appear, check console output for extraction errors or parsing issues | |
| ## Files | |
| - [app.py](app.py) - Main Gradio application with UI and job management | |
| - [utils/](utils/) - Utility package with helper modules: | |
| - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence | |
| - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction | |
| - [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations | |
| - [requirements.txt](requirements.txt) - Python dependencies | |
| - [README.md](README.md) - This file | |