Spaces:
Sleeping
Sleeping
title: MTSAIR SBS Leaderboard | |
emoji: 🥇 | |
colorFrom: green | |
colorTo: indigo | |
sdk: gradio | |
app_file: app.py | |
pinned: true | |
license: apache-2.0 | |
sdk_version: 5.19.0 | |
# Run benchmark locally | |
To run the benchmark locally on any open dataset that's formatted appropriately, you can use the code from this repository. Here are the two steps you need to follow: | |
1) Generate model answers: generate answers from the candidate model (and baseline model if it's not GPT-4o or GPT-3.5-turbo). You can use the script src/evaluate/generate_answers.py for this task. | |
By default, it's assumed that a server compatible with the OpenAI API (such as one deployed with VLLM) is used. If you use a different setup, you need to add your own APIModelBase in src/evaluate/util.py. | |
Example commands for generating answers: | |
for asynchronous generation: | |
``` | |
export OPENAI_API_KEY=<your_token> | |
python3 -m src.evaluate.generate_answers --hostname https://api.openai.com/v1 --model-name gpt-3.5-turbo --model-openai gpt-4o --chunk-size 100 | |
``` | |
for synchronous generation: | |
``` | |
python3 -m src.evaluate.generate_answers --hostname https://api.openai.com/v1 --model-name gpt-3.5-turbo --model-openai gpt-4o --sync | |
``` | |
2) Evaluate answers: evaluate the answers using your LLM-as-a-judge. Here too you need to specify the address of a server compatible with the OpenAI API. Example command: | |
``` | |
python3 -m src.evaluate.evaluate_answers --hostname https://api.fireworks.ai/inference/v1 --eval-model accounts/fireworks/models/deepseek-r1-distill-llama-70b --chunk-size 100 --candidate-model gpt-3.5-turbo | |
``` | |
These steps will help you set up local benchmarking using the provided repository. Make sure to adjust parameters and setup based on your specific environment and models. |