metadata

title: MTSAIR SBS Leaderboard
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
sdk_version: 5.19.0

Run benchmark locally

To run the benchmark locally on any open dataset that's formatted appropriately, you can use the code from this repository. Here are the two steps you need to follow:

Generate model answers: generate answers from the candidate model (and baseline model if it's not GPT-4o or GPT-3.5-turbo). You can use the script src/evaluate/generate_answers.py for this task.

By default, it's assumed that a server compatible with the OpenAI API (such as one deployed with VLLM) is used. If you use a different setup, you need to add your own APIModelBase in src/evaluate/util.py.

Example commands for generating answers: for asynchronous generation:

export OPENAI_API_KEY=<your_token>
python3 -m src.evaluate.generate_answers --hostname https://api.openai.com/v1 --model-name gpt-3.5-turbo --model-openai gpt-4o --chunk-size 100

for synchronous generation:

python3 -m src.evaluate.generate_answers --hostname https://api.openai.com/v1 --model-name gpt-3.5-turbo --model-openai gpt-4o --sync

Evaluate answers: evaluate the answers using your LLM-as-a-judge. Here too you need to specify the address of a server compatible with the OpenAI API. Example command:

   python3 -m src.evaluate.evaluate_answers --hostname https://api.fireworks.ai/inference/v1 --eval-model accounts/fireworks/models/deepseek-r1-distill-llama-70b --chunk-size 100 --candidate-model gpt-3.5-turbo

These steps will help you set up local benchmarking using the provided repository. Make sure to adjust parameters and setup based on your specific environment and models.