Spaces:

MTSAIR
/

ru_leaderboard

Sleeping

App Files Files Community

ru_leaderboard / README.md

sitnich

Update README.md

68f814b verified 6 months ago

preview code

raw

history blame contribute delete

1.76 kB

	---
	title: MTSAIR SBS Leaderboard
	emoji: 🥇
	colorFrom: green
	colorTo: indigo
	sdk: gradio
	app_file: app.py
	pinned: true
	license: apache-2.0
	sdk_version: 5.19.0
	---

	# Run benchmark locally
	To run the benchmark locally on any open dataset that's formatted appropriately, you can use the code from this repository. Here are the two steps you need to follow:

	1) Generate model answers: generate answers from the candidate model (and baseline model if it's not GPT-4o or GPT-3.5-turbo). You can use the script src/evaluate/generate_answers.py for this task.

	By default, it's assumed that a server compatible with the OpenAI API (such as one deployed with VLLM) is used. If you use a different setup, you need to add your own APIModelBase in src/evaluate/util.py.

	Example commands for generating answers:
	for asynchronous generation:
	```
	export OPENAI_API_KEY=<your_token>
	python3 -m src.evaluate.generate_answers --hostname https://api.openai.com/v1 --model-name gpt-3.5-turbo --model-openai gpt-4o --chunk-size 100
	```
	for synchronous generation:
	```
	python3 -m src.evaluate.generate_answers --hostname https://api.openai.com/v1 --model-name gpt-3.5-turbo --model-openai gpt-4o --sync
	```


	2) Evaluate answers: evaluate the answers using your LLM-as-a-judge. Here too you need to specify the address of a server compatible with the OpenAI API. Example command:

	```
	python3 -m src.evaluate.evaluate_answers --hostname https://api.fireworks.ai/inference/v1 --eval-model accounts/fireworks/models/deepseek-r1-distill-llama-70b --chunk-size 100 --candidate-model gpt-3.5-turbo
	```

	These steps will help you set up local benchmarking using the provided repository. Make sure to adjust parameters and setup based on your specific environment and models.