OpenEvals
community
AI & ML interests
LLM evaluation
Recent Activity
Articles
A small overview of our research collabs through the years
-
GAIA: a benchmark for General AI Assistants
Paper β’ 2311.12983 β’ Published β’ 241 -
Zephyr: Direct Distillation of LM Alignment
Paper β’ 2310.16944 β’ Published β’ 122 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper β’ 2502.02737 β’ Published β’ 249 -
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper β’ 2412.03304 β’ Published β’ 21
This leaderboard evaluated 7K LLMs from Apr 2023 to Jun 2024, on ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande and GSM8K
-
Find a leaderboard
π116Explore and discover all leaderboards from the HF community
-
YourBench
π42Generate custom evaluations from your data easily!
-
Example Leaderboard Template
π₯16Duplicate this leaderboard to initialize your own!
-
Run your LLM evaluations on the hub
π’Generate a command to run model evaluations
This leaderboard has been evaluating LLMs from Jun 2024 on IFEval, MuSR, GPQA, MATH, BBH and MMLU-Pro
-
Open-LLM performances are plateauing, letβs make the leaderboard steep again
π125Explore and compare advanced language models on a new leaderboard
-
Open LLM Leaderboard
π13.7kTrack, rank and evaluate open LLMs and chatbots
-
open-llm-leaderboard/contents
Viewer β’ Updated β’ 4.58k β’ 9.99k β’ 20 -
open-llm-leaderboard/results
Preview β’ Updated β’ 50.1k β’ 15
A small overview of our research collabs through the years
-
GAIA: a benchmark for General AI Assistants
Paper β’ 2311.12983 β’ Published β’ 241 -
Zephyr: Direct Distillation of LM Alignment
Paper β’ 2310.16944 β’ Published β’ 122 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper β’ 2502.02737 β’ Published β’ 249 -
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper β’ 2412.03304 β’ Published β’ 21
-
Find a leaderboard
π116Explore and discover all leaderboards from the HF community
-
YourBench
π42Generate custom evaluations from your data easily!
-
Example Leaderboard Template
π₯16Duplicate this leaderboard to initialize your own!
-
Run your LLM evaluations on the hub
π’Generate a command to run model evaluations
This leaderboard has been evaluating LLMs from Jun 2024 on IFEval, MuSR, GPQA, MATH, BBH and MMLU-Pro
-
Open-LLM performances are plateauing, letβs make the leaderboard steep again
π125Explore and compare advanced language models on a new leaderboard
-
Open LLM Leaderboard
π13.7kTrack, rank and evaluate open LLMs and chatbots
-
open-llm-leaderboard/contents
Viewer β’ Updated β’ 4.58k β’ 9.99k β’ 20 -
open-llm-leaderboard/results
Preview β’ Updated β’ 50.1k β’ 15
This leaderboard evaluated 7K LLMs from Apr 2023 to Jun 2024, on ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande and GSM8K