-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 56 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Calibrating LLM-Based Evaluator
Paper • 2309.13308 • Published • 12
Andrew Reed
andrewrreed
AI & ML interests
Applied ML, Practical AI, Inference & Deployment, LLMs, Multi-modal Models, RAG
Recent Activity
upvoted
an
article
2 months ago
We Got Claude to Fine-Tune an Open Source LLM
liked
a Space
2 months ago
OpenEvals/evaluation-guidebook
upvoted
a
paper
2 months ago
Agentic Context Engineering: Evolving Contexts for Self-Improving
Language Models
Organizations
Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs.
Eval Leaderboards
-
Configuration error4.71k
LMArena Leaderboard
🏆4.71kCompare and rank AI model performance
-
Running on CPU Upgrade13.8k
Open LLM Leaderboard
🏆13.8kTrack, rank and evaluate open LLMs and chatbots
-
Running on CPU Upgrade7.01k
MTEB Leaderboard
🥇7.01kEmbedding Leaderboard
-
RunningFeatured582
LLM-Perf Leaderboard
🏆582Compare and find the best LLM performance on different hardware
AI x Audio
Hallucination Detection
-
vectara/hallucination_evaluation_model
Text Classification • 0.1B • Updated • 154k • 339 -
notrichardren/HaluEval
Viewer • Updated • 35k • 23 -
TRUE: Re-evaluating Factual Consistency Evaluation
Paper • 2204.04991 • Published • 1 -
Fine-grained Hallucination Detection and Editing for Language Models
Paper • 2401.06855 • Published • 4
Small, but mighty chat models
Awesome Spaces
-
Running on Zero116
StableDesign
🏆116Transform empty room images into designed spaces using text prompts
-
Running on ZeroFeatured5.36k
IllusionDiffusion
👁5.36kGenerate stunning high quality illusion artwork
-
Runtime errorFeatured1.57k
InstantMesh
📚1.57kCreate a 3D model from an image in 10 seconds!
-
Runtime errorFeatured184
Sing an idea ➡️ Music
🔥184Bring song ideas to life
LLM as a Judge
Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs.
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 56 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Calibrating LLM-Based Evaluator
Paper • 2309.13308 • Published • 12
Hallucination Detection
-
vectara/hallucination_evaluation_model
Text Classification • 0.1B • Updated • 154k • 339 -
notrichardren/HaluEval
Viewer • Updated • 35k • 23 -
TRUE: Re-evaluating Factual Consistency Evaluation
Paper • 2204.04991 • Published • 1 -
Fine-grained Hallucination Detection and Editing for Language Models
Paper • 2401.06855 • Published • 4
Eval Leaderboards
-
Configuration error4.71k
LMArena Leaderboard
🏆4.71kCompare and rank AI model performance
-
Running on CPU Upgrade13.8k
Open LLM Leaderboard
🏆13.8kTrack, rank and evaluate open LLMs and chatbots
-
Running on CPU Upgrade7.01k
MTEB Leaderboard
🥇7.01kEmbedding Leaderboard
-
RunningFeatured582
LLM-Perf Leaderboard
🏆582Compare and find the best LLM performance on different hardware
Small, but mighty chat models
AI x Audio
Awesome Spaces
-
Running on Zero116
StableDesign
🏆116Transform empty room images into designed spaces using text prompts
-
Running on ZeroFeatured5.36k
IllusionDiffusion
👁5.36kGenerate stunning high quality illusion artwork
-
Runtime errorFeatured1.57k
InstantMesh
📚1.57kCreate a 3D model from an image in 10 seconds!
-
Runtime errorFeatured184
Sing an idea ➡️ Music
🔥184Bring song ideas to life