Eval Leaderboards - a andrewrreed Collection

andrewrreed 's Collections

Hallucination Detection

Eval Leaderboards

Small, but mighty chat models

Eval Leaderboards

updated Jun 17

Running

4.56k

4.56k

LMArena Leaderboard

🏆

Display LMArena Leaderboard
Running on CPU Upgrade

13.4k

13.4k

Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade

6.16k

6.16k

MTEB Leaderboard

🥇

Embedding Leaderboard
Running

546

546

LLM-Perf Leaderboard

🏆

Explore hardware performance for LLMs
Running on CPU Upgrade

996

996

Open ASR Leaderboard

🏆

Request evaluation for a speech model
Running

1.39k

1.39k

Big Code Models Leaderboard

📈

Search and submit code models for evaluation
Running on CPU Upgrade

143

143

Hallucinations Leaderboard

🔥

View and submit LLM evaluations
Runtime error

105

105

Enterprise Scenarios Leaderboard

🥇
Running on CPU Upgrade

93

93

LLM Safety Leaderboard

🥇

View and submit machine learning model evaluations
Running

223

223

AI2 WildBench Leaderboard (V2)

🦁

Display and explore model leaderboards and chat history
Running

164

164

Open Object Detection Leaderboard

🏆

Request model evaluation on COCO val 2017 dataset
Runtime error

30

30

Contextual Leaderboard

🐨
Running

189

189

Yet Another LLM Leaderboard

🌖

Run a Streamlit web app
Running on CPU Upgrade

844

844

Open VLM Leaderboard

🌎

VLMEvalKit Evaluation Results Collection
Running

552

552

Vision Arena (Testing VLMs side-by-side)

🖼

Analyze images to detect and label objects
Running

36

36

Leaderboard

🐠
Runtime error

423

423

Open Medical-LLM Leaderboard

🥇

Browse and submit LLM evaluations
Running on CPU Upgrade

56

56

Open CoT Leaderboard

🥇

Track, rank and evaluate open LLMs' CoT quality
Running

23

23

MM-UPD Leaderboard

🥇

Submit and evaluate model results for the MM-AAD leaderboard
Running

218

218

BigCodeBench Leaderboard

🥇

Explore and analyze code evaluation data
Running

10

10

MJ Bench Leaderboard

🥇

Display and filter multimodal model leaderboard results
Running

390

390

Reward Bench Leaderboard

📐

Display and analyze reward model evaluation results
Running on CPU Upgrade

395

395

Agent Leaderboard

💬

Ranking of LLMs for agentic tasks
Running

95

95

Find a leaderboard

🔍

Explore and discover all leaderboards from the HF community
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Paper • 2506.11763 • Published Jun 13 • 69