Evaluate documentation

Evaluate on the Hub

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.4.5).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Evaluate on the Hub


Evaluate on the Hub banner

You can evaluate AI models on the Hub in multiple ways and this page will guide you through the different options:

  • Community Leaderboards bring together the best models for a given task or domain and make them accessible to everyone by ranking them.
  • Model Cards provide a comprehensive overview of a model’s capabilities from the author’s perspective.
  • Libraries and Packages give you the tools to evaluate your models on the Hub.

Community Leaderboards

Community leaderboards show how a model performs on a given task or domain. For example, there are leaderboards for question answering, reasoning, classification, vision, and audio. If you’re tackling a new task, you can use a leaderboard to see how a model performs on it.

Here are some examples of community leaderboards:

Leaderboard Model Type Description
MTEB Embedding The Massive Text Embedding Benchmark leaderboard compares 100+ text and image embedding models across 1000+ languages. Refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types. Anyone is welcome to add a model, add benchmarks, help improve zero-shot annotations, or propose other changes to the leaderboard.
GAIA Agentic GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See the paper for more details.)
OpenVLM Leaderboard Vision Language Models The OpenVLM Leaderboard evaluates 272+ Vision-Language Models (including GPT-4v, Gemini, QwenVLPlus, LLaVA) across 31 different multi-modal benchmarks using the VLMEvalKit framework. It focuses on open-source VLMs and publicly available API models.
Open ASR Leaderboard Audio The Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. Models are ranked based on their Average WER, from lowest to highest.
LLM-Perf Leaderboard LLM Performance The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardware, backends and optimizations using Optimum-Benchmark.

There are many more leaderboards on the Hub. Check out all the leaderboards via this search or use this dedicated Space to find a leaderboard for your task.

Model Cards

Model cards provide an overview of a model’s capabilities evaluated by the community or the model’s author. They are a great way to understand a model’s capabilities and limitations.

Qwen model card

Unlike leaderboards, model card evaluation scores are often created by the author, rather than by the community.

For information on reporting results, see details on the Model Card Evaluation Results metadata.

Libraries and packages

There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or performance on a custom evaluation task.

LightEval

LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval repository for more information.

For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out LightEval.

🤗 Evaluate

A library for easily evaluating machine learning models and datasets.

With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way!

Visit the 🤗 Evaluate organization for a full list of available metrics. Each metric has a dedicated Space with an interactive demo for how to use the metric, and a documentation card detailing the metrics limitations and usage.

< > Update on GitHub