---
title: "Unitxt Metric"
emoji: 📈
colorFrom: pink
colorTo: purple
sdk: static
app_file: README.md
pinned: false
---
#
[](https://pypi.org/project/unitxt/)



[](https://coveralls.io/github/IBM/unitxt)

[](https://pepy.tech/project/unitxt)
### 🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking
#
## Why Unitxt?
- 🌐 **Comprehensive**: Evaluate text, tables, vision, speech, and code in one unified framework
- 💼 **Enterprise-Ready**: Battle-tested components with extensive catalog of benchmarks
- 🧠 **Model Agnostic**: Works with HuggingFace, OpenAI, WatsonX, and custom models
- 🔒 **Reproducible**: Shareable, modular components ensure consistent results
## Quick Links
- 📖 [Documentation](https://www.unitxt.ai)
- 🚀 [Getting Started](https://www.unitxt.ai)
- 📁 [Browse Catalog](https://www.unitxt.ai/en/latest/catalog/catalog.__dir__.html)
# Installation
```bash
pip install unitxt
```
# Quick Start
## Command Line Evaluation
```bash
# Simple evaluation
unitxt-evaluate \
--tasks "card=cards.mmlu_pro.engineering" \
--model cross_provider \
--model_args "model_name=llama-3-1-8b-instruct" \
--limit 10
# Multi-task evaluation
unitxt-evaluate \
--tasks "card=cards.text2sql.bird+card=cards.mmlu_pro.engineering" \
--model cross_provider \
--model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
--split test \
--limit 10 \
--output_path ./results/evaluate_cli \
--log_samples \
--apply_chat_template
# Benchmark evaluation
unitxt-evaluate \
--tasks "benchmarks.tool_calling" \
--model cross_provider \
--model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
--split test \
--limit 10 \
--output_path ./results/evaluate_cli \
--log_samples \
--apply_chat_template
```
## Loading as Dataset
Load thousands of datasets in chat API format, ready for any model:
```python
from unitxt import load_dataset
dataset = load_dataset(
card="cards.gpqa.diamond",
split="test",
format="formats.chat_api",
)
```
## 📊 Available on The Catalog





## 🚀 Interactive Dashboard
Launch the graphical user interface to explore datasets and benchmarks:
```
pip install unitxt[ui]
unitxt-explore
```
# Complete Python Example
Evaluate your own data with any model:
```python
# Import required components
from unitxt import evaluate, create_dataset
from unitxt.blocks import Task, InputOutputTemplate
from unitxt.inference import HFAutoModelInferenceEngine
# Question-answer dataset
data = [
{"question": "What is the capital of Texas?", "answer": "Austin"},
{"question": "What is the color of the sky?", "answer": "Blue"},
]
# Define the task and evaluation metric
task = Task(
input_fields={"question": str},
reference_fields={"answer": str},
prediction_type=str,
metrics=["metrics.accuracy"],
)
# Create a template to format inputs and outputs
template = InputOutputTemplate(
instruction="Answer the following question.",
input_format="{question}",
output_format="{answer}",
postprocessors=["processors.lower_case"],
)
# Prepare the dataset
dataset = create_dataset(
task=task,
template=template,
format="formats.chat_api",
test_set=data,
split="test",
)
# Set up the model (supports Hugging Face, WatsonX, OpenAI, etc.)
model = HFAutoModelInferenceEngine(
model_name="Qwen/Qwen1.5-0.5B-Chat", max_new_tokens=32
)
# Generate predictions and evaluate
predictions = model(dataset)
results = evaluate(predictions=predictions, data=dataset)
# Print results
print("Global Results:\n", results.global_scores.summary)
print("Instance Results:\n", results.instance_scores.summary)
```
# Contributing
Read the [contributing guide](./CONTRIBUTING.md) for details on how to contribute to Unitxt.
#
# Citation
If you use Unitxt in your research, please cite our paper:
```bib
@inproceedings{bandel-etal-2024-unitxt,
title = "Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative {AI}",
author = "Bandel, Elron and
Perlitz, Yotam and
Venezian, Elad and
Friedman, Roni and
Arviv, Ofir and
Orbach, Matan and
Don-Yehiya, Shachar and
Sheinwald, Dafna and
Gera, Ariel and
Choshen, Leshem and
Shmueli-Scheuer, Michal and
Katz, Yoav",
editor = "Chang, Kai-Wei and
Lee, Annie and
Rajani, Nazneen",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-demo.21",
pages = "207--215",
}
```