Spaces:

unitxt
/

metric

Running

App Files Files Community

metric / README.md

Elron

Upload folder using huggingface_hub

c160aec verified 15 days ago

preview code

raw

history blame contribute delete

5.97 kB

	---
	title: "Unitxt Metric"
	emoji: 📈
	colorFrom: pink
	colorTo: purple
	sdk: static
	app_file: README.md
	pinned: false
	---
	<div align="center">
	<img src="https://www.unitxt.ai/en/latest/_static/banner.png" alt="Image Description" width="100%" />
	</div>

	#
	[![version](https://img.shields.io/pypi/v/unitxt)](https://pypi.org/project/unitxt/)
	![license](https://img.shields.io/github/license/ibm/unitxt)
	![python](https://img.shields.io/badge/python-3.8%20\|%203.9-blue)
	![tests](https://img.shields.io/github/actions/workflow/status/ibm/unitxt/library_tests.yml?branch=main&label=tests)
	[![Coverage Status](https://coveralls.io/repos/github/IBM/unitxt/badge.svg)](https://coveralls.io/github/IBM/unitxt)
	![Read the Docs](https://img.shields.io/readthedocs/unitxt)
	[![downloads](https://static.pepy.tech/personalized-badge/unitxt?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/unitxt)

	### 🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

	#

	## Why Unitxt?

	- 🌐 Comprehensive: Evaluate text, tables, vision, speech, and code in one unified framework
	- 💼 Enterprise-Ready: Battle-tested components with extensive catalog of benchmarks
	- 🧠 Model Agnostic: Works with HuggingFace, OpenAI, WatsonX, and custom models
	- 🔒 Reproducible: Shareable, modular components ensure consistent results

	## Quick Links
	- 📖 [Documentation](https://www.unitxt.ai)
	- 🚀 [Getting Started](https://www.unitxt.ai)
	- 📁 [Browse Catalog](https://www.unitxt.ai/en/latest/catalog/catalog.__dir__.html)

	# Installation

	```bash
	pip install unitxt
	```

	# Quick Start

	## Command Line Evaluation
	```bash
	# Simple evaluation
	unitxt-evaluate \
	--tasks "card=cards.mmlu_pro.engineering" \
	--model cross_provider \
	--model_args "model_name=llama-3-1-8b-instruct" \
	--limit 10

	# Multi-task evaluation
	unitxt-evaluate \
	--tasks "card=cards.text2sql.bird+card=cards.mmlu_pro.engineering" \
	--model cross_provider \
	--model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
	--split test \
	--limit 10 \
	--output_path ./results/evaluate_cli \
	--log_samples \
	--apply_chat_template

	# Benchmark evaluation
	unitxt-evaluate \
	--tasks "benchmarks.tool_calling" \
	--model cross_provider \
	--model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
	--split test \
	--limit 10 \
	--output_path ./results/evaluate_cli \
	--log_samples \
	--apply_chat_template
	```

	## Loading as Dataset
	Load thousands of datasets in chat API format, ready for any model:
	```python
	from unitxt import load_dataset

	dataset = load_dataset(
	card="cards.gpqa.diamond",
	split="test",
	format="formats.chat_api",
	)
	```

	## 📊 Available on The Catalog

	![Tasks](https://img.shields.io/badge/Tasks-68-blue)
	![Datasets](https://img.shields.io/badge/Datasets-3254-blue)
	![Prompts](https://img.shields.io/badge/Prompts-357-blue)
	![Benchmarks](https://img.shields.io/badge/Benchmarks-11-blue)
	![Metrics](https://img.shields.io/badge/Metrics-584-blue)

	## 🚀 Interactive Dashboard

	Launch the graphical user interface to explore datasets and benchmarks:
	```
	pip install unitxt[ui]
	unitxt-explore
	```

	# Complete Python Example

	Evaluate your own data with any model:

	```python
	# Import required components
	from unitxt import evaluate, create_dataset
	from unitxt.blocks import Task, InputOutputTemplate
	from unitxt.inference import HFAutoModelInferenceEngine

	# Question-answer dataset
	data = [
	{"question": "What is the capital of Texas?", "answer": "Austin"},
	{"question": "What is the color of the sky?", "answer": "Blue"},
	]

	# Define the task and evaluation metric
	task = Task(
	input_fields={"question": str},
	reference_fields={"answer": str},
	prediction_type=str,
	metrics=["metrics.accuracy"],
	)

	# Create a template to format inputs and outputs
	template = InputOutputTemplate(
	instruction="Answer the following question.",
	input_format="{question}",
	output_format="{answer}",
	postprocessors=["processors.lower_case"],
	)

	# Prepare the dataset
	dataset = create_dataset(
	task=task,
	template=template,
	format="formats.chat_api",
	test_set=data,
	split="test",
	)

	# Set up the model (supports Hugging Face, WatsonX, OpenAI, etc.)
	model = HFAutoModelInferenceEngine(
	model_name="Qwen/Qwen1.5-0.5B-Chat", max_new_tokens=32
	)

	# Generate predictions and evaluate
	predictions = model(dataset)
	results = evaluate(predictions=predictions, data=dataset)

	# Print results
	print("Global Results:\n", results.global_scores.summary)
	print("Instance Results:\n", results.instance_scores.summary)
	```

	# Contributing

	Read the [contributing guide](./CONTRIBUTING.md) for details on how to contribute to Unitxt.

	#

	# Citation

	If you use Unitxt in your research, please cite our paper:

	```bib
	@inproceedings{bandel-etal-2024-unitxt,
	title = "Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative {AI}",
	author = "Bandel, Elron and
	Perlitz, Yotam and
	Venezian, Elad and
	Friedman, Roni and
	Arviv, Ofir and
	Orbach, Matan and
	Don-Yehiya, Shachar and
	Sheinwald, Dafna and
	Gera, Ariel and
	Choshen, Leshem and
	Shmueli-Scheuer, Michal and
	Katz, Yoav",
	editor = "Chang, Kai-Wei and
	Lee, Annie and
	Rajani, Nazneen",
	booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)",
	month = jun,
	year = "2024",
	address = "Mexico City, Mexico",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2024.naacl-demo.21",
	pages = "207--215",
	}
	```