Quantization

🤗 Optimum provides an optimum.onnxruntime package that enables you to apply quantization on many model hosted on the 🤗 hub using the ONNX Runtime quantization tool.

Creating an `ORTQuantizer`

The ORTQuantizer class is used to quantize your ONNX model. The class can be initialized using the from_pretrained() method, which supports different checkpoint formats.

Using an already initialized ORTModelForXXX class.

>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification

# Loading ONNX Model from the Hub
>>> ort_model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")

# Create a quantizer from a ORTModelForXXX
>>> quantizer = ORTQuantizer.from_pretrained(ort_model)

# Configuration
>>> ...

# Quantize the model
>>> quantizer.quantize(...)

Using a local ONNX model from a directory.

>>> from optimum.onnxruntime import ORTQuantizer

# This assumes a model.onnx exists in path/to/model
>>> quantizer = ORTQuantizer.from_pretrained("path/to/model")

# Configuration
>>> ...

# Quantize the model
>>> quantizer.quantize(...)

Dynamic Quantization example

The ORTQuantizer class can be used to dynamically quantize your ONNX model. Below you will find an easy end-to-end example on how to dynamically quantize distilbert-base-uncased-finetuned-sst-2-english.

>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig

>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
# Load PyTorch model and convert to ONNX
>>> onnx_model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)

# Create quantizer
>>> quantizer = ORTQuantizer.from_pretrained(onnx_model)

# Define the quantization strategy by creating the appropriate configuration 
>>> dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# Quantize the model
>>> model_quantized_path = quantizer.quantize(
    save_dir="path/to/output/model",
    quantization_config=dqconfig,
)

Static Quantization example

The ORTQuantizer class can be used to statically quantize your ONNX model. Below you will find an easy end-to-end example on how to statically quantize distilbert-base-uncased-finetuned-sst-2-english.

>>> from functools import partial
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig, AutoCalibrationConfig

>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"

# Load PyTorch model and convert to ONNX and create Quantizer and setup config
>>> onnx_model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> quantizer = ORTQuantizer.from_pretrained(onnx_model)
>>> qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)

# Create the calibration dataset
>>> def preprocess_fn(ex, tokenizer):
    return tokenizer(ex["sentence"])

>>> calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="sst2",
    preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
    num_samples=50,
    dataset_split="train",
)
# Create the calibration configuration containing the parameters related to calibration.
>>> calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)

# Perform the calibration step: computes the activations quantization ranges
>>> ranges = quantizer.fit(
    dataset=calibration_dataset,
    calibration_config=calibration_config,
    operators_to_quantize=qconfig.operators_to_quantize,
)

# Apply static quantization on the model
>>> model_quantized_path = quantizer.quantize(
    save_dir="path/to/output/model",
    calibration_tensors_range=ranges,
    quantization_config=qconfig,
)

Quantize Seq2Seq models

The ORTQuantizer currently doesn’t support multi-file models, like ORTModelForSeq2SeqLM. If you want to quantize a Seq2Seq model, you have to quantize each model’s component individually using the ORTQuantizer class. Currently, only dynamic quantization is supported for Seq2Seq model.

Load seq2seq model as ORTModelForSeq2SeqLM.

>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSeq2SeqLM
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig

# load Seq2Seq model and set model file directory
>>> model_id = "optimum/t5-small"
>>> onnx_model = ORTModelForSeq2SeqLM.from_pretrained(model_id)
>>> model_dir = onnx_model.model_save_dir

Define Quantizer for encoder, decoder and decoder with past keys

# Create encoder quantizer
>>> encoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="encoder_model.onnx")

# Create decoder quantizer
>>> decoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_model.onnx")

# Create decoder with past key values quantizer
>>> decoder_wp_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_with_past_model.onnx")

# Create Quantizer list
>>> quantizer = [encoder_quantizer, decoder_quantizer, decoder_wp_quantizer]

Quantize all models

# Define the quantization strategy by creating the appropriate configuration 
>>> dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# Quantize the model
>>> [q.quantize(save_dir=".",quantization_config=dqconfig) for q in quantizer]

ORTQuantizer

class optimum.onnxruntime.ORTQuantizer

< source >

( onnx_model_path: typing.List[pathlib.Path] )

Handles the ONNX Runtime quantization process for models shared on huggingface.co/models.

compute_ranges

< source >

( )

fit

< source >

( dataset: Dataset calibration_config: CalibrationConfig onnx_augmented_model_name: str = 'augmented_model.onnx' operators_to_quantize: typing.Optional[typing.List[str]] = None batch_size: int = 1 use_external_data_format: bool = False use_gpu: bool = False force_symmetric_range: bool = False )

Parameters

dataset (Dataset) — The dataset to use when performing the calibration step.
calibration_config (CalibrationConfig) — The configuration containing the parameters related to the calibration step.
onnx_augmented_model_name (Union[str, os.PathLike]) — The path used to save the augmented model used to collect the quantization ranges.
operators_to_quantize (list, optional) — List of the operators types to quantize.
batch_size (int, defaults to 1) — The batch size to use when collecting the quantization ranges values.
use_external_data_format (bool, defaults to False) — Whether uto se external data format to store model which size is >= 2Gb.
use_gpu (bool, defaults to False) — Whether to use the GPU when collecting the quantization ranges values.
force_symmetric_range (bool, defaults to False) — Whether to make the quantization ranges symmetric.

Perform the calibration step and collect the quantization ranges.

from_pretrained

< source >

( model_or_path: typing.Union[str, pathlib.Path] file_name: typing.Optional[str] = None )

Parameters

model_or_path (Union[str, Path]) — Can be either:
- A path to a saved exported ONNX Intermediate Representation (IR) model, e.g., `./my_model_directory/.
- Or a ORTModelForXX class, e.g., ORTModelForQuestionAnswering.
file_name(`Union[str, List[str]], *optional*) -- Overwrites the default model file name from “model.onnx”tofile_name`. This allows you to load different model files from the same repository or directory.

Instantiate a ORTQuantizer from a pretrained pytorch model and preprocessor.

get_calibration_dataset

< source >

( dataset_name: str num_samples: int = 100 dataset_config_name: typing.Optional[str] = None dataset_split: typing.Optional[str] = None preprocess_function: typing.Optional[typing.Callable] = None preprocess_batch: bool = True seed: int = 2016 use_auth_token: bool = False )

Parameters

dataset_name (str) — The dataset repository name on the Hugging Face Hub or path to a local directory containing data files to load to use for the calibration step.
num_samples (int, defaults to 100) — The maximum number of samples composing the calibration dataset.
dataset_config_name (str, optional) — The name of the dataset configuration.
dataset_split (str, optional) — Which split of the dataset to use to perform the calibration step.
preprocess_function (Callable, optional) — Processing function to apply to each example after loading dataset.
preprocess_batch (bool, defaults to True) — Whether the preprocess_function should be batched.
seed (int, defaults to 2016) — The random seed to use when shuffling the calibration dataset.
use_auth_token (bool, defaults to False) — Whether to use the token generated when running transformers-cli login (necessary for some datasets like ImageNet).

Create the calibration datasets.Dataset to use for the post-training static quantization calibration step

partial_fit

< source >

Parameters

dataset (Dataset) — The dataset to use when performing the calibration step.
calibration_config (CalibrationConfig) — The configuration containing the parameters related to the calibration step.
onnx_augmented_model_name (Union[str, os.PathLike]) — The path used to save the augmented model used to collect the quantization ranges.
operators_to_quantize (list, optional) — List of the operators types to quantize.
batch_size (int, defaults to 1) — The batch size to use when collecting the quantization ranges values.
use_external_data_format (bool, defaults to False) — Whether uto se external data format to store model which size is >= 2Gb.
use_gpu (bool, defaults to False) — Whether to use the GPU when collecting the quantization ranges values.
force_symmetric_range (bool, defaults to False) — Whether to make the quantization ranges symmetric.

Perform the calibration step and collect the quantization ranges.

quantize

< source >

( quantization_config: QuantizationConfig save_dir: typing.Union[str, pathlib.Path] file_suffix: typing.Optional[str] = 'quantized' calibration_tensors_range: typing.Union[typing.Dict[str, typing.Tuple[float, float]], NoneType] = None use_external_data_format: bool = False preprocessor: typing.Optional[optimum.onnxruntime.preprocessors.quantization.QuantizationPreprocessor] = None )

Parameters

quantization_config (QuantizationConfig) — The configuration containing the parameters related to quantization.
save_dir (Union[str, Path]) — The directory where the quantized model should be saved.
file_suffix (str, optional, defaults to "quantized") — The file_suffix used to save the quantized model.
calibration_tensors_range (Dict[NodeName, Tuple[float, float]], optional) — The dictionary mapping the nodes name to their quantization ranges, used and required only when applying static quantization.
use_external_data_format (bool, defaults to False) — Whether to use external data format to store model which size is >= 2Gb.
preprocessor (QuantizationPreprocessor, optional) — The preprocessor to use to collect the nodes to include or exclude from quantization.

Quantize a model given the optimization specifications defined in quantization_config.

Optimum

Quantization

Creating an ORTQuantizer

Dynamic Quantization example

Static Quantization example

Quantize Seq2Seq models

ORTQuantizer

class optimum.onnxruntime.ORTQuantizer

compute_ranges

fit

from_pretrained

get_calibration_dataset

partial_fit

quantize

Creating an `ORTQuantizer`