Optimum documentation
Quantization
Quantization
🤗 Optimum provides an optimum.onnxruntime package that enables you to apply quantization on many model hosted on the 🤗 hub using the ONNX Runtime quantization tool.
Creating an ORTQuantizer
The ORTQuantizer class is used to quantize your ONNX model. The class can be initialized using the from_pretrained() method, which supports different checkpoint formats.
- Using an already initialized
ORTModelForXXXclass.
>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
# Loading ONNX Model from the Hub
>>> ort_model = ORTModelForSequenceClassification.from_pretrained("optimum/distilbert-base-uncased-finetuned-sst-2-english")
# Create a quantizer from a ORTModelForXXX
>>> quantizer = ORTQuantizer.from_pretrained(ort_model)
# Configuration
>>> ...
# Quantize the model
>>> quantizer.quantize(...)- Using a local ONNX model from a directory.
>>> from optimum.onnxruntime import ORTQuantizer
# This assumes a model.onnx exists in path/to/model
>>> quantizer = ORTQuantizer.from_pretrained("path/to/model")
# Configuration
>>> ...
# Quantize the model
>>> quantizer.quantize(...)Dynamic Quantization example
The ORTQuantizer class can be used to dynamically quantize your ONNX model. Below you will find an easy end-to-end example on how to dynamically quantize distilbert-base-uncased-finetuned-sst-2-english.
>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
# Load PyTorch model and convert to ONNX
>>> onnx_model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
# Create quantizer
>>> quantizer = ORTQuantizer.from_pretrained(onnx_model)
# Define the quantization strategy by creating the appropriate configuration
>>> dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
# Quantize the model
>>> model_quantized_path = quantizer.quantize(
save_dir="path/to/output/model",
quantization_config=dqconfig,
)Static Quantization example
The ORTQuantizer class can be used to statically quantize your ONNX model. Below you will find an easy end-to-end example on how to statically quantize distilbert-base-uncased-finetuned-sst-2-english.
>>> from functools import partial
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig, AutoCalibrationConfig
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
# Load PyTorch model and convert to ONNX and create Quantizer and setup config
>>> onnx_model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> quantizer = ORTQuantizer.from_pretrained(onnx_model)
>>> qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)
# Create the calibration dataset
>>> def preprocess_fn(ex, tokenizer):
return tokenizer(ex["sentence"])
>>> calibration_dataset = quantizer.get_calibration_dataset(
"glue",
dataset_config_name="sst2",
preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
num_samples=50,
dataset_split="train",
)
# Create the calibration configuration containing the parameters related to calibration.
>>> calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)
# Perform the calibration step: computes the activations quantization ranges
>>> ranges = quantizer.fit(
dataset=calibration_dataset,
calibration_config=calibration_config,
operators_to_quantize=qconfig.operators_to_quantize,
)
# Apply static quantization on the model
>>> model_quantized_path = quantizer.quantize(
save_dir="path/to/output/model",
calibration_tensors_range=ranges,
quantization_config=qconfig,
)Quantize Seq2Seq models
The ORTQuantizer currently doesn’t support multi-file models, like ORTModelForSeq2SeqLM. If you want to quantize a Seq2Seq model, you have to quantize each model’s component individually using the ORTQuantizer class. Currently, only dynamic quantization is supported for Seq2Seq model.
- Load seq2seq model as
ORTModelForSeq2SeqLM.
>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSeq2SeqLM
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig
# load Seq2Seq model and set model file directory
>>> model_id = "optimum/t5-small"
>>> onnx_model = ORTModelForSeq2SeqLM.from_pretrained(model_id)
>>> model_dir = onnx_model.model_save_dir- Define Quantizer for encoder, decoder and decoder with past keys
# Create encoder quantizer
>>> encoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="encoder_model.onnx")
# Create decoder quantizer
>>> decoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_model.onnx")
# Create decoder with past key values quantizer
>>> decoder_wp_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_with_past_model.onnx")
# Create Quantizer list
>>> quantizer = [encoder_quantizer, decoder_quantizer, decoder_wp_quantizer]- Quantize all models
# Define the quantization strategy by creating the appropriate configuration
>>> dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
# Quantize the model
>>> [q.quantize(save_dir=".",quantization_config=dqconfig) for q in quantizer]ORTQuantizer
Handles the ONNX Runtime quantization process for models shared on huggingface.co/models.
fit
< source >( dataset: Dataset calibration_config: CalibrationConfig onnx_augmented_model_name: str = 'augmented_model.onnx' operators_to_quantize: typing.Optional[typing.List[str]] = None batch_size: int = 1 use_external_data_format: bool = False use_gpu: bool = False force_symmetric_range: bool = False )
Parameters
-
dataset (
Dataset) — The dataset to use when performing the calibration step. -
calibration_config (
CalibrationConfig) — The configuration containing the parameters related to the calibration step. -
onnx_augmented_model_name (
Union[str, os.PathLike]) — The path used to save the augmented model used to collect the quantization ranges. -
operators_to_quantize (
list, optional) — List of the operators types to quantize. -
batch_size (
int, defaults to 1) — The batch size to use when collecting the quantization ranges values. -
use_external_data_format (
bool, defaults toFalse) — Whether uto se external data format to store model which size is >= 2Gb. -
use_gpu (
bool, defaults toFalse) — Whether to use the GPU when collecting the quantization ranges values. -
force_symmetric_range (
bool, defaults toFalse) — Whether to make the quantization ranges symmetric.
Perform the calibration step and collect the quantization ranges.
from_pretrained
< source >( model_or_path: typing.Union[str, pathlib.Path] file_name: typing.Optional[str] = None )
Parameters
-
model_or_path (
Union[str, Path]) — Can be either:- A path to a saved exported ONNX Intermediate Representation (IR) model, e.g., `./my_model_directory/.
- Or a
ORTModelForXXclass, e.g.,ORTModelForQuestionAnswering.
-
file_name(`Union[str, List[str]]
, *optional*) -- Overwrites the default model file name from“model.onnx”tofile_name`. This allows you to load different model files from the same repository or directory.
Instantiate a ORTQuantizer from a pretrained pytorch model and preprocessor.
get_calibration_dataset
< source >( dataset_name: str num_samples: int = 100 dataset_config_name: typing.Optional[str] = None dataset_split: typing.Optional[str] = None preprocess_function: typing.Optional[typing.Callable] = None preprocess_batch: bool = True seed: int = 2016 use_auth_token: bool = False )
Parameters
-
dataset_name (
str) — The dataset repository name on the Hugging Face Hub or path to a local directory containing data files to load to use for the calibration step. -
num_samples (
int, defaults to 100) — The maximum number of samples composing the calibration dataset. -
dataset_config_name (
str, optional) — The name of the dataset configuration. -
dataset_split (
str, optional) — Which split of the dataset to use to perform the calibration step. -
preprocess_function (
Callable, optional) — Processing function to apply to each example after loading dataset. -
preprocess_batch (
bool, defaults toTrue) — Whether thepreprocess_functionshould be batched. -
seed (
int, defaults to 2016) — The random seed to use when shuffling the calibration dataset. -
use_auth_token (
bool, defaults toFalse) — Whether to use the token generated when runningtransformers-cli login(necessary for some datasets like ImageNet).
Create the calibration datasets.Dataset to use for the post-training static quantization calibration step
partial_fit
< source >( dataset: Dataset calibration_config: CalibrationConfig onnx_augmented_model_name: str = 'augmented_model.onnx' operators_to_quantize: typing.Optional[typing.List[str]] = None batch_size: int = 1 use_external_data_format: bool = False use_gpu: bool = False force_symmetric_range: bool = False )
Parameters
-
dataset (
Dataset) — The dataset to use when performing the calibration step. -
calibration_config (
CalibrationConfig) — The configuration containing the parameters related to the calibration step. -
onnx_augmented_model_name (
Union[str, os.PathLike]) — The path used to save the augmented model used to collect the quantization ranges. -
operators_to_quantize (
list, optional) — List of the operators types to quantize. -
batch_size (
int, defaults to 1) — The batch size to use when collecting the quantization ranges values. -
use_external_data_format (
bool, defaults toFalse) — Whether uto se external data format to store model which size is >= 2Gb. -
use_gpu (
bool, defaults toFalse) — Whether to use the GPU when collecting the quantization ranges values. -
force_symmetric_range (
bool, defaults toFalse) — Whether to make the quantization ranges symmetric.
Perform the calibration step and collect the quantization ranges.
quantize
< source >( quantization_config: QuantizationConfig save_dir: typing.Union[str, pathlib.Path] file_suffix: typing.Optional[str] = 'quantized' calibration_tensors_range: typing.Union[typing.Dict[str, typing.Tuple[float, float]], NoneType] = None use_external_data_format: bool = False preprocessor: typing.Optional[optimum.onnxruntime.preprocessors.quantization.QuantizationPreprocessor] = None )
Parameters
-
quantization_config (
QuantizationConfig) — The configuration containing the parameters related to quantization. -
save_dir (
Union[str, Path]) — The directory where the quantized model should be saved. -
file_suffix (
str, optional, defaults to"quantized") — The file_suffix used to save the quantized model. -
calibration_tensors_range (
Dict[NodeName, Tuple[float, float]], optional) — The dictionary mapping the nodes name to their quantization ranges, used and required only when applying static quantization. -
use_external_data_format (
bool, defaults toFalse) — Whether to use external data format to store model which size is >= 2Gb. -
preprocessor (
QuantizationPreprocessor, optional) — The preprocessor to use to collect the nodes to include or exclude from quantization.
Quantize a model given the optimization specifications defined in quantization_config.