Canary-Qwen-2.5B CoreML INT8

This repository contains a community INT8 CoreML conversion of the public NVIDIA Canary-Qwen-2.5B speech recognition model for Apple Silicon workflows.

This is not an official NVIDIA release. The conversion is derived from the public base model and keeps the model split into CoreML components:

encoder_int8.mlpackage: INT8 speech encoder
projection.mlpackage: FP16 projection from encoder space to Qwen embedding space
canary_decoder_stateful_int8.mlpackage: INT8 stateful autoregressive decoder with KV cache
canary_lm_head_int8.mlpackage: INT8 LM head that maps decoder hidden states to logits

Base Model

Original model: nvidia/canary-qwen-2.5b
Original license: CC-BY-4.0
Original architecture: FastConformer encoder + Qwen decoder with projection and LoRA adaptation

Please review and comply with the original model card and license terms:

Included Artifacts

File	Precision	Purpose
`encoder_int8.mlpackage`	INT8	Speech encoder
`projection.mlpackage`	FP16	Encoder-to-LLM projection
`canary_decoder_stateful_int8.mlpackage`	INT8	Stateful decoder with KV cache
`canary_lm_head_int8.mlpackage`	INT8	Decoder hidden states to logits

Notes

This repo contains model artifacts only.
The decoder is separated from the LM head.
The projection remains FP16 because it is tiny and not worth quantizing.
The stateful decoder is intended for macOS 15 / iOS 18 era CoreML state support.
Long-audio chunking, prompt formatting, and transcript stitching live in the runtime layer and are not included here.

Related Repos

FP16 sibling release: phequals/canary-qwen-2.5b-coreml-fp16
Original base model: nvidia/canary-qwen-2.5b

Downloads last month: 74

Model tree for phequals/canary-qwen-2.5b-coreml-int8

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

nvidia/canary-qwen-2.5b

Quantized

(3)

this model