Canary-Qwen-2.5B CoreML INT8
This repository contains a community INT8 CoreML conversion of the public NVIDIA Canary-Qwen-2.5B speech recognition model for Apple Silicon workflows.
This is not an official NVIDIA release. The conversion is derived from the public base model and keeps the model split into CoreML components:
encoder_int8.mlpackage: INT8 speech encoderprojection.mlpackage: FP16 projection from encoder space to Qwen embedding spacecanary_decoder_stateful_int8.mlpackage: INT8 stateful autoregressive decoder with KV cachecanary_lm_head_int8.mlpackage: INT8 LM head that maps decoder hidden states to logits
Base Model
- Original model: nvidia/canary-qwen-2.5b
- Original license:
CC-BY-4.0 - Original architecture: FastConformer encoder + Qwen decoder with projection and LoRA adaptation
Please review and comply with the original model card and license terms:
Included Artifacts
| File | Precision | Purpose |
|---|---|---|
encoder_int8.mlpackage |
INT8 | Speech encoder |
projection.mlpackage |
FP16 | Encoder-to-LLM projection |
canary_decoder_stateful_int8.mlpackage |
INT8 | Stateful decoder with KV cache |
canary_lm_head_int8.mlpackage |
INT8 | Decoder hidden states to logits |
Notes
- This repo contains model artifacts only.
- The decoder is separated from the LM head.
- The projection remains FP16 because it is tiny and not worth quantizing.
- The stateful decoder is intended for macOS 15 / iOS 18 era CoreML state support.
- Long-audio chunking, prompt formatting, and transcript stitching live in the runtime layer and are not included here.
Related Repos
- FP16 sibling release: phequals/canary-qwen-2.5b-coreml-fp16
- Original base model: nvidia/canary-qwen-2.5b
- Downloads last month
- 74