Qwen3-TTS 12Hz 1.7B CustomVoice — CoreML conversion

Apple CoreML conversion of Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice's Audio Tokenizer (Decoder + Encoder + Speaker Encoder).

Converted with coremltools to run on the Apple Neural Engine (ANE) of Apple Silicon (M1 and later) for low-power, low-memory local TTS inference.

If you want to try this CoreML pipeline embedded in a finished app right away, you can use KeyVoice (a macOS voice-input app).

⚠️ This repository contains only the Audio Tokenizer (CoreML). To run the full TTS pipeline you must also load the LM weights from Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice.

File	Role	Size
`Qwen3TTSDecoder.mlpackage`	Audio Token → Mel Decoder	218 MB
`Qwen3TTSEncoder.mlpackage`	Reference audio PCM → Audio Token Encoder (Voice Cloning)	182 MB
`Qwen3TTSSpeakerEncoder.mlpackage`	Reference audio → Speaker Embedding (Voice Design)	46 MB

Model Details

Base model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Conversion: PyTorch → CoreML (coremltools 8.x)
Target compute units: .cpuAndNeuralEngine (ANE-optimized)
Sample rate: 24 kHz
Frame rate: 12.5 Hz (80 ms per frame)
License: Apache 2.0 (inherited)

Input / Output Shapes

Qwen3TTSDecoder

Input: audio_tokens [1, 16, T_frames] (Int32) — 16 codebooks × T frames
Output: mel [1, 80, T_mel] (Float32)

Qwen3TTSEncoder

Input: audio [1, 1, 28800] (Float32) — fixed-length 1.2 s at 24 kHz (15 frames × 1920 samples)
Output: audio_tokens [1, 16, 15] (Int32)

Qwen3TTSSpeakerEncoder

Input: mel [1, 80, T] (Float32)
Output: embedding [1, D] (Float32) — speaker vector for Voice Design

How to Use

Swift

import CoreML

let url = URL(fileURLWithPath: "./Qwen3TTSDecoder.mlpackage")
let compiledURL = try await MLModel.compileModel(at: url)
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let model = try MLModel(contentsOf: compiledURL, configuration: config)
// Feed audio_tokens [1, 16, T_frames] to obtain mel output

Python (inspection)

import coremltools as ct
model = ct.models.MLModel("./Qwen3TTSDecoder.mlpackage")
print(model.get_spec())

Limitations

ANE arithmetic does not exactly match CPU/GPU, so floating-point outputs differ slightly between the CoreML and PyTorch versions (no audible impact in practice).
Shapes are fixed. The encoder assumes a 1.2-second reference clip — shorter inputs are zero-padded, longer inputs are truncated to the first 1.2 s. This is stricter than the original PyTorch model.
Initial load compiles .mlpackage to .mlmodelc, which takes ~30 seconds the first time.

Citation

@misc{qwen3-tts-12hz-1.7b-customvoice-coreml,
  author = {okayuji},
  title = {Qwen3-TTS 12Hz 1.7B CustomVoice — CoreML conversion},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/okayuji/Qwen3-TTS-12Hz-1.7B-CustomVoice-CoreML}}
}

Base model citation

@misc{qwen3-tts,
  title = {Qwen3-TTS Technical Report},
  author = {Qwen Team},
  year = {2026},
  eprint = {2601.15621},
  archivePrefix = {arXiv},
  howpublished = {\url{https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice}}
}

License

Released under the Apache License 2.0, inheriting the license of Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice.

See LICENSE for full text.

Acknowledgements

Qwen Team for releasing the original Qwen3-TTS under Apache 2.0
Apple for the coremltools framework

Downloads last month: 27

Model tree for okayuji/Qwen3-TTS-12Hz-1.7B-CustomVoice-CoreML

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Quantized

(6)

this model

Paper for okayuji/Qwen3-TTS-12Hz-1.7B-CustomVoice-CoreML

Qwen3-TTS Technical Report

Paper • 2601.15621 • Published Jan 22 • 76

okayuji
/

Qwen3-TTS-12Hz-1.7B-CustomVoice-CoreML