Qwen3-TTS Technical Report
Paper β’ 2601.15621 β’ Published β’ 76
Apple CoreML conversion of Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice's Audio Tokenizer (Decoder + Encoder + Speaker Encoder).
Converted with coremltools to run on the Apple Neural Engine (ANE) of
Apple Silicon (M1 and later) for low-power, low-memory local TTS inference.
If you want to try this CoreML pipeline embedded in a finished app right away, you can use KeyVoice (a macOS voice-input app).
β οΈ This repository contains only the Audio Tokenizer (CoreML). To run the full TTS pipeline you must also load the LM weights from Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice.
| File | Role | Size |
|---|---|---|
Qwen3TTSDecoder.mlpackage |
Audio Token β Mel Decoder | 218 MB |
Qwen3TTSEncoder.mlpackage |
Reference audio PCM β Audio Token Encoder (Voice Cloning) | 182 MB |
Qwen3TTSSpeakerEncoder.mlpackage |
Reference audio β Speaker Embedding (Voice Design) | 46 MB |
coremltools 8.x).cpuAndNeuralEngine (ANE-optimized)audio_tokens [1, 16, T_frames] (Int32) β 16 codebooks Γ T framesmel [1, 80, T_mel] (Float32)audio [1, 1, 28800] (Float32) β fixed-length 1.2 s at 24 kHz (15 frames Γ 1920 samples)audio_tokens [1, 16, 15] (Int32)mel [1, 80, T] (Float32)embedding [1, D] (Float32) β speaker vector for Voice Designimport CoreML
let url = URL(fileURLWithPath: "./Qwen3TTSDecoder.mlpackage")
let compiledURL = try await MLModel.compileModel(at: url)
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let model = try MLModel(contentsOf: compiledURL, configuration: config)
// Feed audio_tokens [1, 16, T_frames] to obtain mel output
import coremltools as ct
model = ct.models.MLModel("./Qwen3TTSDecoder.mlpackage")
print(model.get_spec())
.mlpackage to .mlmodelc, which takes ~30 seconds the first time.@misc{qwen3-tts-12hz-1.7b-customvoice-coreml,
author = {okayuji},
title = {Qwen3-TTS 12Hz 1.7B CustomVoice β CoreML conversion},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/okayuji/Qwen3-TTS-12Hz-1.7B-CustomVoice-CoreML}}
}
@misc{qwen3-tts,
title = {Qwen3-TTS Technical Report},
author = {Qwen Team},
year = {2026},
eprint = {2601.15621},
archivePrefix = {arXiv},
howpublished = {\url{https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice}}
}
Released under the Apache License 2.0, inheriting the license of Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice.
See LICENSE for full text.
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice