Qwen3-TTS 12Hz 1.7B CustomVoice β€” CoreML conversion

Apple CoreML conversion of Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice's Audio Tokenizer (Decoder + Encoder + Speaker Encoder).

Converted with coremltools to run on the Apple Neural Engine (ANE) of Apple Silicon (M1 and later) for low-power, low-memory local TTS inference.

If you want to try this CoreML pipeline embedded in a finished app right away, you can use KeyVoice (a macOS voice-input app).

⚠️ This repository contains only the Audio Tokenizer (CoreML). To run the full TTS pipeline you must also load the LM weights from Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice.

Contents

File Role Size
Qwen3TTSDecoder.mlpackage Audio Token β†’ Mel Decoder 218 MB
Qwen3TTSEncoder.mlpackage Reference audio PCM β†’ Audio Token Encoder (Voice Cloning) 182 MB
Qwen3TTSSpeakerEncoder.mlpackage Reference audio β†’ Speaker Embedding (Voice Design) 46 MB

Model Details

  • Base model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
  • Conversion: PyTorch β†’ CoreML (coremltools 8.x)
  • Target compute units: .cpuAndNeuralEngine (ANE-optimized)
  • Sample rate: 24 kHz
  • Frame rate: 12.5 Hz (80 ms per frame)
  • License: Apache 2.0 (inherited)

Input / Output Shapes

Qwen3TTSDecoder

  • Input: audio_tokens [1, 16, T_frames] (Int32) β€” 16 codebooks Γ— T frames
  • Output: mel [1, 80, T_mel] (Float32)

Qwen3TTSEncoder

  • Input: audio [1, 1, 28800] (Float32) β€” fixed-length 1.2 s at 24 kHz (15 frames Γ— 1920 samples)
  • Output: audio_tokens [1, 16, 15] (Int32)

Qwen3TTSSpeakerEncoder

  • Input: mel [1, 80, T] (Float32)
  • Output: embedding [1, D] (Float32) β€” speaker vector for Voice Design

How to Use

Swift

import CoreML

let url = URL(fileURLWithPath: "./Qwen3TTSDecoder.mlpackage")
let compiledURL = try await MLModel.compileModel(at: url)
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let model = try MLModel(contentsOf: compiledURL, configuration: config)
// Feed audio_tokens [1, 16, T_frames] to obtain mel output

Python (inspection)

import coremltools as ct
model = ct.models.MLModel("./Qwen3TTSDecoder.mlpackage")
print(model.get_spec())

Limitations

  • ANE arithmetic does not exactly match CPU/GPU, so floating-point outputs differ slightly between the CoreML and PyTorch versions (no audible impact in practice).
  • Shapes are fixed. The encoder assumes a 1.2-second reference clip β€” shorter inputs are zero-padded, longer inputs are truncated to the first 1.2 s. This is stricter than the original PyTorch model.
  • Initial load compiles .mlpackage to .mlmodelc, which takes ~30 seconds the first time.

Citation

@misc{qwen3-tts-12hz-1.7b-customvoice-coreml,
  author = {okayuji},
  title = {Qwen3-TTS 12Hz 1.7B CustomVoice β€” CoreML conversion},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/okayuji/Qwen3-TTS-12Hz-1.7B-CustomVoice-CoreML}}
}

Base model citation

@misc{qwen3-tts,
  title = {Qwen3-TTS Technical Report},
  author = {Qwen Team},
  year = {2026},
  eprint = {2601.15621},
  archivePrefix = {arXiv},
  howpublished = {\url{https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice}}
}

License

Released under the Apache License 2.0, inheriting the license of Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice.

See LICENSE for full text.

Acknowledgements

  • Qwen Team for releasing the original Qwen3-TTS under Apache 2.0
  • Apple for the coremltools framework
Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for okayuji/Qwen3-TTS-12Hz-1.7B-CustomVoice-CoreML

Quantized
(6)
this model

Paper for okayuji/Qwen3-TTS-12Hz-1.7B-CustomVoice-CoreML