Use it from Swift
Add the package
Package.swift:
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),
// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
Platforms: iOS 18+ / macOS 15+.
Download + chat (one call)
import CoreMLLLM
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/gemma-4-E2B-coreml")
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user, content: "Hello!")],
maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }
Image / video / audio
// Image
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user,
content: "Describe this image")],
image: cgImage)
// Video (frames + audio extracted internally)
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user,
content: "What happens in this clip?")],
videoURL: localFileURL)
Audio-only and other variants are exposed via the same
generate(_:) overloads β see the Swift file
CoreMLLLM.swift
for the full surface.
Gemma 4 E2B β Core ML (ANE multimodal)
Core ML port of google/gemma-4-E2B-it (the 2B-effective Gemma 4 / Gemma 3n decoder), optimized for Apple Neural Engine. Text + image + audio + short video, INT4 weights.
Branches:
mainis the long-running 4-chunk text+vision+audio bundle. The default ship target forCoreMLLLMChatv1.6+ is then1024branch (3-chunk merged decoder, slightly faster prefill). Both ship the same architecture β only the chunk topology and tokenizer artifacts differ. Pick whichever matches the Swift runtime you're using; if in doubt, usen1024.
Files (root, n1024 branch β recommended)
chunk1.mlmodelc/ # L0β7 β INT4 palettized
chunk2_3way.mlmodelc/ # L8β24 β merged middle (3-chunk decoder)
chunk3_3way.mlmodelc/ # L25β34 + lm_head β multifunction
prefill_chunk{1..4}.mlmodelc/ # T=N prefill bodies (mlmodelc, weights shared
# with decode chunks via hardlink)
vision.mlmodelc/ # SigLIP encoder, 322 MB
vision_video.mlmodelc/ # video frame encoder (64 tok/frame)
audio.mlmodelc/ # 282 MB Whisper-style audio encoder
embed_tokens_q8.bin 402 MB β INT8 token embeddings (262144 Γ 1536)
embed_tokens_scales.bin 512 KB
embed_tokens_per_layer_q8.bin 2.19 GB β INT8 PLE
embed_tokens_per_layer_scales.bin 512 KB
per_layer_projection.bin 26 MB
per_layer_norm_weight.bin 1 KB
cos_{full,sliding}.npy 8 MB / 4 MB β precomputed RoPE cos
sin_{full,sliding}.npy 8 MB / 4 MB β precomputed RoPE sin
mel_filterbank.bin 129 KB β for audio path
embed_proj_weight.npy 4.5 MB β vision/audio β text embed projection
output_proj_{weight,bias}.npy 3 MB / 3 KB β audio output projection
model_config.json 434 B β runtime config (hidden=1536, layers=35, β¦)
audio_config.json 402 B β audio path config
hf_model/{tokenizer.json, tokenizer_config.json, config.json}
The main branch additionally carries the older 4-chunk topology (chunk2.mlmodelc + chunk3.mlmodelc + chunk4.mlmodelc) and several legacy variant directories (sdpa/, sdpa-8k/, swa/, stateless/, stateless-ctx2048/, lite/, lite-chunks/, mf/, w8a8-8k/, model.mlmodelc, model.mlpackage). These are research builds β only the chunk*.mlmodelc (or chunk{1,2_3way,3_3way}.mlmodelc) family is the shipping path.
Why so many sidecars
Gemma 4 / 3n uses a per-layer embedding (PLE) bank that dwarfs the token embedding. Loading PLE through Core ML would dequant the whole 2.19 GB into the CPU heap. Instead, the raw INT8 + scale files are mmap'd in Swift and only the rows actually touched are dequantized on the fly. The chunks themselves stay ANE-resident.
cos/sin .npy are pre-baked so the Swift side doesn't ship a RoPE builder.
Tokenizer
Already in hf_model/. Or pull from upstream:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-E2B-it")
Standalone usage (Python / Mac)
from huggingface_hub import snapshot_download
import coremltools as ct, json
local = snapshot_download(
"mlboydaisuke/gemma-4-E2B-coreml", revision="n1024",
allow_patterns=[
"chunk1.mlmodelc/*", "chunk2_3way.mlmodelc/*", "chunk3_3way.mlmodelc/*",
"prefill_chunk*.mlmodelc/*",
"embed_tokens*.bin", "per_layer_*.bin",
"cos_*.npy", "sin_*.npy",
"model_config.json", "hf_model/*",
],
)
cfg = json.load(open(f"{local}/model_config.json"))
chunks = [
ct.models.MLModel(f"{local}/chunk1.mlmodelc"),
ct.models.MLModel(f"{local}/chunk2_3way.mlmodelc"),
ct.models.MLModel(f"{local}/chunk3_3way.mlmodelc"),
]
For a working end-to-end loop (PLE dequant, vision/audio injection, KV alias plumbing), see Sources/CoreMLLLM/ChunkedEngine.swift β the canonical reference.
Vision / Audio
vision.mlmodelcexpectspixel_values (1, 3, 256, 256)fp16, outputs(1, 256, 1536)text-aligned tokens.audio.mlmodelcexpects mel-spectrogram features (usemel_filterbank.binfor the front-end), outputs an audio token stream injected into the same text decoder.vision_video.mlmodelcpacks 64 tokens per frame for short video.
iOS / Mac app
Pick Gemma 4 E2B in CoreMLLLMChat β it auto-downloads this repo (the picker fetches the n1024 branch by default) and runs it via ChunkedEngine.
Architecture
| value | |
|---|---|
num_hidden_layers |
35 |
hidden_size |
1536 |
num_key_value_heads |
1 |
intermediate_size |
6144 |
num_kv_shared_layers |
20 |
| KV producers (sliding/full) | L13 / L14 |
| sliding window | 512 |
| context length (shipping) | 1024 (n1024) / 2048 (main) |
| vocab | 262144 |
License
Inherits the Gemma terms of use.
- Downloads last month
- 762
Model tree for mlboydaisuke/gemma-4-E2B-coreml
Base model
google/gemma-4-E2B-it