Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/gemma-4-E2B-coreml")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream { print(chunk, terminator: "") }

Image / video / audio

// Image
let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user,
                       content: "Describe this image")],
    image: cgImage)

// Video (frames + audio extracted internally)
let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user,
                       content: "What happens in this clip?")],
    videoURL: localFileURL)

Audio-only and other variants are exposed via the same generate(_:) overloads β€” see the Swift file CoreMLLLM.swift for the full surface.

Gemma 4 E2B β€” Core ML (ANE multimodal)

Core ML port of google/gemma-4-E2B-it (the 2B-effective Gemma 4 / Gemma 3n decoder), optimized for Apple Neural Engine. Text + image + audio + short video, INT4 weights.

Branches: main is the long-running 4-chunk text+vision+audio bundle. The default ship target for CoreMLLLMChat v1.6+ is the n1024 branch (3-chunk merged decoder, slightly faster prefill). Both ship the same architecture β€” only the chunk topology and tokenizer artifacts differ. Pick whichever matches the Swift runtime you're using; if in doubt, use n1024.

Files (root, n1024 branch β€” recommended)

chunk1.mlmodelc/                  # L0–7    β€” INT4 palettized
chunk2_3way.mlmodelc/             # L8–24   β€” merged middle (3-chunk decoder)
chunk3_3way.mlmodelc/             # L25–34 + lm_head β€” multifunction
prefill_chunk{1..4}.mlmodelc/     # T=N prefill bodies (mlmodelc, weights shared
                                  # with decode chunks via hardlink)
vision.mlmodelc/                  # SigLIP encoder, 322 MB
vision_video.mlmodelc/            # video frame encoder (64 tok/frame)
audio.mlmodelc/                   # 282 MB Whisper-style audio encoder

embed_tokens_q8.bin               402 MB  β€” INT8 token embeddings (262144 Γ— 1536)
embed_tokens_scales.bin           512 KB
embed_tokens_per_layer_q8.bin     2.19 GB β€” INT8 PLE
embed_tokens_per_layer_scales.bin 512 KB
per_layer_projection.bin          26 MB
per_layer_norm_weight.bin         1 KB
cos_{full,sliding}.npy            8 MB / 4 MB β€” precomputed RoPE cos
sin_{full,sliding}.npy            8 MB / 4 MB β€” precomputed RoPE sin
mel_filterbank.bin                129 KB β€” for audio path
embed_proj_weight.npy             4.5 MB β€” vision/audio β†’ text embed projection
output_proj_{weight,bias}.npy     3 MB / 3 KB β€” audio output projection

model_config.json                 434 B  β€” runtime config (hidden=1536, layers=35, …)
audio_config.json                 402 B  β€” audio path config
hf_model/{tokenizer.json, tokenizer_config.json, config.json}

The main branch additionally carries the older 4-chunk topology (chunk2.mlmodelc + chunk3.mlmodelc + chunk4.mlmodelc) and several legacy variant directories (sdpa/, sdpa-8k/, swa/, stateless/, stateless-ctx2048/, lite/, lite-chunks/, mf/, w8a8-8k/, model.mlmodelc, model.mlpackage). These are research builds β€” only the chunk*.mlmodelc (or chunk{1,2_3way,3_3way}.mlmodelc) family is the shipping path.

Why so many sidecars

Gemma 4 / 3n uses a per-layer embedding (PLE) bank that dwarfs the token embedding. Loading PLE through Core ML would dequant the whole 2.19 GB into the CPU heap. Instead, the raw INT8 + scale files are mmap'd in Swift and only the rows actually touched are dequantized on the fly. The chunks themselves stay ANE-resident.

cos/sin .npy are pre-baked so the Swift side doesn't ship a RoPE builder.

Tokenizer

Already in hf_model/. Or pull from upstream:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/gemma-4-E2B-it")

Standalone usage (Python / Mac)

from huggingface_hub import snapshot_download
import coremltools as ct, json

local = snapshot_download(
    "mlboydaisuke/gemma-4-E2B-coreml", revision="n1024",
    allow_patterns=[
        "chunk1.mlmodelc/*", "chunk2_3way.mlmodelc/*", "chunk3_3way.mlmodelc/*",
        "prefill_chunk*.mlmodelc/*",
        "embed_tokens*.bin", "per_layer_*.bin",
        "cos_*.npy", "sin_*.npy",
        "model_config.json", "hf_model/*",
    ],
)
cfg = json.load(open(f"{local}/model_config.json"))
chunks = [
    ct.models.MLModel(f"{local}/chunk1.mlmodelc"),
    ct.models.MLModel(f"{local}/chunk2_3way.mlmodelc"),
    ct.models.MLModel(f"{local}/chunk3_3way.mlmodelc"),
]

For a working end-to-end loop (PLE dequant, vision/audio injection, KV alias plumbing), see Sources/CoreMLLLM/ChunkedEngine.swift β€” the canonical reference.

Vision / Audio

  • vision.mlmodelc expects pixel_values (1, 3, 256, 256) fp16, outputs (1, 256, 1536) text-aligned tokens.
  • audio.mlmodelc expects mel-spectrogram features (use mel_filterbank.bin for the front-end), outputs an audio token stream injected into the same text decoder.
  • vision_video.mlmodelc packs 64 tokens per frame for short video.

iOS / Mac app

Pick Gemma 4 E2B in CoreMLLLMChat β€” it auto-downloads this repo (the picker fetches the n1024 branch by default) and runs it via ChunkedEngine.

Architecture

value
num_hidden_layers 35
hidden_size 1536
num_key_value_heads 1
intermediate_size 6144
num_kv_shared_layers 20
KV producers (sliding/full) L13 / L14
sliding window 512
context length (shipping) 1024 (n1024) / 2048 (main)
vocab 262144

License

Inherits the Gemma terms of use.

Downloads last month
762
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/gemma-4-E2B-coreml

Quantized
(167)
this model