Use it from Swift
Add the package
Package.swift:
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),
// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
Platforms: iOS 18+ / macOS 15+.
Download + chat (one call)
import CoreMLLLM
// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/lfm2.5-350m-coreml")
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user, content: "Hello!")],
maxTokens: 256
)
for await chunk in stream {
print(chunk, terminator: "")
}
Multi-turn: keep an [CoreMLLLM.Message] array, append the
user/assistant turns, and pass the whole history to
generate(_:) again. Call llm.reset() to start a new
conversation (clears the KV cache).
LFM2.5 350M β CoreML build for Apple Neural Engine
CoreML port of LiquidAI/LFM2.5-350M for the CoreML-LLM iOS / macOS runtime. fp16, 97.8 % ANE-resident, 52 tok/s decode on iPhone 17 Pro.
Use it from Swift
1. Add the package
Package.swift:
.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),
// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),
Platforms: iOS 18+ / macOS 15+.
2. Download + chat (one-turn streaming)
import CoreMLLLM
let info = ModelDownloader.ModelInfo.lfm2_5_350m
let downloader = ModelDownloader.shared
// First launch: pulls ~810 MB from this repo to
// Documents/Models/lfm2.5-350m/. Subsequent launches no-op.
if !downloader.isDownloaded(info) {
_ = try await downloader.download(info)
}
let modelDir = downloader.localModelURL(for: info)!
.deletingLastPathComponent() // bundle root (parent of model.mlmodelc)
let llm = try await CoreMLLLM.load(from: modelDir)
let stream = try await llm.generate(
[CoreMLLLM.Message(role: .user, content: "Hello!")],
maxTokens: 256
)
for await chunk in stream {
print(chunk, terminator: "")
}
3. Multi-turn chat
var history: [CoreMLLLM.Message] = [
.init(role: .system, content: "You are a concise assistant."),
]
func reply(to user: String) async throws -> String {
history.append(.init(role: .user, content: user))
var out = ""
let stream = try await llm.generate(history, maxTokens: 512)
for await chunk in stream {
out += chunk
print(chunk, terminator: "")
}
history.append(.init(role: .assistant, content: out))
return out
}
llm.reset() // start a fresh conversation (clears KV + conv state)
CoreMLLLM.load() honours the model's ChatML template, the
<|im_end|> / <|endoftext|> EOS tokens, and the conv-state I/O
contract automatically β you don't pass any of that yourself.
4. Compute units
Defaults to .cpuAndNeuralEngine (the 52 tok/s number above).
Override at load time:
let llm = try await CoreMLLLM.load(
from: modelDir,
computeUnits: .cpuOnly, // or .all / .cpuAndGPU
)
Or via env (only affects LFM2):
setenv("LLM_LFM2_USE_CPU", "1", 1)
App: CoreMLLLMChat
If you just want to try it without writing code, the example app (Examples/CoreMLLLMChat) ships an LFM2.5 350M (ANE) entry in its model picker β open the project in Xcode, run on a device, tap Download.
Sideload (Mac β iPhone, no in-app download)
For development / offline use:
DEVICE=$(xcrun devicectl list devices | awk '/connected/{print $3}' | head -1)
xcrun devicectl device copy to --device "$DEVICE" \
--domain-type appDataContainer \
--domain-identifier com.example.CoreMLLLMChat \
--source ./lfm2.5-350m-coreml \
--destination Documents/Models/lfm2.5-350m \
--remove-existing-content true
Note: xcrun devicectl writes files as UID 0 / 0755, which the app
sandbox can't unlink later β the picker's trash button will fail with
a permission error. To clear a sideloaded copy run
scripts/uninstall_sideloaded_model.sh
from the host or uninstall the app to wipe the container.
Files in this repo
model.mlmodelc/ compiled model β load via MLModel(contentsOf:)
model_config.json context_length, num_hidden_layers, lfm2_conv_l_pad β¦
hf_model/ tokenizer (ChatML, sanitised for swift-transformers)
Architecture notes
- Hybrid: 6 attention layers (GQA + RoPE + QK-norm) + 10 short-conv layers (depthwise causal Conv1d, kernel = 3).
- The conv-state rolling window is a regular input/output tensor,
not an MLState β the M-series ANE planner rejects the dual-state
combination (
kv_cache_0+conv_cache_0) at predict-time (status=0x1d). L_pad = conv_L_cache = 3. An earlier 16-wide padding fed enough fp16 noise into the depthwise reduction that autoregressive output collapsed to "kingkingkingβ¦" within a few tokens. Dropping the padding fixed both correctness and ANE compatibility.- Compute precision is the default fp16 β no fp32 fallback needed once the padding is fixed.
- Chat template: ChatML (
<|im_start|>role\nβ¦<|im_end|>\n) wrapped in<|startoftext|>. EOS =<|im_end|>(id 7) and<|endoftext|>(id 2).
Full conversion + drift writeup: docs/LFM2_CONVERSION_FINDINGS.md.
License
This CoreML port inherits LFM Open License v1.0 from the base model.
Important β commercial use limit: the LFM Open License grants free commercial use only up to a revenue threshold of US $10M / year. Above that threshold (and for non-501(c)(3) entities) you need a separate commercial license from Liquid AI. See the upstream LICENSE and Liquid AI commercial licensing for details.
The CoreML conversion code in this repo (the model class, conversion scripts, runtime glue) is Apache 2.0 (parent project CoreML-LLM).