Use it from Swift

Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

Download + chat (one call)

import CoreMLLLM

// First call pulls the bundle from this repo to Documents/Models/.
// Subsequent calls reuse the on-disk copy.
let llm = try await CoreMLLLM.load(repo: "mlboydaisuke/lfm2.5-350m-coreml")

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}

Multi-turn: keep an [CoreMLLLM.Message] array, append the user/assistant turns, and pass the whole history to generate(_:) again. Call llm.reset() to start a new conversation (clears the KV cache).

LFM2.5 350M — CoreML build for Apple Neural Engine

CoreML port of LiquidAI/LFM2.5-350M for the CoreML-LLM iOS / macOS runtime. fp16, 97.8 % ANE-resident, 52 tok/s decode on iPhone 17 Pro.

Use it from Swift

1. Add the package

Package.swift:

.package(url: "https://github.com/john-rocky/CoreML-LLM", branch: "main"),

// In your target:
.product(name: "CoreMLLLM", package: "CoreML-LLM"),

Platforms: iOS 18+ / macOS 15+.

2. Download + chat (one-turn streaming)

import CoreMLLLM

let info = ModelDownloader.ModelInfo.lfm2_5_350m
let downloader = ModelDownloader.shared

// First launch: pulls ~810 MB from this repo to
// Documents/Models/lfm2.5-350m/.  Subsequent launches no-op.
if !downloader.isDownloaded(info) {
    _ = try await downloader.download(info)
}

let modelDir = downloader.localModelURL(for: info)!
    .deletingLastPathComponent()  // bundle root (parent of model.mlmodelc)

let llm = try await CoreMLLLM.load(from: modelDir)

let stream = try await llm.generate(
    [CoreMLLLM.Message(role: .user, content: "Hello!")],
    maxTokens: 256
)
for await chunk in stream {
    print(chunk, terminator: "")
}

3. Multi-turn chat

var history: [CoreMLLLM.Message] = [
    .init(role: .system, content: "You are a concise assistant."),
]

func reply(to user: String) async throws -> String {
    history.append(.init(role: .user, content: user))
    var out = ""
    let stream = try await llm.generate(history, maxTokens: 512)
    for await chunk in stream {
        out += chunk
        print(chunk, terminator: "")
    }
    history.append(.init(role: .assistant, content: out))
    return out
}

llm.reset()  // start a fresh conversation (clears KV + conv state)

CoreMLLLM.load() honours the model's ChatML template, the <|im_end|> / <|endoftext|> EOS tokens, and the conv-state I/O contract automatically — you don't pass any of that yourself.

4. Compute units

Defaults to .cpuAndNeuralEngine (the 52 tok/s number above). Override at load time:

let llm = try await CoreMLLLM.load(
    from: modelDir,
    computeUnits: .cpuOnly,  // or .all / .cpuAndGPU
)

Or via env (only affects LFM2):

setenv("LLM_LFM2_USE_CPU", "1", 1)

App: CoreMLLLMChat

If you just want to try it without writing code, the example app (Examples/CoreMLLLMChat) ships an LFM2.5 350M (ANE) entry in its model picker — open the project in Xcode, run on a device, tap Download.

Sideload (Mac → iPhone, no in-app download)

For development / offline use:

DEVICE=$(xcrun devicectl list devices | awk '/connected/{print $3}' | head -1)
xcrun devicectl device copy to --device "$DEVICE" \
    --domain-type appDataContainer \
    --domain-identifier com.example.CoreMLLLMChat \
    --source ./lfm2.5-350m-coreml \
    --destination Documents/Models/lfm2.5-350m \
    --remove-existing-content true

Note: xcrun devicectl writes files as UID 0 / 0755, which the app sandbox can't unlink later — the picker's trash button will fail with a permission error. To clear a sideloaded copy run scripts/uninstall_sideloaded_model.sh from the host or uninstall the app to wipe the container.

Files in this repo

model.mlmodelc/      compiled model — load via MLModel(contentsOf:)
model_config.json    context_length, num_hidden_layers, lfm2_conv_l_pad …
hf_model/            tokenizer (ChatML, sanitised for swift-transformers)

Architecture notes

Hybrid: 6 attention layers (GQA + RoPE + QK-norm) + 10 short-conv layers (depthwise causal Conv1d, kernel = 3).
The conv-state rolling window is a regular input/output tensor, not an MLState — the M-series ANE planner rejects the dual-state combination (kv_cache_0 + conv_cache_0) at predict-time (status=0x1d).
L_pad = conv_L_cache = 3. An earlier 16-wide padding fed enough fp16 noise into the depthwise reduction that autoregressive output collapsed to "kingkingking…" within a few tokens. Dropping the padding fixed both correctness and ANE compatibility.
Compute precision is the default fp16 — no fp32 fallback needed once the padding is fixed.
Chat template: ChatML (<|im_start|>role\n…<|im_end|>\n) wrapped in <|startoftext|>. EOS = <|im_end|> (id 7) and <|endoftext|> (id 2).

Full conversion + drift writeup: docs/LFM2_CONVERSION_FINDINGS.md.

License

This CoreML port inherits LFM Open License v1.0 from the base model.

Important — commercial use limit: the LFM Open License grants free commercial use only up to a revenue threshold of US $10M / year. Above that threshold (and for non-501(c)(3) entities) you need a separate commercial license from Liquid AI. See the upstream LICENSE and Liquid AI commercial licensing for details.

The CoreML conversion code in this repo (the model class, conversion scripts, runtime glue) is Apache 2.0 (parent project CoreML-LLM).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/lfm2.5-350m-coreml

Base model

LiquidAI/LFM2.5-350M-Base

Finetuned

LiquidAI/LFM2.5-350M

Finetuned

(22)

this model