
Swift package for uzu, a high-performance inference engine for AI models on Apple Silicon. It allows you to deploy AI directly in your app with zero latency, full data privacy, and no inference costs. You don’t need an ML team or weeks of setup - one developer can handle everything in minutes. Key features:
- Simple, high-level API
- Specialized configurations with significant performance boosts for common use cases like classification and summarization
- Broad model support
- Observable model manager
Set up your project through Platform and obtain an API_KEY. Place the API_KEY in the corresponding example file, and then run it using one of the following commands:
swift run example chat
swift run example summarization
swift run example classificationAdd the uzu-swift dependency to your Package.swift:
dependencies: [
.package(url: "https://github.com/trymirai/uzu-swift.git", from: "0.1.36")
]Create and activate engine:
let engine = UzuEngine()
let status = try await engine.activate(apiKey: "API_KEY")let repoId = "Qwen/Qwen3-0.6B"let modelDownloadState = engine.downloadState(repoId: repoId)
if modelDownloadState?.phase != .downloaded {
let handle = try engine.downloadHandle(repoId: repoId)
try await handle.download()
let progressStream = handle.progress()
while let progressUpdate = await progressStream.next() {
print("Progress: \(progressUpdate.progress)")
}
}Session is the core entity used to communicate with the model:
let session = try engine.createSession(
repoId,
modelType: .local,
config: Config(preset: .general)
)Once loaded, the same Session can be reused for multiple requests until you drop it. Each model may consume a significant amount of RAM, so it's important to keep only one session loaded at a time. For iOS apps, we recommend adding the Increased Memory Capability entitlement to ensure your app can allocate the required memory.
After creating it, you can run the Session with a specific prompt or a list of messages:
let messages = [
Message(role: .system, content: "You are a helpful assistant."),
Message(role: .user, content: "Tell me a short, funny story about a robot."),
]
let input: Input = .messages(messages: messages)let runConfig = RunConfig()
.tokensLimit(1024)
let output = try session.run(
input: input,
config: runConfig
) { _ in
return true
}Output also includes generation metrics such as prefill duration and tokens per second. It’s important to note that you should run a release build to obtain accurate metrics.
In this example, we will extract a summary of the input text:
let session = try engine.createSession(
repoId,
modelType: .local,
config: Config(preset: .summarization)
)let textToSummarize =
"A Large Language Model (LLM) is a type of AI that processes and generates text using transformer-based architectures trained on vast datasets. They power chatbots, translation, code assistants, and more."
let input: Input = .text(
text: "Text is: \"\(textToSummarize)\". Write only summary itself.")let runConfig = RunConfig()
.tokensLimit(256)
.enableThinking(false)
.samplingPolicy(.custom(value: .greedy))
let output = try session.run(
input: input,
config: runConfig
) { _ in
return true
}This will generate ~34 output tokens with only ~5 model runs during the generation phase, instead of ~34 runs.
Let’s look at a case where you need to classify input text based on a specific feature, such as sentiment:
let feature = ClassificationFeature(
name: "sentiment",
values: ["Happy", "Sad", "Angry", "Fearful", "Surprised", "Disgusted"]
)
let config = Config(preset: .classification(feature: feature))
let session = try engine.createSession(repoId, modelType: .local, config: config)let textToDetectFeature =
"Today's been awesome! Everything just feels right, and I can't stop smiling."
let prompt =
"Text is: \"\(textToDetectFeature)\". Choose \(feature.name) from the list: \(feature.values.joined(separator: ", ")). Answer with one word. Don't add a dot at the end."
let input: Input = .text(text: prompt)let runConfig = RunConfig()
.tokensLimit(32)
.enableThinking(false)
.samplingPolicy(.custom(value: .greedy))
let output = try session.run(
input: input,
config: runConfig
) { _ in
return true
}In this example, you will get the answer Happy immediately after the prefill step, and the actual generation won't even start.
This project is licensed under the MIT License. See the LICENSE file for details.