GitHub - SharpAI/SwiftLM: ⚡ Native Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app.

apple inference openai

2026-04-03 | Source: Mastodon | Original article

SharpAI has released SwiftLM, a native Swift‑based inference server that runs large language models directly on Apple Silicon. The open‑source project leverages the MLX framework to stream models exceeding 100 billion parameters from SSD, supports mixture‑of‑experts (MoE) architectures, and introduces TurboQuant KV‑cache compression to slash memory footprints. An OpenAI‑compatible REST API makes it easy for existing tooling to switch to on‑device inference, while a companion iPhone app demonstrates real‑time generation on iOS hardware. The launch matters because it closes a gap that has kept high‑end LLMs largely in the cloud. Apple’s M‑series chips deliver unprecedented matrix‑multiply throughput, yet most developers still rely on remote APIs due to the lack of a performant, locally‑runnable server. By exposing a familiar API and handling the heavy lifting of SSD streaming and cache quantisation, SwiftLM enables privacy‑preserving applications, reduces latency, and cuts operating costs for startups and research labs that can now run state‑of‑the‑art models on a MacBook or iPad. It also adds a new competitor to the emerging ecosystem of local deployment tools, such as Docker’s Model Runner (reported on 2 April) and AMD’s Lemonade server (also reported on 2 April). The next few weeks will reveal whether SwiftLM can deliver the promised throughput on real‑world workloads. Benchmarks against Docker Model Runner and other open‑source servers will be watched closely, as will community contributions that expand model support and integrate with Apple’s Core ML pipeline. Apple’s own stance on third‑party inference servers could shape the long‑term viability of on‑device LLMs, making the evolution of SwiftLM a key indicator of the broader shift toward decentralized AI.

Sources

Back to AIPULSEN