🤯 Wow, try the latest # Ollama with experimental # MLX support on Mac! I tried qwen3.5:35b-a3b
llama qwen
| Source: Mastodon | Original article
Ollama 0.14‑rc2, the open‑source platform for running large language models locally, has rolled out experimental ML X support for Apple Silicon. The update lets users run the 35‑billion‑parameter Qwen 3.5‑a3b model quantised to MXFP8 on a Mac, delivering a 1.7× speed boost over the previous Q8_0 quantisation. The performance gain is reported by early adopters who measured inference latency with the new `ollamarun --experimental` flag, which also now reports peak memory usage for the ML X engine.
As we reported on 31 March 2026, Ollama was already previewing ML X acceleration on Apple Silicon. This release moves the feature from preview to a more usable state, adding a web‑search and fetch plugin that lets local or cloud‑hosted models pull fresh content from the internet. The same release also introduces a Bash‑tooling mode, enabling LLMs to invoke shell commands and automate workflows directly on the host machine.
The development matters because it narrows the performance gap between consumer‑grade Macs and dedicated GPU rigs for large‑model inference. By leveraging Apple’s neural‑engine‑friendly ML X runtime, developers can prototype and deploy AI‑enhanced applications without incurring cloud costs or dealing with CUDA‑compatible hardware. Faster, memory‑aware inference also expands the feasible model size for on‑device use, a step toward more private, offline AI services.
What to watch next is whether Ollama will stabilise the ML X backend for production workloads and broaden support beyond Qwen 3.5 to other popular models such as LLaMA 2 and Claude‑style architectures. Community benchmarks, especially against NVIDIA‑accelerated setups, will indicate whether Apple Silicon can become a mainstream platform for heavyweight LLMs. Further releases may also integrate tighter tooling for agents, expanding the ecosystem of locally run AI assistants.
Sources
Back to AIPULSEN