Zero-Copy GPU Inference from WebAssembly on Apple Silicon

apple gpu inference

2026-04-19 | Source: HN | Original article

A team of developers has unveiled a proof‑of‑concept library that lets WebAssembly code invoke Apple‑silicon GPUs without copying data between system memory and the graphics processor. By wiring the WebGPU compute API directly to the Metal driver and exposing the buffers to Wasm via the new “zero‑copy” extension, neural‑network tensors can stay resident in GPU memory while inference kernels run, cutting latency by up to 70 % compared with the traditional upload‑download cycle. The breakthrough matters because it removes one of the last technical barriers to truly local‑first AI in the browser. Until now, on‑device models on M1/M2 Macs required either CPU‑only execution or a costly round‑trip that duplicated tensors in RAM before the GPU could touch them. Zero‑copy inference means web apps can deliver desktop‑class performance while keeping user data on the device, a key advantage for privacy‑sensitive workloads such as medical imaging, personal assistants, or real‑time translation. It also aligns with Apple’s broader push to expose Metal‑level capabilities through WebGPU, a move that has already seen early demos like a spinning cube in Safari and the WHLSL‑to‑MSL compiler work described on the GPUWeb wiki. What to watch next is the standardisation path for the zero‑copy buffer API. The WebGPU Working Group is expected to discuss the extension at the upcoming GPUWeb F2F meeting in September, and Apple’s Safari team has hinted at a beta rollout in macOS 15. If the extension lands in the WebGPU specification, third‑party frameworks such as ncnn or the Llama.cpp WebGPU backend (which we covered on 18 April) could ship production‑ready models that run entirely in the browser on Apple silicon. Developers and privacy advocates should keep an eye on the WebGPU CTS updates, as they will determine whether the new path can be trusted across the diverse GPU ecosystem.

Sources

Back to AIPULSEN