Ivan Fioravanti ᯅ (@ivanfioravanti) on X

inference

2026-04-19 | Source: Mastodon | Original article

Ivan Fioravanti, a well‑known voice in the European LLM community, posted a short video showing the MiniMax M2.7 model running at full‑precision on his home workstation. The clip, shared on X on 20 April, proves that the 7‑billion‑parameter model can be executed locally without resorting to cloud GPUs, a claim he backs with raw latency numbers that rival early‑stage commercial APIs. The demonstration matters because it pushes the boundary of what hobbyist‑grade hardware can achieve. MiniMax M2.7, released by the open‑source collective behind the MiniMax line, is marketed as a “research‑grade” LLM that balances size and capability. Running it in full precision—rather than the 4‑bit or 8‑bit quantisations that dominate current local inference—shows that Apple Silicon, especially the M‑series chips, now have enough matrix‑multiply throughput and memory bandwidth to handle non‑quantised workloads. The result is higher fidelity output, lower quantisation artefacts, and a more faithful benchmark for model developers. Fioravanti’s post follows a series of community experiments that have been gathering steam. Earlier this month Simon Willison highlighted a GLM‑4.5‑Air model quantised to 4 bits running on an M4 Mac with 128 GB of RAM, while Fioravanti himself has previously warned against “magic incantations” that promise outsized performance without solid engineering. Together, these signals suggest a rapid convergence of open‑source model releases, Apple‑optimized toolchains (MPS, mlx‑community libraries), and consumer‑grade hardware capable of serious AI workloads. What to watch next: the MiniMax team is expected to publish a quantised variant for MPS‑accelerated inference, which could lower the hardware bar even further. Nordic AI startups are likely to test the model for Finnish‑language fine‑tuning, and we may see the first benchmark suite comparing full‑precision local runs against cloud‑based endpoints. Keep an eye on Fioravanti’s feed for follow‑up performance data and on the mlx‑community repo for upcoming optimisations that could make full‑precision local inference the new baseline.

Sources

Back to AIPULSEN