Claude Code Runs 15 Times Slower Than Expected When Self-Hosted
claude
| Source: Dev.to | Original article
Claude Code runs 15x slower on self-hosted setups. Performance issues are now being addressed.
As we follow the development of AI coding tools, a recent update sheds light on performance issues with self-hosted Claude Code. Running Claude Code against a self-hosted vllm-mlx backend on a Mac Studio revealed significant slowdowns, with cold turns taking approximately 108 seconds and follow-ups taking almost the same time. This is despite the system prompt being byte-stable, a scenario where any reputable LLM engine should be caching the prefix for faster performance.
The discovery that self-hosted Claude Code was 15 times slower than expected matters because it highlights the challenges of maintaining and optimizing AI-powered coding tools in-house. This slowdown can hinder developer productivity and overall efficiency. The issue has since been addressed with the SimpleEngine prefix-cache patch, now upstream as of May 14, 2026.
Looking ahead, developers will be watching how this update impacts the performance of self-hosted Claude Code setups. The choice between self-hosting and using managed services like LLM API will also be under scrutiny, as the trade-offs between control, maintenance, and cost become more apparent. With the open-source alternative Open Design emerging as a local-first option to Anthropic's Claude Design, the landscape of AI coding tools continues to evolve, offering developers a range of choices and potential solutions to the challenges of integrating AI into their workflows.
Sources
Back to AIPULSEN