Nvidia RTX 5080 and RTX 3090 Configuration Achieves 80 Tokens per Second on Qwen 3.6 27B Q8 Model
gemma qwen
| Source: HN | Original article
Nvidia's RTX 5080 and RTX 3090 combine for 80 tokens per second on Qwen 3.6.
A significant breakthrough has been achieved in running Qwen 3.6 27B Q8, with a setup combining an RTX 5080 and an RTX 3090 reaching an impressive 80 tokens per second. This development is noteworthy as it demonstrates the potential for leveraging multiple GPUs to enhance performance in large language models. As we reported on June 4, running a 35B MoE on 2x GTX 1080 Ti, the addition of a second GPU can substantially improve speeds, and this latest achievement underscores that point.
The achievement matters because it highlights the importance of optimizing hardware configurations for demanding AI workloads. With the increasing complexity of models like Qwen 3.6, having sufficient VRAM and computational power is crucial for achieving high performance. This setup showcases how combining newer and older but still capable GPUs can offer a cost-effective path to significant performance gains.
Looking ahead, it will be interesting to see how this setup compares to running on Nvidia's recently announced RTX Spark, touted as the most efficient PC chip ever built. Additionally, as LM Studio continues to optimize its platform for local large language models, we can expect further speed enhancements. The community's ongoing experimentation with different hardware configurations and software optimizations will be key to unlocking the full potential of models like Qwen 3.6.
Sources
Back to AIPULSEN