I run 20 AI coding agents locally on my desktop workstation at 400+ tokens/sec with MiniMax-M2. It’s a Sonnet drop-in replacement in my Cursor, Claude Code, Droid, Kilo and Cline peak at 11k tok/sec input and 433 tok/s output, can generate 1B+ tok/m.All with 196k context window. I'm running it for 6 days now with this config.
Today max performance was stable at 490.2 tokens/sec across 48 concurrent clients and MiniMax M2.
Running large language models efficiently is more than just raw GPU power. The latest guide breaks down the essential math to determine if your LLM workload is compute-bound or memory-bound.
We apply these principles to a real-world example: Qwen's 32B parameter model on the new NVIDIA RTX PRO 6000 Blackwell Edition.
In this guide, you will learn how to:
Calculate your GPU's operational intensity (Ops:Byte Ratio) Determine your model's arithmetic intensity Identify whether your workload is memory-bound or compute-bound