NVIDIA claims its GB200 NVL72 cluster delivers 10 times better performance than the older Hopper setup when running Mixture of Experts models like Kimi K2, which is a 32-billion-parameter open-source thinking model. The breakthrough came from a co-design approach that splits token batches across 72 chips with 30TB of shared memory, letting expert parallelism scale way harder than before.
MoE models only activate parts of their parameters per query instead of the whole thing, which makes them more efficient but creates scaling bottlenecks. Team Green solved this by using disaggregated serving through their Dynamo framework, where prefill and decode tasks get assigned to different GPUs, plus they added NVFP4 format for better accuracy and speed.
The GB200 chips are already hitting supply chains for frontier AI servers, and NVIDIA looks positioned to cash in big since MoE deployments keep expanding across different environments.
MoE models only activate parts of their parameters per query instead of the whole thing, which makes them more efficient but creates scaling bottlenecks. Team Green solved this by using disaggregated serving through their Dynamo framework, where prefill and decode tasks get assigned to different GPUs, plus they added NVFP4 format for better accuracy and speed.
The GB200 chips are already hitting supply chains for frontier AI servers, and NVIDIA looks positioned to cash in big since MoE deployments keep expanding across different environments.