Latency just took a noticeable hit as NVIDIA GB300 NVL72 flexed harder than the older GB200 in long context AI tests.
Blackwell Ultra performance jump
Blackwell Ultra performance jump
- NVIDIA GB300 NVL72 was stress tested on DeepSeek open models.
- LMSYS measured long context inference across the rack setup.
- Results show roughly 1.4x to 1.5x gains over GB200 NVL72.
- Latency-sensitive jobs saw about a 1.58x improvement.
- Peak output reached 226.2 tokens per second per GPU.
- Multi Token Prediction pushed user-level speed up 1.87x.
- Average uplift kept landing ahead of the prior generation.
- Blackwell Ultra aims squarely at agent-style workloads.
- LMSYS applied Prefill Decode disaggregation during testing.
- That split prompt handling from token generation tasks.
- Dynamic chunking tuned performance under long context windows.
- KV capacity translation also tightened memory handling.
- NVIDIA has not detailed the total cost of ownership yet.
- Deployment expenses reportedly climbed alongside GB300.
- Hyperscalers and neoclouds are eyeing it for agent systems.
- VRAM-heavy workloads lean into its long context design.