GB300 NVL72 beats GB200 by up to 1.5x in latency benchmarks

Latency just took a noticeable hit as NVIDIA GB300 NVL72 flexed harder than the older GB200 in long context AI tests.

Blackwell Ultra performance jump
  • NVIDIA GB300 NVL72 was stress tested on DeepSeek open models.
  • LMSYS measured long context inference across the rack setup.
  • Results show roughly 1.4x to 1.5x gains over GB200 NVL72.
  • Latency-sensitive jobs saw about a 1.58x improvement.
Throughput and user speed gains
  • Peak output reached 226.2 tokens per second per GPU.
  • Multi Token Prediction pushed user-level speed up 1.87x.
  • Average uplift kept landing ahead of the prior generation.
  • Blackwell Ultra aims squarely at agent-style workloads.
Infrastructure level optimizations
  • LMSYS applied Prefill Decode disaggregation during testing.
  • That split prompt handling from token generation tasks.
  • Dynamic chunking tuned performance under long context windows.
  • KV capacity translation also tightened memory handling.
Cost and deployment questions
  • NVIDIA has not detailed the total cost of ownership yet.
  • Deployment expenses reportedly climbed alongside GB300.
  • Hyperscalers and neoclouds are eyeing it for agent systems.
  • VRAM-heavy workloads lean into its long context design.
 

Attachments

  • GB300 NVL72 beats GB200 by up to 1.5x in latency benchmarks.webp
    GB300 NVL72 beats GB200 by up to 1.5x in latency benchmarks.webp
    54.6 KB · Views: 33

Trending content

Sponsored

Top