MLPerf v5 shows big gains in generative AI

MLCommons just shared fresh results from their MLPerf Inference v5.0 benchmark tests. These tests gauge how fast different machines run AI models across many tasks. The AI world clearly focuses most of its energy on generative AI these days. Recent hardware and software improvements have created massive performance jumps compared to last year.

The benchmark tests measure both big data centers and smaller edge systems. They check how quickly each system runs various AI and machine learning models. This open-source testing system creates fair competition that pushes everyone to make better, faster, more efficient AI systems. It also helps buyers pick the right AI systems and adjust them properly. The latest round adds four new tests, including Llama 3.1 405B, Llama 2 70B Interactive for quick responses, RGAT, and Automotive PointPainting for 3D object detection.

The Llama 2 70B test has become super popular. Companies submitted 2.5 times more results to this test than before. This test uses a large generative AI model based on well-known open-source code. Llama 2 70B has replaced Resnet50 as the most popular benchmark test. Performance on Llama 2 70B has skyrocketed since last year. The middle score doubled, and the best score runs 3.3 times faster than in version 4.0.

David Kanter from MLCommons pointed out that everyone wants to deploy generative AI right away. The testing feedback loop works really well. New types of computer chips keep appearing everywhere. These chips come with new software tricks, including support across all hardware and software for something called the FP4 data format. All these advances help companies break records for generative AI speed.

This round includes results from six brand-new processors about to ship or just released. These include AMD Instinct MI325X, Intel Xeon 6980P "Granite Rapids," Google TPU Trillium, NVIDIA B200, NVIDIA Jetson AGX Thor 128, and NVIDIA GB200. The testing team added two new benchmarks that match what AI researchers want to build today.

MLPerf Inference v5.0 introduces a test using Llama 3.1 405B, which sets a new record for size in AI testing. This model packs 405 billion parameters and handles up to 128,000 tokens of input and output. The older Llama 2 70B only managed 4,096 tokens. The benchmark checks three separate abilities of the model, including general questions, math problems, and code writing.

Miro Hodak, who helps run the testing program, calls this their most ambitious test yet. It matches the industry trend toward bigger models that offer more accuracy and handle more types of tasks. Testing these giant models takes more time and effort, but real companies try to use models this big every day. Having trusted test results helps them decide the best way to run these systems.

The team also added a new version of its Llama 2 70B test that requires faster response times. The Llama 2 70B Interactive test matches how people build chatbots and smart agents today. Systems must meet stricter standards for how quickly they show the first response and how fast they continue generating text afterward. Mitchelle Rasquinha explains that people care about how responsive their AI chat feels. Does it answer quickly? How fast does it give the full reply? This interactive test shows how language models perform in real conversations.

The testing team also added a new graph neural network test for data centers. These special networks model links between nodes and help with recommendation systems, knowledge graphs, fraud detection, and similar applications. The new test uses the RGAT model on a huge dataset containing over 547 million nodes and 5.8 billion connections between them.

They also created a new edge computing test called Automotive PointPainting. This helps test AI systems that might run in self-driving cars. It focuses on 3D object detection in camera feeds. The team rarely adds four new tests at once, but Miro Hodak explains they needed to keep up with how fast AI changes. Machine learning advances rapidly, and people need current information to make good decisions.

The latest round includes 17,457 performance results from 23 different organizations. David Kanter welcomed five first-time participants to the testing program, including CoreWeave, FlexAI, GATEOverflow, Lambda, and MangoBoost. The growing number of companies submitting results shows how important accurate performance data has become. He also highlighted Fujitsu and GateOverflow for submitting power efficiency data since energy use matters more than ever for AI systems.

Kanter summarized that machine learning keeps giving people better tools and abilities. Models grow larger, systems respond faster, and AI runs in more places than ever before. MLCommons feels proud to show results from many different systems, including several brand new processors. Their work keeps the benchmark relevant during fast changes, ensuring everyone has valuable performance data to make smart choices about AI hardware and software.
 

Attachments

  • MLPerf v5 shows big gains in generative AI.webp
    MLPerf v5 shows big gains in generative AI.webp
    56.1 KB · Views: 21

Trending content

Latest posts

Top