Meta recently released two fresh versions of its Llama 4 AI—a smaller Scout model and a mid-sized Maverick model. The company boasted that Maverick performed better than ChatGPT-4o and Gemini 2.0 Flash on many tests, but it left out some important details from testers.
People criticized Meta for using a specially tuned AI model in public benchmarks, which made their performance claims seem misleading. After launch, Maverick quickly rose to second place on LMArena, almost taking the top spot. LMArena lets users compare AI responses and vote for the best answers based on accuracy. Meta announced that Maverick scored 1417 ELO points, beating GPT-4o and sitting just behind Gemini 2.5 Pro.
The numbers looked impressive until observers noticed something strange. Meta later admitted they submitted a different model to LMArena than what they planned to release publicly. They entered an experimental chat version that was optimized to sound better in conversations. LMArena stated that Meta should have been clearer about using the "Llama-4-Maverick-03-26-Experimental" version specifically designed for human preference tests.
LMArena changed its leaderboard policies after this incident to ensure fair future rankings. A Meta spokesperson commented that they had released their open-source version for developers to customize. The company didn't break the rules, but it wasn't transparent enough, either. This raised concerns that Meta had gamed the system by using an enhanced version not available to regular users.
Simon Willison, an independent AI researcher, expressed disappointment: "When Llama 4 came out and hit #2, that really impressed me — I'm kicking myself for not reading the small print." He added that the score became worthless since he couldn't access the high-scoring model. Rumors spread that Meta trained its AI specifically for certain tests, but Ahmad Al-Dahle, VP of Generative AI, denied these claims. When asked about the Sunday release date, Mark Zuckerberg simply replied that it was ready that day.
People criticized Meta for using a specially tuned AI model in public benchmarks, which made their performance claims seem misleading. After launch, Maverick quickly rose to second place on LMArena, almost taking the top spot. LMArena lets users compare AI responses and vote for the best answers based on accuracy. Meta announced that Maverick scored 1417 ELO points, beating GPT-4o and sitting just behind Gemini 2.5 Pro.
The numbers looked impressive until observers noticed something strange. Meta later admitted they submitted a different model to LMArena than what they planned to release publicly. They entered an experimental chat version that was optimized to sound better in conversations. LMArena stated that Meta should have been clearer about using the "Llama-4-Maverick-03-26-Experimental" version specifically designed for human preference tests.
LMArena changed its leaderboard policies after this incident to ensure fair future rankings. A Meta spokesperson commented that they had released their open-source version for developers to customize. The company didn't break the rules, but it wasn't transparent enough, either. This raised concerns that Meta had gamed the system by using an enhanced version not available to regular users.
Simon Willison, an independent AI researcher, expressed disappointment: "When Llama 4 came out and hit #2, that really impressed me — I'm kicking myself for not reading the small print." He added that the score became worthless since he couldn't access the high-scoring model. Rumors spread that Meta trained its AI specifically for certain tests, but Ahmad Al-Dahle, VP of Generative AI, denied these claims. When asked about the Sunday release date, Mark Zuckerberg simply replied that it was ready that day.