Menu
Home
Forums
New posts
Search forums
What's new
Featured content
New posts
New media
New media comments
New resources
Latest activity
Media
New media
New comments
Search media
Resources
Latest reviews
Search resources
Log in
Register
What's new
Search
Search
Search titles only
By:
New posts
Search forums
Menu
Log in
Register
Install the app
Install
Home
Forums
Labrish
Nyuuz
Meta model swap muddles Llama 4 test scores
JavaScript is disabled. For a better experience, please enable JavaScript in your browser before proceeding.
You are using an out of date browser. It may not display this or other websites correctly.
You should upgrade or use an
alternative browser
.
Reply to thread
Message
[QUOTE="Munyaradzi Mafaro, post: 32457, member: 636"] Meta recently released two fresh versions of its Llama 4 AI—a smaller Scout model and a mid-sized Maverick model. The company boasted that Maverick performed better than ChatGPT-4o and Gemini 2.0 Flash on many tests, but it left out some important details from testers. People criticized Meta for using a specially tuned AI model in public benchmarks, which made their performance claims seem misleading. After launch, Maverick quickly rose to second place on LMArena, almost taking the top spot. LMArena lets users compare AI responses and vote for the best answers based on accuracy. Meta announced that Maverick scored 1417 ELO points, beating GPT-4o and sitting just behind Gemini 2.5 Pro. The numbers looked impressive until observers noticed something strange. Meta later admitted they submitted a different model to LMArena than what they planned to release publicly. They entered an experimental chat version that was optimized to sound better in conversations. LMArena stated that Meta should have been clearer about using the "Llama-4-Maverick-03-26-Experimental" version specifically designed for human preference tests. LMArena changed its leaderboard policies after this incident to ensure fair future rankings. A Meta spokesperson commented that they had released their open-source version for developers to customize. The company didn't break the rules, but it wasn't transparent enough, either. This raised concerns that Meta had gamed the system by using an enhanced version not available to regular users. Simon Willison, an independent AI researcher, expressed disappointment: "When Llama 4 came out and hit #2, that really impressed me — I'm kicking myself for not reading the small print." He added that the score became worthless since he couldn't access the high-scoring model. Rumors spread that Meta trained its AI specifically for certain tests, but Ahmad Al-Dahle, VP of Generative AI, denied these claims. When asked about the Sunday release date, Mark Zuckerberg simply replied that it was ready that day. [/QUOTE]
Insert quotes…
Name
Post reply
Home
Forums
Labrish
Nyuuz
Meta model swap muddles Llama 4 test scores
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.
Accept
Learn more…
Top