Logo

Metas vanilla Maverick AI model ranks below rivals on a popular chat benchmark

Kyle WiggersOriginal Link2 minutes
AI
Meta
Benchmark

One of Metas newest AI models, Llama 4 Maverick, ranks below rivals on a popular chat benchmark. Meta didnt originally reveal the score.

Earlier this week, Meta faced criticism for using an experimental, unreleased version of its Llama 4 Maverick model to achieve a high score on the crowdsourced benchmark, LM Arena. The incident prompted LM Arena's maintainers to apologize, revise their policies, and score the unmodified, vanilla Maverick.

Poor Performance Revealed

The unmodified Maverick, "Llama-4-Maverick-17B-128E-Instruct," ranked below models like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro as of Friday. Many of these competing models are months old.

"The release version of Llama 4 has been added to LMArena after it was found out they cheated, but you probably didn’t see it because you have to scroll down to 32nd place which is where it ranks."@pigeon__s

Why the Discrepancy?

Meta’s experimental Maverick, Llama-4-Maverick-03-26-Experimental, was "optimized for conversationality," as explained in a chart published last Saturday. These optimizations aligned well with LM Arena’s human-rater preference system.

As previously reported, LM Arena has never been the most reliable benchmark due to its subjective nature. Tailoring a model to a benchmark not only misleads but also makes it difficult for developers to gauge real-world performance.

Meta’s Response

A Meta spokesperson stated that the company experiments with "all types of custom variants."

"'Llama-4-Maverick-03-26-Experimental' is a chat-optimized version we experimented with that also performs well on LM Arena," the spokesperson said. "We have now released our open-source version and will see how developers customize Llama 4 for their own use cases."


Kyle Wiggers is TechCrunch’s AI Editor. View Bio

About the Author

Dr. Lisa Kim

Dr. Lisa Kim

AI Ethics Researcher

Leading expert in AI ethics and responsible AI development with 13 years of research experience. Former member of Microsoft AI Ethics Committee, now provides consulting for multiple international AI governance organizations. Regularly contributes AI ethics articles to top-tier journals like Nature and Science.

Expertise

AI Ethics
Algorithmic Fairness
AI Governance
Responsible AI
Experience
13 years
Publications
95+
Credentials
2

Agent Newsletter

Get Agentic Newsletter Today

Subscribe to our newsletter for the latest news and updates