Metas vanilla Maverick AI model ranks below rivals on a popular chat benchmark
One of Metas newest AI models, Llama 4 Maverick, ranks below rivals on a popular chat benchmark. Meta didnt originally reveal the score.
Earlier this week, Meta faced criticism for using an experimental, unreleased version of its Llama 4 Maverick model to achieve a high score on the crowdsourced benchmark, LM Arena. The incident prompted LM Arena's maintainers to apologize, revise their policies, and score the unmodified, vanilla Maverick.
Poor Performance Revealed
The unmodified Maverick, "Llama-4-Maverick-17B-128E-Instruct," ranked below models like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro as of Friday. Many of these competing models are months old.
"The release version of Llama 4 has been added to LMArena after it was found out they cheated, but you probably didn’t see it because you have to scroll down to 32nd place which is where it ranks." — @pigeon__s
Why the Discrepancy?
Meta’s experimental Maverick, Llama-4-Maverick-03-26-Experimental, was "optimized for conversationality," as explained in a chart published last Saturday. These optimizations aligned well with LM Arena’s human-rater preference system.
As previously reported, LM Arena has never been the most reliable benchmark due to its subjective nature. Tailoring a model to a benchmark not only misleads but also makes it difficult for developers to gauge real-world performance.
Meta’s Response
A Meta spokesperson stated that the company experiments with "all types of custom variants."
"'Llama-4-Maverick-03-26-Experimental' is a chat-optimized version we experimented with that also performs well on LM Arena," the spokesperson said. "We have now released our open-source version and will see how developers customize Llama 4 for their own use cases."
Kyle Wiggers is TechCrunch’s AI Editor. View Bio
Related News
Lenovo Wins Frost Sullivan 2025 Asia-Pacific AI Services Leadership Award
Lenovo earns Frost Sullivan's 2025 Asia-Pacific AI Services Customer Value Leadership Recognition for its value-driven innovation and real-world AI impact.
Baidu Wenku GenFlow 2.0 Revolutionizes AI Agents with Multi-Agent Architecture
Baidu Wenku's GenFlow 2.0 introduces a multi-agent system for parallel task processing, integrating with Cangzhou OS to enhance efficiency and redefine AI workflows.
About the Author

Dr. Lisa Kim
AI Ethics Researcher
Leading expert in AI ethics and responsible AI development with 13 years of research experience. Former member of Microsoft AI Ethics Committee, now provides consulting for multiple international AI governance organizations. Regularly contributes AI ethics articles to top-tier journals like Nature and Science.