Metas vanilla Maverick AI model ranks below rivals on a popular chat benchmark

Earlier this week, Meta faced criticism for using an experimental, unreleased version of its Llama 4 Maverick model to achieve a high score on the crowdsourced benchmark, LM Arena. The incident prompted LM Arena's maintainers to apologize, revise their policies, and score the unmodified, vanilla Maverick.

Poor Performance Revealed

The unmodified Maverick, "Llama-4-Maverick-17B-128E-Instruct," ranked below models like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro as of Friday. Many of these competing models are months old.

"The release version of Llama 4 has been added to LMArena after it was found out they cheated, but you probably didn’t see it because you have to scroll down to 32nd place which is where it ranks." — @pigeon__s

Why the Discrepancy?

Meta’s experimental Maverick, Llama-4-Maverick-03-26-Experimental, was "optimized for conversationality," as explained in a chart published last Saturday. These optimizations aligned well with LM Arena’s human-rater preference system.

As previously reported, LM Arena has never been the most reliable benchmark due to its subjective nature. Tailoring a model to a benchmark not only misleads but also makes it difficult for developers to gauge real-world performance.

Meta’s Response

A Meta spokesperson stated that the company experiments with "all types of custom variants."

"'Llama-4-Maverick-03-26-Experimental' is a chat-optimized version we experimented with that also performs well on LM Arena," the spokesperson said. "We have now released our open-source version and will see how developers customize Llama 4 for their own use cases."

Kyle Wiggers is TechCrunch’s AI Editor. View Bio

Metas vanilla Maverick AI model ranks below rivals on a popular chat benchmark

Poor Performance Revealed

Why the Discrepancy?

Meta’s Response

Related News

AWS extends Bedrock AgentCore Gateway to unify MCP servers for AI agents

CEOs Must Prioritize AI Investment Amid Rapid Change

About the Author

Dr. Lisa Kim

Expertise

Poor Performance Revealed

Why the Discrepancy?

Meta’s Response

Related News

AWS extends Bedrock AgentCore Gateway to unify MCP servers for AI agents

CEOs Must Prioritize AI Investment Amid Rapid Change

About the Author

Dr. Lisa Kim

Expertise

Agent Newsletter

Get Agentic Newsletter Today