Llama 4 review: Meta's MoE open model
Release date: reportedly April 2026 | Status: Active | Licence: Open weights (Llama 4 Community License)
Note on dates and figures: this review carries several numbers we could not confirm against Meta's own documentation. Meta's official announcement puts the Llama 4 launch (Scout and Maverick) at April 2025, not 2026, and the published specs differ from some figures below. Where a claim is unconfirmed, we say so plainly and keep the number visible so you can judge it yourself.
Benchmarks at a glance
| Metric | Score | Context |
|---|---|---|
| SWE-bench Pro | 50.2% (unconfirmed) | See note below |
| MMLU | ~85% | Solid for an open model |
| Context window | 256K tokens (claimed) | Official specs are larger |
| Price (input) | Weights free; hosted API paid | , |
| Price (output) | Weights free; hosted API paid | , |
| Licence | Open weights | Self-hostable, with conditions |
Meta's pitch with Llama 4 is simple enough that any business owner can follow it: download the model, run it on your own hardware, and stop paying a vendor per question. That is a genuinely different deal from the metered API world most teams live in, and it is the reason Llama matters even when it doesn't top the leaderboards.
The catch is that "open" and "free" aren't the same thing, and the marketing around this release blurs the two. The model weights are free to download. Running them is not, you either buy GPUs or rent a hosted API that charges per token. Some of the eye-catching numbers floating around about Llama 4, including its release date and a few headline benchmarks, also don't line up with Meta's own published figures. We flag those as we go.
So the honest framing is this. Llama 4 is a capable, broadly useful open model that can save a real GPU-equipped team a lot of money. It is not a magic "free forever" button, and it is not the best model at any single task. For Australian teams weighing self-hosting against a paid API, that distinction is the whole decision.
The MoE architecture
Llama 4 uses a sparse Mixture-of-Experts design. Per Meta's Maverick model card, the Maverick variant has roughly 400 billion total parameters but only activates about 17 billion of them per token. That is a real break from Llama 3, which used a dense architecture where every parameter fires on every token, and it brings Meta into line with how most frontier labs now build models (Meta's Llama 4 blog calls these its first native MoE models).
The practical upshot of a sparse design: you get the knowledge capacity of a very large model without paying the full inference cost on every request, because only a slice of the network runs at a time.
Performance assessment
On coding, the picture is murky. The article's headline of 50.2% on SWE-bench Pro, framed as a 6.8-point jump over Llama 3.1's final release, is one we couldn't verify. Llama 4 doesn't appear on the SWE-bench Pro leaderboard we checked, and an independent SWE-bench Lite run put Maverick far lower, around 8%. Treat the 50.2% figure as unconfirmed. What we can say with more confidence: Llama 4 handles routine engineering work, boilerplate, simple debugging, code review, better than it handles complex multi-file changes or novel algorithmic problems. It is a useful assistant, not a senior engineer.
On general knowledge it holds up well. Independent trackers like llm-stats put Maverick's MMLU around 85%, which is strong for an open model. The article's specific comparison numbers, GPT-5.5 Instant at 84.2% and Qwen 3 at 84.6%, we couldn't confirm against any source, so read those as unverified. The broad point still stands: for Q&A, summarisation, and content generation, Llama 4 is more than adequate.
The self-hosting proposition
Because the weights are free to download, the cost of running Llama 4 yourself is infrastructure. The article suggests a single H100 can serve the Q4 quantised version with acceptable latency for internal tools, and a pair of H100s for production. Those numbers are plausible for a 400B-total/17B-active MoE under Q4 quantisation, but we couldn't find an authoritative source confirming the exact hardware recommendations, so take them as a reasonable starting estimate rather than a spec.
The economics are still the draw. Once the hardware is paid off, each additional request costs you electricity rather than per-token API fees. One thing to keep in mind: the Llama 4 Community Licence isn't unconditional. It restricts some EU access to the multimodal models and requires a separate commercial licence for companies above 700 million monthly active users, unlikely to bite most Australian businesses, but worth reading before you build on it.
Verdict
Llama 4 isn't the best model at any one thing, but it's good enough at most things, and you can run it on your own gear. For startups, researchers, and teams that already own GPUs, it's a sensible default to start with. Move to a paid model when you hit a specific capability wall, and treat the "completely free" framing with caution, because the free part is the weights, not the running of them.
Score: 7.8 / 10 (capability) / 9.5 / 10 (value)



