Llama 4 vs Qwen 3 vs Mistral Large 2: Open model showdown
The open-weights world has three big names worth knowing: Llama 4 from Meta, Qwen 3 from Alibaba, and Mistral Large 2 out of Paris. Each one can stand in for a closed model like GPT or Claude, but they're built for different jobs. Picking the right one comes down to what you actually need it for.
If you run an AI tool inside your business, you've probably noticed how much of the conversation assumes you're paying a US company per word. The open-weights models break that assumption. You can download the weights, run them on your own hardware, and stop sending your data to someone else's servers. That's the appeal, and it's a real one for Australian teams worried about cost or where their customer data ends up.
But "open" doesn't mean "free of decisions." Llama 4 costs nothing to license, yet you need the machines to run it. Qwen 3 is cheap to rent by the token and strong across Asian languages. Mistral Large 2 costs more, but it's a European company with European data rules baked in. Three good options, three different trade-offs.
A word of caution before the numbers below: some of the headline benchmark figures floating around for these models don't hold up against the public leaderboards. We've flagged those as we go. Treat any single "score" as a starting point for your own testing, not gospel.
Three-way benchmarks
| Metric | Llama 4 | Qwen 3 | Mistral Large 2 |
|---|---|---|---|
| SWE-bench Pro | 50.2% | 46.2% | 48.6% |
| MMLU | 84.8% | 84.6% | 85.1% |
| Context window | 256K | 128K | 256K |
| Price (input) | Free | $0.40 / 1M | $2.00 / 1M |
| Price (output) | Free | $1.20 / 1M | $6.00 / 1M |
| Licence | Open | Open | Open |
A few of these figures need a health warning. The SWE-bench Pro numbers in this table (50.2 / 46.2 / 48.6%) don't match the actual Scale AI public leaderboard, which lists Llama 4 Maverick at around 5.24% and Qwen3-Coder at 38.70%, with Mistral Large 2 not listed at all. On that leaderboard the top model sits near 59%, and even strong models cluster well below the figures shown here. So read the SWE-bench row as unconfirmed, and note that the real ordering reverses the one above: Llama 4 is the weakest of the listed pair, not the strongest.
The MMLU scores are roughly in the right ballpark. Reports put Llama 4 Maverick near 85.5% and Qwen 3 in the low-to-mid 80s, according to ComputingForGeeks' open-source LLM comparison, but the exact per-model figures here can't be traced to a primary source, and the claim that Mistral edges out Llama 4 cuts against most reports.
The context windows are also softer than the table suggests. Meta pre-trained Llama 4 at 256K, but the released Instruct models go much further: Maverick up to 1M tokens and Scout up to 10M, per Meta's own announcement. Qwen 3 commonly runs 128K, though some variants reach much higher (Qwen3-Max around 262K). And Mistral Large 2's window is widely documented as 128K, not 256K, on the official model card.
Finally, "Open" is doing a lot of work in that licence row. All three publish their weights, but none is OSI-approved open source. Llama 4 ships under the Llama 4 Community License, and Mistral Large 2 uses the Mistral Research License, which means commercial self-deployment needs a separate commercial licence. Worth reading the fine print before you build a product on top of one.
Llama 4: The free default
On price, Llama 4 is hard to argue with. The weights download at no cost, and Meta's hosted Llama API has been offered free as well. Meta's mixture-of-experts design holds up across coding and knowledge tasks. (The benchmark ranking claiming it's the clear leader of the three is the one we've flagged above, so don't lean on that part.)
The catch is the infrastructure. To self-host Llama 4 you need GPUs, and the real cost shows up in hardware, electricity, and the people who keep it running. If you already have GPU capacity sitting around, that cost is close to nothing. If you're starting from a blank slate, renting an API from Qwen or Mistral is the simpler path, even if it isn't technically "free."
Qwen 3: The multilingual choice
Qwen 3 sits a little behind on the English coding benchmarks, but where it reportedly pulls ahead is languages. For Mandarin, Japanese, Korean, and Southeast Asian languages, Qwen is widely regarded as the strongest of the three. That reputation lines up with how the model is built and trained, though we haven't found a published head-to-head benchmark proving it beats Llama 4 and Mistral across every one of those languages, so treat it as a strong rule of thumb rather than a measured fact.
On price, the figure of $0.40 input / $1.20 output per million tokens is cheap enough that infrastructure stops being a worry, but it's worth checking against your provider. Qwen pricing is tiered and varies a lot by variant, so that flat rate couldn't be matched to a primary source. The 128K context window is the practical ceiling for big-document work, which is fine for most jobs but tight if you're feeding it long contracts or codebases.
Mistral Large 2: The European specialist
Mistral Large 2 is the priciest of the three at $2 input / $6 output per million tokens, and what you pay for is European language handling. For French, German, Spanish, Italian, and the Scandinavian languages, it's reported to come out ahead of both rivals, again a claim that fits Mistral's reputation more than any single published benchmark we could cite.
The bigger draw for some teams is where Mistral lives. It's a Paris-based company, so for organisations with EU data-residency rules, its European headquarters and GDPR posture are genuine practical advantages. That matters less for an Australian business serving local customers, but if you operate in or sell into Europe, it's a real point in Mistral's favour.
Verdict
Pick Llama 4 if you already have GPU infrastructure and want a capable model with no licence cost. Pick Qwen 3 for Asian-language work and low per-token API pricing. Pick Mistral Large 2 for European languages and EU data-residency needs. All three are solid working models, and the right one depends on your situation more than on any leaderboard, especially since several of the leaderboard figures quoted for these models don't survive a check against the public sources. Run a short pilot on your own tasks before you commit.



