Model Review

Best free models: Llama 4, Qwen 3, and self-hosted options.

Llama 4 is completely free. Qwen 3 costs $0.40/$1.20 API but the weights are free. We explore the best zero-cost AI options and what infrastructure you need to run them.

Daniel Fleuren2026-06-1511 min readFounders and operatorsUpdated 2026-06-19

Written by

Daniel Fleuren

Founder, AI Kick Start. 20+ years enterprise IT

Updated 2026-06-19

AI Kick Start editorial image for Best free models: Llama 4, Qwen 3, and self-hosted options.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: Llama 4 is completely free. Qwen 3 costs $0.40/$1.20 API but the weights are free. We explore the best zero-cost AI options and what infrastructure you need to run them.

Key takeaways

Best free models: Llama 4, Qwen 3, and self-hosted options: "Free" is the wrong word for an AI model, and it trips up more budgets than almost anything else in this space.
Truly free: Llama 4: Meta's [Llama 4](https://www.llama.com/models/llama-4/) is the closest thing here to a no-strings option, though "completely free, permanently" oversells it.
Infrastructure requirements:: Minimum: Single A100 40GB (Q4 quantisation) Recommended: Dual A100 80GB or single H100 (Q5 quant) Production: 2x H100 or 4x A100 for concurrent serving
Monthly operating cost (self-hosted, dual A100):: Cloud rental: ~$4,000-6,000/month Power (on-premise): ~$500-800/month Amortised hardware (on-premise): ~$1,500-2,500/month
Free weights, paid API: Qwen 3, DeepSeek V3.5, MiniMax M3: These models hand you the weights for self-hosting but also sell a managed API.

Best free models: Llama 4, Qwen 3, and self-hosted options

"Free" is the wrong word for an AI model, and it trips up more budgets than almost anything else in this space. Yes, you can download the weights for Meta's Llama 4 or Alibaba's Qwen without paying a cent in licensing. But the moment you run one, you're paying for the GPU it sits on, and that bill arrives whether or not anyone uses the thing.

So when a vendor or a blog post says "free open model," what they usually mean is "no licence fee, you bring the compute." For an Australian business team weighing self-hosting against a paid API, that distinction is the whole game. Get it wrong and you'll spend $6,000 a month on rented hardware to avoid a $2,000 API bill.

This is a guide to that decision: which open-weight models are genuinely worth running yourself, what hardware they need, and the point where hosting your own actually beats paying per token. A note up front, the open-model field moves fast, several of the specific benchmark figures below come from the vendors themselves rather than independent testing, and at least one model name in the original comparison turned out not to exist. We've flagged those as we go.

Truly free: Llama 4

Meta's Llama 4 is the closest thing here to a no-strings option, though "completely free, permanently" oversells it. Meta publishes the weights and the inference code, so there's no per-token charge and you can run it as long as you like. But it ships under the Llama 4 Community License, not a standard open-source licence: companies above 700 million monthly active users have to ask Meta for permission, you're required to display "Built with Llama" attribution, and the multimodal versions are off-limits to organisations based in the EU. (Claims floating around about Meta offering "subsidised cloud hosting partnerships" as a free perk are unconfirmed, and we couldn't find anything backing them up.)

On performance, treat the headline numbers with caution. A figure of 84.8% on MMLU is plausible for Llama 4's larger variant, though we couldn't confirm it to the decimal against an official Meta page (Llama 4 guide). The coding story is weaker than the original draft suggested: a "50.2% on SWE-bench Pro" claim doesn't hold up, independent testing puts Llama 4 Maverick closer to 8% on SWE-bench Lite and around 5 on SWE-bench Pro (LayerLens benchmark). In short, Llama 4 is a capable general model but not a strong agentic coder. Plan accordingly.

The infrastructure side is more concrete, though the figures below are reasonable estimates rather than published guarantees, real requirements shift depending on which Llama 4 variant and quantisation you pick (Llama 4 Maverick model card).

Infrastructure requirements:

Minimum: Single A100 40GB (Q4 quantisation)
Recommended: Dual A100 80GB or single H100 (Q5 quant)
Production: 2x H100 or 4x A100 for concurrent serving

Monthly operating cost (self-hosted, dual A100):

Cloud rental: ~$4,000-6,000/month
Power (on-premise): ~$500-800/month
Amortised hardware (on-premise): ~$1,500-2,500/month

Free weights, paid API: Qwen 3, DeepSeek V3.5, MiniMax M3

These models hand you the weights for self-hosting but also sell a managed API. The weights are permissively licensed, so once you've got them you can run them indefinitely without paying anyone.

Qwen 3 (reportedly 46.2% SWE-bench, 84.6% MMLU): the smallest of the three, and it runs comfortably on a single A100 40GB. A caveat on the name and the numbers, the Qwen open-weight family is real and Apache-licensed (Qwen3 guide), but by mid-2026 the current flagship is the Qwen 3.6 series, so "Qwen 3" is already a little dated. The specific scores quoted here don't match any official Qwen benchmark we could find and are best read as approximate. Where Qwen genuinely earns its place is Chinese and other Asian-language work, that strength is well established.

DeepSeek V3.5: worth a clear warning here. No model called "DeepSeek V3.5" was ever released. DeepSeek's actual line runs V3 and V3.2 in late 2025, then V4 in April 2026, and it's V4, not any "V3.5," that carries the 1M-token context window (DeepSeek on GitHub; DeepSeek-V3.2 on Hugging Face). The "52.4% SWE-bench, 85.8% MMLU, 1M context" row in the table below appears to conflate features from several real models. If long-context self-hosting is your goal, look at DeepSeek V4 (for 1M context) or V3.2, and ignore the fabricated "V3.5" label. DeepSeek V3.2's real SWE-bench Verified score sits around 72-74%, well above the figure quoted.

MiniMax M3 (59.0% SWE-bench Pro, reportedly 86.4% MMLU, 1M context): the most capable of the bunch and the largest, needing dual H100s to run well. It launched on 1 June 2026 as a 428B-parameter mixture-of-experts model (about 23B active per token) with a 1M-token context window and native multimodality (The Decoder on MiniMax M3). Two things to keep in mind: the 59.0% SWE-bench Pro figure is company-reported on MiniMax's own setup, with independent verification still pending at launch, and the open weights hadn't actually shipped on day one (they were due within about ten days). The 86.4% MMLU number we couldn't verify against any source (DataNorth launch coverage), so treat it as unconfirmed.

The self-hosting decision matrix

Factor	Llama 4	DeepSeek V3.5	MiniMax M3
Best for	General use	Long-context	Coding
Hardware	A100 40GB+	Dual A100 / H100	Dual H100
Monthly cost*	$4K-6K	$5K-8K	$8K-12K
SWE-bench Pro	50.2%	52.4%	59.0%
MMLU	84.8%	85.8%	86.4%
1M context	No	Yes	Yes

*Cloud rental estimates. Note: the SWE-bench Pro and MMLU figures in this table read as an internally consistent set rather than independently sourced numbers. The Llama 4 SWE-bench figure in particular contradicts independent testing (closer to ~5), the "DeepSeek V3.5" column refers to a model that doesn't exist (see DeepSeek V4 or V3.2 instead), and the MiniMax M3 scores are company-reported. Use these as rough orientation, not procurement data.

When self-hosting makes sense

Running your own makes financial sense in a handful of situations:

Your monthly API spend is climbing past about $5,000, though that break-even is a rule of thumb, not a law. The real crossover depends on how hard you push the hardware, how you finance it, and what the API actually charges, so model it against your own usage before committing.
You have strict data residency requirements
You need very high throughput with no rate limits
You already own GPU infrastructure that's sitting underused
You want to fine-tune on proprietary data

Verdict

Llama 4 is the sensible default if you just want a free open model with the lowest hardware bar and no per-token charge, provided you can live with its Community License terms and you're not leaning on it for heavy agentic coding. If long context is what you're after, skip the mislabelled "V3.5" and go straight to DeepSeek V4 (or V3.2), which give you genuine long-context capability per dollar of infrastructure. And if you want the strongest open-weights coding model and can absorb the dual-H100 cost, MiniMax M3 is the one to watch, with the caveat that its benchmarks were still self-reported and its weights barely out the door at the time of writing.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

Meta Llama documentation

What to do next

Write the job-to-be-done before looking at another product.
Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: Best free models: Llama 4, Qwen 3, and self-hosted options

Read with ChatGPT Open Claude Search with AI Mode

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call