Analysis
For two years, if you wanted an AI model to do your specific job well, you retrained it. You gathered examples, you ran an expensive training job, and you ended up with a model tuned to your task. A whole cottage industry grew up around that work: tools, consultants, services, all selling the same promise.
That promise is fading. The reason is almost embarrassingly simple. Models can now read far more in one sitting than they could even a year ago, and they actually remember what they read. So instead of baking your knowledge into a model's weights over a week of training, you can paste your documentation straight into the prompt and get answers that are just as good, often within seconds, for a fraction of the cost.
For an Australian business team, the practical upshot is this: the slow, costly path to a "custom" AI is no longer the obvious one. The faster path, feed the model your manuals, your policies, your product docs at the moment you ask, has caught up, and in a lot of cases overtaken it. Fine-tuning isn't gone. But it's stopped being the first thing you reach for.
Here's what's driving the shift, and where retraining still earns its keep.
The Context Window Revolution
The most direct pressure on fine-tuning comes from how much a model can read at once. When GPT-3 launched, its context window was about 2,000 tokens (GPT-3, Wikipedia). You couldn't fit any real task documentation into a prompt that small, so retraining the weights was the only way to make the model "know" your domain. Today, models like MiniMax M3 and Gemini 3.5 Flash offer 1-million-token contexts, room for an entire codebase, a product documentation library, or a full customer-support knowledge base, dropped straight into the prompt. (Some coverage also lists a "DeepSeek V3.5" in this group, but that model name appears to be unconfirmed; DeepSeek's 1M-context model at this point is V4.)
That changes the maths of adapting a model. Rather than spending weeks preparing training data, running a costly fine-tuning job, and grading the output, you include the relevant documentation in the prompt and get comparable or better results. It's faster and cheaper, and it bends easily, change the underlying documents and the next prompt reflects it, no retraining required.

Improved In-Context Learning
The second shift is that models have got much better at actually using the context you give them. Early models treated a long prompt a bit like packing material. They handled information near the start and end well, then lost the thread in the middle. This "lost in the middle" pattern, documented by Liu and colleagues in Lost in the Middle: How Language Models Use Long Contexts, made long-context approaches hard to trust.
Newer models have largely worked past it. MiniMax reportedly cites very high needle-in-a-haystack retrieval at 1M tokens for M3, though that specific figure isn't published on its official blog or model page, so treat it as unconfirmed rather than a benchmarked fact. Google's Gemini models show similar reach, and even models with "only" 128K-256K windows tend to perform reliably across their whole range.
What this means in practice: putting your task documentation in the prompt is now a real alternative to fine-tuning for most work. Give a model a well-built prompt with the right examples and reference material, and on many tasks it matches what a fine-tuned model would do.
The Cost Calculation
Fine-tuning has never been cheap, and it has got dearer as models have grown. Retraining something the size of Llama 4 (400B parameters) or GLM-5.2 (753B parameters) needs serious GPU time, by most reasonable estimates, tens of thousands of dollars for a single run, on top of the engineering hours to prepare data, babysit the training, and grade the results. Those dollar figures are uncited order-of-magnitude estimates rather than published prices, so read them as ballpark.
In-context learning, by contrast, costs nothing extra to develop and adds only the inference cost of the longer prompt in production. Estimates put 100,000 tokens of context at roughly $0.01-0.03 per request on the cheaper providers, though premium models run higher (Gemini 3.5 Flash sits closer to $0.15 per 100K input tokens, per OpenRouter pricing). Either way, it's a rounding error next to a fine-tuning run.
The gap widens once you account for upkeep. A fine-tuned model is frozen in time. When your product documentation changes, your knowledge base updates, or your task shifts, you retrain. An in-context setup picks up those changes the moment you edit the prompt content.
When Fine-Tuning Still Makes Sense
None of this kills fine-tuning. There are jobs where it still beats in-context learning outright. Work that demands very low latency benefits, because shorter prompts process faster. Work with rigid output formats that have to come out identical every time benefits from weight-level adaptation. And work where the training data carries subtle patterns that are hard to spell out in a prompt benefits from the model absorbing those patterns through retraining.
Kimi K2.7-Code is the usual example here. Fine-tuned on a company's own codebase, it's reportedly claimed to beat the base-model-plus-context setup by a wide margin on internal coding tasks, but that specific improvement figure has no supporting source and looks like an illustrative estimate, so don't bank on the exact number. The general point holds: for organisations whose core business is writing code, that kind of gain can justify the fine-tuning bill.


