Back to news

AI Tools

nanochat: From $48 GPT-2 to understanding LLMs.

How Andrej Karpathy's nanochat takes you from complete beginner to understanding every component of a large language model.

AI Kick Start editorial image for nanochat: From $48 GPT-2 to understanding LLMs.

Decision

Shortlist

Score tools by workflow fit, data handling, owner readiness, and cost at scale before buying seats.

Risk to watch

Shelfware

A capable tool still fails if nobody owns the workflow or checks whether it is used weekly.

Proof to collect

Pilot score

Run one real task through each shortlisted tool and record quality, time saved, and support burden.

TL;DR

TL;DR: Karpathy's nanochat is a small, readable codebase that walks you through training a working ChatGPT-style model end to end. The headline number on the repo is "the best ChatGPT that $100 can buy," and a leaner GPT-2 tier run lands at around $48. It's not production infrastructure. It's a learning tool, and a good one.

Key takeaways

  • Nanochat is Karpathy's small, readable training stack that takes you from zero to a working GPT-2 class model, with a GPT-2 tier run costing about $48 (the repo's headline number is "$100").
  • The famous $48 covers roughly two hours on an 8XH100 node, not a single RTX 4090 over 24 hours; the speedrun model is around 561M parameters.
  • The whole project is about 8,000 lines, mostly Python with PyTorch plus Rust for the tokeniser, and the code is written to be read as a curriculum.
  • It teaches tokenisation, embeddings, attention, training dynamics, and generation strategies, with the concepts transferring directly to larger production systems.
  • It's the capstone for Karpathy's LLM101n course via Eureka Labs; broader claims of university and corporate adoption are unconfirmed.

Briefing

The best way to understand something is to build it. nanochat, Andrej Karpathy's minimal LLM training stack, is built on that idea. It takes you from "what's a transformer?" to training your own GPT-2 class model for about $48. With roughly 55,000 GitHub stars, it has become one of the most widely used teaching projects in AI.

Analysis

For most people, large language models are a black box. You type something in, an answer comes out, and the machinery in between stays hidden. Karpathy's bet with nanochat is that the box stops being scary the moment you build a small version of it yourself.

That's the story here. A single developer, a few hours of rented GPU time, and roughly $48 gets you a complete training run for a GPT-2 class model. Not a toy that prints "hello world," but a real pipeline: raw text in, a chatting model out. The thing that used to cost tens of thousands of dollars and a research lab now fits on a hobbyist's budget.

The repo has pulled in around 55,000 stars on GitHub (source), which tells you something about the appetite. People don't just want to use AI anymore. They want to understand what's actually happening under the hood. For a business team, that matters more than it sounds: the people who can explain why a model behaves the way it does are the ones who make sensible calls about where to use it.

The Educational Arc

Nanochat is laid out as a learning path. Each part of the code maps to a concept you need to grasp:

Data Pipeline → How do LLMs learn from text? Tokenisation → How is text converted to numbers? Architecture → What are transformers and how do they work? Training Loop → How do models actually learn? Inference → How do trained models generate text?

The harness covers tokenisation, pretraining, finetuning, evaluation, inference, and a chat UI, with the tokeniser trained in Rust and pretraining done on the FineWeb dataset (source). When you build each piece yourself with Karpathy's guidance, you pick up an intuition that reading papers never quite gives you.

The $48 Breakdown

The $48 figure is real, but it's worth being precise about where it comes from. The README's marquee number is "the best ChatGPT that $100 can buy." The $48 is the cheaper GPT-2 tier estimate further down, and it covers roughly two hours on an 8XH100 GPU node, with spot instances bringing it closer to $15 (source).

A common retelling of the breakdown gets the details wrong. It's sometimes described as a single RTX 4090 at about $2/hour running for 24 hours on a 124M-parameter GPT-2 small. That isn't accurate. The official run uses an 8XH100 node at roughly $24/hour, and the speedrun model is around 561M parameters, not 124M. The dollar total happens to land in the same place, but the hardware, the hours, and the parameter count are all different.

If you have your own multi-GPU hardware, the cost drops to electricity. Some people have suggested cheaper hobbyist paths, such as a free Colab tier, but that isn't a supported or documented route. Nanochat is designed and tested for an 8XH100/8XA100 node, so a single free-tier GPU would be impractical for a full run. The point of the number isn't the exact dollar amount anyway. It's that training a real LLM is now within reach of an individual.

For context, the README itself notes that the original GPT-2 cost around $43,000 to train back in 2019 (source). That's the contrast worth sitting with.

Code as Curriculum

Nanochat's code is written to be read. The whole project is about 8,000 lines, mostly Python with PyTorch, plus a little Rust for the tokeniser (source). Each file works like a lesson:

# train.py, The training loop, heavily commented
# Each section explains WHY, not just HOW

# 1. Forward pass: predict the next token
# 2. Compute loss: how wrong were we?
# 3. Backward pass: how do we improve?
# 4. Update weights: apply the learning

The comments don't stop at what the code does. They explain the concepts behind it. Reading the source feels less like decoding a repo and more like sitting next to a patient tutor who explains every step.

What You Learn

Working through nanochat leaves you with a real grasp of:

Tokenisation: Byte-pair encoding, how a vocabulary gets built, and why it shapes model performance.

Embeddings: How words turn into vectors, positional encoding, and why context matters.

Attention: The core transformer mechanism. Self-attention, multi-head attention, and why it works as well as it does.

Training Dynamics: Gradient descent, learning rate schedules, overfitting, and convergence.

Generation Strategies: Temperature, top-k, top-p, and how each one shapes the output.

Distributed Training: How to scale across multiple GPUs when one isn't enough.

Beyond the Basics

For anyone who wants to push further, nanochat touches on heavier topics. Because it runs on a multi-GPU node and uses PyTorch, distributed training and mixed precision come with the territory. The README doesn't itemise every one of these as a separate teaching module, but the foundations are there to build on:

  • Mixed precision training: Faster training with lower memory use
  • Gradient checkpointing: Trade compute for memory
  • Model parallelism: Split models across devices
  • Custom architectures: Adapt the standard transformer for specific tasks

The Community Effect

The nanochat community has a distinct feel. The issue tracker and discussions tend to draw a mix of people:

  • Beginners asking fundamental questions, and getting welcomed rather than mocked
  • Experienced practitioners sharing optimisations
  • Researchers comparing architectural variants
  • Educators using the project as course material

That mix is part of what makes it work. A beginner's question often turns into clearer documentation that helps everyone who comes after.

From nanochat to Production

Nanochat never claims to be production infrastructure. It's for learning. But the ideas carry straight across:

  • The data pipeline principles still apply to billion-parameter models
  • The training loop has the same shape, just at a larger scale
  • The generation strategies are identical
  • The debugging skills are exactly what you'll need

Plenty of people have used it as a stepping stone toward working on production LLM systems, and many credit it for the groundwork.

Why 55,000 Stars Matter

The star count says something about reach, not just hype. Nanochat is the capstone project for LLM101n, a course from Karpathy's company Eureka Labs that runs through the full LLM lifecycle from data prep to reinforcement learning (source). That's the documented educational backbone.

Beyond the course, it's reportedly turned up in self-study by people across the field and in research teams poking at architectural variants. You'll sometimes see claims that universities like Stanford and MIT, or corporate training programmes at big tech firms, use it directly. Those aren't confirmed, so treat them as unverified word of mouth rather than fact.

In a market where pricey courses promise to teach you AI, nanochat hands a lot of it over for free. The stars read like a thank-you from people who learned something that stuck.

Source trail

Primary references to keep this briefing grounded

AI and automation information changes quickly. Use these official or primary references to verify the claims, pricing, product behaviour, and compliance details before committing budget or production data.

What to do next

  1. Write the job-to-be-done before looking at another product.
  2. Score each shortlisted tool for workflow fit, data handling, cost, and owner readiness.
  3. Run one small pilot and remove anything the team does not use weekly.

Want help applying this? Explore the AI tools directory.

AI Kick Start is an Illawarra-based AI studio in Figtree, helping businesses across Wollongong, Shellharbour and Kiama and right across Australia put AI to work.

Explore with AI

Use the article as a decision prompt

Summarise this AI Kick Start article for an Australian business owner. Focus on the useful decision, the risks, and the first practical next step: nanochat: From $48 GPT-2 to understanding LLMs

Turn this into a practical roadmap.

Use the guide as a starting point, then map the first workflow worth building.

Book an AI strategy call