Briefing
The best way to understand something is to build it. nanochat, Andrej Karpathy's minimal LLM training stack, is built on that idea. It takes you from "what's a transformer?" to training your own GPT-2 class model for about $48. With roughly 55,000 GitHub stars, it has become one of the most widely used teaching projects in AI.
Analysis
For most people, large language models are a black box. You type something in, an answer comes out, and the machinery in between stays hidden. Karpathy's bet with nanochat is that the box stops being scary the moment you build a small version of it yourself.
That's the story here. A single developer, a few hours of rented GPU time, and roughly $48 gets you a complete training run for a GPT-2 class model. Not a toy that prints "hello world," but a real pipeline: raw text in, a chatting model out. The thing that used to cost tens of thousands of dollars and a research lab now fits on a hobbyist's budget.
The repo has pulled in around 55,000 stars on GitHub (source), which tells you something about the appetite. People don't just want to use AI anymore. They want to understand what's actually happening under the hood. For a business team, that matters more than it sounds: the people who can explain why a model behaves the way it does are the ones who make sensible calls about where to use it.
The Educational Arc
Nanochat is laid out as a learning path. Each part of the code maps to a concept you need to grasp:
Data Pipeline → How do LLMs learn from text? Tokenisation → How is text converted to numbers? Architecture → What are transformers and how do they work? Training Loop → How do models actually learn? Inference → How do trained models generate text?
The harness covers tokenisation, pretraining, finetuning, evaluation, inference, and a chat UI, with the tokeniser trained in Rust and pretraining done on the FineWeb dataset (source). When you build each piece yourself with Karpathy's guidance, you pick up an intuition that reading papers never quite gives you.
The $48 Breakdown
The $48 figure is real, but it's worth being precise about where it comes from. The README's marquee number is "the best ChatGPT that $100 can buy." The $48 is the cheaper GPT-2 tier estimate further down, and it covers roughly two hours on an 8XH100 GPU node, with spot instances bringing it closer to $15 (source).
A common retelling of the breakdown gets the details wrong. It's sometimes described as a single RTX 4090 at about $2/hour running for 24 hours on a 124M-parameter GPT-2 small. That isn't accurate. The official run uses an 8XH100 node at roughly $24/hour, and the speedrun model is around 561M parameters, not 124M. The dollar total happens to land in the same place, but the hardware, the hours, and the parameter count are all different.
If you have your own multi-GPU hardware, the cost drops to electricity. Some people have suggested cheaper hobbyist paths, such as a free Colab tier, but that isn't a supported or documented route. Nanochat is designed and tested for an 8XH100/8XA100 node, so a single free-tier GPU would be impractical for a full run. The point of the number isn't the exact dollar amount anyway. It's that training a real LLM is now within reach of an individual.
For context, the README itself notes that the original GPT-2 cost around $43,000 to train back in 2019 (source). That's the contrast worth sitting with.
Code as Curriculum
Nanochat's code is written to be read. The whole project is about 8,000 lines, mostly Python with PyTorch, plus a little Rust for the tokeniser (source). Each file works like a lesson:
# train.py, The training loop, heavily commented
# Each section explains WHY, not just HOW
# 1. Forward pass: predict the next token
# 2. Compute loss: how wrong were we?
# 3. Backward pass: how do we improve?
# 4. Update weights: apply the learningThe comments don't stop at what the code does. They explain the concepts behind it. Reading the source feels less like decoding a repo and more like sitting next to a patient tutor who explains every step.
What You Learn
Working through nanochat leaves you with a real grasp of:
Tokenisation: Byte-pair encoding, how a vocabulary gets built, and why it shapes model performance.
Embeddings: How words turn into vectors, positional encoding, and why context matters.
Attention: The core transformer mechanism. Self-attention, multi-head attention, and why it works as well as it does.
Training Dynamics: Gradient descent, learning rate schedules, overfitting, and convergence.
Generation Strategies: Temperature, top-k, top-p, and how each one shapes the output.
Distributed Training: How to scale across multiple GPUs when one isn't enough.
Beyond the Basics
For anyone who wants to push further, nanochat touches on heavier topics. Because it runs on a multi-GPU node and uses PyTorch, distributed training and mixed precision come with the territory. The README doesn't itemise every one of these as a separate teaching module, but the foundations are there to build on:
- Mixed precision training: Faster training with lower memory use
- Gradient checkpointing: Trade compute for memory
- Model parallelism: Split models across devices
- Custom architectures: Adapt the standard transformer for specific tasks
The Community Effect
The nanochat community has a distinct feel. The issue tracker and discussions tend to draw a mix of people:
- Beginners asking fundamental questions, and getting welcomed rather than mocked
- Experienced practitioners sharing optimisations
- Researchers comparing architectural variants
- Educators using the project as course material
That mix is part of what makes it work. A beginner's question often turns into clearer documentation that helps everyone who comes after.
From nanochat to Production
Nanochat never claims to be production infrastructure. It's for learning. But the ideas carry straight across:
- The data pipeline principles still apply to billion-parameter models
- The training loop has the same shape, just at a larger scale
- The generation strategies are identical
- The debugging skills are exactly what you'll need
Plenty of people have used it as a stepping stone toward working on production LLM systems, and many credit it for the groundwork.
Why 55,000 Stars Matter
The star count says something about reach, not just hype. Nanochat is the capstone project for LLM101n, a course from Karpathy's company Eureka Labs that runs through the full LLM lifecycle from data prep to reinforcement learning (source). That's the documented educational backbone.
Beyond the course, it's reportedly turned up in self-study by people across the field and in research teams poking at architectural variants. You'll sometimes see claims that universities like Stanford and MIT, or corporate training programmes at big tech firms, use it directly. Those aren't confirmed, so treat them as unverified word of mouth rather than fact.
In a market where pricey courses promise to teach you AI, nanochat hands a lot of it over for free. The stars read like a thank-you from people who learned something that stuck.


