Skip to content

Chapter 17: Putting It All Together

Code file: src/ch16_full_pipeline.py Run it: python src/ch16_full_pipeline.py Expected time: ~20-30 minutes on CPU


Theory

You've Made It

Let's take a moment to look back at what we've built:

  1. Tokenizer (Ch 3): Converts Shakespeare text to integers and back.
  2. Embeddings (Ch 4): Maps each token to a 128-dimensional vector.
  3. Self-attention (Ch 5): Each token gathers information from others.
  4. Multi-head attention (Ch 6): 4 attention heads, 4 perspectives.
  5. Feed-forward layer (Ch 7): Each token processes what it learned.
  6. Transformer block (Ch 8): One complete unit with residual connections.
  7. GPT model (Ch 9): 4 blocks stacked, plus embedding and LM head.
  8. Training loop (Ch 12): Gradient descent for 3000 steps.
  9. Generation (Ch 15): Temperature + top-k sampling.

The full pipeline runs all of these steps in sequence. This is the same architecture as GPT-2 Small (117M parameters), just about 142x smaller.

What does "same architecture" mean? GPT-2 uses the exact same building blocks you just built: character/token embeddings, self-attention with multiple heads, transformer blocks with residual connections and layer norm, and gradient descent training. The only differences are scale: GPT-2 uses 768-dimensional embeddings (we use 128), 12 transformer layers (we use 4), and it was trained on 40GB of internet text (we use 1MB of Shakespeare). The code structure is identical. If you understand what you built here, you understand GPT-2, and by extension the core of GPT-3, GPT-4, and all modern LLMs.


The Complete Pipeline

shakespeare.txt
    |
[tokenize]
    |
integer tokens  →  DataLoader (batches of 32 × 128)
    |
[model] 825K-parameter GPT
    |
[training loop] 3000 steps, Adam optimizer
    |
checkpoints/model_final.pt
    |
[generation] temperature=0.8, top_k=40
    |
Shakespeare-like text!

What "Shakespeare-like" Actually Looks Like

After 3000 training steps, the model typically produces output like:

ROMEO:
What is it thou dost say?
I cannot hold my peace. The king hath not
Yet made me know his pleasure.

JULIET:
Good night, good night! Parting is such sweet
sorrow that I shall say good night till it be morrow.

It's not perfect Shakespeare. It mixes up characters, makes grammatical mistakes, and sometimes trails off. But it: - Uses correct character name formatting - Forms mostly grammatical sentences - Uses period-appropriate vocabulary - Occasionally produces surprisingly beautiful lines

All from 825,000 numbers, trained in 20-30 minutes!


What Would Make It Better?

More steps and more parameters: - 10,000 steps → noticeably better - Larger model (n_embd=256, n_layers=6) → better still - GPT-2 Small (117M parameters) → impressively good

These would require a GPU to train in a reasonable time. But the code is essentially the same, just bigger numbers in the config.


What You've Actually Learned

By completing this tutorial, you now understand: - How language models work (next-token prediction) - What the Transformer architecture actually does, layer by layer - How training works (gradient descent, backpropagation, Adam) - What "parameters" are and how they're learned - How text generation works (autoregressive sampling)

The fundamental building blocks - embeddings, self-attention, transformer blocks, and gradient descent - are the same ones used to build GPT-4, Llama, Gemini, and every other modern LLM. Modern models add refinements on top, but these are optimizations, not different fundamentals: - Rotary positional embeddings: A better way to encode positions than the lookup table we built. Works more reliably at very long sequences. - Grouped-query attention: A faster version of multi-head attention that shares some weight matrices across heads. Reduces memory and compute. - Instruction tuning: After pretraining (which is what we did), models are fine-tuned on human-written question/answer examples to be more helpful and less likely to produce harmful content. - RLHF (Reinforcement Learning from Human Feedback): Further training that uses human preference rankings to make the model's responses more helpful and aligned with what people actually want.

The foundation - transformer blocks, self-attention, gradient descent training - is exactly what you built. Everything else is refinement.


Code

File: src/ch16_full_pipeline.py Run it: python src/ch16_full_pipeline.py

This is the single self-contained end-to-end script. It: - Downloads the data if not present - Tokenizes - Creates DataLoaders - Builds the model (prints param count) - Trains for 3000 steps (logs loss every 300 steps) - Saves a checkpoint - Generates 400 characters of Shakespeare-like text

You can also use it as a "smoke test" - a quick end-to-end check that nothing is broken. (The term comes from electronics: plug in a new circuit, if it doesn't literally smoke, you're probably fine.) If ch16_full_pipeline.py runs start-to-finish without errors, the whole tutorial is working correctly.


Key Takeaways

  • The full pipeline = tokenize → embed → 4 transformer blocks → train → generate.
  • 825K parameters, trained in ~20-30 minutes on CPU.
  • Same architecture as GPT-2, just much smaller.
  • Generation uses temperature=0.8, top_k=40 for good balance.
  • The fundamentals here are the same ones used in production LLMs.

Congratulations!

You started with an empty folder and ended with a working language model.

You didn't just run someone else's code - you built every component yourself, from character-level tokenization through to text generation, and you understand why each piece is there.

You didn't just follow a recipe. You built the kitchen, designed the recipe, and cooked a meal that generates Shakespeare. Not bad for a weekend project.

That's the real achievement.