5 Chapter 4: Frameworks and What They Hide
You built a neural network from scratch. Now rebuild it in PyTorch, and pay attention to what changes and what stays the same.
5.1 The same network, fewer lines
import torch
import torch.nn as nn
class Network(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10)
)
def forward(self, x):
return self.layers(x)That’s the whole model. Compare it to the 30+ lines from Chapter 3.
5.2 What disappeared
Three things vanished:
- Weight initialization. PyTorch picks sensible defaults (Kaiming uniform for linear layers). You can override them, but the defaults work.
- Backpropagation.
loss.backward()computes all gradients automatically. No manual chain rule. - Weight updates. The optimizer handles it:
optimizer.step().
5.3 What autograd actually does
When you call loss.backward(), PyTorch walks backward through a computation graph it built during the forward pass. Every operation (matmul, add, ReLU) recorded itself and knows how to compute its own gradient.
This is the same chain rule math you wrote by hand. PyTorch just automates it.
5.4 The training loop
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(50):
output = net(X_train)
loss = loss_fn(output, y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()5.5 What you’re trading
Convenience costs understanding. When you use a framework:
- You can’t easily see the gradients flowing through your network
- Debugging shape mismatches becomes harder (the error is inside the framework, not your code)
- You trust the framework’s implementation is correct (it usually is, but “usually” has burned people)
The point of Chapter 3 wasn’t to teach you to never use frameworks. It was to make sure that when you use one, you know what it’s doing.
5.6 When to go manual, when to use frameworks
Use frameworks for anything you ship. The optimizations (GPU kernels, mixed precision, distributed training) are not things you want to rewrite.
Go manual when you’re learning a new concept, debugging a weird training behavior, or implementing a paper that does something nonstandard.
5.7 Exercises
- Replace SGD with Adam. What changes in training dynamics?
- Add dropout between layers. Train with and without it. Compare test accuracy.
- Try
torch.compile()and benchmark the speed difference.
5.8 What’s next
Chapter 5: Convolutional networks. You’ll learn why fully-connected layers are the wrong tool for images, and build a CNN that beats your MLP.