6 Chapter 5: Convolutions and Vision
The network from Chapter 3 treats each pixel as an independent input. It has no concept of “this pixel is next to that pixel.” That’s a massive waste of structure.
6.1 The problem with fully-connected layers for images
A 28x28 MNIST image has 784 pixels. A fully-connected layer with 128 neurons means 784 * 128 = 100,352 parameters. For a single layer.
Scale that to a 224x224 color image (150,528 inputs) and a 1024-neuron layer, and you’re at 154 million parameters before you’ve done anything useful. Most of those parameters are redundant because they’re learning the same patterns at different positions.
6.2 What convolutions fix
A convolution slides a small filter (say 3x3) across the image, computing a dot product at each position. This means:
- Parameter sharing. One 3x3 filter = 9 parameters, applied everywhere. Instead of learning “detect an edge at position (5,5)” and separately “detect an edge at position (10,10),” you learn “detect an edge” once.
- Translation invariance. A cat in the top-left is detected by the same filter as a cat in the bottom-right.
- Locality. Each neuron only sees a small patch. Deeper layers combine patches into larger patterns.
6.3 The convolution operation
def conv2d(image, kernel):
h, w = image.shape
kh, kw = kernel.shape
output = np.zeros((h - kh + 1, w - kw + 1))
for i in range(output.shape[0]):
for j in range(output.shape[1]):
patch = image[i:i+kh, j:j+kw]
output[i, j] = np.sum(patch * kernel)
return outputThat’s a convolution. Four lines of logic. The GPU-optimized version is 1000x faster, but this is what it computes.
6.4 Building a CNN
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Linear(64 * 7 * 7, 128),
nn.ReLU(),
nn.Linear(128, 10),
)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
return self.classifier(x)This network has ~170K parameters compared to the MLP’s ~110K, but it’ll hit 99%+ on MNIST. The structure matters more than the parameter count.
6.5 What each layer learns
- Layer 1 filters learn edges: horizontal, vertical, diagonal
- Layer 2 filters learn combinations: corners, curves, simple shapes
- The classifier maps shapes to digits
You can visualize the filters after training. The patterns are surprisingly interpretable.
6.6 Exercises
- Visualize the learned filters from layer 1. Do they look like edge detectors?
- Remove the MaxPool layers and use stride=2 in the Conv layers instead. Compare.
- Train on CIFAR-10 (color images, 10 classes). How does accuracy compare?
6.7 What’s next
Chapter 6: Sequences and recurrence. Images have spatial structure. Text and time series have temporal structure. Different problem, different architecture.