6 Chapter 5: Convolutions and Vision

Author

Pranav Deshpande

Published

March 5, 2026

The network from Chapter 3 treats each pixel as an independent input. It has no concept of “this pixel is next to that pixel.” That’s a massive waste of structure.

6.1 The problem with fully-connected layers for images

A 28x28 MNIST image has 784 pixels. A fully-connected layer with 128 neurons means 784 * 128 = 100,352 parameters. For a single layer.

Scale that to a 224x224 color image (150,528 inputs) and a 1024-neuron layer, and you’re at 154 million parameters before you’ve done anything useful. Most of those parameters are redundant because they’re learning the same patterns at different positions.

6.2 What convolutions fix

A convolution slides a small filter (say 3x3) across the image, computing a dot product at each position. This means:

Parameter sharing. One 3x3 filter = 9 parameters, applied everywhere. Instead of learning “detect an edge at position (5,5)” and separately “detect an edge at position (10,10),” you learn “detect an edge” once.
Translation invariance. A cat in the top-left is detected by the same filter as a cat in the bottom-right.
Locality. Each neuron only sees a small patch. Deeper layers combine patches into larger patterns.

6.3 The convolution operation

def conv2d(image, kernel):
    h, w = image.shape
    kh, kw = kernel.shape
    output = np.zeros((h - kh + 1, w - kw + 1))
    for i in range(output.shape[0]):
        for j in range(output.shape[1]):
            patch = image[i:i+kh, j:j+kw]
            output[i, j] = np.sum(patch * kernel)
    return output

That’s a convolution. Four lines of logic. The GPU-optimized version is 1000x faster, but this is what it computes.

6.4 Building a CNN

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(64 * 7 * 7, 128),
            nn.ReLU(),
            nn.Linear(128, 10),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

This network has ~170K parameters compared to the MLP’s ~110K, but it’ll hit 99%+ on MNIST. The structure matters more than the parameter count.

6.5 What each layer learns

Layer 1 filters learn edges: horizontal, vertical, diagonal
Layer 2 filters learn combinations: corners, curves, simple shapes
The classifier maps shapes to digits

You can visualize the filters after training. The patterns are surprisingly interpretable.

6.6 Exercises

Visualize the learned filters from layer 1. Do they look like edge detectors?
Remove the MaxPool layers and use stride=2 in the Conv layers instead. Compare.
Train on CIFAR-10 (color images, 10 classes). How does accuracy compare?

6.7 What’s next

Chapter 6: Sequences and recurrence. Images have spatial structure. Text and time series have temporal structure. Different problem, different architecture.