4 Chapter 3: Your First Neural Network

Author

Pranav Deshpande

Published

March 3, 2026

This is where you stop reading about neural networks and start building one.

4.1 What we’re building

A feedforward neural network that classifies handwritten digits (MNIST). Two hidden layers, ReLU activations, softmax output. Trained with gradient descent. No frameworks.

By the end of this chapter, your network will hit ~97% accuracy. Not state of the art, but real, and you’ll understand every line.

4.2 The architecture

Input (784 pixels) → Hidden Layer 1 (128 neurons, ReLU)
                   → Hidden Layer 2 (64 neurons, ReLU)
                   → Output (10 classes, Softmax)

Each arrow is a matrix multiplication plus a bias addition. That’s a “layer.”

4.3 Forward pass

The forward pass computes the output given an input. It’s just matrix math:

class Network:
    def __init__(self):
        self.W1 = np.random.randn(784, 128) * 0.01
        self.b1 = np.zeros(128)
        self.W2 = np.random.randn(128, 64) * 0.01
        self.b2 = np.zeros(64)
        self.W3 = np.random.randn(64, 10) * 0.01
        self.b3 = np.zeros(10)

    def forward(self, x):
        self.z1 = x @ self.W1 + self.b1
        self.a1 = np.maximum(0, self.z1)  # ReLU
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = np.maximum(0, self.z2)  # ReLU
        self.z3 = self.a2 @ self.W3 + self.b3
        self.a3 = softmax(self.z3)        # probabilities
        return self.a3

If you understand this code, you understand what a neural network does. Everything in deep learning is a variation on this pattern.

4.4 Backpropagation

Backprop computes how much each weight contributed to the error, then adjusts it. You work backwards through the network using the chain rule.

The key insight: at each layer, you compute the gradient of the loss with respect to that layer’s weights, then pass the gradient backward to the previous layer.

def backward(self, x, y, output, lr=0.01):
    m = x.shape[0]

    # Output layer gradient
    dz3 = output - y  # cross-entropy + softmax gradient
    dW3 = self.a2.T @ dz3 / m
    db3 = dz3.mean(axis=0)

    # Hidden layer 2
    dz2 = (dz3 @ self.W3.T) * (self.z2 > 0)  # ReLU derivative
    dW2 = self.a1.T @ dz2 / m
    db2 = dz2.mean(axis=0)

    # Hidden layer 1
    dz1 = (dz2 @ self.W2.T) * (self.z1 > 0)
    dW1 = x.T @ dz1 / m
    db1 = dz1.mean(axis=0)

    # Update
    self.W3 -= lr * dW3
    self.b3 -= lr * db3
    self.W2 -= lr * dW2
    self.b2 -= lr * db2
    self.W1 -= lr * dW1
    self.b1 -= lr * db1

4.5 Training loop

for epoch in range(50):
    output = net.forward(X_train)
    loss = cross_entropy(output, y_train)
    net.backward(X_train, y_train, output)
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

Run it. Watch the loss drop. That’s learning.

4.6 What you should notice

Weight initialization matters. Random too large and gradients explode. Too small and they vanish. The * 0.01 isn’t arbitrary.
ReLU is dead simple. It’s just max(0, x). Its derivative is 0 or 1. That simplicity is why it works so well.
Batch size changes behavior. Try training one example at a time vs. the full dataset at once. Notice the tradeoff between noise and speed.

4.7 Exercises

Swap ReLU for sigmoid. What happens to training speed?
Add a third hidden layer. Does accuracy improve?
Reduce the hidden layer sizes to 32 and 16. How low can you go before accuracy drops?

4.8 What’s next

Chapter 4: you implement the same network in PyTorch to see what frameworks actually do for you (and what they hide).