4 Chapter 3: Your First Neural Network
This is where you stop reading about neural networks and start building one.
4.1 What we’re building
A feedforward neural network that classifies handwritten digits (MNIST). Two hidden layers, ReLU activations, softmax output. Trained with gradient descent. No frameworks.
By the end of this chapter, your network will hit ~97% accuracy. Not state of the art, but real, and you’ll understand every line.
4.2 The architecture
Input (784 pixels) → Hidden Layer 1 (128 neurons, ReLU)
→ Hidden Layer 2 (64 neurons, ReLU)
→ Output (10 classes, Softmax)
Each arrow is a matrix multiplication plus a bias addition. That’s a “layer.”
4.3 Forward pass
The forward pass computes the output given an input. It’s just matrix math:
class Network:
def __init__(self):
self.W1 = np.random.randn(784, 128) * 0.01
self.b1 = np.zeros(128)
self.W2 = np.random.randn(128, 64) * 0.01
self.b2 = np.zeros(64)
self.W3 = np.random.randn(64, 10) * 0.01
self.b3 = np.zeros(10)
def forward(self, x):
self.z1 = x @ self.W1 + self.b1
self.a1 = np.maximum(0, self.z1) # ReLU
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = np.maximum(0, self.z2) # ReLU
self.z3 = self.a2 @ self.W3 + self.b3
self.a3 = softmax(self.z3) # probabilities
return self.a3If you understand this code, you understand what a neural network does. Everything in deep learning is a variation on this pattern.
4.4 Backpropagation
Backprop computes how much each weight contributed to the error, then adjusts it. You work backwards through the network using the chain rule.
The key insight: at each layer, you compute the gradient of the loss with respect to that layer’s weights, then pass the gradient backward to the previous layer.
def backward(self, x, y, output, lr=0.01):
m = x.shape[0]
# Output layer gradient
dz3 = output - y # cross-entropy + softmax gradient
dW3 = self.a2.T @ dz3 / m
db3 = dz3.mean(axis=0)
# Hidden layer 2
dz2 = (dz3 @ self.W3.T) * (self.z2 > 0) # ReLU derivative
dW2 = self.a1.T @ dz2 / m
db2 = dz2.mean(axis=0)
# Hidden layer 1
dz1 = (dz2 @ self.W2.T) * (self.z1 > 0)
dW1 = x.T @ dz1 / m
db1 = dz1.mean(axis=0)
# Update
self.W3 -= lr * dW3
self.b3 -= lr * db3
self.W2 -= lr * dW2
self.b2 -= lr * db2
self.W1 -= lr * dW1
self.b1 -= lr * db14.5 Training loop
for epoch in range(50):
output = net.forward(X_train)
loss = cross_entropy(output, y_train)
net.backward(X_train, y_train, output)
if epoch % 10 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")Run it. Watch the loss drop. That’s learning.
4.6 What you should notice
- Weight initialization matters. Random too large and gradients explode. Too small and they vanish. The
* 0.01isn’t arbitrary. - ReLU is dead simple. It’s just
max(0, x). Its derivative is 0 or 1. That simplicity is why it works so well. - Batch size changes behavior. Try training one example at a time vs. the full dataset at once. Notice the tradeoff between noise and speed.
4.7 Exercises
- Swap ReLU for sigmoid. What happens to training speed?
- Add a third hidden layer. Does accuracy improve?
- Reduce the hidden layer sizes to 32 and 16. How low can you go before accuracy drops?
4.8 What’s next
Chapter 4: you implement the same network in PyTorch to see what frameworks actually do for you (and what they hide).